[vLLM] Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

코딩/vLLM

[vLLM] Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

khseon7 2025. 12. 15. 21:50

https://blog.vllm.ai/2025/12/14/halugate.html

Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

Your LLM just called a tool, received accurate data, and still got the answer wrong. Welcome to the world of extrinsic hallucination—where models confidently ignore the ground truth sitting right in front of them.

blog.vllm.ai

LLM 모델이 도구를 호출하여 정확한 데이터를 받았음에도 불구하고 extrinsic hallucination(외적 환각) 때문에 잘못된 답을 내놓는다. vLLM은 Signal-Decision Architecture를 기반으로, 지원되지 않는 claim이 사용자에게 도달하기 전에 이를 감지하는 conditional, token-level hallucination detection pipeline인 HaluGate를 소개한다.

The Problem: Hallucinations Block Production Deployment

Hallucination은 LLM을 실제 현장에 적용하는 데 있어 가장 큰 걸림돌이 되고 있다. 법률, 의료, 금융, 고객 서비스 등 다양한 산업 분야에서 이러한 현상은 공통적으로 나타난다. hallucination의 문제는 정확한 답변 속에 교묘하게 숨겨진 오류, 즉 해당 분야 전문가의 지식이나 외부 검증을 통해서만 발견할 수 있는 오류이다. 기업에게 있어 이러한 불확실성 때문에 LLM 도입은 자산이 아닌 부담이 된다.

The Scenario: When Tools Work But Models Don’t

User: "When was the Eiffel Tower built?"
Tool Call: get_landmark_info("Eiffel Tower")
Tool Response: {"name": "Eiffel Tower", "built": "1887-1889", "height": "330 meters", "location": "Paris, France"}
LLM Response: "The Eiffel Tower was built in 1950 and stands at 500 meters tall in Paris, France."}

해당 도구는 정확한 데이터를 반환했고, 모델의 응답에는 사실이 포함되어 있다. 그러나 그 "사실" 중 두 가지는 조작된 것, 즉 제공된 맥락과 정면으로 모순되는 extrinsic hallucinations이다.

이 시나리오의 문제는 다음과 같다.

사용자들은 도구의 이름 확인하고 결과를 신뢰한다.
기존 필터는 해로운 내용이 없기 때문에 이를 걸러내지 못한다.
다른 LLM에게 평가를 맡기면 비용이 많이 든다.

만약? 우리가 이러한 오류를 밀리초 단위의 지연 시간으로 실시간 자동 감지할 수 있다면 어떨까?

The Insight: Function Calling as Ground Truth

최신 function-calling APIs는 이미 기본적인 맥락 정보를 제공한다. 사용자가 사실에 관한 질문을 하면 모델은 DB 조회, API 호출, 문서 검색과 같은 도구를 호출한다. 이러한 방식으로 별도의 검색 인프라를 구축할 필요가 없다. 그저 문제는 그 답이 문맥에 부합하는가 하는 것이다.

그렇다고 다른 LLM을 호출하여 확인하는 방식은 실제 운영 환경에서 근본적인 문제를 가진다.

Latency (지연)
Cost (비용)
Explainability (설명 가능성)
Position bias (입장 편향)
Verbosity bias (장황함 편향)
Self-preference (자기 선호)
Inconsistency (불일치)

HaluGate: A Two-Stage Detection Pipeline

HaluGate는 효율성과 정확성의 균형을 맞춘 조건부 2단계 파이프라인을 구현한다.

Stage 1: HaluGate Sentinel (Prompt Classification)

다음 예시들을 보면 모든 쿼리에 hallucination detection 기능이 필요한 것은 아니다.

Prompt	Needs Fact-Check?	Reason
“When was Einstein born?”	✅ Yes	Verifiable fact
“Write a poem about autumn”	❌ No	Creative task
“Debug this Python code”	❌ No	Technical assistance
“What’s your opinion on AI?”	❌ No	Opinion request
“Is the Earth round?”	✅ Yes	Factual claim

창작물이나 코드 리뷰에 token-level detection을 실행하는 것은 낭비이며, 잠재적으로 오탐을 발생할 수 있다.

사전 분류가 중요한 이유: token-level detection은 context length에 비례하여 확장된다. 4,000개 토큰으로 구성된 RAG context의 경우 detection에 약 125ms가 소요되지만, 16,000개 토큰의 경우 약 365ms가 소요된다. 실제 운영 환경에서 쿼리의 약 35%가 사실에 기반하지 않은 쿼리인 경우, 사전 분류를 통해 효율성을 72.2% 향상시킬 수 있다. 즉, 불필요한 쿼리에 대해서 비용이 많이 드는 detection 과정을 완전히 생략할 수 있다.

이 모델은 다음과 같은 데이터 조합으로 학습되었다.

사실 확인 필요

Question Answering: SQuAD, TriviaQA, Natural Questions, HotpotQA
Truthfulness: TruthfulQA
Hallucination Benchmarks: HaluEval, FactCHD
Information-Seeking Dialogue: FaithDial, CoQA
RAG Datasets: neural-bridge/rag-dataset-12000

사실 확인 불필요

Creative Writing: WritingPrompts, story generation
Code: CodeSearchNet docstrings, programming tasks
Opinion/Instruction: Dolly non-factual, Alpaca creative

이 이진 분류는 Rust/Candle 네이티브 통합을 통해 약 12ms의 추론 지연 시간으로 96.4%의 검증 정확도를 달성한다.

Stage 2: Token-Level Detection + NLI Explanation

사실 확인 요청으로 분류된 프롬프트의 경우, 두 가지 모델을 사용하는 탐지 파이프라인을 실행한다.

Token-Level Hallucination Detection

sentence-level classifier는 "hallucinated/not hallucinated"라는 단일 label을 출력하는 반면, token-level detection은 문맥에 의해 뒷받침되지 않는 토큰을 정확하게 식별한다.

The Model Architecture

Input: [CLS] context [SEP] question [SEP] answer [SEP]
        ↓
   ModernBERT Encoder
     ↓
   Token Classification Head (Binary per token)
      ↓
    Label: 0 = Supported, 1 = Hallucinated (for answer tokens only)

NLI Explanation Layer

NLI (Natural Language Inference) 모델은 감지된 각 범위를 문맥에 따라 분류한다.

NLI Label	Meaning	Severity	Action
CONTRADICTION	Claim conflicts with context	4 (High)	Flag as error
NEUTRAL	Claim not supported by context	2 (Medium)	Flag as unverifiable
ENTAILMENT	Context supports the claim	0	Filter false positive

Integration with Signal-Decision Architecture

HaluGate는 독립적으로 작동하는 것이 아니라, 새로운 Signal-Decision Architecture에 긴밀하게 통합되어 있다.

fact_check as a Signal Type

fact_check는 이제 핵심 신호 유형으로 자리 잡았다. 이를 통해 질문이 사실 확인을 위한 것인지 여부에 따라 결정을 내릴 수 있다.

Request-Response Context Propagation

핵심 과제는 classification은 요청 시점에 이루어지지만, detection은 응답 시점에 이루어진다는 점이다.

RequestContext 구조체는 필요한 모든 상태 정보를 담고 있다.

RequestContext:
  # Classification results (set at request time)
  FactCheckNeeded: true
  FactCheckConfidence: 0.87

  # Tool context (extracted at request time)
  HasToolsForFactCheck: true
  ToolResultsContext: "Built 1887-1889, 330 meters..."
  UserContent: "When was the Eiffel Tower built?"

  # Detection results (set at response time)
  HallucinationDetected: true
  HallucinationSpans: ["1950", "500 meters"]
  HallucinationConfidence: 0.92

The Complete Pipeline: Three Paths

Path	Condition	Latency Added	Action
Path 1	Non-factual prompt	~12ms (classifier only)	Pass through
Path 2	Factual + No tools	~12ms	Add warning headers
Path 3	Factual + Tools available	76-162ms	Full detection + headers

Model Architecture Deep Dive

HaluGate Sentinel: Binary Prompt Classification
→ Architecture: ModernBERT-base + LoRA adapter + binary classification head
HaluGate Detector: Token-Level Binary Classification
→ Architecture: ModernBERT-base + token classification head
HaluGate Explainer: Three-Way NLI Classification
→ Architecture: ModernBERT-base fine-tuned on NLI

Why Native Rust/Candle Matters

세 가지 모델 모두 Hugging Face의 Rust 기반 머신러닝 프레임워크인 Candle을 통해 native로 실행되며, Go 언어용 CGO 바인딩을 제공한다.

What HaluGate Cannot Detect

HaluGate는 도구/RAG context가 검증의 근거를 제공하는 외인성 환각을 대상으로 한다. 하지만 다음과 같은 알려진 한계점이 있다.

Limitation	Example	Reason
Intrinsic hallucinations	Model says “Einstein was born in 1900” without any tool call	No context to verify against
No-context scenarios	User asks factual question, no tools defined	Missing ground truth

현재글[vLLM] Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

khseon7 님의 블로그

인공지능과 관련된 이것저것 정리해보는 블로그

리눅스, 강화 학습, Linux, rmok, OOM, k8s, grpo, URM, minikube, vllm, 강화학습, TurboQuant, servicemesh, Rag, dapo, Terminal-bench, k3d, 심층 강화 학습, LLM, benchmark,

Today :
Yesterday :

khseon7 님의 블로그