Can we detect when language models confabulate?
Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?
Standard entropy estimation for LLM outputs is misleading because the same correct answer can be expressed in many syntactically different ways, inflating apparent uncertainty. Semantic entropy solves this by operating at the level of meaning rather than tokens.
The method: sample multiple answers to a question, cluster them by bidirectional entailment (if A entails B and B entails A, they share a semantic cluster), then compute entropy over the clusters. High semantic entropy — many incompatible meaning clusters — signals confabulation. Low semantic entropy — answers converge on the same meaning despite different wording — signals reliability.
Key properties:
- Works across datasets and tasks without a priori knowledge of the task
- Requires no task-specific data
- Robustly generalizes to unseen tasks
- Significantly improves question-answering accuracy by identifying when to trust the model
The paper draws a precise distinction: not all hallucinations are confabulations. Confabulations are "arbitrary and incorrect generations" — outputs where the model could have generated different (and incompatible) answers with equal probability. Semantic entropy detects this specific failure mode: inconsistency at the meaning level.
This is practically valuable because it is self-referential — the model's own output distribution provides the uncertainty signal, requiring no external ground truth. When a model confabulates, it typically does so inconsistently across samples: different runs produce semantically incompatible answers. This inconsistency, invisible at the token level, becomes measurable at the semantic level.
Inquiring lines that use this note as a source 23
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How much does ROUGE metric choice inflate hallucination detection claims?
- Does inevitable LLM hallucination make detection metric validity critical?
- Can novelty detection alone distinguish grounded synthesis from hallucinated restatement?
- What linguistic markers distinguish longer incorrect traces from correct ones?
- What role does entity salience play in detecting incoherence?
- What semantic classifier design avoids lexical variation without genuine conceptual distinctness?
- How do you verify whether your context distribution satisfies covariate diversity?
- How should designers measure and explain semantic uncertainty to users?
- Is confabulation inevitable in large language models regardless of training?
- Can measuring semantic entropy help us detect unreliable generations?
- How does the Word Novelty Rate metric measure convention formation?
- Do high-disagreement items signal contested values or measurement noise?
- Why does output alignment fail to catch internally incoherent reasoning?
- Why do models confabulate inconsistently across different samples?
- How does semantic entropy compare to confidence scores from internal model probabilities?
- Why do NLP benchmarks treat annotation disagreement as noise rather than signal?
- Can models distinguish between ambiguous and incomplete information inputs?
- What makes out-of-band monitoring better than in-band verification loops?
- What breaks when a mis-synthesized verifier runs with high confidence?
- Why does model confidence fail to detect hallucinations on rare entity pairs?
- Why does model confidence fail to detect hallucinations about rare entities?
- Can learned verifiers detect structural near-misses that pooled retrievers miss?
- How does linguistic calibration differ from token probability calibration?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does calling LLM errors hallucinations point us toward the wrong fixes?
Explores whether the metaphor of 'hallucination' for LLM errors misdirects our efforts. The terminology we choose shapes which interventions we prioritize and how we conceptualize the underlying problem.
semantic entropy operationalizes the detection of one class of fabrication: semantically inconsistent generation
-
Can model confidence work as a reward signal for reasoning?
Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
semantic entropy is an alternative confidence signal; both use self-referential measures but semantic entropy operates over sampled outputs rather than internal probabilities
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
calibration and confabulation detection are related: well-calibrated models should have lower semantic entropy on questions they answer correctly
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Detecting hallucinations in large language models using semantic entropy
- Fine-grained Hallucination Detection and Editing for Language Models
- The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
- Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?
- Chain-of-Verification Reduces Hallucination in Large Language Models
- Hallucination is Inevitable: An Innate Limitation of Large Language Models
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
- Sources of Hallucination by Large Language Models on Inference Tasks
Original note title
semantic entropy detects confabulations by computing uncertainty over meanings rather than tokens