Can we detect when language models confabulate?

Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?

Synthesis note · 2026-02-23 · sourced from MechInterp

Standard entropy estimation for LLM outputs is misleading because the same correct answer can be expressed in many syntactically different ways, inflating apparent uncertainty. Semantic entropy solves this by operating at the level of meaning rather than tokens.

The method: sample multiple answers to a question, cluster them by bidirectional entailment (if A entails B and B entails A, they share a semantic cluster), then compute entropy over the clusters. High semantic entropy — many incompatible meaning clusters — signals confabulation. Low semantic entropy — answers converge on the same meaning despite different wording — signals reliability.

Key properties:

Works across datasets and tasks without a priori knowledge of the task
Requires no task-specific data
Robustly generalizes to unseen tasks
Significantly improves question-answering accuracy by identifying when to trust the model

The paper draws a precise distinction: not all hallucinations are confabulations. Confabulations are "arbitrary and incorrect generations" — outputs where the model could have generated different (and incompatible) answers with equal probability. Semantic entropy detects this specific failure mode: inconsistency at the meaning level.

This is practically valuable because it is self-referential — the model's own output distribution provides the uncertainty signal, requiring no external ground truth. When a model confabulates, it typically does so inconsistently across samples: different runs produce semantically incompatible answers. This inconsistency, invisible at the token level, becomes measurable at the semantic level.

Inquiring lines that use this note as a source 23

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 161 in 2-hop network ·dense cluster Open in graph ↗

Can we detect when language models confabulate? Does calling LLM errors hallucinations point us to… Can model confidence work as a reward signal for r… Does binary reward training hurt model calibration…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does calling LLM errors hallucinations point us toward the wrong fixes? Explores whether the metaphor of 'hallucination' for LLM errors misdirects our efforts. The terminology we choose shapes which interventions we prioritize and how we conceptualize the underlying problem.
semantic entropy operationalizes the detection of one class of fabrication: semantically inconsistent generation
Can model confidence work as a reward signal for reasoning? Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
semantic entropy is an alternative confidence signal; both use self-referential measures but semantic entropy operates over sampled outputs rather than internal probabilities
Does binary reward training hurt model calibration? Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
calibration and confabulation detection are related: well-calibrated models should have lower semantic entropy on questions they answer correctly

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

semantic entropy detects confabulations by computing uncertainty over meanings rather than tokens

Can we detect when language models confabulate?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4