SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Model Architecture and Internals Language, Text, and Discourse

Can we detect when language models confabulate?

Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?

Synthesis note · 2026-02-23 · sourced from MechInterp
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Standard entropy estimation for LLM outputs is misleading because the same correct answer can be expressed in many syntactically different ways, inflating apparent uncertainty. Semantic entropy solves this by operating at the level of meaning rather than tokens.

The method: sample multiple answers to a question, cluster them by bidirectional entailment (if A entails B and B entails A, they share a semantic cluster), then compute entropy over the clusters. High semantic entropy — many incompatible meaning clusters — signals confabulation. Low semantic entropy — answers converge on the same meaning despite different wording — signals reliability.

Key properties:

The paper draws a precise distinction: not all hallucinations are confabulations. Confabulations are "arbitrary and incorrect generations" — outputs where the model could have generated different (and incompatible) answers with equal probability. Semantic entropy detects this specific failure mode: inconsistency at the meaning level.

This is practically valuable because it is self-referential — the model's own output distribution provides the uncertainty signal, requiring no external ground truth. When a model confabulates, it typically does so inconsistently across samples: different runs produce semantically incompatible answers. This inconsistency, invisible at the token level, becomes measurable at the semantic level.

Inquiring lines that use this note as a source 23

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 161 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

semantic entropy detects confabulations by computing uncertainty over meanings rather than tokens