Why do models hallucinate when retrieval heads fail despite having information in context?
This explores the mechanism question: when the information is sitting right there in the context window, why does a model still make things up — and what the 'retrieval head' failure tells us about it.
This explores the mechanism question: when the information is sitting right there in the context window, why does a model still make things up — and what the 'retrieval head' failure tells us about it. The corpus has a surprisingly concrete answer to the first half. A small set of attention heads — under 5% across every model family studied — do the actual work of pulling a fact out of long context and carrying it to the output. These are the 'retrieval heads,' and they're causally necessary: prune them and the model hallucinates even though the answer is verbatim present in its input What mechanism enables models to retrieve from long context?. So 'the information is in context' and 'the model can use the information' are two different facts. Retrieval is a sparse, fragile mechanism, not a guarantee that comes free with a long context window.
But head failure is only one route to the same symptom, and the more interesting lesson is that having information in context is routinely not enough. One line of work shows context integration fails when the model's training-time associations are simply stronger than what the prompt says — parametric memory overrides the in-context evidence, and no amount of textual instruction ('use only the provided text') reliably fixes it; you have to intervene in the representations themselves Why do language models ignore information in their context?. A related failure: models accommodate false presuppositions baked into a question even when a direct factual query proves they know better — knowledge being present doesn't mean it gets triggered Why do language models accept false assumptions they know are wrong?. The unifying point is that 'knowing' and 'retrieving-and-using' are separate competences, and the gap between them is where hallucination lives.
There's a deeper reframe worth sitting with. Several notes argue the word 'hallucination' itself misdirects us, because accurate and inaccurate outputs run through the identical statistical machinery — the model isn't perceiving wrong, it's generating plausible tokens, and 'fabrication' is the more honest term Should we call LLM errors hallucinations or fabrications? Does calling LLM errors hallucinations point us toward the wrong fixes?. Under that view, a retrieval-head failure isn't a malfunction in a normally-grounded system; it's the absence of the one mechanism that was temporarily making the statistical process look grounded. Pull the prop and you see what was always underneath.
That said, models do carry some internal signal about their own knowledge. Sparse autoencoders reveal entity-recognition circuits that track whether the model actually knows facts about an entity, and these causally steer it toward either answering or refusing Do models know what they don't know?. So the self-knowledge exists — the failure is often that it doesn't get wired to the retrieval step, so the model confidently fabricates instead of flagging uncertainty. And if you're chasing a permanent fix: one set of formal theorems argues hallucination is mathematically inevitable for any computable model, meaning internal self-correction can never fully close the gap and external safeguards are mandatory rather than optional Can any computable LLM truly avoid hallucinating?.
Which points at what the corpus offers as the practical exits. Rather than trusting the model's confidence (which stays high even when wrong), one approach watches the *training data* — flagging entity combinations the model likely never saw and triggering retrieval on that signal instead Can pretraining data statistics detect hallucinations better than model confidence?. And rather than reasoning in a closed loop where errors compound, interleaving each reasoning step with a real external lookup injects ground truth before the fabrication can snowball Can interleaving reasoning with real-world feedback prevent hallucination?. Both treat grounding as something you have to engineer around the model, not something the context window provides on its own.
Sources 9 notes
Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.
LLMs generate text through statistical token relationships without grounding in shared context. Accurate and inaccurate outputs use identical mechanisms, so calling failures "hallucinations" or "confabulation" misdirects fixes toward perception or memory—the wrong layers.
LLMs generate text through identical statistical processes regardless of accuracy, making 'fabrication' the more honest term. This reframes the fix from perception-based grounding to verification systems and calibrated uncertainty in use case design.
Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.
Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.
QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.