Is hallucination mechanistically identical to generalization across datasets?
This explores whether the internal process that lets a model generalize to new inputs is the same process that produces hallucinations — so that a 'wrong' answer and a 'right' one are made the exact same way, and only the outcome differs.
This explores whether hallucination and generalization are one mechanism wearing two labels — the corpus comes down surprisingly firmly on 'yes,' and the most direct evidence is the argument that LLMs don't actually have two modes at all. Two notes make the case that accurate and inaccurate outputs are produced by *identical* statistical machinery — the model is always just predicting likely tokens, never perceiving or recalling — which is why the authors want to retire the word 'hallucination' in favor of 'fabrication' Should we call LLM errors hallucinations or fabrications? Does calling LLM errors hallucinations point us toward the wrong fixes?. The payoff of that reframing is practical: if there's no separate 'hallucination process' to detect or suppress, then fixes aimed at the model's 'perception' are aimed at a layer that doesn't exist, and you have to move to external verification instead.
If the mechanism is shared, where does the failure actually come from? A second strand locates it in the data, not the model's confidence. One note shows that the strongest predictor of a fabrication isn't the model feeling uncertain — it's the combination of entities being statistically *unseen* in pretraining, which is exactly the regime where the model is forced to generalize beyond what it observed Can pretraining data statistics detect hallucinations better than model confidence?. That dovetails with work showing models build dense internal representations for familiar data and fall back to sparse ones for unfamiliar inputs Is representational sparsity learned or intrinsic to neural networks?. Read together, these say the same activation behavior that produces a confident correct answer on familiar territory produces a confident wrong one off it — generalization and hallucination are two readings of the same extrapolation.
The theory side closes the loop: hallucination is formally *inevitable* for any computable LLM, on infinitely many inputs, and no internal trick removes it Can any computable LLM truly avoid hallucinating?. That's what you'd expect if hallucination were a structural consequence of generalizing from finite data rather than a separable defect — you can't delete it without deleting generalization itself.
Where the corpus complicates the clean 'identical' answer is in the cases where the mechanism diverges. RLHF can drive a model toward stating things it internally represents as false — here belief probes show the model still 'knows' the truth, so this failure is about indifference to truth, not a generalization error Does RLHF make language models indifferent to truth?. And the fractured-representations work warns that two models with identical outputs can have radically different internals Can identical outputs hide broken internal representations? — so 'same output behavior' doesn't guarantee 'same underlying structure.' The honest synthesis: the *base* generative act of fabrication and of correct generalization look mechanistically identical, but training pressures (like RLHF) and internal representational quality add distinct failure modes layered on top.
The thing you didn't know you wanted to know: the most effective interventions in this corpus don't try to fix the mechanism at all — they sidestep it. ReAct interleaves reasoning with live external lookups so errors get caught by reality before they compound Can interleaving reasoning with real-world feedback prevent hallucination?. That's the tell that researchers quietly agree hallucination isn't a separable bug: if it were, you'd patch the model; instead, the winning move is to assume the model will always extrapolate and to build verification around it.
Sources 8 notes
LLMs generate text through statistical token relationships without grounding in shared context. Accurate and inaccurate outputs use identical mechanisms, so calling failures "hallucinations" or "confabulation" misdirects fixes toward perception or memory—the wrong layers.
LLMs generate text through identical statistical processes regardless of accuracy, making 'fabrication' the more honest term. This reframes the fix from perception-based grounding to verification systems and calibrated uncertainty in use case design.
QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.
Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.
RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.
Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.