Is confabulation inevitable in large language models regardless of training?
This explores whether confabulation (confident, fluent fabrication) is a fixable training problem or a permanent feature of how LLMs generate text — and what the corpus says we can do about it either way.
This reads the question as: can better training ever eliminate confabulation, or is it baked into the machinery? The strongest answer in the collection is uncomfortable — confabulation isn't a bug you train out. One result proves with formal theorems that any computable LLM must hallucinate on infinitely many inputs, and that internal fixes like self-correction cannot remove this; it's a mathematical constraint, not an engineering shortfall Can any computable LLM truly avoid hallucinating?. The conclusion the authors draw is the interesting part: if you can't eliminate it from the inside, external safeguards become necessary rather than optional.
Why is it structural? Other notes point to the same root cause from different angles. A model doesn't commit to a single answer or persona — it holds a superposition of plausible continuations and samples one at generation time, so regenerating the same prompt yields different, each-internally-consistent outputs Do large language models actually commit to a single character?. Confabulation is what that sampling looks like when no continuation is actually grounded. It also shows up as a tug-of-war: when a model's training-time associations are strong, they override the information sitting right in the context, and prompting alone can't fix it — you'd need to intervene in the representations themselves Why do language models ignore information in their context?. And the fluency that makes confabulation convincing is surface-deep: top models reliably misparse embedded clauses and complex grammar, capturing surface patterns rather than deep rules Why do large language models fail at complex linguistic tasks?.
There's a deeper clue about *when* confabulation strikes. Reasoning failures aren't triggered by task complexity but by instance-level unfamiliarity — models fit patterns from training instances rather than general algorithms, so they fabricate confidently exactly where they've seen nothing similar Do language models fail at reasoning due to complexity or novelty?. Interestingly, the model sometimes *knows* it's in unfamiliar territory: hidden states sparsify systematically under out-of-distribution shift, a signal that correlates with unfamiliarity Do language models sparsify their activations under difficult tasks?. The fabrication isn't blind — the uncertainty is there in the internals, just not surfaced in the words.
That's where the collection turns from diagnosis to handling. Since the model's internal uncertainty exists but isn't visible at the token level, you can measure it: semantic entropy clusters many sampled answers by meaning and computes uncertainty over meanings, catching confabulations invisible at the token level — without task-specific training Can we detect when language models confabulate?. This is the practical reconciliation. If confabulation is formally inevitable, the win isn't a confabulation-free model; it's a model whose confabulations you can *detect* and gate.
So the honest answer is: yes, regardless of training — but that's not the end of the story. The corpus reframes the goal from elimination to detection and external containment. The thing worth knowing you didn't ask for: the model's own activations often carry a usable 'I'm guessing' signal even as it confidently fabricates, which is why detection works without retraining at all.
Sources 7 notes
Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
Clustering sampled answers by bidirectional entailment and computing entropy over semantic clusters catches confabulations invisible at token level. This self-referential approach works across tasks without task-specific training data.