Why do models confabulate inconsistently across different samples?

This explores why a model's made-up answers vary from one run to the next — why the same prompt yields a confident fabrication one time and a different (or correct) answer the next — rather than treating confabulation as a fixed flaw.

This explores why a model's made-up answers vary from one run to the next, rather than treating confabulation as a single fixed defect. The corpus points to a simple root: every answer a model gives is one *draw* from a probability distribution, not a lookup of a stored fact. Even with temperature pinned to zero, you're just re-rolling the same loaded die — the output is still a single sample that happens to repeat, which is why deterministic settings produce consistency without producing reliability Does setting temperature to zero actually make LLM outputs reliable?. When the model actually *knows* something, that distribution is sharply peaked and every sample lands in the same place. When it doesn't, the distribution is diffuse, and each sample wanders somewhere different. Confabulation is what diffuse sampling looks like from the outside.

A vivid version of this comes from the 'simulator' framing: an LLM doesn't commit to one character or one belief but holds a superposition of plausible continuations, sampling a fresh one each time you regenerate Does an LLM commit to a single character or maintain many?. That's exactly why persona prompts buckle — when you run the same persona repeatedly, the spread across runs matches or exceeds the spread across *different* personas, revealing that raw model uncertainty, not any stable knowledge, is steering the output Why do LLM persona prompts produce inconsistent outputs across runs?. The inconsistency isn't noise on top of a real answer; it *is* the signal that there was no settled answer underneath.

This is precisely why inconsistency turns out to be *useful* rather than merely annoying. Semantic entropy detects confabulations by sampling several answers, clustering them by whether they mean the same thing, and measuring how scattered the meanings are — high scatter flags a fabrication, no task-specific training required Can we detect when language models confabulate?. The cross-sample variance you're asking about is the detector. The deeper cause of *where* the distribution goes diffuse shows up in work on reasoning failure: models break not at some complexity threshold but at instance-level *unfamiliarity* — they pattern-match to training instances rather than running a general algorithm, so an unfamiliar input drops them into the high-uncertainty regime where samples diverge Do language models fail at reasoning due to complexity or novelty?. Relatedly, the true risk lives in unseen *combinations* of entities in the pretraining data — combinations the model never saw co-occur are exactly where it improvises, and improvisation samples differently each time Can pretraining data statistics detect hallucinations better than model confidence?.

Worth knowing: this isn't a bug a better model will sand away. Formal results prove that any computable LLM must hallucinate on infinitely many inputs, and that internal self-correction can't eliminate it — the variability is structural, which is why external safeguards (retrieval triggers, entropy checks) are necessary rather than optional Can any computable LLM truly avoid hallucinating?. The reframe the corpus offers is the thing you didn't know you wanted: stop treating cross-sample inconsistency as a failure to be suppressed, and start treating it as the most honest confidence signal the model gives you. A consistent answer might still be wrong, but a *scattered* one is the model telling you, structurally, that it's guessing.

Sources 7 notes

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Does an LLM commit to a single character or maintain many?

Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Can we detect when language models confabulate?

Clustering sampled answers by bidirectional entailment and computing entropy over semantic clusters catches confabulations invisible at token level. This self-referential approach works across tasks without task-specific training data.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Can pretraining data statistics detect hallucinations better than model confidence?

QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).

Can any computable LLM truly avoid hallucinating?

Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher re-testing claims about why LLMs confabulate inconsistently across samples. The question remains open: *what structural properties of LLMs produce variable hallucinations, and can training or inference methods materially reduce this variation without collapsing model expressiveness?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable until re-grounded:
• Confabulation inconsistency reflects sampling from a diffuse probability distribution, not noise on a fixed answer; temperature=0 produces consistency without reliability (2024).
• Persona prompts show cross-run instability exceeding cross-persona spread, indicating raw model uncertainty rather than stable knowledge (2024).
• Semantic entropy detects confabulations by clustering multi-sample meanings; high scatter flags fabrication without task-specific training (2024).
• Reasoning breakdown is driven by instance-level unfamiliarity (pattern-matching failure), not task complexity; rare entity combinations in pretraining trigger improvisation and variable sampling (2026).
• Hallucination is formally inevitable for any computable LLM; internal self-correction cannot eliminate it (2024).

Anchor papers (verify; mind their dates):
• arXiv:2401.11817 (2024) — formal proof of inevitable hallucination.
• arXiv:2511.00222 (2025) — multi-turn RL applied to persona consistency.
• arXiv:2602.06176 (2026) — reasoning failures via unfamiliarity.
• arXiv:2510.27062 (2025) — consistency training reducing variation.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the diffuse-distribution model, check whether post-training methods (consistency training, DPO variants, synthetic data on rare combinations) have since narrowed the distribution or merely shifted its peak. Separately: has retrieval-augmentation or in-context calibration moved the needle on *structural* versus *remediable* inconsistency? State plainly what still holds and what's been relaxed.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Specifically: does the 2025–2026 literature on RL post-training (Echo Chamber, Consistency Training, Multi-Turn RL papers) materially reduce cross-sample variance, or does it merely enforce surface consistency while leaving underlying uncertainty intact?
(3) Propose 2 research questions that assume the regime has moved: (a) *Can fine-grained uncertainty quantification over pretraining data statistics (what combinations were rare) be used to *predict* confabulation variance before inference?* (b) *Does consistency training that explicitly preserves semantic diversity (e.g., RL rewards for stable meaning + varied surface form) outperform naive consistency on both reliability and utility?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do models confabulate inconsistently across different samples?

Sources 7 notes

Next inquiring lines