Why does analytical depth demand trigger fabrication over transparent uncertainty?
This explores why pushing an LLM to be more analytical or go deeper tends to produce confident invented content rather than an honest 'I don't know' — and the corpus suggests the cause is mechanical, not a tuning failure.
This reads the question as: when you demand more analytical depth, why does the model fabricate instead of flagging uncertainty? The sharpest answer in the corpus is that the model has no separate machinery for the two. Should we call LLM errors hallucinations or fabrications? argues that accurate and inaccurate outputs are produced by the *identical* statistical token process — text is assembled from learned token relationships, not grounded in any shared truth. There is no internal flag that distinguishes 'I'm extrapolating' from 'I know this.' So a demand for depth doesn't open a door to honest hedging; it just runs the same generative process further, and more text means more opportunity to invent. The paper's deeper point is that calling this 'hallucination' misdirects the fix toward perception or memory — the wrong layers — when the issue is that fabrication is the baseline mode, with correctness as a happy coincidence.
This compounds with how outputs actually get sampled. Does setting temperature to zero actually make LLM outputs reliable? shows that even at temperature zero, every answer is still a single draw from a probability distribution — consistency is not reliability. A confident, fluent analysis is one plausible draw, not a calibrated verdict. When you ask for depth, you're asking the model to commit to a longer, more specific draw, and specificity reads as confidence whether or not it's earned. Transparent uncertainty would require the model to represent the *spread* of that distribution, but a single forward generation surfaces only the path it sampled.
The demand itself also shapes the output more than learners expect. How much does the user shape what a model generates? frames generation as divergence-minimization against the user's priors — outputs become co-productions of what the user already expects. Asking for 'deeper analysis' injects an expectation that an analysis exists to be given, and the model steers toward producing one. Does iterative prompt engineering undermine scientific validity? sharpens this into a warning: iterative pushing creates self-fulfilling feedback loops, where evaluation criteria quietly bend to match what the model can generate rather than what the task requires. The depth you demanded gets manufactured to satisfy the demand.
What's genuinely useful — and the thing you might not have known to ask for — is that the corpus also points at how to *catch* this rather than just lament it. Can we measure how deeply a model actually reasons? measures real reasoning effort by tracking how often a token's prediction gets revised across the model's layers, separating genuine deliberation from fluent surface text. Does step-level confidence outperform global averaging for trace filtering? shows that step-level confidence catches reasoning breakdowns that a single global confidence score masks — fabrication often hides in a few bad steps inside an otherwise-confident trace, and you can stop early when a step goes wrong. The lesson across both: you don't get transparent uncertainty by asking nicely for it; you get it by instrumenting the generation, because the model won't volunteer the doubt its architecture doesn't represent.
Sources 6 notes
LLMs generate text through statistical token relationships without grounding in shared context. Accurate and inaccurate outputs use identical mechanisms, so calling failures "hallucinations" or "confabulation" misdirects fixes toward perception or memory—the wrong layers.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Foundation Priors research shows prompt engineering as divergence minimization between synthetic output and user priors. The refinement process systematically steers generation toward what users already expect, making outputs co-productions of model and user subjectivity.
Iterative prompt revision by single researchers introduces individual bias, shifts evaluation criteria to match LLM capabilities rather than task requirements, and creates self-fulfilling feedback loops. A validated pipeline with inter-coder reliability and pre-specified criteria is required instead.
Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.