Why do models trust their own generated answers?
Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.
Self-detection — the use of a model's own capabilities to evaluate the trustworthiness of its outputs — is a widely used approach to hallucination mitigation and output quality assessment. The "Think Twice Before Trusting" paper identifies a fundamental structural problem with it: LLMs have an inherent bias toward trusting their own generated answers.
Two paradigms of self-detection both fail in the same direction:
- Confidence calibration: Sampling multiple answers and checking agreement. Fails when errors are consistent — the model generates the same wrong answer repeatedly with high self-agreement.
- Self-evaluation: Directly asking the model whether its answer is correct. Fails because the model is biased toward validating what it generated.
The mechanism is not random — it is structural. The same training process that produced the incorrect answer also evaluates whether that answer is correct. Distributional bias toward self-agreement is baked into the model: responses the model generated are, by definition, high-probability outputs, and high-probability outputs feel more "correct" to the evaluating model. This is a form of Why do language models avoid correcting false user claims? applied at the output-evaluation level: the model accommodates its own prior outputs rather than critically assessing them.
The proposed fix — evaluating trustworthiness by comparing the generated answer against a broader answer space — breaks the self-agreement loop. When the model must justify multiple candidate answers (not just its own), the strong justifications available for correct alternatives counterbalance the bias toward the generated answer.
This connects to Does revising your own reasoning actually help or hurt?: both findings identify the same asymmetry — external perspective breaks the self-referential loop, internal perspective perpetuates it. The difference is that self-detection failure is specifically about the evaluation act, while revision source failure is about the correction act.
For deployment: systems that use LLM self-evaluation as a reliability signal (e.g., uncertainty estimation, output filtering) are implicitly assuming models can detect their own errors. This assumption is false when errors are systematic. The signal is reliable only for idiosyncratic errors the model would not generate with high confidence — the cases where self-detection is needed least.
Inquiring lines that use this note as a source 99
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does self-observation enable experts to verify their own judgment?
- Can AI self-correct its way out of epistemic circularity?
- Why does self-critiquing actually reduce plan quality in language models?
- Can external verification systems fix what self-verification cannot accomplish?
- Do models learn different sophistry strategies for QA versus code generation?
- What makes deliberate practice on your own errors more effective than copying others?
- Should validation responsibility move away from the primary user?
- Why might writers trust AI renderings of their views over their own words?
- Can evaluators investigate dependencies without accumulating mistakes over time?
- Why does self-generated training data outperform externally sourced data?
- How do agents revise their own errors during autonomous architecture discovery?
- Does self-revision actually improve reasoning in large language models?
- Can single models correct their own beliefs without amplifying confidence in wrong answers?
- What failure modes emerge when model-generated content trains on itself iteratively?
- Why do error avalanches accelerate in self-training loops without verification?
- Do models actually self-assess their confidence or just confirm answers?
- How does the generation-verification gap limit AI self-improvement capabilities?
- Why do review corpora contain biases that affect generated comparisons?
- What are the three root causes models fail at self-correction?
- Can models learn better from critiquing errors than imitating correct responses?
- Why does self-generated training data outperform externally curated domain examples?
- Do anomaly detection circuits help models identify misalignment with creator intentions?
- Can self-consistency checks fully prevent error avalanching in self-training loops?
- Why does external verification stop error amplification but internal self-assessment enable it?
- How does anomalous state of knowledge affect user self-assessment?
- How does hidden processing in language models prevent accurate self-assessment?
- How does self-distillation differ from standard fine-tuning approaches?
- Can models learn to generate their own training examples effectively?
- Why does self-correction during generation produce reliable labels without exemplars?
- How can we measure whether a user actually understands their own needs?
- Why does self-revision degrade reasoning accuracy in o1-like models?
- What skills can large models identify and organize about their own abilities?
- How does self-revision on wrong answers increase model confidence further?
- Do external perspectives fix the self-evaluation bias in language models?
- Can uncertainty estimates based on model self-assessment reliably signal errors?
- Can we verify fabricated text without redesigning the generation process?
- Can language models accurately evaluate the quality of their own ideas?
- Can measuring semantic entropy help us detect unreliable generations?
- How can we verify outputs from systems that generate without grounding?
- Why do models maintain accurate beliefs but generate false claims?
- Why do reasoning models struggle with self-evaluation and revision?
- How does self-revision in reasoning chains amplify confidence in wrong answers?
- Why do human raters miss factual errors that domain experts catch?
- Why does single-model self-revision amplify confidence in incorrect answers?
- Why does single-agent self-revision amplify confidence in wrong answers over time?
- Why does self-reflection during training fail to improve model self-correction?
- Does model confidence actually explain why paraphrases produce different outputs?
- Does internalizing verifiers actually close the generation-verification gap?
- How much inference efficiency do we gain by eliminating self-correction passes?
- Why do models generate creative ideas but fail to evaluate their legitimacy?
- Does reflection training actually teach models to self-correct their mistakes?
- Why do reasoning models amplify confidence in incorrect answers during self-revision?
- Can debate between multiple models prevent the failures of single-model self-revision?
- Does self-reflection help models notice their own constraint violations?
- How do instruction backtranslation and MAGPIE demonstrate self-generation principles?
- Why does filtering for correct examples prevent error compounding in self-training?
- Does the generation-verification gap actually limit self-improvement in verifiable tasks?
- Can language models accurately evaluate the quality of their own reasoning?
- Does self-supervised process supervision work for domains with ambiguous correctness?
- How does error avalanching compound failures in self-training iterations?
- Why do self-consistency methods fail where pretraining bias is strongest?
- Why does model self-revision increase confidence while degrading accuracy?
- Does internal self-revision actually degrade reasoning accuracy in models?
- What makes a first answer so often the best answer a model produces?
- Why do models detect false assumptions but still fail to correct them appropriately?
- Why does self-consistency fail as a proxy reward for correctness?
- Can a model evaluate its own improvements without degrading over iterations?
- How should systems maintain and revise models of their own assumptions?
- Why do familiar patterns that support correct answers sometimes drive errors?
- Why do models trained on critique fail at self-critique despite strong other-model evaluation?
- How does training on correct answer form differ mechanistically from training on failure analysis?
- How can we detect dishonesty in model outputs separate from capability failures?
- Why might larger models become less honest despite better truthfulness scores?
- Why do novices accept AI output without validation in vibe coding workflows?
- How do verification labels themselves become part of the misinformation problem?
- How should harness infrastructure validate code that agents generate themselves?
- How does the generation-verification gap prevent language models from improving themselves?
- Why do AI agents fail at verification but succeed at generation?
- How does generation-verification asymmetry create the need for verifiable reporting?
- Why does uncontrolled self-revision drift toward instance-specific overfitting?
- Why does evaluating errors teach more than imitating correct responses?
- Why does self-judgment of success or failure work without ground truth labels?
- What breaks when a mis-synthesized verifier runs with high confidence?
- How do language models infer their own mental states like humans do?
- Can models detect statistical properties of their own generation in real time?
- Why does systematic overconfidence on self-generated outputs compound autoregressive errors?
- Do models spontaneously develop self-reflection from minimal training signals?
- Why do reasoning models exhibit self-doubt about their own early assessments?
- How does predictive accuracy on future tokens differ from correctness on labeled answers?
- Why does self-verification fail but external process verification work?
- Does deliberate self-revision introduce different errors than passive context contamination?
- Do reasoning models need to verbalize doubt to correct their own mistakes?
- What makes self-consistency a sufficient training target for the judge role?
- Can developers detect and flag harmful validation in personal advice exchanges?
- Do fluent generated summaries carry false authority over expert judgment?
- Why does self-critique fail without external verification signals?
- How do mechanistic interpretability tools help distinguish truthfulness from honesty?
- Can models be honest without being truthful about facts?
- Does the generation-verification gap define where self-rewarding actually works?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does revising your own reasoning actually help or hurt?
Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.
same asymmetry, adjacent mechanism: external perspective helps, internal self-reference degrades
-
Why do language models avoid correcting false user claims?
Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.
face-saving at self-evaluation level: the model validates its own output as a form of face-maintenance
-
Does a model improve by arguing with itself?
When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
Degeneration-of-Thought is the multi-turn version of self-trust failure; both document increasing confidence in wrong answers through self-reference
-
Can models abandon correct beliefs under conversational pressure?
Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.
both involve belief formation errors in the presence of one information source; external pressure vs. self-generated pressure
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection
- Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models
- Can Large Language Models Reason and Plan?
- Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?
- Measuring Faithfulness in Chain-of-Thought Reasoning
- When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models
- Deep Research: A Systematic Survey
- Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
Original note title
llm self-detection fails because models have inherent bias toward trusting their own generated answers