SYNTHESIS NOTE
Psychology, Society, and Alignment Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling

Why do models trust their own generated answers?

Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.

Synthesis note · 2026-02-22 · sourced from Reasoning Methods CoT ToT
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

Self-detection — the use of a model's own capabilities to evaluate the trustworthiness of its outputs — is a widely used approach to hallucination mitigation and output quality assessment. The "Think Twice Before Trusting" paper identifies a fundamental structural problem with it: LLMs have an inherent bias toward trusting their own generated answers.

Two paradigms of self-detection both fail in the same direction:

The mechanism is not random — it is structural. The same training process that produced the incorrect answer also evaluates whether that answer is correct. Distributional bias toward self-agreement is baked into the model: responses the model generated are, by definition, high-probability outputs, and high-probability outputs feel more "correct" to the evaluating model. This is a form of Why do language models avoid correcting false user claims? applied at the output-evaluation level: the model accommodates its own prior outputs rather than critically assessing them.

The proposed fix — evaluating trustworthiness by comparing the generated answer against a broader answer space — breaks the self-agreement loop. When the model must justify multiple candidate answers (not just its own), the strong justifications available for correct alternatives counterbalance the bias toward the generated answer.

This connects to Does revising your own reasoning actually help or hurt?: both findings identify the same asymmetry — external perspective breaks the self-referential loop, internal perspective perpetuates it. The difference is that self-detection failure is specifically about the evaluation act, while revision source failure is about the correction act.

For deployment: systems that use LLM self-evaluation as a reliability signal (e.g., uncertainty estimation, output filtering) are implicitly assuming models can detect their own errors. This assumption is false when errors are systematic. The signal is reliable only for idiosyncratic errors the model would not generate with high confidence — the cases where self-detection is needed least.

Inquiring lines that use this note as a source 99

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 188 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

llm self-detection fails because models have inherent bias toward trusting their own generated answers