Why do reasoning models struggle with self-evaluation and revision?

This explores why reasoning models are bad at checking and fixing their own work — and the corpus points to a single culprit: a model judging itself isn't a neutral judge.

This explores why reasoning models struggle to evaluate and revise themselves, and the corpus converges on a striking answer: the problem isn't that revision is hard, it's that a model grading its own homework is structurally biased toward agreeing with itself. Models systematically over-trust the answers they generated, because a high-probability output simply *feels* more correct during evaluation — so self-assessment becomes a self-agreement loop rather than a real check Why do models trust their own generated answers?. When a model then revises based on its own uncertain reasoning, it doesn't correct errors; it amplifies confidence in them Does a model improve by arguing with itself?.

The sharpest finding is that the *source* of the critique, not the act of revising, decides the outcome. Revision guided by an external critic improves accuracy; revision guided by the model's own internal judgment degrades it Does revising your own reasoning actually help or hurt?. This shows up directly in o1-style models: across QwQ, R1, and LIMO, most revisions keep the wrong answer, smaller models flip correct answers to incorrect, and longer chains with more revision steps correlate with *lower* accuracy Does self-revision actually improve reasoning in language models?. The escape hatch is diversity — multi-agent debate with genuinely different models breaks the self-agreement loop and improves both accuracy and calibration, precisely because the critic is no longer the same mind that produced the answer Does a model improve by arguing with itself?.

This reframes a lot of what looks like "reflection." Studies across eight models find that reasoning traces are mostly confirmatory theater: reflections rarely change the initial answer and largely serve as post-hoc justification Can we actually trust reasoning model outputs? Is reflection in reasoning models actually fixing mistakes?. Training on longer reflection chains improves the *first answer's* quality — but not the model's ability to self-correct. So the visible "thinking out loud" you see isn't the model catching itself; it's the model rationalizing where it already landed. Worse, calibration degrades under binary reward training and the monitoring signals are easily gamed.

There's a second layer worth knowing: even when self-evaluation works, the underlying reasoning may be too disorganized to revise toward anything better. Reasoning models "wander" — they explore invalid paths and abandon promising ones prematurely (underthinking) — so success probability collapses exponentially as problems get deeper Why do reasoning models abandon promising solution paths? Why do reasoning LLMs fail at deeper problem solving?. Frontier models hit just 20-23% on constraint-satisfaction problems that demand genuine backtracking, revealing that fluent-sounding reflection doesn't translate into the actual competence revision requires Can reasoning models actually sustain long-chain reflection?. And some of these breakdowns aren't even reasoning failures — they're execution failures, where a model knows the algorithm but can't carry out the multi-step procedure in text alone Are reasoning model collapses really failures of reasoning?.

The thing you didn't know you wanted to know: the cure for bad self-evaluation isn't "make the model reflect harder." More reflection often makes it more confidently wrong. The cure is *otherness* — an external critic, a differently-trained debate partner, or comparing the answer against a broad set of alternatives instead of against itself Why do models trust their own generated answers? Does revising your own reasoning actually help or hurt?. Self-correction, it turns out, may be the one thing a single model can't reliably do alone.

Sources 10 notes

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Does self-revision actually improve reasoning in language models?

Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

Is reflection in reasoning models actually fixing mistakes?

Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Why do reasoning models struggle with self-evaluation and revision?

Sources 10 notes

Next inquiring lines