How does self-revision in reasoning chains amplify confidence in wrong answers?
This explores why a model rethinking its own answer tends to dig in on mistakes rather than catch them — and what the corpus says actually breaks that loop.
This explores the mechanism behind a counterintuitive finding: when a reasoning model revisits its own chain of thought, it often becomes *more* sure of a wrong answer instead of correcting it. The corpus traces this to a structural bias rather than a quirk of any one model. Models systematically over-trust outputs they generated themselves, because a high-probability answer simply *feels* more correct when the same model evaluates it Why do models trust their own generated answers?. So a revision pass isn't a fresh look — it's the same biased judge re-reading its own work, and the loop tends to ratify the first answer rather than challenge it.
The sharpest framing is that the revision *source*, not the revision *act*, decides the outcome. When an external critic guides the rewrite, accuracy improves; when a model revises its own uncertain output, it usually amplifies confidence in the error Does revising your own reasoning actually help or hurt?. This shows up empirically across o1-like models — analysis of QwQ, R1, and LIMO finds most revisions keep the wrong answer, smaller models often flip *correct* answers to incorrect, and longer chains with more revision steps correlate with *lower* accuracy Does self-revision actually improve reasoning in language models?. Reflection, in other words, is largely theater: across eight reasoning models, reflective passages rarely change the answer and mostly serve as post-hoc confirmation of what the model already said Is reflection in reasoning models actually fixing mistakes?.
The corpus even names this as a distinct failure mode — "degeneration of thought" — where a single model reconsidering its own prior reasoning entrenches errors. The fix isn't more self-reflection but genuine disagreement: multi-agent debate between *different* models reverses the pattern, improving both accuracy and calibration Does a model improve by arguing with itself?. The common thread with self-detection research is the same: the cure is comparison against real alternatives that break the self-agreement loop Why do models trust their own generated answers?.
There's a deeper lesson lurking here that you might not have gone looking for: fluent reflection is not the same as competence. Frontier reasoning models that *sound* like they're backtracking and re-checking still hit only 20-23% on constraint-satisfaction problems that require genuine backtracking Can reasoning models actually sustain long-chain reflection?. And longer isn't better — chain-of-thought accuracy follows an inverted-U, so piling on more revision steps past the peak actively hurts Why does chain of thought accuracy eventually decline with length?. The revision machinery produces the *texture* of careful thought without its substance.
Where it gets interesting is that confidence — the very thing self-revision inflates — can be turned into a useful signal when measured well rather than trusted blindly. Using a model's answer-span confidence as a *reward* during training can actually restore calibration that RLHF degraded Can model confidence work as a reward signal for reasoning?, and confidence variance can be read as a diagnostic to detect over- and under-thinking and steer the model back toward balance Can confidence patterns reveal overthinking versus underthinking?. The distinction worth carrying away: confidence consulted in a self-reinforcing revision loop misleads, but confidence treated as an external measurement to calibrate against can help.
Sources 9 notes
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.
Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.
Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.
Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.