How do self-revisions degrade reasoning accuracy in extended traces?
This explores why letting a reasoning model second-guess and rewrite its own work in long chains often makes its answers worse rather than better — and what's actually going wrong inside those revisions.
This explores why letting a reasoning model second-guess and rewrite its own work in long chains tends to make answers worse, not better. The most direct evidence is blunt: across QwQ, R1, and LIMO, most revisions keep a wrong answer wrong, and smaller models frequently flip a correct answer to an incorrect one mid-revision — so longer chains with more revisions actually correlate with *lower* accuracy Does self-revision actually improve reasoning in language models?. The revision isn't neutral overhead; it's an active source of degradation.
The sharpest clue to *why* comes from separating who is doing the critiquing. When an external model guides the revision, accuracy improves; when a model revises its own uncertain output, it typically amplifies confidence in the wrong answer instead of correcting it. The decisive factor is the revision *source*, not the act of revising Does revising your own reasoning actually help or hurt?. This dovetails with work showing that a model's self-reflection is mostly confirmatory theater — reflections rarely change the initial answer, and calibration actually degrades under binary-reward training, so the model becomes more sure precisely as it becomes less reliable Can we actually trust reasoning model outputs?. A self-revising model is essentially asking a poorly-calibrated judge to overrule itself, and that judge mostly votes to stay the course or wander off.
There's also a structural failure layer underneath. Reasoning models go wrong not from too little compute but from disorganization — 'wandering' into invalid exploration and 'underthinking' by abandoning promising paths too early. Notably, decoding-level penalties that discourage premature thought-switching recover accuracy without any fine-tuning, which says the good paths were there and got revised away Why do reasoning models abandon promising solution paths?. The pivots that matter most — planning and backtracking sentences — are sparse and disproportionately influential, so a bad revision at one of these 'thought anchors' can steer the whole rest of the trace off a cliff Which sentences actually steer a reasoning trace?.
This reframes the whole 'longer is smarter' intuition. Accuracy as a function of chain length follows an inverted U: it peaks at an intermediate length and declines past it, and more capable models prefer *shorter* chains — RL training naturally drifts toward brevity as models improve Why does chain of thought accuracy eventually decline with length?. Extended revision pushes traces past the peak and down the far slope. And the ceiling is real: frontier models hit only 20–23% on constraint-satisfaction problems that demand genuine backtracking, so the very capacity a productive self-revision would need is the thing these models most lack Can reasoning models actually sustain long-chain reflection?.
The practical takeaway is that the fix isn't to revise less blindly but to *verify locally*. Step-level confidence filtering catches breakdowns that whole-trace averaging hides, and can stop a trace early before a bad revision metastasizes — matching majority-vote accuracy with far fewer generated traces Does step-level confidence outperform global averaging for trace filtering?. More broadly, checking intermediate states during generation rather than scoring the final answer lifted task success from 32% to 87%, because most failures are process violations, not arithmetic mistakes Where do reasoning agents actually fail during long traces?. Self-revision degrades because it's an unverified internal edit; the cure is external or step-local verification that catches the bad edit before the trace keeps building on it.
Sources 9 notes
Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.
Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.
Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Counterfactual resampling, attention analysis, and causal suppression all identify planning and backtracking sentences as thought anchors—sparse critical points that guide subsequent reasoning. These are functional pivots, not noise.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.