How do self-revisions degrade reasoning accuracy in extended traces?

This explores why letting a reasoning model second-guess and rewrite its own work in long chains often makes its answers worse rather than better — and what's actually going wrong inside those revisions.

This explores why letting a reasoning model second-guess and rewrite its own work in long chains tends to make answers worse, not better. The most direct evidence is blunt: across QwQ, R1, and LIMO, most revisions keep a wrong answer wrong, and smaller models frequently flip a correct answer to an incorrect one mid-revision — so longer chains with more revisions actually correlate with *lower* accuracy Does self-revision actually improve reasoning in language models?. The revision isn't neutral overhead; it's an active source of degradation.

The sharpest clue to *why* comes from separating who is doing the critiquing. When an external model guides the revision, accuracy improves; when a model revises its own uncertain output, it typically amplifies confidence in the wrong answer instead of correcting it. The decisive factor is the revision *source*, not the act of revising Does revising your own reasoning actually help or hurt?. This dovetails with work showing that a model's self-reflection is mostly confirmatory theater — reflections rarely change the initial answer, and calibration actually degrades under binary-reward training, so the model becomes more sure precisely as it becomes less reliable Can we actually trust reasoning model outputs?. A self-revising model is essentially asking a poorly-calibrated judge to overrule itself, and that judge mostly votes to stay the course or wander off.

There's also a structural failure layer underneath. Reasoning models go wrong not from too little compute but from disorganization — 'wandering' into invalid exploration and 'underthinking' by abandoning promising paths too early. Notably, decoding-level penalties that discourage premature thought-switching recover accuracy without any fine-tuning, which says the good paths were there and got revised away Why do reasoning models abandon promising solution paths?. The pivots that matter most — planning and backtracking sentences — are sparse and disproportionately influential, so a bad revision at one of these 'thought anchors' can steer the whole rest of the trace off a cliff Which sentences actually steer a reasoning trace?.

This reframes the whole 'longer is smarter' intuition. Accuracy as a function of chain length follows an inverted U: it peaks at an intermediate length and declines past it, and more capable models prefer *shorter* chains — RL training naturally drifts toward brevity as models improve Why does chain of thought accuracy eventually decline with length?. Extended revision pushes traces past the peak and down the far slope. And the ceiling is real: frontier models hit only 20–23% on constraint-satisfaction problems that demand genuine backtracking, so the very capacity a productive self-revision would need is the thing these models most lack Can reasoning models actually sustain long-chain reflection?.

The practical takeaway is that the fix isn't to revise less blindly but to *verify locally*. Step-level confidence filtering catches breakdowns that whole-trace averaging hides, and can stop a trace early before a bad revision metastasizes — matching majority-vote accuracy with far fewer generated traces Does step-level confidence outperform global averaging for trace filtering?. More broadly, checking intermediate states during generation rather than scoring the final answer lifted task success from 32% to 87%, because most failures are process violations, not arithmetic mistakes Where do reasoning agents actually fail during long traces?. Self-revision degrades because it's an unverified internal edit; the cure is external or step-local verification that catches the bad edit before the trace keeps building on it.

Sources 9 notes

Does self-revision actually improve reasoning in language models?

Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Which sentences actually steer a reasoning trace?

Counterfactual resampling, attention analysis, and causal suppression all identify planning and backtracking sentences as thought anchors—sparse critical points that guide subsequent reasoning. These are functional pivots, not noise.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-systems analyst. The question remains open: **Why do self-revisions in extended reasoning traces often degrade rather than improve accuracy?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. A library of recent work reports:
- Across QwQ, R1, LIMO: most self-revisions keep wrong answers wrong; smaller models flip correct answers to incorrect ones mid-trace; longer chains with more revisions correlate with *lower* accuracy (~2025).
- The critical variable is revision *source*: external critique improves accuracy; self-revision amplifies confidence in wrong answers instead (~2025).
- Self-reflection under binary-reward training degrades calibration; models become more certain as reliability drops (~2024).
- Reasoning fails from disorganization (wandering, underthinking), not insufficient compute; decoding-level penalties on thought-switching recover accuracy without fine-tuning (~2025).
- Planning and backtracking steps ('thought anchors') are sparse but disproportionately influential; bad revisions here cascade (~2026).
- Accuracy vs. chain length follows an inverted U; frontier models peak at intermediate length, then decline; more capable models prefer *shorter* traces (~2025).
- Frontier models achieve only 20–23% on constraint-satisfaction problems requiring genuine backtracking (~2025).
- Step-level confidence filtering outperforms global averaging; process verification (vs. final-answer scoring) lifts success from 32% to 87% (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2404.09129 (2024) — reflective thinking limits
- arXiv:2505.20296 (2025) — wandering as exploration failure
- arXiv:2506.19143 (2026) — thought anchor influence
- arXiv:2508.15260 (2025) — confidence-aware filtering

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (post-June 2026), training methods (process reward models, iterative verification), orchestration (cached intermediate states, multi-agent debate), or evaluation harnesses have since relaxed or overturned it. Separate the durable question (does self-revision still degrade?) from perishable limitations (e.g., poor calibration, weak backtracking). Cite what resolved each and say plainly where constraints still hold.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months that claims self-revision *does* improve reasoning or that the inverted-U is flattened by new architectures or training.
(3) **Propose 2 research questions** that assume the regime may have shifted: e.g., "Do externally-guided revision orchestrations (e.g., collaborative multi-turn verification) now *sustain* accuracy past the inverted-U peak?" or "Has process reward model training rebalanced the self-reflection failure, making internal critique reliable?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How do self-revisions degrade reasoning accuracy in extended traces?

Sources 9 notes

Next inquiring lines