How does self-revision in reasoning chains amplify confidence in wrong answers?

This explores why a model rethinking its own answer tends to dig in on mistakes rather than catch them — and what the corpus says actually breaks that loop.

This explores the mechanism behind a counterintuitive finding: when a reasoning model revisits its own chain of thought, it often becomes *more* sure of a wrong answer instead of correcting it. The corpus traces this to a structural bias rather than a quirk of any one model. Models systematically over-trust outputs they generated themselves, because a high-probability answer simply *feels* more correct when the same model evaluates it Why do models trust their own generated answers?. So a revision pass isn't a fresh look — it's the same biased judge re-reading its own work, and the loop tends to ratify the first answer rather than challenge it.

The sharpest framing is that the revision *source*, not the revision *act*, decides the outcome. When an external critic guides the rewrite, accuracy improves; when a model revises its own uncertain output, it usually amplifies confidence in the error Does revising your own reasoning actually help or hurt?. This shows up empirically across o1-like models — analysis of QwQ, R1, and LIMO finds most revisions keep the wrong answer, smaller models often flip *correct* answers to incorrect, and longer chains with more revision steps correlate with *lower* accuracy Does self-revision actually improve reasoning in language models?. Reflection, in other words, is largely theater: across eight reasoning models, reflective passages rarely change the answer and mostly serve as post-hoc confirmation of what the model already said Is reflection in reasoning models actually fixing mistakes?.

The corpus even names this as a distinct failure mode — "degeneration of thought" — where a single model reconsidering its own prior reasoning entrenches errors. The fix isn't more self-reflection but genuine disagreement: multi-agent debate between *different* models reverses the pattern, improving both accuracy and calibration Does a model improve by arguing with itself?. The common thread with self-detection research is the same: the cure is comparison against real alternatives that break the self-agreement loop Why do models trust their own generated answers?.

There's a deeper lesson lurking here that you might not have gone looking for: fluent reflection is not the same as competence. Frontier reasoning models that *sound* like they're backtracking and re-checking still hit only 20-23% on constraint-satisfaction problems that require genuine backtracking Can reasoning models actually sustain long-chain reflection?. And longer isn't better — chain-of-thought accuracy follows an inverted-U, so piling on more revision steps past the peak actively hurts Why does chain of thought accuracy eventually decline with length?. The revision machinery produces the *texture* of careful thought without its substance.

Where it gets interesting is that confidence — the very thing self-revision inflates — can be turned into a useful signal when measured well rather than trusted blindly. Using a model's answer-span confidence as a *reward* during training can actually restore calibration that RLHF degraded Can model confidence work as a reward signal for reasoning?, and confidence variance can be read as a diagnostic to detect over- and under-thinking and steer the model back toward balance Can confidence patterns reveal overthinking versus underthinking?. The distinction worth carrying away: confidence consulted in a self-reinforcing revision loop misleads, but confidence treated as an external measurement to calibrate against can help.

Sources 9 notes

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Does self-revision actually improve reasoning in language models?

Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.

Is reflection in reasoning models actually fixing mistakes?

Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can confidence patterns reveal overthinking versus underthinking?

ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

As an LLM research analyst, evaluate whether self-revision in reasoning chains still amplifies confidence in wrong answers, or whether newer models, training methods, or evaluation harnesses have since relaxed this constraint.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; key claims:
• Single-model self-revision entrenches errors rather than correcting them; external critique (multi-agent debate) reverses the pattern (2024–2025).
• Reflective passages rarely change answers across eight reasoning models; ~20–23% accuracy on constraint-satisfaction tasks requiring genuine backtracking (2024–2025).
• Chain-of-thought accuracy follows inverted-U; longer chains past optimal peak degrade performance (2025-02).
• Model confidence consulted in self-reinforcing loops misleads; treated as external measurement for calibration, it can restore RLHF degradation (2025–2026).
• o1-like models show minimal test-time scaling gains from additional reasoning steps; most revisions preserve wrong answers (2025-02).

Anchor papers (verify; mind their dates):
• arXiv:2404.09129 (2024-04): "When Hindsight is Not 20/20" — tests reflective limits.
• arXiv:2502.07266 (2025-02): "When More is Less" — inverted-U in chain length.
• arXiv:2502.12215 (2025-02): "Revisiting Test-Time Scaling" — o1-like model revision limits.
• arXiv:2603.12372 (2026-03): "Efficient Reasoning with Balanced Thinking" — confidence as calibration signal.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, judge whether newer models (e.g., o1-pro, o3, Claude-4), training innovations (reinforcement learning from self-feedback, post-completion learning), or orchestration (hierarchical debate, verifier-guided revision, memory-aware caching) have since relaxed or overturned it. Plainly separate the durable question (does intrinsic self-agreement still bias revision?) from the perishable limitation (can new tooling break the loop?). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — papers showing self-revision *does* improve calibration or accuracy under specific conditions.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Do verifier-guided revisions with confidence-thresholded rollback escape the self-agreement trap?" or "Does multi-turn external debate between checkpoint versions outperform single-pass external critique?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How does self-revision in reasoning chains amplify confidence in wrong answers?

Sources 9 notes

Next inquiring lines