INQUIRING LINE

Does self-revision actually improve reasoning in large language models?

This explores whether an LLM checking and rewriting its own reasoning actually makes its answers more correct — and the corpus suggests it usually doesn't, unless the correction signal comes from outside the model.


This explores whether an LLM revising its own reasoning actually improves accuracy. The short answer from the corpus is counterintuitive: self-revision, left to the model itself, tends to make things worse, not better. Direct measurements on o1-style reasoning models (QwQ, R1, LIMO) show that most revisions keep a wrong answer wrong, and smaller models frequently flip a correct answer to an incorrect one mid-revision — longer chains with more second-guessing actually correlate with *lower* accuracy Does self-revision actually improve reasoning in language models?. So more deliberation is not automatically more truth.

The pivotal variable turns out to be the *source* of the critique, not the act of revising. When an external model guides the revision, accuracy improves; when a model audits its own uncertain output, it typically amplifies its confidence in the wrong answer rather than catching the mistake Does revising your own reasoning actually help or hurt?. There's a clean mechanistic reason for this self-sabotage: models carry a structural bias toward trusting answers they generated themselves, because their own high-probability tokens simply *feel* more correct during evaluation. The fix is comparison against outside alternatives, which breaks the self-agreement loop Why do models trust their own generated answers?. This connects to a deeper formal limit — self-improvement is bounded by the generation-verification gap, meaning a model can't reliably validate a fix without something external to check it against What stops large language models from improving themselves?.

But 'external' doesn't have to mean a human or a bigger model standing over its shoulder. The more interesting thread in the collection is that self-correction *can* be trained in — it just can't be improvised at inference time. Supervised fine-tuning on pre-recorded correction traces fails, because the errors in training don't match the errors at test time and models collapse into a single canned 'correction' move. What works is multi-turn online reinforcement learning on the model's *own* live mistakes, so it practices fixing the errors it actually makes Why does self-correction training on offline data fail?. Related work shows models can even learn to compute their own reward in the unused sequence space after their answer, internalizing self-evaluation during training at zero inference cost Can models learn to evaluate their own work during training?, and that proposer-solver self-play can manufacture an external-feeling verification signal without human labels Can language models improve themselves without any external training data?.

It's also worth zooming out on what 'reasoning' is doing in the first place, because it reframes why revision underdelivers. Frontier reasoning models that look fluent at reflection score only ~20-23% on constraint-satisfaction problems demanding genuine backtracking — the appearance of careful reflection doesn't translate to competence on unfamiliar structures Can reasoning models actually sustain long-chain reflection?. And failures track instance *novelty*, not problem complexity: models lean on pattern-matched instances rather than general algorithms, so revising a chain doesn't help when the underlying approach was a memorized template that doesn't fit Do language models fail at reasoning due to complexity or novelty?. If the model never had the right method, re-reading its own work won't conjure one.

The thing you might not have known you wanted to know: the corpus quietly dissolves the romantic picture of an AI 'thinking harder' and getting wiser. Real gains in reasoning seem to come from a small set of high-entropy 'forking' decisions where the model commits to a direction Do high-entropy tokens drive reasoning model improvements? — and from baking verification into training — far more than from after-the-fact self-revision. Reflection that isn't anchored to an external check is, at best, theater; at worst, it's a confidence machine for wrong answers.


Sources 10 notes

Does self-revision actually improve reasoning in language models?

Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can language models improve themselves without any external training data?

SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability researcher. The question remains open: under what conditions does self-revision actually *improve* reasoning in LLMs, and when does it degrade performance?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–Feb 2026. Key constraints reported:
• Unaided self-revision on o1-style models (QwQ, R1, LIMO) *worsens* accuracy; longer chains correlate with lower scores (~2024–2025).
• Models systematically trust their own prior outputs over external alternatives, amplifying confidence in wrong answers rather than catching errors (2024-03, 2024-12).
• Supervised fine-tuning on correction traces fails due to train–test distribution mismatch; only online RL on live mistakes works (~2024-09).
• Frontier reasoning models achieve only ~20–23% on constraint-satisfaction tasks demanding genuine backtracking, despite fluent reflection (2024-04).
• Reasoning gains cluster around high-entropy 'forking' decisions and training-time verification, not post-hoc reflection (~2025-06).

Anchor papers (verify; mind their dates):
• arXiv:2403.09972 (2024-03): Think Twice Before Trusting — self-detection bias in LLMs.
• arXiv:2409.12917 (2024-09): Training Language Models to Self-Correct via RL — online learning, not SFT.
• arXiv:2506.01939 (2025-06): High-Entropy Minority Tokens — forking points, not longer chains, drive gains.
• arXiv:2502.12215 (2025-02): Test-Time Scaling of o1-like Models — re-evaluates scaling assumptions.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, judge whether newer inference methods (multi-agent orchestration, external verifier systems, dynamic routing), training advances (constitutional RL, process reward models), or evals have *relaxed* the self-agreement trap or overturned the distribution-mismatch barrier. Where does self-revision still fail and why? What has been solved?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that claims self-revision *does* improve reasoning under specific setups (e.g., with scaffolding, domain-specific training, or external memory). Highlight the gap between claims and measurements.
(3) Propose 2 research questions that assume the regime *has* shifted: (a) Can externality be *simulated* at inference time without a second model? (b) Does reasoning quality depend on *which* revision path the model explores, independent of self vs. external critique?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines