Why does model self-revision increase confidence while degrading accuracy?

This explores the paradox where a model that reconsiders its own answer ends up surer of itself but more often wrong — and what the corpus says is driving that split between confidence and correctness.

This explores the paradox where a model that reconsiders its own answer ends up surer of itself but more often wrong. The corpus is unusually direct about the mechanism: a model has a built-in bias toward trusting answers it generated itself. Because a high-probability output simply *feels* more correct when the same model evaluates it, self-review becomes a self-agreement loop rather than a real check Why do models trust their own generated answers?. Each pass over its own reasoning adds another vote of confidence without adding any independent evidence — so confidence climbs while accuracy doesn't.

The key insight is that the revision *act* isn't the problem — the revision *source* is. When an external model critiques the output, accuracy improves; when a model revises against its own uncertain reasoning, it typically amplifies confidence in the wrong answer instead of correcting it Does revising your own reasoning actually help or hurt?. This has a name: 'degeneration of thought,' a distinct failure mode where single-model self-revision entrenches errors. Notably, the fix isn't more self-reflection but *difference* — multi-agent debate between genuinely different models reverses the pattern and improves both accuracy and calibration Does a model improve by arguing with itself?. The corrective ingredient is an outside perspective the model can't generate from inside its own distribution.

The accuracy degradation is well-measured in practice. Across o1-like reasoning models (QwQ, R1, LIMO), most revisions retain the wrong answer, and smaller models frequently flip *correct* answers to incorrect during revision — with longer chains and more revision steps correlating with lower accuracy, not higher Does self-revision actually improve reasoning in language models?. There's a compounding effect too: once an error sits in the context, it biases everything downstream, causing sharp non-linear degradation as the model conditions on its own mistakes Do models fail worse when their own errors fill the context?. So a revision pass isn't neutral — it feeds the model more of its own potentially-wrong reasoning to anchor on.

Step back and this is one instance of a structural ceiling the corpus calls the self-improvement mirage: pure self-improvement stalls because of the generation-verification gap — a model can't verify better than it can generate. Reliable improvement methods all quietly smuggle in an external anchor: a past model version, a third-party judge, a user correction, or tool feedback Can models reliably improve themselves without external feedback?. The same logic explains why training-time fixes work when they introduce real error signal: self-correction has to be trained with online RL on the model's *own actual mistakes*, because offline correction traces don't match the errors the model makes at test time Why does self-correction training on offline data fail?.

The twist worth taking away: confidence isn't a useless signal — it's just being used in the wrong loop. When model confidence is harnessed as a *reward* during training rather than as a self-grading shortcut at inference, it can actually restore calibration and strengthen reasoning Can model confidence work as a reward signal for reasoning?. The failure of self-revision isn't that the model is overconfident in the abstract — it's that asking a model to check itself measures agreement, while real correction requires a source of difference the model doesn't already contain.

Sources 8 notes

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Does self-revision actually improve reasoning in language models?

Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question—still open—is: Why does a model that revises its own answer end up more confident but less accurate?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as a snapshot:
• Self-revision without external critique creates a self-agreement loop: the same model's confidence in its own output increases with each pass, but accuracy does not (2024–2025).
• Single-model self-revision entrenches errors; multi-agent debate between genuinely different models reverses this, improving both accuracy and calibration (2023–2024).
• Across o1-like models (QwQ, R1, LIMO), longer revision chains correlate with *lower* accuracy; smaller models flip correct answers to incorrect during revision (2025).
• The "self-improvement mirage": pure self-improvement stalls because generation outpaces verification; reliable fixes smuggle in external anchors—past versions, third-party judges, or tool feedback (2024–2025).
• Model confidence as *intrinsic reward* during training (not as self-grading at test time) can restore calibration and strengthen reasoning (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2309.13007 (2023): ReConcile — diversity and consensus among models
• arXiv:2403.09972 (2024): Think Twice Before Trusting — self-detection bias
• arXiv:2409.12917 (2024): Training via RL from self-correction
• arXiv:2502.12215 (2025): o1-like test-time scaling limits

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above—especially the claim that revision degrades accuracy and that multi-agent debate reverses it—judge whether newer models, scaled RL from self-feedback, or better verifier tooling (e.g., symbolic oracles, learned reward models) have since relaxed or overturned these limits. Which are still observed in Feb 2025+ models? Where has external critique been built into training rather than inference?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any recent paper shown that self-revision *does* improve accuracy under specific architectural or training conditions? Name it and why it may reconcile the tension.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Do models trained with confidence-weighted RL surpass the self-verification ceiling?" or "Can ensembles of differently-seeded models avoid the self-agreement loop?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why does model self-revision increase confidence while degrading accuracy?

Sources 8 notes

Next inquiring lines