Why does model self-revision increase confidence while degrading accuracy?
This explores the paradox where a model that reconsiders its own answer ends up surer of itself but more often wrong — and what the corpus says is driving that split between confidence and correctness.
This explores the paradox where a model that reconsiders its own answer ends up surer of itself but more often wrong. The corpus is unusually direct about the mechanism: a model has a built-in bias toward trusting answers it generated itself. Because a high-probability output simply *feels* more correct when the same model evaluates it, self-review becomes a self-agreement loop rather than a real check Why do models trust their own generated answers?. Each pass over its own reasoning adds another vote of confidence without adding any independent evidence — so confidence climbs while accuracy doesn't.
The key insight is that the revision *act* isn't the problem — the revision *source* is. When an external model critiques the output, accuracy improves; when a model revises against its own uncertain reasoning, it typically amplifies confidence in the wrong answer instead of correcting it Does revising your own reasoning actually help or hurt?. This has a name: 'degeneration of thought,' a distinct failure mode where single-model self-revision entrenches errors. Notably, the fix isn't more self-reflection but *difference* — multi-agent debate between genuinely different models reverses the pattern and improves both accuracy and calibration Does a model improve by arguing with itself?. The corrective ingredient is an outside perspective the model can't generate from inside its own distribution.
The accuracy degradation is well-measured in practice. Across o1-like reasoning models (QwQ, R1, LIMO), most revisions retain the wrong answer, and smaller models frequently flip *correct* answers to incorrect during revision — with longer chains and more revision steps correlating with lower accuracy, not higher Does self-revision actually improve reasoning in language models?. There's a compounding effect too: once an error sits in the context, it biases everything downstream, causing sharp non-linear degradation as the model conditions on its own mistakes Do models fail worse when their own errors fill the context?. So a revision pass isn't neutral — it feeds the model more of its own potentially-wrong reasoning to anchor on.
Step back and this is one instance of a structural ceiling the corpus calls the self-improvement mirage: pure self-improvement stalls because of the generation-verification gap — a model can't verify better than it can generate. Reliable improvement methods all quietly smuggle in an external anchor: a past model version, a third-party judge, a user correction, or tool feedback Can models reliably improve themselves without external feedback?. The same logic explains why training-time fixes work when they introduce real error signal: self-correction has to be trained with online RL on the model's *own actual mistakes*, because offline correction traces don't match the errors the model makes at test time Why does self-correction training on offline data fail?.
The twist worth taking away: confidence isn't a useless signal — it's just being used in the wrong loop. When model confidence is harnessed as a *reward* during training rather than as a self-grading shortcut at inference, it can actually restore calibration and strengthen reasoning Can model confidence work as a reward signal for reasoning?. The failure of self-revision isn't that the model is overconfident in the abstract — it's that asking a model to check itself measures agreement, while real correction requires a source of difference the model doesn't already contain.
Sources 8 notes
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.
Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.
Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.
Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.