Why does external critique improve revision accuracy more than self-assessment?
This explores why a model revising its work does better when an outside critic guides the revision than when it grades itself — and what breaks when a model is left to second-guess alone.
This explores why external critique beats self-assessment at the revision step — not whether revision helps in general, but why the *source* of the feedback changes the outcome. The corpus points to a clean answer: the act of revising is neutral; what determines accuracy is who's holding the red pen. When a model revises against an external critic, accuracy goes up; when it revises against its own judgment, it tends to talk itself further into the wrong answer Does revising your own reasoning actually help or hurt?. Self-revision in o1-style reasoning models has been shown to mostly *retain* wrong answers, and smaller models often flip correct answers to incorrect — longer chains with more self-revisions actually correlate with lower accuracy Does self-revision actually improve reasoning in language models?.
The mechanism behind the failure has a name: degeneration of thought. A single model reconsidering its own prior reasoning doesn't get a fresh signal — it gets an echo. It compounds its earlier confidence, so errors get *more* certain rather than corrected. Introduce genuinely different models in debate and the pattern reverses, improving both accuracy and calibration Does a model improve by arguing with itself?. This connects to a deeper structural claim: pure self-improvement is circular. It stalls on a generation-verification gap (a model can't reliably verify what it couldn't reliably generate), diversity collapse, and reward hacking. Every method that *does* reliably improve smuggles in an external anchor — a past model version, a third-party judge, a user correction, a tool's output Can models reliably improve themselves without external feedback?.
There's a confidence angle that makes this concrete. A model's confidence predicts how much it resists changing its output — high confidence means it barely budges under pressure Does model confidence predict robustness to prompt changes?. That's exactly the wrong property for self-correction: the cases where you most need a revision are the confidently-wrong ones, and those are the cases self-assessment is least equipped to overturn. An external critic isn't subject to the same confidence inertia, so it can supply the disagreement the model can't generate internally.
What's interesting is that critique's value runs deeper than the test-time accuracy bump. Training models to *critique* noisy responses builds deeper understanding than training them to imitate correct answers, because engaging with failure modes forces structural reasoning rather than surface-pattern copying Does critiquing errors teach deeper understanding than imitating correct answers?. And in the training loop itself, step-level critique preserves solution diversity and counteracts the premature convergence that self-training otherwise causes — a more fundamental benefit than the accuracy gain Do critique models improve diversity during training itself?. This frames the imitation trap from the other side: copying a stronger model's *style* fools evaluators but closes no real capability gap Can imitating ChatGPT fool evaluators into thinking models improved?.
The surprise worth taking away: self-assessment isn't merely weaker than external critique — it can be actively harmful, turning a correct answer wrong by reinforcing the same reasoning that produced it. The fix isn't a smarter self-grader; it's *difference*. Anything that introduces an independent vantage point — a different model, a debate partner, a human correction, a verifiable tool signal — breaks the echo chamber. There's even early work on letting a model internalize self-evaluation during training using the unused space after its output, which sidesteps inference cost but still leans on a learned reward signal rather than naked self-second-guessing Can models learn to evaluate their own work during training?.
Sources 9 notes
Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.
Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.
Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.