How should training incorporate external critique versus encouraging self-correction?
This explores a real tension in how models learn to get better: should training lean on outside critics (other models, judges, human corrections, tools) or push models to catch and fix their own mistakes — and the corpus turns out to have a fairly clear answer about when each works.
This explores whether training should rely on external critique versus self-correction — and the corpus lands on a sharper position than "it depends": pure self-correction has a structural ceiling, while critique (external or carefully internalized) is what actually moves capability. The cleanest statement of the ceiling is the generation–verification gap: a model can only bootstrap itself when it judges answers better than it produces them, and that gap shrinks with scale and disappears for factual tasks What limits how much models can improve themselves?. When you try to close that gap with the model alone, things go wrong in characteristic ways — diversity collapse, reward hacking, and a model that revises its own uncertain answers into *more* confident wrong ones rather than corrections Can models reliably improve themselves without external feedback? Does revising your own reasoning actually help or hurt?.
The most useful reframing in the collection is that the methods which look like self-improvement usually smuggle external signal back in. "Reliable" self-improvement works because it leans on past model versions, third-party judges, user corrections, or tool feedback — the anchor is external even when the loop looks closed Can models reliably improve themselves without external feedback?. A companion note pushes this further: metacognition may need to be *externalized* rather than learned, because a model grading itself eventually runs into self-valuation problems What actually constrains large language models from self-improvement?. And the failure mode has a name — degeneration of thought, where a single model rethinking its own reasoning amplifies errors, a pattern that only reverses when you introduce genuinely different models to debate it Does a model improve by arguing with itself?.
So why not just train on external critique exclusively? Because critique turns out to be a *better teacher* than imitation, and that's the surprising part. Training a model to critique noisy, wrong responses produces deeper understanding than training it on correct answers, because critique forces engagement with failure modes instead of surface patterns — even imperfect critique supervision beats correct-answer imitation Does critiquing errors teach deeper understanding than imitating correct answers?. This is the flip side of why copying a stronger model ("imitate ChatGPT") captures style but closes no real capability gap Can imitating ChatGPT fool evaluators into thinking models improved?. Critique inside the training loop also does something accuracy numbers hide: step-level critique counteracts "tail narrowing" and keeps solution diversity alive across self-training rounds, which matters more than the test-time bump Do critique models improve diversity during training itself?.
The practical synthesis, then, is not external-or-self but *external critique used to train a genuine self-correction skill*. The clearest evidence: SFT on offline correction traces fails (the training errors don't match the model's real test errors, and it collapses into one correction mode), but multi-turn online RL on the model's *own* mistakes succeeds — the model practices fixing the errors it actually makes Why does self-correction training on offline data fail?. Two notes show how to internalize the critic so you don't pay for it forever: Post-Completion Learning uses the unused sequence space after the model's output to train self-evaluation during training at zero inference cost Can models learn to evaluate their own work during training?, and self-examining RL has the model alternate between answering and judging, deriving reward from ranking consistency — reaching real gains without external signals once that judging muscle exists Can models learn to judge themselves without external rewards?.
One caution the corpus adds, easy to miss: how you apply the external signal can quietly damage the model. RLHF-style preference optimization, by rewarding confident single-turn answers, erodes the grounding acts (clarifying questions, understanding checks) that dialogue depends on — an "alignment tax" where the model looks more helpful and fails more silently Does preference optimization harm conversational understanding?. The takeaway for a training recipe: use external critique as the source of truth and the teacher, but invest the critique signal into building an internal correction loop — and watch that your preference signal isn't optimizing away the very behaviors that make correction possible.
Sources 12 notes
Models can only improve themselves when they verify solutions better than they generate them. This gap scales with model size but vanishes entirely for factual tasks, predicting which domains benefit from self-improvement.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.
LLMs cannot reliably improve themselves without external verification; metacognition must be externalized rather than learned. Alignment philosophy is shifting from preferentism to normative standards, but coherent values at scale include problematic self-valuation requiring utility engineering beyond output control.
Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.
Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.