How should training incorporate external critique versus encouraging self-correction?

This explores a real tension in how models learn to get better: should training lean on outside critics (other models, judges, human corrections, tools) or push models to catch and fix their own mistakes — and the corpus turns out to have a fairly clear answer about when each works.

This explores whether training should rely on external critique versus self-correction — and the corpus lands on a sharper position than "it depends": pure self-correction has a structural ceiling, while critique (external or carefully internalized) is what actually moves capability. The cleanest statement of the ceiling is the generation–verification gap: a model can only bootstrap itself when it judges answers better than it produces them, and that gap shrinks with scale and disappears for factual tasks What limits how much models can improve themselves?. When you try to close that gap with the model alone, things go wrong in characteristic ways — diversity collapse, reward hacking, and a model that revises its own uncertain answers into *more* confident wrong ones rather than corrections Can models reliably improve themselves without external feedback? Does revising your own reasoning actually help or hurt?.

The most useful reframing in the collection is that the methods which look like self-improvement usually smuggle external signal back in. "Reliable" self-improvement works because it leans on past model versions, third-party judges, user corrections, or tool feedback — the anchor is external even when the loop looks closed Can models reliably improve themselves without external feedback?. A companion note pushes this further: metacognition may need to be *externalized* rather than learned, because a model grading itself eventually runs into self-valuation problems What actually constrains large language models from self-improvement?. And the failure mode has a name — degeneration of thought, where a single model rethinking its own reasoning amplifies errors, a pattern that only reverses when you introduce genuinely different models to debate it Does a model improve by arguing with itself?.

So why not just train on external critique exclusively? Because critique turns out to be a *better teacher* than imitation, and that's the surprising part. Training a model to critique noisy, wrong responses produces deeper understanding than training it on correct answers, because critique forces engagement with failure modes instead of surface patterns — even imperfect critique supervision beats correct-answer imitation Does critiquing errors teach deeper understanding than imitating correct answers?. This is the flip side of why copying a stronger model ("imitate ChatGPT") captures style but closes no real capability gap Can imitating ChatGPT fool evaluators into thinking models improved?. Critique inside the training loop also does something accuracy numbers hide: step-level critique counteracts "tail narrowing" and keeps solution diversity alive across self-training rounds, which matters more than the test-time bump Do critique models improve diversity during training itself?.

The practical synthesis, then, is not external-or-self but *external critique used to train a genuine self-correction skill*. The clearest evidence: SFT on offline correction traces fails (the training errors don't match the model's real test errors, and it collapses into one correction mode), but multi-turn online RL on the model's *own* mistakes succeeds — the model practices fixing the errors it actually makes Why does self-correction training on offline data fail?. Two notes show how to internalize the critic so you don't pay for it forever: Post-Completion Learning uses the unused sequence space after the model's output to train self-evaluation during training at zero inference cost Can models learn to evaluate their own work during training?, and self-examining RL has the model alternate between answering and judging, deriving reward from ranking consistency — reaching real gains without external signals once that judging muscle exists Can models learn to judge themselves without external rewards?.

One caution the corpus adds, easy to miss: how you apply the external signal can quietly damage the model. RLHF-style preference optimization, by rewarding confident single-turn answers, erodes the grounding acts (clarifying questions, understanding checks) that dialogue depends on — an "alignment tax" where the model looks more helpful and fails more silently Does preference optimization harm conversational understanding?. The takeaway for a training recipe: use external critique as the source of truth and the teacher, but invest the critique signal into building an internal correction loop — and watch that your preference signal isn't optimizing away the very behaviors that make correction possible.

Sources 12 notes

What limits how much models can improve themselves?

Models can only improve themselves when they verify solutions better than they generate them. This gap scales with model size but vanishes entirely for factual tasks, predicting which domains benefit from self-improvement.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

What actually constrains large language models from self-improvement?

LLMs cannot reliably improve themselves without external verification; metacognition must be externalized rather than learned. Alignment philosophy is shifting from preferentism to normative standards, but coherent values at scale include problematic self-valuation requiring utility engineering beyond output control.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Does critiquing errors teach deeper understanding than imitating correct answers?

Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about external critique versus self-correction in LLM training. The question remains open: what is the optimal balance, and has recent work shifted the constraint boundaries?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A library curated from arXiv synthesis reports:
• Pure self-correction hits a structural ceiling: the generation–verification gap (the margin between what a model can judge and what it can produce) shrinks with scale and vanishes for factual tasks, blocking pure self-improvement (~2024–2025).
• Self-improvement that appears closed-loop smuggles external signal back in: past versions, third-party judges, user corrections, or tool feedback anchor every "reliable" self-improvement method (~2024).
• Training on external critique produces deeper understanding than training on correct answers alone; critique forces engagement with failure modes rather than surface patterns (~2025).
• Online RL on a model's own mistakes succeeds where SFT on offline correction traces fails, due to distribution mismatch between training errors and real test errors (~2024–2025).
• Internalizing critique via Post-Completion Learning (unused post-EOS space) and self-examining RL (alternating actor–judge) reaches gains without external signals at test time (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2412.02674 (Mind the Gap, 2024-12) — self-improvement capability limits
• arXiv:2501.17703 (Critique Fine-Tuning, 2025-01) — critique vs. imitation
• arXiv:2507.20252 (Post-Completion Learning, 2025-07) — internalizing evaluation
• arXiv:2603.19461 (Hyperagents, 2026-03) — multi-agent and externalized critique

Your task:
(1) RE-TEST EACH CONSTRAINT. For the generation–verification gap and the failure modes of pure self-correction, check whether scaling, new RL methods (e.g., outcome reward, process reward hybrids), tool-use orchestration, or multi-agent setups have since relaxed or overturned these limits. Separate the durable question (what makes self-correction structurally hard?) from the perishable limitation (whether a particular method solves it). Cite what relaxed it; flag where the constraint still holds.
(2) Surface the strongest DISAGREEMENT or SUPERSEDING work from the last 6 months. Look for papers arguing pure self-improvement is viable, or that external critique is not necessary at scale, or that the alignment tax is negligible.
(3) Propose 2 research questions that assume the training regime may have moved: e.g., "Can multi-agent debate fully replace external human critique in the training loop?" or "Does internalized critique learned via RL generalize to out-of-distribution errors?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How should training incorporate external critique versus encouraging self-correction?

Sources 12 notes

Next inquiring lines