Why does self-correction training on offline data fail?
Can language models learn to correct their own mistakes through supervised training on correction examples? This explores whether distribution mismatch and behavior collapse prevent self-correction from emerging.
SCoRe (Self-Correction via Reinforcement Learning) starts from a stark baseline: "there is no major work showing successful intrinsic self-correction via prompting alone." Naively prompting LLMs for self-correction can degrade performance. The question is whether self-correction is an impossible capability or just one that requires the right training approach.
SFT on offline correction traces fails through two mechanisms:
Distribution mismatch: the errors made by the data-collection policy (used to generate correction examples) don't match the errors the trained model will make at test time. The model learns corrections for someone else's mistakes, not its own. At test time, it encounters novel error patterns that the correction training never addressed.
Behavior collapse: SFT implicitly gravitates toward a single dominant correction mode — whichever pattern maximizes likelihood across training examples. This mode may work for some error types but fails to generalize. The model learns one way to correct rather than learning when and how to adapt correction strategy to the specific error encountered.
SCoRe addresses both by training under the model's own distribution of self-generated correction traces using multi-turn online RL. The model generates a first attempt, then generates a correction attempt, and the RL reward is based on whether the correction improved the outcome. Appropriate regularization steers learning toward genuinely effective correction behaviors rather than fitting high-reward responses for given prompts.
This connects to a broader pattern: Does supervised fine-tuning actually improve reasoning quality? documents the same SFT-vs-RL dynamic in domain specialization. SFT copies surface patterns; RL trains under the model's actual distribution. For self-correction specifically, this means the model must practice correcting its own mistakes, not someone else's — the same principle that makes deliberate practice effective for humans.
The implication for the self-revision literature is precise: Does self-revision actually improve reasoning in language models? and Does reflection in reasoning models actually correct errors? show that current models can't self-correct. SCoRe suggests this is a training problem, not a capability limit — but fixing it requires abandoning SFT in favor of online RL.
Inquiring lines that use this note as a source 47
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can AI self-correct its way out of epistemic circularity?
- Does self-conditioning improve belief-behavior alignment better than external priors?
- Why does self-critiquing actually reduce plan quality in language models?
- Why does online RL succeed where supervised training fails for self-correction?
- How does distribution mismatch between training and deployment break self-correction?
- What makes deliberate practice on your own errors more effective than copying others?
- Why does self-generated training data outperform externally sourced data?
- Does self-revision actually improve reasoning in large language models?
- Can self-distillation reduce catastrophic forgetting in continual learning?
- What are the three root causes models fail at self-correction?
- Can models learn better from critiquing errors than imitating correct responses?
- Why does self-generated training data outperform externally curated domain examples?
- Can AI-generated explanations of errors teach as effectively as self-resolution?
- Can self-consistency checks fully prevent error avalanching in self-training loops?
- Why does external verification stop error amplification but internal self-assessment enable it?
- How does self-distillation differ from standard fine-tuning approaches?
- Can models learn to generate their own training examples effectively?
- Why does self-correction during generation produce reliable labels without exemplars?
- Why does self-revision degrade reasoning accuracy in o1-like models?
- Why does decoupling retriever and generator training create misalignment?
- How does self-revision on wrong answers increase model confidence further?
- Do external perspectives fix the self-evaluation bias in language models?
- Why does single-model self-revision amplify confidence in incorrect answers?
- Why does self-reflection during training fail to improve model self-correction?
- What are the computational trade-offs between training-time vs inference-time consistency correction?
- Does reflection training actually teach models to self-correct their mistakes?
- How should training incorporate external critique versus encouraging self-correction?
- Does self-reflection help models notice their own constraint violations?
- How do instruction backtranslation and MAGPIE demonstrate self-generation principles?
- Why does filtering for correct examples prevent error compounding in self-training?
- How does error avalanching compound failures in self-training iterations?
- Can self-training drift be prevented by applying student compatibility filtering?
- Why does monological training prevent models from overriding statistical priors?
- Why does model self-revision increase confidence while degrading accuracy?
- Why does self-consistency fail as a proxy reward for correctness?
- Can grammar alone repair misunderstanding without ritual correction work?
- Why do models trained on critique fail at self-critique despite strong other-model evaluation?
- How does domain shift expose failures in fixed self-improvement mechanisms?
- What external anchors prevent self-editing from collapsing into circularity?
- Why does uncontrolled self-revision drift toward instance-specific overfitting?
- Why does evaluating errors teach more than imitating correct responses?
- Do models spontaneously develop self-reflection from minimal training signals?
- How does metacognitive self-correction enable models to revise failed strategies?
- What makes policy self-distillation more effective than external teacher distillation?
- Does deliberate self-revision introduce different errors than passive context contamination?
- Why do structure-targeted training negatives fail to fix the underlying problem?
- Why does self-critique fail without external verification signals?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does self-revision actually improve reasoning in language models?
When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
SCoRe explains why: current models weren't trained to self-correct under their own error distribution
-
Does reflection in reasoning models actually correct errors?
When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
the confirmatory nature of reflection may be an SFT artifact; online RL could produce genuinely corrective reflection
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
same SFT failure mode: surface pattern copying without distributional grounding
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
SCoRe is a specific case: RL teaches when and how to correct, not just when to reason
-
Does revising your own reasoning actually help or hurt?
Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.
SCoRe resolves the internal-vs-external revision dilemma: online RL under the model's own error distribution makes internal revision viable by training the model on its actual mistakes rather than someone else's, converting internal revision from a harmful default into a trained capability
-
Does a model improve by arguing with itself?
When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
SCoRe is designed to prevent degeneration-of-thought: by training under the model's own error distribution with RL rewards for genuine correction, it builds the self-correction capacity that untrained self-revision lacks, addressing the confidence-amplification failure at its training-time root
-
How quickly do errors compound during model self-training?
When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
SCoRe's distribution mismatch finding explains a root cause of error avalanching: self-training loops fail because corrections learned from one distribution don't apply to the model's own evolving errors — online RL under the model's own error distribution is the principled fix for both single-generation self-correction (SCoRe) and multi-iteration self-training (avalanche prevention)
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Training Language Models to Self-Correct via Reinforcement Learning
- Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies
- Can Large Reasoning Models Self-Train?
- Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges
- When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models
- Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
- SPICE: Self-Play In Corpus Environments Improves Reasoning
- Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
Original note title
SFT on model-generated correction traces fails due to distribution mismatch — multi-turn online RL under the model's own error distribution is required for self-correction