Why does SFT-then-RL training follow a predictable three-phase pattern?
When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.
The standard SFT-then-RL pipeline doesn't consistently outperform pure RL. CHORD's investigation reveals why: the learning curve follows a "shift-readapt-overfit" progression through three distinct phases. First, initial disruption — the sudden policy shift from expert data degrades existing capabilities. Second, readaptation — the model adapts to expert patterns and recovers performance. Third, overfitting — the model eventually overfits to the expert data, losing generalization.
This three-phase pattern appears specifically when expert data significantly diverges from the model's own established patterns. Expert data brings new capabilities but disrupts established ones, creating a fundamental tension in the SFT-then-RL approach.
CHORD's solution reframes SFT not as a separate tuning stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Two control mechanisms manage the expert data influence: a global coefficient that guides the transition from off-policy imitation to on-policy exploration over training, and a per-token weighting function that down-weights highly divergent tokens from off-policy data that could disrupt on-policy training.
The insight connects to the broader SFT-RL dynamic. Since Does supervised fine-tuning actually improve reasoning quality?, the degradation phase in CHORD's three-phase pattern may correspond to the reasoning quality loss that SFT introduces. Since How quickly do errors compound during model self-training?, the overfit phase represents a slower-timescale version of the same cumulative failure dynamic.
The practical implication: rather than treating SFT and RL as sequential stages with a hard boundary, integrating them as a continuous spectrum (from imitation-heavy to exploration-heavy) over training produces more stable and higher-performing results.
Inquiring lines that use this note as a source 10
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does non-reasoning SFT prevent overfitting before RL training begins?
- Do grokking phases correspond to transitions between nesting levels?
- Why do models follow a two-phase pattern of procedural then strategic learning?
- Can continuous spectrum training outperform sequential SFT-then-RL stages?
- How does Supervised RL bridge the gap between SFT and RLVR?
- Does weight decay directly cause contractive behavior near training examples?
- What's the difference between RLHF, RLVR, and RLCF as training paradigms?
- Does the productive difficulty band ever stabilize during training?
- What features does a sample reinforce when it moves bands?
- Why does SFT fail when expert demonstrations are too long for small models?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
connects: the disruption phase may correspond to SFT's reasoning quality degradation
-
How quickly do errors compound during model self-training?
When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
extends: overfit phase is slow-timescale error compounding
-
Does RL improve domain reasoning by adding knowledge or removing it?
When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
supports: the RL phase works by pruning the overfitting artifacts of the SFT phase
-
Does supervised fine-tuning improve reasoning or just answers?
Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
extends: CHORD's disruption phase is the SFT accuracy trap in temporal progression — SFT raises accuracy while degrading reasoning quality, and CHORD shows this degradation is the first phase of a three-phase dynamic that RL can recover from if properly integrated
-
Does training order reshape how models handle different task types?
Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.
Omni-Thinker's complementary entropy dynamics extend CHORD's temporal framework: CHORD shows SFT→RL follows shift-readapt-overfit within a single domain, while multi-task RL reveals that different domains pull entropy in opposite directions — making training order across domains a mechanistic variable that interacts with CHORD's within-domain phase progression
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
- Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
- rStar2-Agent: Agentic Reasoning Technical Report
- LSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
Original note title
sft-then-rl training exhibits a shift-readapt-overfit progression when expert data diverges from model patterns