SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Why does SFT-then-RL training follow a predictable three-phase pattern?

When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time? How do you build domain expertise into general AI models?

The standard SFT-then-RL pipeline doesn't consistently outperform pure RL. CHORD's investigation reveals why: the learning curve follows a "shift-readapt-overfit" progression through three distinct phases. First, initial disruption — the sudden policy shift from expert data degrades existing capabilities. Second, readaptation — the model adapts to expert patterns and recovers performance. Third, overfitting — the model eventually overfits to the expert data, losing generalization.

This three-phase pattern appears specifically when expert data significantly diverges from the model's own established patterns. Expert data brings new capabilities but disrupts established ones, creating a fundamental tension in the SFT-then-RL approach.

CHORD's solution reframes SFT not as a separate tuning stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Two control mechanisms manage the expert data influence: a global coefficient that guides the transition from off-policy imitation to on-policy exploration over training, and a per-token weighting function that down-weights highly divergent tokens from off-policy data that could disrupt on-policy training.

The insight connects to the broader SFT-RL dynamic. Since Does supervised fine-tuning actually improve reasoning quality?, the degradation phase in CHORD's three-phase pattern may correspond to the reasoning quality loss that SFT introduces. Since How quickly do errors compound during model self-training?, the overfit phase represents a slower-timescale version of the same cumulative failure dynamic.

The practical implication: rather than treating SFT and RL as sequential stages with a hard boundary, integrating them as a continuous spectrum (from imitation-heavy to exploration-heavy) over training produces more stable and higher-performing results.

Inquiring lines that use this note as a source 10

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 164 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

sft-then-rl training exhibits a shift-readapt-overfit progression when expert data diverges from model patterns