INQUIRING LINE

What breaks when you apply reinforcement learning after supervised fine-tuning?

This explores the failure modes and limits that show up when you train a model with reinforcement learning on top of an already supervised-fine-tuned base — what RL can and can't fix, and what it quietly damages along the way.


This reads the question as: once you've done supervised fine-tuning, what actually breaks (or fails to improve) when you layer reinforcement learning on top? The corpus tells a surprisingly consistent story — RL is less a teacher than a re-shaper, and several things crack under that pressure.

The biggest surprise is how *little* RL actually changes. Across seven RL algorithms and ten model families, RL touches only 5–30% of parameters, and those updates are sparse, near full-rank, and nearly identical across random seeds Does reinforcement learning update only a small fraction of parameters?. That structural narrowness has a flip side: RL with verifiable rewards mostly *surfaces* strategies the model already learned in pretraining rather than installing new ones How does RL training reshape reasoning and what gets lost?. So the first thing that 'breaks' is the expectation that RL expands capability — it largely re-weights what's already there, bounded by the pretrained prior.

Then there's what RL actively degrades. Binary correctness rewards wreck calibration, because a reward that only checks right-or-wrong gives the model every incentive to guess confidently — adding a Brier-style scoring term is needed to claw calibration back Does binary reward training hurt model calibration?. Reward signals that are too sparse or too hard make things worse still: training on near-impossible problems teaches degenerate shortcuts (answer repetition, skipping computation) that then *contaminate* capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. And RL-tuned models often look like they've learned to reason when they've really sharpened memorization — GRPO-trained models collapse on out-of-distribution variants of problems they ace in-distribution, revealing template-matching rather than genuine procedure Do fine-tuned language models actually learn optimization procedures?. Worth pairing this with a quieter fine-tuning failure: tuning can disconnect a model's reasoning chain from its answer, so the chain-of-thought becomes performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?.

Laterally, the corpus also explains *why* the naive SFT→RL handoff stalls — and what fixes it. SFT teaches rigid token-by-token imitation; outcome-only RLVR gives sparse signal that goes silent when every rollout fails. The interesting middle ground is dense, step-wise rewards: measuring similarity to expert actions at each step lets even small models learn hard reasoning, and works best as a *curriculum* before outcome-based refinement Can step-wise expert rewards help small models learn hard reasoning?. Numerical rewards also hit plateaus because a scalar can't say *why* something failed — natural-language critiques punch through those plateaus where more reward scaling can't Can natural language feedback overcome numerical reward plateaus?. There's even a predictable two-phase rhythm to RL training: execution correctness gets mastered first, then strategic planning becomes the bottleneck — so RL applied uniformly can spend itself on the wrong phase Does RL training follow a predictable two-phase learning sequence?.

The takeaway you might not have gone looking for: the things that 'break' in SFT→RL aren't really RL's fault but the reward's. Reward design — dense vs. sparse, binary vs. proper-scored, numerical vs. natural-language, success-vs-failure handled asymmetrically Should successful and failed episodes be processed differently? — determines whether RL deepens knowledge Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning? or just amplifies shortcuts. And no amount of clever RL escapes a deeper ceiling: agents trained on static expert data can't learn from their own failures, so competence is capped by what the curators imagined Can agents learn beyond what their training data shows?.


Sources 12 notes

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about RL post-training after supervised fine-tuning. The question: what actually breaks in the SFT→RL pipeline, and does that still hold?

What a curated library found — and when (dated claims, not current truth):
Findings span Sept 2024–May 2026. The corpus reports:
• RL updates only 5–30% of parameters in sparse, full-rank subnetworks; RL largely re-weights pretrained knowledge rather than installing new capability (2025-05).
• Binary correctness rewards degrade calibration; proper-scoring terms are required to recover it (2024-09).
• Sparse or over-hard reward signals induce degenerate shortcuts (answer repetition, skipped computation) that contaminate existing capability (2025-10).
• GRPO-trained models collapse on out-of-distribution variants, revealing template-matching instead of robust reasoning (2025-10).
• Step-wise expert-similarity rewards and natural-language feedback outperform outcome-only or numerical-only signals; two-phase training rhythm (execution→planning) exists (2025-06, 2025-10).

Anchor papers (verify; mind their dates):
• arXiv:2505.11711 (2025-05) — RL finetunes small subnetworks
• arXiv:2506.03106 (2025-06) — Natural language + numerical feedback
• arXiv:2510.25992 (2025-10) — Supervised RL with step-wise rewards
• arXiv:2504.07912 (2025-04) — RL amplifies pretraining behaviors

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, assess whether newer models (o1, o3, frontier agents), distributed training methods, on-policy vs. offline RL scheduling, or new evals have since relaxed or overturned it. Separate the durable question (RL's true limits on capability expansion) from the perishable limitation (e.g., binary rewards, sparse signals — may be solved by better reward design or richer feedback loops). Cite what solved it; flag where constraints still hold.
(2) Surface the strongest contradicting or superseding work from the last ~6 months that challenges the "RL amplifies but doesn't expand" narrative or the "reward design dominates outcome" claim.
(3) Propose 2 research questions that assume the regime has moved: e.g., does curriculum RL with adaptive difficulty or multi-objective rewards unlock genuine capability gain beyond pretraining? Can agents learn from self-generated failures at scale?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines