Does reinforcement learning preserve reasoning quality better than supervised fine-tuning?
This explores whether RL training keeps a model's actual reasoning process intact better than supervised fine-tuning (SFT) does — not just whether either lifts benchmark scores, but whether the reasoning underneath the right answers stays genuine.
This explores whether RL training keeps a model's actual reasoning process intact better than supervised fine-tuning — and the corpus leans clearly toward yes, but with a sharp twist about what "better" actually means. The most pointed evidence is what one might call the SFT accuracy trap: supervised fine-tuning can raise benchmark accuracy while cutting the quality of the reasoning steps by nearly 39 percent, because the model learns to produce correct-looking answers through after-the-fact rationalization rather than real inference Does supervised fine-tuning improve reasoning or just answers?. Standard metrics hide this, because they only check whether the final answer is right. So the headline comparison isn't RL vs. SFT on scores — it's that SFT can quietly hollow out reasoning while looking like an improvement.
RL, by contrast, tends to reward the reasoning, not just the endpoint. Reinforcement learning from augmented generation rewards both answer accuracy and the rationality of the explanation, internalizing coherent knowledge structures and outperforming SFT precisely because it prioritizes reasoning quality over token-level correctness Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. In agentic settings, process rewards for metacognition train agents to reason more efficiently — cutting repetitive actions by 31 percent while generalizing better than supervised fine-tuning alone Can RL agents learn to reason better, not just succeed?. And RL can even improve domain reasoning by *pruning* — suppressing wrong facts along reasoning paths rather than cramming in new ones, the opposite of SFT's imitation-of-examples approach Does RL improve domain reasoning by adding knowledge or removing it?.
But here's the twist that should reframe the whole question: a large strand of the corpus argues RL doesn't *create* reasoning at all. Base models already carry latent reasoning ability that minimal training merely unlocks Do base models already contain hidden reasoning ability?. RL post-training seems to teach a model *when* to reason, not *how* — hybrid models recover 91 percent of the gains by routing tokens alone Does RL post-training create reasoning or just deploy it?. And RLVR (RL with verifiable rewards) improves how efficiently a model samples from strategies it already had, without expanding its actual capability boundary — at high sampling counts, base models can even outperform their RLVR-trained versions What does reward learning actually do to model reasoning?, Does RLVR actually expand what models can reason about?. Notably, that same body of work flags distillation — an SFT-style transfer — as the thing that genuinely moves new reasoning patterns across models.
So the cleaner answer is this: RL preserves the *integrity* of reasoning better — it selects and sharpens genuine inference instead of rewarding post-hoc rationalization — but it largely works by surfacing capability the base model already has, not by installing new reasoning. SFT's risk is degrading the process while flattering the score; RL's limit is that it's an elicitation engine, not a capability engine. The frontier work tries to get past RL's own ceilings — natural-language critiques break through plateaus that more numerical reward can't Can natural language feedback overcome numerical reward plateaus?, and checklist-style decomposed rewards make RL workable on subjective, hard-to-verify tasks Can breaking down instructions into checklists improve AI reward signals?.
The thing you didn't know you wanted to know: the most interesting result here may be that simple accuracy rewards can make sophisticated domain reasoning *emerge* on their own, no chain-of-thought distillation from a teacher required Can simple rewards alone teach complex domain reasoning? — and that reasoning may not even belong in post-training at all, since treating chain-of-thought as a rewarded action *during pretraining* lifts reasoning by 19 percent Can chain-of-thought reasoning be learned during pretraining itself?. The RL-vs-SFT contest may be a fight over which method best reveals reasoning that was planted much earlier.
Sources 12 notes
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.
RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.
RL enhances medical reasoning by suppressing incorrect domain knowledge during reasoning—not by expanding what models know. Evidence shows RL achieves +12.4 point knowledge improvement by removing low-reward reasoning trajectories that invoke wrong facts.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.
Medical AI systems and o3 demonstrate that sophisticated domain reasoning emerges naturally from RL training on difficult problems with only basic accuracy signals, without requiring explicit chain-of-thought distillation from teacher models.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.