How does pretraining determine what RL can later teach a model?
This explores how a model's pretraining sets the boundaries of what reinforcement learning can do afterward — whether RL builds new abilities or just surfaces what's already latent.
This explores how a model's pretraining sets the boundaries of what reinforcement learning can do afterward. The corpus converges on a striking answer: for most reasoning, RL doesn't teach — it *activates*. Verifiable rewards act as catalysts that surface capabilities already laid down in pretraining rather than building new ones, and the updates themselves are structurally sparse, touching only a fraction of parameters and bounded by the pretrained prior How does RL training reshape reasoning and what gets lost?. One sharp framing puts it as a division of labor: pretraining decides *how* to reason, RL decides *when* to deploy that reasoning. Hybrid models recover 91% of the gains by routing tokens alone, and the activation vectors for reasoning strategies already exist before any RL touches the model Does RL post-training create reasoning or just deploy it?.
But the picture isn't purely "RL only reveals." The boundary is conditional. RL produces genuine capability gains — not just better sampling — when two things hold: pretraining left *headroom* (the primitives are present but underused) and the RL data targets the *edge of competence* rather than what the model already nails When does RL actually extend reasoning beyond pretraining?. Where those conditions fail, RL just sharpens the distribution it inherited. So pretraining doesn't just supply the raw material — it determines whether there's any slack left for RL to exploit.
The exception that proves the rule is complex, multi-step planning. For standard reasoning, RL activates latent ability; but for tasks requiring deep coordination, RL can generate genuinely novel strategies that base models can't reach even with massive sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. Prolonged RL on diverse, non-mathematical tasks — with KL control and policy resetting — beats base models at *every* pass@k level, which means the capability boundary genuinely moved, especially in domains where pretraining never established strong patterns Can reinforcement learning discover reasoning strategies base models cannot?. The lesson flips elegantly: RL extends furthest precisely where pretraining left the most gaps.
There's also a quieter way pretraining constrains RL — through *format*. RL converges hard on a single dominant distribution inherited from pretraining within the first epoch, suppressing the alternatives. And the format that wins is set by model scale, not by which one performs best, which is largely invisible when you start from a proprietary base model and can't see what got collapsed Does RL training collapse format diversity in pretrained models?. Underneath, RL's mechanism is mostly *suppression* — sparsely updating 5–30% of parameters by damping wrong trajectories rather than amplifying right ones, in a predictable two-phase arc of procedural consolidation then strategic exploration What actually changes inside a model during RL training? Does RL training follow a predictable two-phase learning sequence?.
The deeper, less obvious takeaway: pretraining and post-training scale on *different axes* of behavior. Scaling pretraining improves factual knowledge; scaling fine-tuning improves helpfulness — a decoupling with architectural roots, since pretraining enriches lower-layer knowledge storage while later training reshapes upper-layer expression scaling-fine-tuning-improves-improves-helpfulness-while-scaling-pretraining-improves-fact. So if you want RL to teach something new, the move is often upstream: PretrainZero shows you can even run RL *during* pretraining, gaining ground by actively selecting not-yet-mastered content — the gain comes from *which* content gets reinforced, not from new data Can reinforcement learning improve models during general pretraining?. And whatever recipe you choose, the ceiling is set early: RL scales along sigmoid curves whose asymptote is fixed by recipe choices, with implementation details only affecting how fast you get there Does RL training follow predictable scaling curves?.
Sources 11 notes
Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
A controlled synthetic framework shows RL produces true capability gains only when pretraining established reasoning primitives and RL data targets tasks at the boundary of the model's competence. Without these conditions, RL refines sampling rather than extending capability.
For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.
PretrainZero shows that RL during pretraining on Wikipedia, combined with active selection of not-yet-mastered content, outperforms standard pretraining and random reinforcement. The gain comes from *which* content is reinforced, not new data.
Large-scale study (400K GPU-hours, 200+ models) shows RL performance scales sigmoidally. Recipe choices set the ceiling; implementation details only affect efficiency. Stable recipes enable reliable extrapolation from small runs.