What pretraining formats encode latent reasoning strategies that RLVR can surface?
This explores what's actually sitting in pretraining data that RLVR can later 'switch on' — and the corpus reframes the question, since it suggests RLVR surfaces *formats and strategies* already present rather than encoding new ones.
This explores what's actually sitting in pretraining data that RLVR can later 'switch on.' The honest answer the corpus keeps circling back to: RLVR doesn't teach reasoning, it *selects* it. The reasoning strategies are already latent in the base model's activations, and post-training's job is elicitation, not acquisition Do base models already contain hidden reasoning ability?. So the real question becomes — which pretraining formats are sitting there waiting to be amplified, and what determines which one wins?
The sharpest finding here is that RLVR converges on a *single dominant format* from pretraining within the first epoch, while actively suppressing the alternatives Does RL training collapse format diversity in pretrained models?. Pretraining seeds multiple competing reasoning-format distributions; RL picks one and collapses the rest. Strikingly, the winner is chosen by model scale rather than by which format performs best — meaning the format RLVR surfaces isn't necessarily the strongest one, just the most prevalent at that scale. And because this dynamic is hidden when you start from proprietary pretrained models, most people never see which format their pretraining actually encoded.
What makes a format *surfaceable* in the first place? It has to already produce viable trajectories. RLVR improves sampling efficiency within existing capability boundaries — it narrows sampling toward solutions the base model could already reach, but base models still beat RLVR models at high pass@k, which is the tell that no new territory got added Does RLVR actually expand what models can reason about?. A single training example, or even spurious rewards, can trigger activation, precisely because the reward isn't supplying the strategy — pretraining already did What does reward learning actually do to model reasoning?. The cleaner framing: RL post-training teaches a model *when* to reason, not *how*; reasoning activation vectors pre-exist before any RL touches the weights Does RL post-training create reasoning or just deploy it?.
Here's the thing you didn't know you wanted to know: 'surfacing latent reasoning' and 'scoring better on benchmarks' are separable phenomena. RLVR can genuinely activate pretrained reasoning patterns *and* the benchmark gains can simultaneously be memorization from contaminated data — both true at once, measured at different levels Can genuine reasoning activation coexist with contaminated benchmarks?. On contaminated math sets, models reconstruct half the test from partial prompts and the 'reasoning' improvement is mostly retrieval; on clean benchmarks only correct rewards help, and random or inverted rewards degrade Does RLVR success on math benchmarks reflect genuine reasoning improvement?. So a format can look like it encodes reasoning when it really encodes the answer key.
The practical corollary: if pretraining didn't seed a usable format, RLVR has nothing to surface — and forcing it backfires. Training on problems that are too hard makes the model invent degenerate shortcuts (answer repetition, computation-skipping) that contaminate the genuine capabilities it did have Do overly hard RLVR samples actually harm model capabilities?. That's why a curriculum that *creates* the format first works better: run imitation-style supervised RL to lay down reasonable rollouts, then let RLVR sharpen them — outperforming either alone, because the imitation phase makes the verifiable reward informative in the first place Does sequencing imitation then exploration training improve reasoning?. And if you'd rather not train at all, modular cognitive tools can elicit the same latent reasoning through structured isolation, lifting GPT-4.1 on AIME from 26.7% to 43.3% with no RL — more evidence the capability was always there, just waiting for the right surfacing mechanism Can modular cognitive tools unlock reasoning without training?.
Sources 10 notes
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.