What pretraining formats encode latent reasoning strategies that RLVR can surface?

This explores what's actually sitting in pretraining data that RLVR can later 'switch on' — and the corpus reframes the question, since it suggests RLVR surfaces *formats and strategies* already present rather than encoding new ones.

This explores what's actually sitting in pretraining data that RLVR can later 'switch on.' The honest answer the corpus keeps circling back to: RLVR doesn't teach reasoning, it *selects* it. The reasoning strategies are already latent in the base model's activations, and post-training's job is elicitation, not acquisition Do base models already contain hidden reasoning ability?. So the real question becomes — which pretraining formats are sitting there waiting to be amplified, and what determines which one wins?

The sharpest finding here is that RLVR converges on a *single dominant format* from pretraining within the first epoch, while actively suppressing the alternatives Does RL training collapse format diversity in pretrained models?. Pretraining seeds multiple competing reasoning-format distributions; RL picks one and collapses the rest. Strikingly, the winner is chosen by model scale rather than by which format performs best — meaning the format RLVR surfaces isn't necessarily the strongest one, just the most prevalent at that scale. And because this dynamic is hidden when you start from proprietary pretrained models, most people never see which format their pretraining actually encoded.

What makes a format *surfaceable* in the first place? It has to already produce viable trajectories. RLVR improves sampling efficiency within existing capability boundaries — it narrows sampling toward solutions the base model could already reach, but base models still beat RLVR models at high pass@k, which is the tell that no new territory got added Does RLVR actually expand what models can reason about?. A single training example, or even spurious rewards, can trigger activation, precisely because the reward isn't supplying the strategy — pretraining already did What does reward learning actually do to model reasoning?. The cleaner framing: RL post-training teaches a model *when* to reason, not *how*; reasoning activation vectors pre-exist before any RL touches the weights Does RL post-training create reasoning or just deploy it?.

Here's the thing you didn't know you wanted to know: 'surfacing latent reasoning' and 'scoring better on benchmarks' are separable phenomena. RLVR can genuinely activate pretrained reasoning patterns *and* the benchmark gains can simultaneously be memorization from contaminated data — both true at once, measured at different levels Can genuine reasoning activation coexist with contaminated benchmarks?. On contaminated math sets, models reconstruct half the test from partial prompts and the 'reasoning' improvement is mostly retrieval; on clean benchmarks only correct rewards help, and random or inverted rewards degrade Does RLVR success on math benchmarks reflect genuine reasoning improvement?. So a format can look like it encodes reasoning when it really encodes the answer key.

The practical corollary: if pretraining didn't seed a usable format, RLVR has nothing to surface — and forcing it backfires. Training on problems that are too hard makes the model invent degenerate shortcuts (answer repetition, computation-skipping) that contaminate the genuine capabilities it did have Do overly hard RLVR samples actually harm model capabilities?. That's why a curriculum that *creates* the format first works better: run imitation-style supervised RL to lay down reasonable rollouts, then let RLVR sharpen them — outperforming either alone, because the imitation phase makes the verifiable reward informative in the first place Does sequencing imitation then exploration training improve reasoning?. And if you'd rather not train at all, modular cognitive tools can elicit the same latent reasoning through structured isolation, lifting GPT-4.1 on AIME from 26.7% to 43.3% with no RL — more evidence the capability was always there, just waiting for the right surfacing mechanism Can modular cognitive tools unlock reasoning without training?.

Sources 10 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about which pretraining formats encode latent reasoning strategies that RLVR can surface. The question remains open: what pretraining structure *enables* RLVR to elicit reasoning, and what determines which format wins?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable snapshots:

• RLVR converges on a *single dominant pretraining format* within the first epoch, suppressing alternatives; the winner is chosen by model scale, not performance (2025-04, 2504.07912).
• RLVR improves sampling efficiency *within existing capability boundaries* — base models still beat RLVR at high pass@k; no new reasoning territory is added (2025-04, 2504.13837).
• RL post-training teaches *when* to reason, not *how*; reasoning activation vectors pre-exist before RL touches weights (2025-07, 2507.14843).
• On contaminated benchmarks, ~50% of "reasoning improvement" is data memorization/retrieval; only clean benchmarks isolate genuine format surfacing (2025-07, 2507.10532).
• Curriculum learning (imitation-then-RLVR) outperforms RLVR alone; cognitive tools without RL lift AIME from 26.7% to 43.3%, evidence the capability pre-existed (2025-06, 2506.12115; 2026-05, 2605.28388).

Anchor papers (verify; mind their dates):
• arXiv:2504.07912 (Echo Chamber, 2025-04)
• arXiv:2504.13837 (Does Reinforcement Learning Really Incentivize, 2025-04)
• arXiv:2507.10532 (Reasoning or Memorization, 2025-07)
• arXiv:2506.12115 (Cognitive Tools, 2025-06)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every claim above, judge whether newer inference methods (speculative decoding, multi-token sampling, tool-assisted generation), training regimes (mixing RLVR with imitation at finer granularity, dynamic format switching), or mechanistic measurement tools have since *relaxed* or *overturned* the single-format-convergence ceiling or the pass@k boundary. Separate durable claim (pretraining seeds formats; RLVR selects them) from perishable limitation (single format wins; no new capability added). Cite what resolved it; flag where it still holds.

(2) **Surface contradicting or superseding work** from the last ~6 months that argues RLVR *does* discover new reasoning schemas or that format diversity can be preserved and co-exploited.

(3) **Propose 2 research questions assuming the regime has moved:** e.g., can multi-format RLVR orchestration (mixing formats per-token or per-trajectory) unlock reasoning outside pretraining's envelope? Can mechanistic probes show whether format selection is truly model-scale-driven or driven by reward alignment?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What pretraining formats encode latent reasoning strategies that RLVR can surface?

Sources 10 notes

Next inquiring lines