Does RLVR teach new reasoning or activate existing pretraining capabilities?
This explores whether RLVR (reinforcement learning with verifiable rewards) actually creates new reasoning skills a base model didn't have, or just surfaces and sharpens capabilities already baked in during pretraining.
This explores whether RLVR teaches genuinely new reasoning or mostly activates what pretraining already laid down — and the corpus leans hard toward activation, with sharp caveats. The cleanest version of the claim is that RLVR improves *sampling efficiency*, not *capability*: at high sampling budgets (pass@k), base models actually match or beat their RLVR-trained versions, meaning RLVR isn't unlocking new solvable problems but narrowing the model's output toward solutions already in the base distribution Does RLVR actually expand what models can reason about?, What does reward learning actually do to model reasoning?. A striking corollary: a single training example can trigger the effect, and *spurious* rewards — random or even incorrect ones — improve reasoning nearly as well as correct rewards, but only for models whose pretraining already contains the latent behavior to surface Why do random rewards improve reasoning for some models but not others?, What does reward learning actually do to model reasoning?. If you could teach reasoning with random rewards, you weren't teaching it; you were switching it on.
The most precise framing reframes the whole question from *how* to *when*: base models already hold reasoning in latent form, and RL optimizes the *deployment timing* of those strategies rather than the strategies themselves. Hybrid models recover 91% of the gains by routing tokens alone, and the activation vectors for reasoning strategies exist before any RL touches the model Does RL post-training create reasoning or just deploy it?. A related mechanism explains *why* spurious rewards work at all: RL converges on one dominant pretraining format within the first epoch while suppressing the alternatives — it's picking a winner from a menu pretraining already wrote, not authoring a new dish Does RL training collapse format diversity in pretrained models?.
But 'activation' isn't the whole story, and the interesting tension is where the corpus disagrees with itself. One line argues capability creation is *domain-conditional*: for standard reasoning, RL activates latent ability, but for complex multi-step planning, RL generates genuinely novel strategies the base model can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. Another separates two phenomena that get conflated: RLVR can activate *genuine* reasoning patterns even while measured benchmark gains are something else entirely — memorization on contaminated test sets Can genuine reasoning activation coexist with contaminated benchmarks?. That memorization concern is concrete: Qwen2.5-Math reconstructs over half of MATH-500 from partial prompts yet scores zero on a post-release benchmark, and on clean benchmarks only *correct* rewards help Does RLVR success on math benchmarks reflect genuine reasoning improvement?. So part of what looks like 'new reasoning' is neither activation nor learning — it's the test leaking into training.
There's also a cost to the narrowing. Because RLVR exploits the base distribution rather than exploring beyond it, it can collapse the model's problem-solving scope — 'capability boundary collapse' — trading breadth for a sharper peak, which exploration-based advantage functions and external data can partially counteract Why does RLVR training narrow a model's problem solving ability?. Push the rewards too hard with impossible problems and it gets worse: models learn degenerate shortcuts (answer repetition, skipping computation) that contaminate the real capabilities they started with Do overly hard RLVR samples actually harm model capabilities?. And even when RLVR helps, what it improves may be cosmetic — it makes reasoning traces locally *coherent* (fewer errors between adjacent steps) without making them globally *valid* Does RLVR actually improve mathematical reasoning or just coherence?.
The doorway worth walking through: if RLVR mostly activates, then the way to actually *teach* new reasoning is to front-load it — run supervised imitation first to build reasoning foundations, then apply RLVR to sharpen against verifiable rewards. That curriculum beats either method alone, because imitation creates the reasonable rollouts that make outcome rewards informative in the first place Does sequencing imitation then exploration training improve reasoning?. In other words, RLVR is a fantastic amplifier and a poor teacher — which tells you exactly where to put the teaching.
Sources 12 notes
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.
RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.