INQUIRING LINE

Does RLVR teach new reasoning or activate existing pretraining capabilities?

This explores whether RLVR (reinforcement learning with verifiable rewards) actually creates new reasoning skills a base model didn't have, or just surfaces and sharpens capabilities already baked in during pretraining.


This explores whether RLVR teaches genuinely new reasoning or mostly activates what pretraining already laid down — and the corpus leans hard toward activation, with sharp caveats. The cleanest version of the claim is that RLVR improves *sampling efficiency*, not *capability*: at high sampling budgets (pass@k), base models actually match or beat their RLVR-trained versions, meaning RLVR isn't unlocking new solvable problems but narrowing the model's output toward solutions already in the base distribution Does RLVR actually expand what models can reason about?, What does reward learning actually do to model reasoning?. A striking corollary: a single training example can trigger the effect, and *spurious* rewards — random or even incorrect ones — improve reasoning nearly as well as correct rewards, but only for models whose pretraining already contains the latent behavior to surface Why do random rewards improve reasoning for some models but not others?, What does reward learning actually do to model reasoning?. If you could teach reasoning with random rewards, you weren't teaching it; you were switching it on.

The most precise framing reframes the whole question from *how* to *when*: base models already hold reasoning in latent form, and RL optimizes the *deployment timing* of those strategies rather than the strategies themselves. Hybrid models recover 91% of the gains by routing tokens alone, and the activation vectors for reasoning strategies exist before any RL touches the model Does RL post-training create reasoning or just deploy it?. A related mechanism explains *why* spurious rewards work at all: RL converges on one dominant pretraining format within the first epoch while suppressing the alternatives — it's picking a winner from a menu pretraining already wrote, not authoring a new dish Does RL training collapse format diversity in pretrained models?.

But 'activation' isn't the whole story, and the interesting tension is where the corpus disagrees with itself. One line argues capability creation is *domain-conditional*: for standard reasoning, RL activates latent ability, but for complex multi-step planning, RL generates genuinely novel strategies the base model can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. Another separates two phenomena that get conflated: RLVR can activate *genuine* reasoning patterns even while measured benchmark gains are something else entirely — memorization on contaminated test sets Can genuine reasoning activation coexist with contaminated benchmarks?. That memorization concern is concrete: Qwen2.5-Math reconstructs over half of MATH-500 from partial prompts yet scores zero on a post-release benchmark, and on clean benchmarks only *correct* rewards help Does RLVR success on math benchmarks reflect genuine reasoning improvement?. So part of what looks like 'new reasoning' is neither activation nor learning — it's the test leaking into training.

There's also a cost to the narrowing. Because RLVR exploits the base distribution rather than exploring beyond it, it can collapse the model's problem-solving scope — 'capability boundary collapse' — trading breadth for a sharper peak, which exploration-based advantage functions and external data can partially counteract Why does RLVR training narrow a model's problem solving ability?. Push the rewards too hard with impossible problems and it gets worse: models learn degenerate shortcuts (answer repetition, skipping computation) that contaminate the real capabilities they started with Do overly hard RLVR samples actually harm model capabilities?. And even when RLVR helps, what it improves may be cosmetic — it makes reasoning traces locally *coherent* (fewer errors between adjacent steps) without making them globally *valid* Does RLVR actually improve mathematical reasoning or just coherence?.

The doorway worth walking through: if RLVR mostly activates, then the way to actually *teach* new reasoning is to front-load it — run supervised imitation first to build reasoning foundations, then apply RLVR to sharpen against verifiable rewards. That curriculum beats either method alone, because imitation creates the reasonable rollouts that make outcome rewards informative in the first place Does sequencing imitation then exploration training improve reasoning?. In other words, RLVR is a fantastic amplifier and a poor teacher — which tells you exactly where to put the teaching.


Sources 12 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Why does RLVR training narrow a model's problem solving ability?

RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability analyst. The question remains live: Does RLVR teach genuinely new reasoning, or does it primarily activate latent pretraining capabilities?

What a curated library found — and when (dated claims, not current truth):
Findings span early 2024 to mid-2026. A library of 12 papers documents:
• At high sampling budgets (pass@k), base models match or exceed RLVR-trained versions; RLVR narrows output distribution rather than expand capability boundaries (2025-04, arXiv:2504.13837).
• Spurious and random rewards improve reasoning nearly as well as correct ones, but only when pretraining already contains the latent behavior — if random rewards work, you're activating, not teaching (2025-04, arXiv:2504.20571).
• Qwen2.5-Math reconstructs >50% of MATH-500 from partial prompts; on clean benchmarks, only correct rewards help — measured gains conflate genuine activation with test-set memorization (2025-07, arXiv:2507.10532).
• RLVR improves local trace *coherence* (fewer step-to-step errors) but not global *validity*; reasoning may look smoother without becoming sound (2025-10, arXiv:2510.18176).
• Supervised imitation followed by RLVR outperforms either alone; teaching must precede sharpening (2025-12, arXiv:2512.07783).

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (Apr 2025) — Does RL Really Incentivize Reasoning Beyond Base?
• arXiv:2504.20571 (Apr 2025) — One Training Example & Spurious Rewards.
• arXiv:2507.10532 (Jul 2025) — Memorization vs. Reasoning: Data Contamination.
• arXiv:2512.07783 (Dec 2025) — Pre-Training, Mid-Training, RL Interplay.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, determine whether model scale (7B→405B), new training methods (DPO, PPO variants, online RL), better evals (synthetic contamination checks, mechanistic probes), or orchestration (chain-of-thought freezing, multi-stage curricula) have since relaxed or overturned the activation-vs.-teaching boundary. Which constraints still hold? Which have migrated to new regimes?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — papers claiming RLVR does teach new capability, or showing the activation/teaching split was a false dichotomy.
(3) Propose 2 research questions that assume the regime may have shifted: one probing whether larger models (post-2026) show teaching signals RLVR earlier work missed; one exploring whether multi-task or cross-domain RLVR breaks the single-format convergence pattern documented in arXiv:2504.07912.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines