INQUIRING LINE

Do spurious rewards activate reasoning without teaching new skills?

This explores a surprising finding — that reward signals which don't actually point at correct answers can still improve model reasoning — and asks whether that means reinforcement learning is surfacing skills the model already had rather than teaching new ones.


This explores whether reward signals that don't correlate with correct answers can still boost reasoning, and what that reveals about what reward learning actually does. The corpus answers clearly: yes — but only as activation, not instruction. The cleanest case study is Qwen2.5-Math, which gains 16-25% on MATH-500 from random or even incorrect rewards, while Llama and OLMo get nothing from the same treatment Why do random rewards improve reasoning for some models but not others?. The difference isn't the reward — it's the model. Qwen's pretraining had stocked it with latent code-reasoning behavior, and almost any optimization pressure, meaningful or noise, surfaces it. So the reward isn't teaching; it's pulling a lever that was already installed.

That reframes RLVR (reinforcement learning from verifiable rewards) as a sampling story rather than a learning story. RLVR sharpens a model toward solutions already living in its base distribution, improving how efficiently it finds them — a single training example can suffice to trigger this, and spurious rewards work nearly as well as correct ones for appropriately pretrained models What does reward learning actually do to model reasoning?. The boundary evidence is striking: under pass@k analysis, base models actually beat their RLVR-tuned versions at high k, meaning the tuned model solves no genuinely new problems — it just concentrates probability on what the base could already reach Does RLVR actually expand what models can reason about?. Distillation, by contrast, does transfer new reasoning patterns. The line between activating and teaching turns out to be real and measurable.

Widen the lens and this is one instance of a larger pattern: post-training selects reasoning rather than creating it. Five independent mechanisms — RL steering, critique fine-tuning, decoding tweaks, sparse-autoencoder feature steering, and RLVR — all elicit reasoning that was already latent in base-model activations, suggesting the real bottleneck is elicitation, not capability acquisition Do base models already contain hidden reasoning ability?. Even the qualitative effect of RL fits: training doesn't add a thinking faculty, it redirects an existing one, converting a model's counterproductive self-doubt during extended thinking into productive gap analysis Does extended thinking help or hurt model reasoning?. The mechanism is there before the reward; the reward governs how it's used.

There's a genuine tension worth sitting with, though. Other work argues that complex domain reasoning can emerge from RL with only simple accuracy signals — medical systems and o3-style models developing sophisticated reasoning without chain-of-thought distillation from a teacher Can simple rewards alone teach complex domain reasoning?. Whether that's truly new capability or just elaborate elicitation from a very rich base is exactly the unsettled question the spurious-reward result sharpens. And the field's response has been to make rewards carry more real information rather than less — generative judges that reason about each step instead of scoring it Can judges that reason about reasoning outperform classifier rewards?, decomposing feedback into evaluative and directive parts a scalar can't hold Can scalar rewards capture all the information in agent feedback?, and metacognition rewards that train *how* an agent reasons, not just whether it succeeded Can RL agents learn to reason better, not just succeed?.

The thing you didn't know you wanted to know: the fact that noise works as a reward isn't evidence rewards are magic — it's evidence the magic already happened during pretraining, and the reward is just a flashlight. Which is also a warning. A method that looks like it's teaching, when tested only on a Qwen-style model, may be doing nothing but switching on what pretraining built — and will silently fail the moment you swap in a base model that was never stocked.


Sources 9 notes

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Can simple rewards alone teach complex domain reasoning?

Medical AI systems and o3 demonstrate that sophisticated domain reasoning emerges naturally from RL training on difficult problems with only basic accuracy signals, without requiring explicit chain-of-thought distillation from teacher models.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability researcher re-testing claims about reward learning in LLMs. The core question: do spurious rewards activate latent reasoning, or do they teach new skills? A curated library of arXiv papers (Apr 2025–Oct 2025) found:

**What a curated library found — and when (dated claims, not current truth):**
- Qwen2.5-Math gains 16–25% on MATH-500 from random or incorrect rewards; Llama and OLMo gain nothing from the same treatment, suggesting the effect is model-dependent activation, not universal learning (arXiv:2504.13837, Apr 2025).
- Base models outperform RLVR-tuned versions at high pass@k, indicating no new problems are solved, only probability concentration on pre-existing solutions (arXiv:2504.13837, Apr 2025).
- Five independent mechanisms (RL steering, critique fine-tuning, decoding, sparse-autoencoder steering, RLVR) all elicit latent base-model reasoning, suggesting elicitation, not acquisition, is the bottleneck (arXiv:2510.07364, Oct 2025).
- Generative stepwise judges and decomposed evaluative/directive feedback outperform scalar rewards, indicating the field is moving toward richer reward signals (arXiv:2508.19229, Aug 2025; arXiv:2507.22844, Jul 2025).
- Tension: domain-specific RL (medical, o3-style) produces complex reasoning without chain-of-thought distillation, raising whether this is true capability emergence or elaborate elicitation (Jul–Aug 2025 cluster).

**Anchor papers (verify; mind their dates):**
- arXiv:2504.13837 (Apr 2025): Base Models Know How to Reason, Thinking Models Learn When
- arXiv:2508.19229 (Aug 2025): StepWiser — Stepwise Generative Judges
- arXiv:2510.07364 (Oct 2025): Base Models Know How to Reason (verification and synthesis)
- arXiv:2507.22844 (Jul 2025): RLVMR — Meta-Reasoning Rewards

**Your task:**
(1) RE-TEST EACH CONSTRAINT. For Qwen2.5-Math's 16–25% gain from spurious rewards: has this been replicated on newer base models (Llama 3.3+, Phi-4, or post-Oct 2025 releases)? Has training harness evolution (e.g., DPO variants, better caching, or dynamic batch optimization) changed the apparent activation vs. instruction boundary? Separately, cite one concrete piece of evidence that the pass@k boundary still holds or has been overturned.
(2) Surface the strongest CONTRADICTING work from the last 6 months: which paper, if any, claims that scalar or simple rewards DO teach new skills in base models, and how does it dodge the Qwen pass@k test?
(3) Propose 2 research questions that assume the regime may have shifted—e.g., do frontier models (o1-sized, reasoning-native) still exhibit latent-reasoning activation, or do they require genuine instruction? Can reward decomposition (evaluative + directive) escape the activation ceiling?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines