Do spurious rewards activate reasoning without teaching new skills?
This explores a surprising finding — that reward signals which don't actually point at correct answers can still improve model reasoning — and asks whether that means reinforcement learning is surfacing skills the model already had rather than teaching new ones.
This explores whether reward signals that don't correlate with correct answers can still boost reasoning, and what that reveals about what reward learning actually does. The corpus answers clearly: yes — but only as activation, not instruction. The cleanest case study is Qwen2.5-Math, which gains 16-25% on MATH-500 from random or even incorrect rewards, while Llama and OLMo get nothing from the same treatment Why do random rewards improve reasoning for some models but not others?. The difference isn't the reward — it's the model. Qwen's pretraining had stocked it with latent code-reasoning behavior, and almost any optimization pressure, meaningful or noise, surfaces it. So the reward isn't teaching; it's pulling a lever that was already installed.
That reframes RLVR (reinforcement learning from verifiable rewards) as a sampling story rather than a learning story. RLVR sharpens a model toward solutions already living in its base distribution, improving how efficiently it finds them — a single training example can suffice to trigger this, and spurious rewards work nearly as well as correct ones for appropriately pretrained models What does reward learning actually do to model reasoning?. The boundary evidence is striking: under pass@k analysis, base models actually beat their RLVR-tuned versions at high k, meaning the tuned model solves no genuinely new problems — it just concentrates probability on what the base could already reach Does RLVR actually expand what models can reason about?. Distillation, by contrast, does transfer new reasoning patterns. The line between activating and teaching turns out to be real and measurable.
Widen the lens and this is one instance of a larger pattern: post-training selects reasoning rather than creating it. Five independent mechanisms — RL steering, critique fine-tuning, decoding tweaks, sparse-autoencoder feature steering, and RLVR — all elicit reasoning that was already latent in base-model activations, suggesting the real bottleneck is elicitation, not capability acquisition Do base models already contain hidden reasoning ability?. Even the qualitative effect of RL fits: training doesn't add a thinking faculty, it redirects an existing one, converting a model's counterproductive self-doubt during extended thinking into productive gap analysis Does extended thinking help or hurt model reasoning?. The mechanism is there before the reward; the reward governs how it's used.
There's a genuine tension worth sitting with, though. Other work argues that complex domain reasoning can emerge from RL with only simple accuracy signals — medical systems and o3-style models developing sophisticated reasoning without chain-of-thought distillation from a teacher Can simple rewards alone teach complex domain reasoning?. Whether that's truly new capability or just elaborate elicitation from a very rich base is exactly the unsettled question the spurious-reward result sharpens. And the field's response has been to make rewards carry more real information rather than less — generative judges that reason about each step instead of scoring it Can judges that reason about reasoning outperform classifier rewards?, decomposing feedback into evaluative and directive parts a scalar can't hold Can scalar rewards capture all the information in agent feedback?, and metacognition rewards that train *how* an agent reasons, not just whether it succeeded Can RL agents learn to reason better, not just succeed?.
The thing you didn't know you wanted to know: the fact that noise works as a reward isn't evidence rewards are magic — it's evidence the magic already happened during pretraining, and the reward is just a flashlight. Which is also a warning. A method that looks like it's teaching, when tested only on a Qwen-style model, may be doing nothing but switching on what pretraining built — and will silently fail the moment you swap in a base model that was never stocked.
Sources 9 notes
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Medical AI systems and o3 demonstrate that sophisticated domain reasoning emerges naturally from RL training on difficult problems with only basic accuracy signals, without requiring explicit chain-of-thought distillation from teacher models.
StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.