SYNTHESIS NOTE

Why do random rewards improve reasoning for some models but not others?

When RLVR training uses meaningless reward signals, some models gain reasoning improvements while others don't. What determines which models can benefit from optimization pressure without meaningful feedback?

Synthesis note · 2026-02-22 · sourced from RLVR

RLVR improves MATH-500 performance for Qwen2.5-Math-7B by 21.4% with random rewards, 16.4% with format-only rewards, 24.6% with incorrect labels, and 24.4% with 1-shot RL — nearly matching the 28.8% gained with ground truth rewards. The reward signal appears almost irrelevant to the outcome.

But these spurious rewards fail entirely for Llama3 and OLMo2 model families. The critical variable is not the reward but the pretraining strategy. Qwen2.5-Math develops a distinctive "code reasoning" behavior — thinking in code without execution — that rises from 66.7% to over 90% frequency after RLVR, even with spurious rewards. Other model families lack this particular latent strategy.

This is perhaps the strongest evidence for Does RL teach reasoning or just when to use it?. If random rewards work as well as correct rewards for specific models, then RLVR's function is not to provide direction but to provide pressure. The optimization signal — any optimization signal — activates preexisting reasoning strategies encoded during pretraining. The reward is a catalyst, not a teacher.

Since Does training data format shape reasoning strategy more than domain?, the Qwen code-reasoning strategy is a pretraining format artifact. RLVR surfaces it; the specific reward signal is incidental to the surfacing. Models without that pretraining format cannot benefit from the same activation pressure.

The practical implication is sobering: RLVR effectiveness may be almost entirely determined before RLVR training begins. The investment in careful reward engineering may be less important than the investment in pretraining data composition.

Critical challenge: data contamination. The RandomCalculation paper directly challenges the "any reward works" interpretation. Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 problems from partial prompts (first 60%); on post-release LiveMathBench this drops to 0.0%. On a fully clean benchmark of synthetic arithmetic (guaranteed to post-date model release), random rewards produce unstable training with no reliable improvement, while correct rewards deliver consistent gains surpassing the model's ceiling. This means the benchmark gains that motivated the "reward doesn't matter" narrative may be substantially inflated by memorization. The code-reasoning behavior change (66.7% → 90%+) is real and not explained by contamination alone — but the headline finding requires significant qualification. See Does RLVR success on math benchmarks reflect genuine reasoning improvement? for the full contamination argument and ops/tensions/rlvr-spurious-rewards-work-vs-rlvr-gains-are-data-contamination-artifacts.md for the tension analysis.

Reweave 2026-05-18 — the catalyst framing applies when reward is misaligned but structured rewards can still teach. The original "reward is a catalyst, not a teacher" framing remains correct for the specific case spurious rewards study: when the pretrained prior already contains the target capability (Qwen's code-reasoning), almost any optimization pressure surfaces it. But late-2025 evidence sharpens the scope of this claim. Can reward models learn by comparing policies instead of judging them? shows that structured rewards — POLAR's similarity-to-target-policy — provide a genuinely directional signal that does carry information beyond catalysis. The distinction is:

Reward as catalyst (this note's framing): applies when the reward signal is misaligned or random and the prior provides the structure. The reward provides pressure without direction; the prior provides direction. Spurious rewards work in this regime.
Reward as relational signal (POLAR's framing): applies when the reward is structurally aligned with what should be learned. Similarity-to-target IS direction. The signal carries information.

These coexist because they describe different regimes. The "any reward works" finding tells you what happens when the prior dominates; POLAR tells you what happens when the reward form is itself structured to carry the lesson. The general framing: rewards that lack structure rely on the prior; rewards with structure carry independent information.

This connects to the broader Can language models replace reward models with internal signals? convergence — the five verifier-free patterns each provide structured signal (not random), and their substitutability is consistent with the prior dominating within the structured-signal regime. The spurious-rewards finding is a different observation about what happens when signal is absent.

Inquiring lines that use this note as a source 25

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 138 in 2-hop network ·dense cluster Open in graph ↗

Why do random rewards improve reasoning for some… Does RL teach reasoning or just when to use it? Does training data format shape reasoning strategy… Do base models already contain hidden reasoning ab… Do reasoning traces need to be semantically correc…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
spurious rewards are the strongest confirmation that RL teaches timing not capability
Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
code reasoning as pretraining format artifact explains model-specificity
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
any reward pressure unlocks latent strategies
Do reasoning traces need to be semantically correct? Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.
parallel: corrupted inputs can still yield gains

Why do random rewards improve reasoning for some models but not others?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 5