SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Why do random rewards improve reasoning for some models but not others?

When RLVR training uses meaningless reward signals, some models gain reasoning improvements while others don't. What determines which models can benefit from optimization pressure without meaningful feedback?

Synthesis note · 2026-02-22 · sourced from RLVR
How should researchers navigate LLM reasoning research? Do reasoning traces show how models actually think? What does reward learning actually do to model reasoning?

RLVR improves MATH-500 performance for Qwen2.5-Math-7B by 21.4% with random rewards, 16.4% with format-only rewards, 24.6% with incorrect labels, and 24.4% with 1-shot RL — nearly matching the 28.8% gained with ground truth rewards. The reward signal appears almost irrelevant to the outcome.

But these spurious rewards fail entirely for Llama3 and OLMo2 model families. The critical variable is not the reward but the pretraining strategy. Qwen2.5-Math develops a distinctive "code reasoning" behavior — thinking in code without execution — that rises from 66.7% to over 90% frequency after RLVR, even with spurious rewards. Other model families lack this particular latent strategy.

This is perhaps the strongest evidence for Does RL teach reasoning or just when to use it?. If random rewards work as well as correct rewards for specific models, then RLVR's function is not to provide direction but to provide pressure. The optimization signal — any optimization signal — activates preexisting reasoning strategies encoded during pretraining. The reward is a catalyst, not a teacher.

Since Does training data format shape reasoning strategy more than domain?, the Qwen code-reasoning strategy is a pretraining format artifact. RLVR surfaces it; the specific reward signal is incidental to the surfacing. Models without that pretraining format cannot benefit from the same activation pressure.

The practical implication is sobering: RLVR effectiveness may be almost entirely determined before RLVR training begins. The investment in careful reward engineering may be less important than the investment in pretraining data composition.

Critical challenge: data contamination. The RandomCalculation paper directly challenges the "any reward works" interpretation. Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 problems from partial prompts (first 60%); on post-release LiveMathBench this drops to 0.0%. On a fully clean benchmark of synthetic arithmetic (guaranteed to post-date model release), random rewards produce unstable training with no reliable improvement, while correct rewards deliver consistent gains surpassing the model's ceiling. This means the benchmark gains that motivated the "reward doesn't matter" narrative may be substantially inflated by memorization. The code-reasoning behavior change (66.7% → 90%+) is real and not explained by contamination alone — but the headline finding requires significant qualification. See Does RLVR success on math benchmarks reflect genuine reasoning improvement? for the full contamination argument and ops/tensions/rlvr-spurious-rewards-work-vs-rlvr-gains-are-data-contamination-artifacts.md for the tension analysis.

Reweave 2026-05-18 — the catalyst framing applies when reward is misaligned but structured rewards can still teach. The original "reward is a catalyst, not a teacher" framing remains correct for the specific case spurious rewards study: when the pretrained prior already contains the target capability (Qwen's code-reasoning), almost any optimization pressure surfaces it. But late-2025 evidence sharpens the scope of this claim. Can reward models learn by comparing policies instead of judging them? shows that structured rewards — POLAR's similarity-to-target-policy — provide a genuinely directional signal that does carry information beyond catalysis. The distinction is:

These coexist because they describe different regimes. The "any reward works" finding tells you what happens when the prior dominates; POLAR tells you what happens when the reward form is itself structured to carry the lesson. The general framing: rewards that lack structure rely on the prior; rewards with structure carry independent information.

This connects to the broader Can language models replace reward models with internal signals? convergence — the five verifier-free patterns each provide structured signal (not random), and their substitutability is consistent with the prior dominating within the structured-signal regime. The spurious-rewards finding is a different observation about what happens when signal is absent.

Inquiring lines that use this note as a source 25

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 138 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

spurious rewards with no correlation to correct answers still improve rlvr reasoning — but only for models with specific pretraining strategies