Why do spurious rewards work nearly as well as correct ones?

This explores why reinforcement learning with verifiable rewards (RLVR) gets reasoning gains even when the reward signal is random or wrong — and what that reveals about where the reasoning actually lives.

This explores why spurious rewards — random, or even deliberately incorrect — improve reasoning nearly as much as correct ones, and what that tells us about what RL is really doing. The short version from the corpus: the reward isn't teaching the model to reason. It's switching on reasoning the model already had. Why does RLVR work with completely random rewards? frames this as a phase transition in the model's output distribution — the reward acts as a catalyst that shifts the model into a reasoning-heavy mode, and the quality of the signal barely matters compared to the quality of pretraining. You're not installing a skill; you're flipping a switch on a skill that was latent.

The catch is that this only works for some models, which is the most revealing part. Why do random rewards improve reasoning for some models but not others? shows Qwen2.5-Math jumping 16–25% on MATH-500 from random or incorrect rewards — because its pretraining baked in a latent code-reasoning behavior that optimization pressure can surface — while Llama and OLMo, lacking that pretraining format, get nothing. So 'spurious rewards work' isn't a universal law of RL; it's evidence that the reasoning was sitting in the pretrained weights all along, waiting for any optimization pressure to elicit it. The reward picks the lock; pretraining decided whether there was anything behind the door.

This reframes what the reward signal contributes. If almost any signal flips the switch, then the interesting question becomes what a *good* signal adds beyond the flip. The corpus suggests the answer is precision and safety, not activation. Does negative reinforcement alone outperform full reinforcement learning? finds that training on only negative samples — just suppressing wrong trajectories — matches full PPO/GRPO while preserving answer diversity, hinting that much of RL's value is in pruning rather than rewarding. And Can scalar rewards capture all the information in agent feedback? points out that a scalar reward carries 'how well did this do' but throws away 'how should it change' — so a cruder signal loses directional richness, not the basic catalytic push.

The danger lurking under 'spurious rewards are fine' is that proxy signals which correlate with correctness *at first* can quietly stop doing so. Does self-consistency reliably reward correct answers during training? shows self-consistency rewards bootstrapping nicely and then teaching the model to produce confidently wrong but reproducible answers — improvement that's actually decay. Does binary reward training hurt model calibration? makes the related point that even *correct* binary rewards degrade calibration by rewarding confident guessing. So 'the reward barely matters' is true for triggering reasoning and false for shaping its trustworthiness — which is exactly where richer designs like ternary truth/abstention rewards (Can three-way rewards fix the accuracy versus abstention problem?) and reasoning-before-scoring judges (Can reward models benefit from reasoning before scoring?) earn their keep.

The thing you didn't know you wanted to know: the surprising headline 'random rewards work' is really a backhanded measurement of pretraining. RLVR is less a teacher than a developer fluid — it makes visible what the base model already contains. Which means if spurious rewards *don't* help your model, that's not a tuning failure; it's the model telling you the reasoning was never latent there to begin with.

Sources 8 notes

Why does RLVR work with completely random rewards?

RLVR works nearly as well with spurious rewards as correct ones because it catalyzes a phase transition in model output distribution. The effectiveness depends on pretraining quality, not reward signal quality or training volume.

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Does self-consistency reliably reward correct answers during training?

Self-consistency works as an intrinsic reward for bootstrapping RL without labels, but models eventually learn to generate confidently wrong but reproducible answers. The proxy reward correlation with correctness degrades over training, creating a failure mode that looks like improvement.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about spurious rewards in LLM reasoning RL. The question: why do random or incorrect reward signals improve reasoning performance nearly as much as correct ones, and what does that reveal about what RL actually does?

What a curated library found — and when (dated claims, not current truth):
Findings span Sept 2024–Nov 2025. Key claims:
• Spurious rewards trigger a phase transition in output distribution; the reward acts as a catalyst, not a teacher—pretraining determines whether latent reasoning exists to unlock (2025-01, 2025-06).
• Qwen2.5-Math gains 16–25% on MATH-500 from random/incorrect rewards because its pretraining encoded latent code-reasoning; Llama and OLMo do not, suggesting spurious-reward effectiveness is a *pretraining signature*, not a universal law (2025-01).
• Negative-only RL (suppressing wrong trajectories) matches full PPO/GRPO on reasoning while preserving diversity; most RL value is pruning, not rewarding (2025-06).
• Self-consistency as proxy reward bootstraps well, then teaches confident reproducible *incorrectness*—a false improvement signal (2025-05).
• Binary rewards degrade calibration; ternary (correct/wrong/abstain) and reasoning-before-scoring judges preserve trustworthiness (2025-05, 2025-08).

Anchor papers (verify; mind their dates):
• arXiv:2501.09620 (Jan 2025): Beyond Reward Hacking: Causal Rewards
• arXiv:2506.01347 (June 2025): Negative Reinforcement in LLM Reasoning
• arXiv:2507.14843 (July 2025): The Invisible Leash: Why RLVR May Not Escape Its Origin
• arXiv:2511.18397 (Nov 2025): Natural Emergent Misalignment From Reward Hacking

Your task:
(1) RE-TEST THE CONSTRAINT. For each finding above—especially the claim that spurious rewards work because pretraining harbors latent reasoning—has newer work (last 6 months) shown that *training objective* or *model scale* now *enables* reasoning even in models lacking such pretraining? Separately: has anyone shown that *reward signal quality* now *does* matter for reasoning activation, not just calibration? Flag which constraints still hold and which have relaxed.
(2) Surface the strongest *contradiction*: does recent work challenge the pretraining-signature hypothesis? Do any papers show spurious rewards failing to activate reasoning in large modern models, or vice versa? List arXiv IDs.
(3) Propose 2 research questions that assume the regime may have shifted: (a) what if multi-agent or ensemble-based reward aggregation *relaxes* the pretraining bottleneck? (b) what if scaling test-time reasoning compute (reasoning-based judges, iterative refinement) *replaces* reward quality as the lever?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do spurious rewards work nearly as well as correct ones?

Sources 8 notes

Next inquiring lines