Why do spurious rewards work for some models but not others?

This explores why feeding a model random or even incorrect rewards still sharpens its reasoning in some cases — and why the same trick does nothing for other models.

This explores why feeding a model random or even incorrect rewards still sharpens its reasoning in some cases — and why the same trick does nothing for other models. The short version from the corpus: the reward isn't teaching the model anything new. It's pulling a lever that already exists. Whether the lever exists depends entirely on how the model was pretrained.

The sharpest evidence comes from a study where Qwen2.5-Math jumped 16-25% on a math benchmark after training on random or even wrong rewards, while Llama and OLMo got nothing from the same treatment Why do random rewards improve reasoning for some models but not others?. The explanation is that Qwen's pretraining left it with a latent habit — reasoning through code-like steps — that was sitting unused. Almost any optimization pressure, even noise, nudges the model toward surfacing that habit. Llama and OLMo simply don't have the habit to surface, so there's nothing for the noise to activate. The reward is a wake-up call, not a lesson.

This fits a broader finding about what reinforcement learning actually does to reasoning. One line of work argues that RLVR (reinforcement learning from verifiable rewards) improves how efficiently a model samples from abilities it already has, rather than expanding what it can do — a single training example can be enough to trigger the shift, and spurious rewards work nearly as well as correct ones for models with the right pretraining What does reward learning actually do to model reasoning?. So the question "why do spurious rewards work?" is really the question "what was already latent in this model?" — and the answer was written during pretraining, long before any reward showed up.

There's a useful contrast lurking here. If a reward signal can be pure noise and still help, that tells you standard reward training is often optimizing against something other than genuine quality. Other notes in the corpus show reward models latching onto response-level surface features while barely noticing what question was even asked Why do reward models ignore what question was asked?, and learning spurious correlations like length or sycophancy that have to be deliberately stripped out with causal methods Can counterfactual invariance eliminate reward hacking biases?. Spurious rewards "working" and reward models being fooled by spurious features are two sides of the same coin: in both, the actual content of the signal matters far less than we'd assume.

The thing you might not have known you wanted to know: this means the dramatic gains you see from clever reward schemes may be partly an illusion of attribution. The credit belongs to pretraining. If you want to see what's genuinely being added versus merely activated, the more revealing experiments isolate the reward's role — for instance, showing that negative-only reinforcement (suppressing wrong answers) can match full RL while preserving the diversity that positive reinforcement collapses Does negative reinforcement alone outperform full reinforcement learning?. The lesson across all of it: before asking whether a reward works, ask what the model already knew how to do.

Sources 5 notes

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Why do reward models ignore what question was asked?

When prompts are swapped while keeping responses identical, reward model scores barely change. This reveals that standard RLHF optimizes against phantom quality signals rather than prompt-response alignment, enabling four distinct biases.

Can counterfactual invariance eliminate reward hacking biases?

Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Why do spurious rewards work for some models but not others?

Sources 5 notes

Next inquiring lines