INQUIRING LINE

Can structured rewards still teach models when spurious rewards also work?

This explores a puzzle in reinforcement learning for reasoning: if random or 'spurious' rewards train models nearly as well as carefully-built correct ones, does the design of the reward signal still matter — and where does it actually earn its keep?


This explores a puzzle in reinforcement learning for reasoning: if random or 'spurious' rewards train models nearly as well as carefully-built correct ones, does reward design still matter? The corpus says yes — but it reframes *what* rewards are for. The reason spurious rewards work at all is that RLVR (reinforcement learning from verifiable rewards) doesn't teach models new reasoning. It surfaces strategies the base model already learned during pretraining: a single training example can be enough to trigger activation, and the reward mostly sharpens which of the model's existing pathways get sampled What does reward learning actually do to model reasoning?. Pass@k analysis makes the ceiling visible — base models actually beat RLVR-trained models at high k, meaning the training narrows sampling toward solutions already in the distribution rather than expanding what's solvable Does RLVR actually expand what models can reason about?. If the reward is just an activation trigger, almost any consistent signal will pull the lever.

But 'works' is doing a lot of hiding here. The moment you ask the reward to teach something *specific* — not just activate, but discriminate — spurious signals fall apart and structure starts paying off. Binary correctness rewards quietly wreck calibration because they never punish confident wrong answers; bolting on a Brier-score term provably fixes accuracy and confidence together Does binary reward training hurt model calibration?. Make the reward three-way instead of pass/fail — correct, hallucinated, abstained — and a model learns to say 'I don't know,' cutting hallucinations by nearly a third Can three-way rewards fix the accuracy versus abstention problem?. A spurious reward could never produce that behavior, because the behavior lives entirely in the reward's *shape*, not its mere presence.

The deeper lateral point: scalar rewards — spurious or not — throw away information that richer structure can keep. Real feedback splits into two orthogonal channels, *evaluative* (how good was that?) and *directive* (how should it change?), and a single number can only carry the first Can scalar rewards capture all the information in agent feedback?. Standard reward models also can't tell causal quality from coincidence, which is exactly why they get hacked into rewarding length or sycophancy; forcing counterfactual invariance strips those spurious correlations out Can counterfactual invariance eliminate reward hacking biases?. There's a clean design lesson in DRO: use rubrics as *gates* that accept or reject whole rollouts rather than melting them into dense scores, which preserves their sharpness without inviting reward hacking Can rubrics and dense rewards work together without hacking?.

Another angle on the same question — maybe you don't need an external reward to be elaborate if the model's own internal signals are informative. An agent's shifting belief toward a solution can serve as dense per-turn credit with no critic or process-reward model at all Can an agent's own beliefs guide credit assignment without critics?, and tree-search rollout structure can manufacture step-level supervision from nothing but trajectory outcomes by comparing sibling branches tree-search-rollouts-in-agent-rl-convert-outcome-rewards-into-process-supervision. There's even evidence that *negative* reinforcement alone — just suppressing wrong trajectories — matches full RL while preserving the answer diversity that positive-only training collapses Does negative reinforcement alone outperform full reinforcement learning?.

So the resolution is a reframe worth taking home: spurious rewards 'work' only on the narrow task of *activating* latent ability in a well-pretrained model. The instant you want the reward to *teach* — calibration, abstention, causal quality, step-by-step judgment — structure becomes the whole game, and the frontier moves toward reward models that reason before they score Can reward models benefit from reasoning before scoring? and generative judges that critique reasoning steps instead of classifying them Can judges that reason about reasoning outperform classifier rewards?. The spurious-reward result isn't a verdict that reward design is pointless — it's a measurement of how much the base model already knew.


Sources 12 notes

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can counterfactual invariance eliminate reward hacking biases?

Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Next inquiring lines