How does negative reinforcement redistribute probability without guiding toward correct answers?
This explores a counterintuitive finding in RL training: that punishing wrong answers can improve a model without ever explicitly teaching it what's right — and what that says about how reward actually reshapes a model's probability distribution.
This explores a counterintuitive finding in RL training: that punishing wrong answers can improve a model's performance without ever pointing it toward correct ones. The mechanism is redistribution, not instruction. When training pushes probability mass *away* from incorrect trajectories, that mass has to go somewhere — it spreads across the remaining options the model already considered plausible. Because the model was pretrained to find correct answers reasonably often, suppressing the wrong paths leaves the right ones relatively more likely, even though nothing in the signal said 'this is the answer.' One striking result is that negative reinforcement alone matches or beats full PPO/GRPO on Pass@k, precisely because it preserves the diversity of remaining candidates, whereas positive-only reinforcement concentrates mass too aggressively and hurts performance at higher k Does negative reinforcement alone outperform full reinforcement learning?.
Why does suppression-only work at all? Because the corpus suggests RL here isn't teaching new reasoning — it's reweighting strategies the model already has. RLVR improves *sampling efficiency* within existing capability boundaries rather than expanding them; a single example can trigger the shift, and even spurious rewards work nearly as well as correct ones for a well-pretrained model What does reward learning actually do to model reasoning?. That reframes negative reinforcement entirely: if the right answers already live in the distribution, you don't need a directive 'go here' signal. Carving away the wrong answers is enough to surface them.
The lateral payoff is seeing what this gains and what it loses. A pure suppression signal is *evaluative* — it scores how badly a path did — but carries no *directive* content about how to fix it. Those are orthogonal kinds of information, and scalar rewards capture only the first Can scalar rewards capture all the information in agent feedback?. This is also why models plateau under numerical reward and then break through when given chain-of-thought critiques that explain *why* something failed Can natural language feedback overcome numerical reward plateaus?. Negative reinforcement redistributes; it does not explain.
The redistribution framing also exposes a danger. Binary correctness rewards reward confident guessing because they never penalize confident-but-wrong answers, which degrades calibration — the probability mass moves, but toward overconfidence Does binary reward training hurt model calibration?. Designs that add structure to the negative signal counteract this: a ternary reward that distinguishes hallucination from abstention makes 'I don't know' a learnable destination for redistributed mass, cutting hallucinations while preserving truthfulness Can three-way rewards fix the accuracy versus abstention problem?.
The thing worth walking away with: 'improving' a model and 'teaching' a model are not the same operation. Much of what reward-based training accomplishes is rearranging probability the model already holds — which is why suppression can succeed, why spurious rewards sometimes suffice, and why the hard, unsolved part is supplying the directional information that pure redistribution can never contain.
Sources 6 notes
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.