The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs) on reasoning tasks that elicit emergent long chains of thought (CoTs). Unlike supervised learning, it updates the model using both correct and incorrect samples via policy gradients. To better understand its mechanism, we decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR), respectively. We train Qwen2.5-Math-7B and Qwen3-4B on a mathematical reasoning dataset and uncover a surprising result: training with only negative samples—without reinforcing correct responses—can be highly effective: it consistently improves performance over the base model across the entire Pass@k spectrum (k up to 256), often matching or surpassing PPO and GRPO. In contrast, reinforcing only correct responses improves Pass@1 but degrades performance at higher k, due to reduced diversity. These inferencescaling trends highlight that solely penalizing incorrect responses may contribute more to performance than previously recognized.
Introduction. Language models (LMs) have recently demonstrated remarkable capabilities in various complex reasoning tasks, including mathematics [7, 16], coding [19, 58], and scientific reasoning [33, 36]. A key technique in achieving such success is reinforcement learning with verifiable rewards (RLVR) [14, 18, 21, 43], which is particularly effective in domains where the correctness of an outcome can be automatically verified via tools or functions. RLVR typically employs a binary reward (+1 or −1) based on the objective correctness of model responses. This simple yet effective mechanism not only mitigates reward hacking [29, 41] but also eliminates the need for extensive human annotations and complex reward model training [24, 69]. RLVR’s appeal is multifaceted: it offers a conceptually simple formulation [21], exhibits notable sample efficiency [11, 25, 48], and enables inference-time scaling behaviors [12, 31, 52, 60, 66]. However, the precise mechanisms driving its effectiveness remain underexplored, particularly how it utilizes correct and incorrect samples.
Discussion / Conclusion. In this work, we investigate the mechanism underlying RLVR for LM reasoning. By decomposing RLVR into positive and negative sample reinforcement, we reveal a surprising finding: solely penalizing incorrect samples can effectively enhance LM reasoning capabilities while preserving generation diversity. Experimental results show that NSR consistently improves performance across a wide Pass@k spectrum and in many cases matches or outperforms strong RL algorithms such as PPO and GRPO. Our gradient analysis demonstrates that NSR works by suppressing incorrect responses and redistributing probability mass toward plausible alternatives based on the model prior. Building on these findings, we proposed a simple variant of REINFORCE, Weighted-REINFORCE, that upweights the negative sample reinforcement. Empirical results show that it achieves a good balance between PSR and NSR, and yields consistent Pass@k improvements across multiple reasoning benchmarks. We discuss limitations and future work directions in Appendix D.