Does RLVR success on math benchmarks reflect genuine reasoning improvement?
Explores whether RLVR's apparent effectiveness with spurious rewards on contaminated benchmarks like MATH-500 represents actual reasoning gains or merely data memorization retrieval.
The apparent success of RLVR with random, incorrect, or spurious reward signals on Qwen models may be an artifact of data contamination rather than evidence of genuine reasoning improvement.
The contamination evidence: prompting Qwen2.5-Math-7B with the first 60% of each MATH-500 question yields 54.6% exact-match reconstruction of the remaining 40% and 53.6% correct answers to these incomplete problems. On LiveMathBench — a benchmark released after Qwen2.5 — completion rate drops to 0.0%, consistent with Llama3.1-8B (3.8%/0.0% respectively). The model has memorized MATH-500.
On a fully clean benchmark (RandomCalculation — synthetic arithmetic expressions generated after Qwen's release): correct rewards deliver consistent gains surpassing the model's performance ceiling; random rewards make training highly unstable with no reliable improvement; inverse rewards rapidly erode mathematical reasoning ability.
This directly challenges Why do random rewards improve reasoning for some models but not others?. The prior interpretation — that any optimization pressure activates pretraining strategies — may confound two effects: genuine strategy activation (possible) and recall of memorized answers triggered by format-similar optimization (likely for contaminated benchmarks). On clean data, the "any reward works" finding evaporates for random and inverse signals.
The practical implication: RLVR research conclusions drawn from MATH-500 and similar benchmarks for Qwen models should be interpreted with caution. Reward engineering may matter more than the spurious-reward literature suggests — we were measuring memorization recovery, not reasoning improvement.
Inquiring lines that use this note as a source 43
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How much RLVR improvement comes from benchmark data memorization?
- Can clean benchmarks reveal true RLVR reasoning gains?
- How much does ROUGE metric choice inflate hallucination detection claims?
- Does the Heuristic Override Benchmark measure enumeration or world knowledge?
- What makes the Brier score mathematically better than log-likelihood here?
- Why do benchmark designers treat content effects as confounds?
- How do surface correlations between narratives and answers mislead benchmark validity?
- Can solution traces substitute for process-level reward signals in math reasoning?
- Can high test performance mask a complete absence of understanding?
- What information do numerical rewards fail to provide for reasoning tasks?
- How can we measure whether process rewards actually align with reasoning quality?
- Do current math benchmarks measure outcomes or rhetorical plausibility?
- How do partial credit grading systems accidentally reward reasoning theater?
- Can RL format selection explain performance gains attributed to algorithmic improvements?
- Does RLVR reward structure create pressure toward traces that look right?
- Are RLVR models worse than non-reasoning models for subjective annotation?
- What role do high-entropy minority tokens play in RLVR?
- What limits RLVR effectiveness beyond mathematical and coding domains?
- Can a single correct example seed exponential improvement in mathematical reasoning?
- Why do benchmark scores rise while reasoning quality declines?
- How does tool access change what we measure in reasoning tests?
- Does RLVR expand model capability or reorganize existing capability?
- How do satisfaction scores differ from genuine cognitive improvement?
- What makes mathematically confident but incorrect answers resemble valid solution shapes?
- Can one training example activate mathematical reasoning in RL-trained models?
- How do out-of-distribution tests reveal that optimization learning is memorization?
- What is the gap between benchmark performance and real workplace task completion?
- How does 93% reward reliability compare to other RL noise sources?
- What reporting standards would make interactive evaluation scores comparable across benchmarks?
- Can test-time scaling work through retrieval rather than reasoning?
- What makes a trajectory score interpretable across different interactive benchmarks?
- Can mathematical reasoning improvements transfer across problem subdomains?
- Why does medium difficulty outperform both easy and hard RLVR training samples?
- How does RPT compare to learning when versus how to deploy reasoning?
- Why do six different RLVR algorithms converge on similar performance levels?
- How much of MATH-500 improvement comes from data contamination versus real reasoning gains?
- Does RLVR teach new reasoning or activate existing pretraining capabilities?
- What pretraining formats encode latent reasoning strategies that RLVR can surface?
- Does careful reward engineering matter if pretraining determines RLVR effectiveness?
- Can combining SRL with RLVR outperform either method used alone?
- What capability dimension does a closed-ended exam actually fail to measure?
- How do open-world evaluations correct distortions that automated benchmarks introduce?
- Can contamination-free evaluation distinguish between memorization and genuine prediction ability?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do random rewards improve reasoning for some models but not others?
When RLVR training uses meaningless reward signals, some models gain reasoning improvements while others don't. What determines which models can benefit from optimization pressure without meaningful feedback?
directly challenged: the "code reasoning" activation story may be contamination-assisted memorization recall
-
Why does RLVR work with completely random rewards?
RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.
writing angle that needs qualification: the reward may matter after all, on clean benchmarks
-
Does RLVR actually expand what models can reason about?
Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
consistent: RLVR narrows rather than expands, and contamination inflates apparent gains
-
How much of LLM few-shot ability comes from training data?
Do large language models genuinely learn from a few examples, or are they mostly recognizing patterns from their training data? This matters for understanding what LLMs can actually do.
broader contamination phenomenon: RLVR contamination is benchmark-specific memorization; task contamination challenges the entire few-shot evaluation paradigm
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
- Spurious Rewards: Rethinking Training Signals in RLVR
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- Reinforcement Learning for Reasoning in Large Language Models with One Training Example
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- LLMs can implicitly learn from mistakes in-context
- Escaping the Verifier: Learning to Reason via Demonstrations
Original note title
RLVR effectiveness on contaminated benchmarks is primarily data memorization — clean benchmarks eliminate spurious reward gains