INQUIRING LINE

How much of MATH-500 improvement comes from data contamination versus real reasoning gains?

This explores whether higher MATH-500 scores reflect models actually getting better at reasoning, or just having seen the test — and the corpus suggests the honest answer is 'partly both, and they're hard to tell apart.'


This explores whether gains on MATH-500 measure real reasoning or just memorization — and the corpus is unusually direct about the trap. The sharpest evidence: Qwen2.5-Math-7B can reconstruct 54.6% of MATH-500 just from partial prompts, yet scores 0.0% on LiveMathBench, a benchmark released after the model was trained Does RLVR success on math benchmarks reflect genuine reasoning improvement?. That gap is the whole story in miniature: a benchmark the model may have ingested looks like reasoning; a clean one it couldn't have seen exposes how little transferred. On contaminated benchmarks, the gains are mostly recall.

But 'mostly recall' isn't the same as 'nothing real.' One note argues the two effects are genuinely separable: RLVR can activate authentic reasoning behaviors while the headline benchmark number is simultaneously inflated by memorization — they operate at different measurement levels and can coexist without contradiction Can genuine reasoning activation coexist with contaminated benchmarks?. So the question 'how much is contamination vs. reasoning' has a hidden assumption — that it's one pie split two ways. It may be two different things being measured by one number.

What's striking is how thin 'real reasoning gains' turn out to be even when contamination isn't the issue. RLVR makes reasoning traces more locally coherent — fewer logical jumps between adjacent steps — without making the overall proof valid Does RLVR actually improve mathematical reasoning or just coherence?. A single training example can lift math accuracy from 36% to 73.6%, which sounds like learning but looks more like flipping a switch on latent capability the model already had Can a single training example unlock mathematical reasoning?. And supervised fine-tuning raises final-answer accuracy while degrading the quality of the reasoning by ~39% — the model reaches right answers through pattern-matching shortcuts, not inference Does supervised fine-tuning actually improve reasoning quality?.

The most unsettling thread: maybe accuracy on these benchmarks was never measuring reasoning to begin with. Models trained on deliberately corrupted, irrelevant reasoning traces perform comparably to those trained on correct ones Do reasoning traces need to be semantically correct?, and logically invalid chain-of-thought exemplars match valid ones on hard benchmarks Does logical validity actually drive chain-of-thought gains?. If the form of reasoning drives the score regardless of its validity, then a clean MATH-500 number can't cleanly separate 'reasoning' from 'pattern fluency' either.

So the practical takeaway is to stop trusting a single MATH-500 delta. The corpus's recurring move is to triangulate: test on post-release benchmarks the model couldn't have memorized Does RLVR success on math benchmarks reflect genuine reasoning improvement?, measure reasoning informativeness rather than just accuracy Does supervised fine-tuning actually improve reasoning quality?, and separate behavioral activation from benchmark movement Can genuine reasoning activation coexist with contaminated benchmarks?. The thing you didn't know you wanted to know: even after you subtract contamination, the 'reasoning gain' that's left may be coherence and form rather than genuine inference.


Sources 7 notes

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

As a research analyst, evaluate this claim: MATH-500 gains in recent LLMs are primarily data contamination, not reasoning. A curated library of papers (2023–2026) found:

**What the library found — and when (dated claims, not current truth):**
- Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on LiveMathBench (post-training release); contamination explains most benchmark lift (~2025).
- RLVR improves local trace coherence without guaranteeing overall proof validity (~2025).
- Single training examples activate latent capability (36%→73.6%), resembling switch-flipping rather than learning (~2025).
- Supervised fine-tuning raises final-answer accuracy while degrading reasoning quality ~39%; models use pattern-matching shortcuts (~2025).
- Deliberately corrupted reasoning traces and logically invalid chain-of-thought prompts perform comparably to valid ones on hard benchmarks (~2023–2025).

**Anchor papers (verify; mind their dates):**
- 2307.10573: Invalid Logic, Equivalent Gains (reasoning form ≠ logical validity)
- 2507.10532: Reasoning or Memorization? (contamination audit)
- 2510.18176: Local Coherence or Global Validity? (RLVR trace analysis)
- 2603.24472: Why Self-Distillation Degrades Reasoning (capability erosion)

**Your task:**
(1) **Re-test each constraint.** For every finding above (contamination dominance, trace incoherence, latent-activation framing, pattern-matching shortcuts), judge whether newer evaluation harnesses (e.g., real-time contamination detection, held-out math olympiad splits, causal intervention on training data), model scale shifts, or orchestration (multi-agent verification, external symbolic solvers) have relaxed or overturned it. Separate the durable claim (contamination inflates benchmarks) from the perishable one (reasoning gains are zero). Where does the constraint still hold?

(2) **Surface the strongest contradicting or superseding work from the last ~6 months.** Has any recent paper argue that post-LiveMathBench models actually *do* show robust reasoning transfer, or that reasoning and memorization are inseparable in a way that invalidates the binary framing?

(3) **Propose 2 research questions that assume the regime may have moved:**
   - Can we measure *genuine* reasoning transfer without leaning on any benchmark that risks post-hoc contamination?
   - If local coherence and pattern fluency are decoupled from validity, what training objective recovers *global* correctness?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines