Can contamination-free evaluation distinguish between memorization and genuine prediction ability?

This explores whether testing models on clean, never-before-seen data can actually tell apart "the model memorized the answer" from "the model can genuinely reason its way to a new answer" — and what the corpus reveals about how hard that separation really is.

This explores whether contamination-free evaluation — testing on data the model couldn't have seen during training — can cleanly separate memorization from genuine prediction ability. The corpus suggests it can expose the difference dramatically, but it also reveals that memorization and reasoning aren't a clean binary in the first place, which complicates the whole project.

The sharpest evidence that clean evaluation matters comes from math benchmarks. One striking result: a model could reconstruct 54.6% of a popular math test from partial prompts alone — proof it had simply absorbed the answers — yet scored 0.0% on a benchmark released *after* its training cutoff Does RLVR success on math benchmarks reflect genuine reasoning improvement?. The post-release benchmark is the contamination-free probe, and it instantly distinguished a model that looked brilliant on paper from one that couldn't actually do the work. The same finding showed that on clean data, only genuine correctness signals improved performance, while random rewards did nothing — exactly what you'd expect if the earlier gains were recall, not reasoning.

But here's the thing the reader might not expect: memorization and genuine prediction aren't two separate buckets a model falls into — they happen *simultaneously, inside the same answer.* A study that decomposed chain-of-thought reasoning into three independent ingredients found that sheer output probability could swing accuracy from 26% to 70%, that memorization tracked how often patterns appeared in training, and that real step-by-step reasoning existed too but accumulated error at every step What three separate factors drive chain-of-thought performance?. So a single "correct" answer can be part recall, part lucky token statistics, and part actual inference. A clean benchmark removes the recall shortcut — but it doesn't tell you which of the remaining factors carried the load.

That's why some of the most interesting work goes *inside* the model rather than just swapping the test set. Memorized passages leave a physical fingerprint — bigger gradients in lower layers and a specific attention head fixating on rare tokens Where does a model store memorized paragraphs? — and reasoning errors trace back to identifiable memorization sources, with "local" memorization off the immediately preceding tokens causing up to 67% of failures Where do memorization errors arise in chain-of-thought reasoning?. These approaches diagnose memorization mechanistically, sidestepping the question of whether your test data is truly uncontaminated. Relatedly, whether a fact gets memorized at all is surprisingly predictable from its probability before training even happens Can we predict keyword priming before learning happens? — which suggests contamination effects could in principle be anticipated, not just caught after the fact.

The deeper warning the corpus offers: evaluation can be fooled at the *surface* in ways clean data alone won't fix. Models trained to imitate ChatGPT learned its confident, fluent style and fooled human judges while closing no real capability gap Can imitating ChatGPT fool evaluators into thinking models improved?, and even a deterministic, zero-temperature setup that produces the same answer every time gives you consistency that is not the same thing as reliability deterministic-llm-settings-create-fixed-randomness-not-reliable-a-single-outp. So contamination-free evaluation is necessary — it strips away the most blatant form of cheating — but it's not sufficient on its own. The honest answer is that clean test sets catch memorization-as-leakage, mechanistic probes catch memorization-as-mechanism, and you likely need both to be confident you're measuring genuine prediction rather than a convincing echo.

Sources 7 notes

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Where does a model store memorized paragraphs?

Memorized paragraphs leave a distinctive fingerprint in GPT-Neo: larger gradients in lower layers, concentration in a specific low-layer attention head attending to rare tokens, and dependence on a few early-prefix tokens. This localization makes memorization targetable for unlearning.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating whether contamination-free evaluation can cleanly separate memorization from genuine prediction ability in LLMs. This question remains live, but the findings below are dated claims—treat them as perishable constraints to re-test, not current truth.

What a curated library found — and when (findings span 2023–2025):
• A model reconstructed 54.6% of a math test from partial prompts yet scored 0% on a post-release benchmark, showing contamination-free data instantly exposes memorization-only performance (2024–2025).
• Chain-of-thought reasoning decomposes into three independent factors: output probability (swinging accuracy 26%→70%), memorization (tracking training frequency), and step-level reasoning (accumulating error)—so a single correct answer conflates all three (2024-07).
• Memorized passages leave mechanical fingerprints: elevated gradients in lower layers, rare-token attention fixation, and local token-memorization accounting for up to 67% of CoT failures (2024-03, 2025-08).
• Style imitation (confident, fluent output) fools human judges while closing no real capability gap, and deterministic inference produces consistency, not reliability (2023-05).

Anchor papers (verify; mind their dates):
• arXiv:2407.01687 (2024-07) — three disentangled factors in CoT performance
• arXiv:2403.19851 (2024-03) — paragraph memorization localization
• arXiv:2305.15717 (2023-05) — imitation as style capture
• arXiv:2508.02037 (2025-08) — token-level memorization diagnostics

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding, assess whether newer models, scaling (context length, parameter count), training methods (DPO variants, synthetic data), mechanistic probing tooling, or multi-token evaluation protocols have relaxed or overturned the 54.6% reconstruction threshold, the three-factor decomposition, or the sufficiency limits of clean benchmarks alone. Separate the durable question (memorization vs. reasoning remains hard to disentangle) from perishable limitations (e.g., does post-training contamination injection still work as a diagnostic?).
(2) Surface the strongest contradicting or superseding work from the last ~6 months that claims contamination-free evaluation *is* sufficient, or that the three-factor model is incomplete.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., can mechanistic steering of rare-token attention fully decouple memorization leakage without sacrificing performance? Do emergent in-context learning abilities render pre-training contamination irrelevant?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can contamination-free evaluation distinguish between memorization and genuine prediction ability?

Sources 7 notes

Next inquiring lines