How do out-of-distribution tests reveal that optimization learning is memorization?
This explores how giving a model problems it hasn't seen before (out-of-distribution tests) exposes that what looked like learning a procedure was actually memorizing patterns from training.
This explores how out-of-distribution (OOD) tests work as a diagnostic: when you hold optimization performance constant on familiar problems but change the surface form, a model that truly learned a method keeps working, while a model that memorized templates falls apart. The corpus has a surprisingly consistent answer — much of what we call 'learning to optimize' is template-matching that survives only inside the training distribution.
The sharpest demonstration is the N-1 test, where models trained with RL (including GRPO) score well on in-distribution problems but drop steeply on variants built to be the same task in different clothing Do fine-tuned language models actually learn optimization procedures?. The same crack shows up when you watch models that are supposed to *execute* an iterative numerical method: they don't actually run the iterations in latent space, they recognize a problem as template-similar and emit plausible-but-wrong numbers — a failure that doesn't go away with scale Do large language models actually perform iterative optimization?. OOD is what makes this visible, because in-distribution a memorized answer and a computed answer look identical.
Benchmark contamination is the same phenomenon viewed through a different lens. Qwen2.5-Math can reconstruct half of MATH-500 from partial prompts — meaning it has *seen* the test — yet scores zero on a benchmark released after its training cutoff Does RLVR success on math benchmarks reflect genuine reasoning improvement?. The post-cutoff benchmark is just an OOD test by another name: it's the one set of problems memorization can't have reached. Tellingly, on clean problems only genuinely correct rewards help, while random or inverted rewards do nothing — which is what you'd expect if the 'gains' on dirty benchmarks were recall, not reasoning.
What's quietly interesting is *where* the memorization lives. A token-level analysis of chain-of-thought finds that local memorization — predicting the next token from the immediately preceding ones — accounts for up to two-thirds of reasoning errors, and it gets worse exactly as distributional shift increases Where do memorization errors arise in chain-of-thought reasoning?. So the OOD drop isn't mysterious; it's the model leaning on short-range pattern completion that only holds when the surface stays familiar. The same shape appears in instruction tuning, where models trained on semantically empty or deliberately wrong instructions perform about as well as those trained on correct ones — what transfers is knowledge of the output format, not task understanding Does instruction tuning teach task understanding or output format?.
The thing you didn't know you wanted to know: this isn't a verdict that RL fine-tuning is fake. Other notes in the collection show RL does something real and structured — it edits a sparse but full-rank subnetwork that's nearly identical across random seeds Does reinforcement learning update only a small fraction of parameters?, and it follows a reliable two-phase arc from execution mastery to strategic planning Does RL training follow a predictable two-phase learning sequence?. The honest synthesis is that optimization training reliably *sharpens* what the base model can already pattern-match — it just doesn't install new procedures the model can carry into unfamiliar territory. OOD tests are the wedge that separates those two claims, which otherwise look the same on a leaderboard.
Sources 7 notes
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.