How do out-of-distribution tests reveal that optimization learning is memorization?

This explores how giving a model problems it hasn't seen before (out-of-distribution tests) exposes that what looked like learning a procedure was actually memorizing patterns from training.

This explores how out-of-distribution (OOD) tests work as a diagnostic: when you hold optimization performance constant on familiar problems but change the surface form, a model that truly learned a method keeps working, while a model that memorized templates falls apart. The corpus has a surprisingly consistent answer — much of what we call 'learning to optimize' is template-matching that survives only inside the training distribution.

The sharpest demonstration is the N-1 test, where models trained with RL (including GRPO) score well on in-distribution problems but drop steeply on variants built to be the same task in different clothing Do fine-tuned language models actually learn optimization procedures?. The same crack shows up when you watch models that are supposed to *execute* an iterative numerical method: they don't actually run the iterations in latent space, they recognize a problem as template-similar and emit plausible-but-wrong numbers — a failure that doesn't go away with scale Do large language models actually perform iterative optimization?. OOD is what makes this visible, because in-distribution a memorized answer and a computed answer look identical.

Benchmark contamination is the same phenomenon viewed through a different lens. Qwen2.5-Math can reconstruct half of MATH-500 from partial prompts — meaning it has *seen* the test — yet scores zero on a benchmark released after its training cutoff Does RLVR success on math benchmarks reflect genuine reasoning improvement?. The post-cutoff benchmark is just an OOD test by another name: it's the one set of problems memorization can't have reached. Tellingly, on clean problems only genuinely correct rewards help, while random or inverted rewards do nothing — which is what you'd expect if the 'gains' on dirty benchmarks were recall, not reasoning.

What's quietly interesting is *where* the memorization lives. A token-level analysis of chain-of-thought finds that local memorization — predicting the next token from the immediately preceding ones — accounts for up to two-thirds of reasoning errors, and it gets worse exactly as distributional shift increases Where do memorization errors arise in chain-of-thought reasoning?. So the OOD drop isn't mysterious; it's the model leaning on short-range pattern completion that only holds when the surface stays familiar. The same shape appears in instruction tuning, where models trained on semantically empty or deliberately wrong instructions perform about as well as those trained on correct ones — what transfers is knowledge of the output format, not task understanding Does instruction tuning teach task understanding or output format?.

The thing you didn't know you wanted to know: this isn't a verdict that RL fine-tuning is fake. Other notes in the collection show RL does something real and structured — it edits a sparse but full-rank subnetwork that's nearly identical across random seeds Does reinforcement learning update only a small fraction of parameters?, and it follows a reliable two-phase arc from execution mastery to strategic planning Does RL training follow a predictable two-phase learning sequence?. The honest synthesis is that optimization training reliably *sharpens* what the base model can already pattern-match — it just doesn't install new procedures the model can carry into unfamiliar territory. OOD tests are the wedge that separates those two claims, which otherwise look the same on a leaderboard.

Sources 7 notes

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating claims about whether optimization fine-tuning in LLMs teaches genuine reasoning or shallow memorization. The question: *How reliably do out-of-distribution tests distinguish learned procedures from template-matching?* Treat this as unsettled.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat all as perishable:

• N-1 and OOD variant tests show RL-fine-tuned models (GRPO included) drop steeply off-distribution while staying strong in-distribution, suggesting template-matching rather than method learning (~2025, arXiv:2504.07912).
• Benchmark contamination (e.g., Qwen2.5-Math reconstructing MATH-500 from partial prompts) and post-cutoff benchmarks are functionally identical OOD tests; random/inverted rewards fail on clean data, suggesting memorized gains on dirty benchmarks (~2025, arXiv:2507.10532).
• Token-level analysis finds local memorization (next-token prediction from immediate context) explains up to two-thirds of reasoning errors and worsens under distributional shift (~2025, arXiv:2508.02037).
• RL updates only sparse, full-rank subnetworks (~5–30% of parameters, consistent across seeds) and follow a two-phase arc: execution mastery → strategic planning (~2025, arXiv:2505.11711).
• Instruction tuning with semantically empty or inverted instructions performs similarly to correct ones, suggesting output-format knowledge, not task understanding (~2023, arXiv:2305.11383).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (May 2023) — instruction tuning format-learning
• arXiv:2507.10532 (July 2025) — contamination and OOD diagnostics
• arXiv:2508.02037 (August 2025) — token-level memorization in CoT
• arXiv:2505.11711 (May 2025) — sparse subnetwork updates

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, does newer model scale, more sophisticated OOD harnesses (multi-hop, adversarial paraphrasing, compositional generalization scaffolds), or improved reward signals (verifiable reasoning, meta-reasoning, rubric anchoring) since dissolve or relocate the memorization wedge? Separate the durable question (can OOD tests tell reasoning from recall?) from the perishable limitation (do current OOD tests actually catch all memorization?). Cite what resolved it or plainly state where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially any claiming OOD robustness or compositional transfer under realistic post-training.
(3) Propose 2 research questions that ASSUME the regime may have shifted: one assuming RL *has* since learned to generalize procedurally; one assuming OOD tests have grown less reliable as memorization became smarter.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How do out-of-distribution tests reveal that optimization learning is memorization?

Sources 7 notes

Next inquiring lines