Why do large language models outperform fine-tuned models once repeated items are removed?

This explores a memorization-vs-generalization story: the claim that fine-tuned models lean on seeing the same (or near-identical) examples again, so when you strip those repeats away and test on genuinely novel items, a less-specialized large model that generalizes wins.

This explores a memorization-vs-generalization story — the idea that fine-tuning often teaches a model to recognize examples it has effectively already seen, so removing repeated or near-duplicate items exposes how little real reasoning the fine-tuned model installed. The corpus has a surprisingly direct answer to this, and it's less about model size than about what fine-tuning actually does.

The sharpest evidence comes from out-of-distribution stress tests. When researchers build "N-1" variants of problems — same structure, swapped specifics, so the answer can't be retrieved from memory — RL fine-tuned models drop sharply versus their performance on in-distribution problems Do fine-tuned language models actually learn optimization procedures?. The interpretation there is blunt: methods like GRPO sharpen template-matching rather than installing a reasoning procedure. A related finding shows models don't actually execute iterative numerical methods at all; they recognize a problem as template-similar to something memorized and emit plausible-but-wrong values, a failure that persists across scale and training approach Do large language models actually perform iterative optimization?. So when repeated items are present, the fine-tuned model looks strong because it's matching; remove them and the crutch disappears.

The twist the corpus adds is that fine-tuning isn't uniformly a memorization trap — it depends on what signal you give it. Small models fine-tuned with DPO on both correct and incorrect examples outperform plain supervised fine-tuning, precisely because the explicit negative examples target rigid failure modes instead of just reinforcing seen outputs Can small models match large models on function calling?. That's the flip side of the same coin: supervised fine-tuning on repeated correct examples teaches the model to reproduce, while training that contrasts right against wrong pushes toward something more transferable.

There's also a deeper ceiling lurking underneath. Self-improvement in LLMs is formally bounded by a generation-verification gap — a model can't reliably fix itself without an external signal to validate the fix What stops large language models from improving themselves?. Fine-tuning on repeated items is, in a sense, a weak external signal that mostly confirms what the model already does. Once the repeats are gone, there's nothing left to lean on, and the broader, less-overfit base model's general capability re-asserts itself.

What's worth carrying away: the gap you see when duplicates are removed isn't really "big model beats small model." It's "generalization beats memorization," and fine-tuning sits on either side of that line depending on whether it teaches a procedure or just rehearses answers. If you want to feel where this bites in practice, the linguistic blind-spot work shows the same surface-pattern-vs-deep-rule divide from another angle — models capture statistical surface patterns but miss the underlying grammatical structure Why do large language models fail at complex linguistic tasks?.

Sources 5 notes

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst examining whether fine-tuned LLMs truly underperform larger base models when repeated training items are removed, or whether the gap reflects a memorization-vs-generalization tradeoff that may have shifted with newer methods, model architectures, or evaluation practices.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable:

• Fine-tuned models rely on template-matching and memorization rather than installed reasoning; removing in-distribution repeats exposes sharp drops in performance versus base models (~2024–2025).
• RL fine-tuning methods (GRPO, similar) sharpen pattern-recognition instead of teaching iterative procedures; models emit plausible-but-wrong outputs by recognizing structural similarity to memorized examples (~2024–2025).
• DPO fine-tuning on explicit negative examples outperforms supervised fine-tuning on repeated positives alone, suggesting signal quality (contrast) matters more than repetition (~2024).
• Self-improvement in LLMs is formally bounded by a generation-verification gap; fine-tuned models cannot reliably validate their own fixes without external signal (~2025).
• RL post-training amplifies pretraining behaviors rather than teaching novel reasoning; models don't execute latent iterative methods (~2024–2025).

Anchor papers (verify; mind their dates):
- arXiv:2410.18890 (Oct 2024): DPO on small models, function-calling reasoning.
- arXiv:2412.02674 (Dec 2024): Self-improvement and the generation-verification gap.
- arXiv:2504.07912 (Apr 2025): Echo Chamber — RL amplifies pretraining, not novel learning.
- arXiv:2603.03415 (Mar 2026): OOD mechanisms in LLMs under distribution shift.

Your task:

(1) **RE-TEST EACH CONSTRAINT.** For each finding above, investigate whether (a) newer model families (post-o1, reasoning models), (b) hybrid fine-tuning methods (rejection sampling, process reward models, constitutional AI), (c) retrieval-augmented generation or in-context learning strategies, or (d) better OOD evaluation suites have since relaxed or overturned it. Separate the durable question — *do fine-tuned models rely on memorization more than base models?* — from perishable limitations (e.g., *GRPO cannot teach reasoning*). Be explicit: does the constraint still hold?

(2) **Surface contradicting or superseding work from the last ~6 months.** Look for papers claiming fine-tuning or RL *does* transfer procedurally, or that base models *do* underperform fine-tuned ones even OOD. Name the tension and cite arXiv IDs.

(3) **Propose two research questions that assume the regime may have moved.** Assume newer methods *can* teach transferable reasoning. Then ask: (a) what OOD test suite would definitively separate memorization from generalization in post-2025 fine-tuned models? (b) does the memorization-vs-generalization story change if the base model is smaller than the fine-tuned variant?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do large language models outperform fine-tuned models once repeated items are removed?

Sources 5 notes

Next inquiring lines