What causes overfitting when forcing new facts into model weights?

This explores why writing new facts directly into a model's parameters tends to damage what it already knew — and what the corpus says about where that damage comes from.

This explores why forcing new facts into model weights causes overfitting and collateral damage, rather than the model simply learning the fact. The cleanest result in the corpus is that in-weight memorization is fundamentally capacity-limited: a model can only store so many facts in its parameters before storing more starts overwriting what's already there. Can models store unlimited facts without growing larger? proves this formally — factual recall in weights is bounded by model size, and in-weight finetuning degrades general capability by overwriting prior knowledge. So the 'overfitting' you see isn't only the new fact being memorized too rigidly; it's the new fact crowding out neighbors in a finite store.

Where the damage lands is specific. Can decoding-time tuning preserve knowledge better than weight fine-tuning? finds that direct fine-tuning corrupts knowledge storage in the lower layers of the model — exactly where factual associations live — while approaches that leave the base weights untouched (shifting outputs at decoding time instead) preserve that knowledge and only nudge reasoning and style. That points to a mechanism: fact-injection updates collide with the dense, shared parameters that hold everything else, so a gradient step that pins down one new fact also perturbs many old ones.

This reframes overfitting as a distance-from-base problem, not just a too-many-epochs problem. Does staying close to the base model preserve learning ability? shows that the further training drags a model from its original distribution, the more it loses — staying close to the base preserves plasticity and the ability to keep learning, while parameter-heavy updates stall when the domain shifts. The same tension shows up from the opposite direction in Why do language models ignore information in their context?: strong baked-in associations are so dominant that the model ignores contradicting information in its own prompt. Weights resist new facts (you have to push hard to overwrite a prior) and that hard push is exactly what spills over into forgetting.

The practical lever the corpus keeps returning to is: don't touch the weights. Can editing hidden representations beat weight updates for finetuning? learns interventions on frozen representations and beats weight-updating methods like LoRA by 10-50x on parameter efficiency — because it adapts behavior without disturbing stored knowledge at all. Combined with the tool-use result, a pattern emerges: facts that change or accumulate belong outside the parameters (retrieval, tools, decoding-time steering), and weights are best reserved for skills and style. The thing you didn't know you wanted to know: overfitting-on-facts is less a regularization failure and more a storage-architecture mismatch — you're writing mutable data into a fixed-size, densely-shared medium, and overwriting is the price.

Sources 5 notes

Can models store unlimited facts without growing larger?

A formal proof and experiments show in-weight memorization is bounded by model size, while tool-use enables unbounded factual recall through a simple circuit. In-weight finetuning also degrades general capability by overwriting prior knowledge.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher auditing whether model weight-injection overfitting constraints have been relaxed. The question remains: what causes overfitting when forcing new facts into model weights, and has that mechanism shifted?

What a curated library found — and when (findings span 2023–2026; dated claims, not current truth):
• Factual recall capacity in weights is formally bounded by model size; in-weight finetuning degrades general capability by overwriting prior knowledge (2025).
• Direct fine-tuning corrupts knowledge storage in lower layers where factual associations live, while decoding-time interventions preserve base knowledge (2024–2025).
• Staying close to base-model distribution preserves plasticity and continual learning; parameter-heavy updates stall when domain shifts (2026).
• Strong pre-training associations dominate so heavily that models ignore contradicting in-context information—weights resist updates and collateral damage is the price of forcing new facts (2026).
• Frozen-representation interventions (ReFT-style) beat weight-updating methods by 10–50× parameter efficiency by adapting behavior without disturbing stored knowledge (2024).

Anchor papers (verify; mind their dates):
• arXiv:2404.03592 (ReFT, 2024); arXiv:2605.12484 (Continual Learning, 2026); arXiv:2603.29025 (Surface Heuristics & Constraints, 2026); arXiv:2508.20755 (Tool Learning, 2025).

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, assess whether post-2026 models, scaling laws, MoE / adapter-stack tooling, or retrieval-augmented orchestration have RELAXED the storage bottleneck or collision risk. Does the bounded-capacity story still hold at 100B+ scale? Can in-weight finetuning now preserve base knowledge via regularization breakthroughs? Cite what resolved it, or confirm the constraint persists.
(2) Surface the strongest work from the last ~6 months that CONTRADICTS the weight-as-mutable-storage framing—especially papers arguing facts *should* live in parameters, or showing parameter updates don't collide as predicted.
(3) Propose 2 research questions that assume the regime may have moved: (a) if capacity bottlenecks have eased, does collateral damage scale sublinearly now? (b) if frozen-representation methods are dominant, what new overfitting modes emerge at decoding time?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What causes overfitting when forcing new facts into model weights?

Sources 5 notes

Next inquiring lines