What causes overfitting when forcing new facts into model weights?
This explores why writing new facts directly into a model's parameters tends to damage what it already knew — and what the corpus says about where that damage comes from.
This explores why forcing new facts into model weights causes overfitting and collateral damage, rather than the model simply learning the fact. The cleanest result in the corpus is that in-weight memorization is fundamentally capacity-limited: a model can only store so many facts in its parameters before storing more starts overwriting what's already there. Can models store unlimited facts without growing larger? proves this formally — factual recall in weights is bounded by model size, and in-weight finetuning degrades general capability by overwriting prior knowledge. So the 'overfitting' you see isn't only the new fact being memorized too rigidly; it's the new fact crowding out neighbors in a finite store.
Where the damage lands is specific. Can decoding-time tuning preserve knowledge better than weight fine-tuning? finds that direct fine-tuning corrupts knowledge storage in the lower layers of the model — exactly where factual associations live — while approaches that leave the base weights untouched (shifting outputs at decoding time instead) preserve that knowledge and only nudge reasoning and style. That points to a mechanism: fact-injection updates collide with the dense, shared parameters that hold everything else, so a gradient step that pins down one new fact also perturbs many old ones.
This reframes overfitting as a distance-from-base problem, not just a too-many-epochs problem. Does staying close to the base model preserve learning ability? shows that the further training drags a model from its original distribution, the more it loses — staying close to the base preserves plasticity and the ability to keep learning, while parameter-heavy updates stall when the domain shifts. The same tension shows up from the opposite direction in Why do language models ignore information in their context?: strong baked-in associations are so dominant that the model ignores contradicting information in its own prompt. Weights resist new facts (you have to push hard to overwrite a prior) and that hard push is exactly what spills over into forgetting.
The practical lever the corpus keeps returning to is: don't touch the weights. Can editing hidden representations beat weight updates for finetuning? learns interventions on frozen representations and beats weight-updating methods like LoRA by 10-50x on parameter efficiency — because it adapts behavior without disturbing stored knowledge at all. Combined with the tool-use result, a pattern emerges: facts that change or accumulate belong outside the parameters (retrieval, tools, decoding-time steering), and weights are best reserved for skills and style. The thing you didn't know you wanted to know: overfitting-on-facts is less a regularization failure and more a storage-architecture mismatch — you're writing mutable data into a fixed-size, densely-shared medium, and overwriting is the price.
Sources 5 notes
A formal proof and experiments show in-weight memorization is bounded by model size, while tool-use enables unbounded factual recall through a simple circuit. In-weight finetuning also degrades general capability by overwriting prior knowledge.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.