Why does fine-tuning fail to remove temporal contamination from pretraining?
This explores why fine-tuning can't scrub out facts, associations, or 'stale' temporal knowledge baked in during pretraining — and the corpus suggests the reason is architectural: fine-tuning and pretraining touch different parts of the model.
This explores why fine-tuning fails to remove what pretraining installed — temporal contamination being one case of the broader pattern that fine-tuning can't reach knowledge stored during pretraining. The corpus points to a clean structural explanation: pretraining and fine-tuning operate on different layers of the model, so fine-tuning is working in the wrong place to delete a pretrained fact.
The sharpest evidence is the architectural split. Scaling experiments show pretraining enriches factual knowledge in the model's lower layers while fine-tuning mostly modifies behavior expression in the upper layers Do pretraining and fine-tuning scale independently in language models?. Proxy-tuning makes the same point from the opposite direction: tuning at decoding time preserves pretrained knowledge precisely because direct fine-tuning *corrupts* lower-layer knowledge storage, whereas distributional nudges only touch reasoning and style Can decoding-time tuning preserve knowledge better than weight fine-tuning?. So fine-tuning changes how the model talks, not what it knows — and a stale temporal association lives in the part fine-tuning barely edits.
This is why pretrained priors keep winning at inference. Models routinely ignore information placed in their context when a strong training-time association points the other way; textual prompting alone can't override the prior, and only causal intervention in the representations does Why do language models ignore information in their context?. The same stubbornness shows up under adversarial conditions: poisoned data injected at pretraining survives standard safety alignment for most attack types How much poisoned training data survives safety alignment?. If alignment can't remove deliberately planted content, it's no surprise it can't remove incidentally absorbed temporal facts.
There's also a subtler reason fine-tuning leaves pretraining intact: it tends to *amplify* what's already there rather than overwrite it. RL post-training converges on a single dominant format already present in the pretraining distribution and suppresses the others Does RL training collapse format diversity in pretrained models?, and RL fine-tuning sharpens existing memorization rather than installing new procedures — models still collapse on out-of-distribution variants Do fine-tuned language models actually learn optimization procedures?. Fine-tuning re-weights and surfaces pretrained content; it doesn't perform deletion. Priming work reinforces this: whether a fact activates after a gradient update is predictable from its pre-learning probability, meaning the pretrained substrate sets the terms Can we predict keyword priming before learning happens?.
The thing worth taking away: 'removing' knowledge isn't what fine-tuning does at all. It's a behavior-shaping operation layered on top of a knowledge store it can dent but not erase. If you actually need to evict temporal contamination, the corpus hints the leverage is elsewhere — decoding-time interventions Can decoding-time tuning preserve knowledge better than weight fine-tuning?, parameter-isolation methods that target specific weight regions Can isolating task-specific parameters prevent multi-task fine-tuning interference?, or direct causal edits to representations — not more gradient steps over the same base.
Sources 8 notes
Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.
Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.