How does in-weights adaptation create spurious forgetting in models?

This explores why baking new learning directly into a model's weights (via fine-tuning or RL) tends to erase capabilities it already had — and what the corpus suggests is actually going wrong.

This reads the question as: when you adapt a model by changing its parameters rather than its context, why does old knowledge degrade — and is that loss inevitable? The corpus's most striking suggestion is that it isn't. Forgetting looks less like a hard physical limit and more like a misallocation problem — you're writing a task-specific lesson into the wrong storage medium. Can splitting adaptation into two channels reduce forgetting? makes this explicit: route the lesson into an optimized prompt (fast, textual) and keep parameter edits minimal (slow), and you reach the same performance 1.4–3x faster with far less forgetting. The forgetting wasn't the cost of learning; it was the cost of learning in the wrong place.

Why does the weight channel do the damage? Can decoding-time tuning preserve knowledge better than weight fine-tuning? offers a concrete mechanism: direct fine-tuning corrupts knowledge stored in the model's lower layers, whereas shifting only the output distribution at decoding time leaves that stored knowledge intact and mostly touches reasoning and style. So the spurious part of 'spurious forgetting' is that you overwrite factual storage while only meaning to change behavior. The deeper you reach into the weights, the more collateral damage.

There's a quieter, second flavor of forgetting that isn't about facts at all but about flexibility. Does RL training collapse format diversity in pretrained models? shows RL amplifying one pretraining format within a single epoch while collapsing all the alternatives the base model could have produced — and the surviving format tracks model scale, not performance. The model 'forgets' that it knew other ways to answer. Does staying close to the base model preserve learning ability? connects this to future learning: parameter-only RL drifts far from the base distribution and then stalls when the task domain changes, while staying up to 70% closer to base preserves the model's ability to keep learning. Forgetting and loss-of-plasticity turn out to be the same wound seen from two angles.

This reframes a counterintuitive finding: Does reinforcement learning update only a small fraction of parameters? shows RL only touches a small, structured, nearly-identical-across-seeds slice of parameters. You might expect such surgical updates to be safe — but the damage isn't about how many weights move, it's about which representations they sit on top of. A small full-rank edit to a load-bearing subnetwork can still collapse format diversity or corrupt lower-layer storage.

The constructive throughline is that the whole field has an escape hatch: stop adapting in-weights at all. Can agents learn from failure without updating their weights?, Can agents learn continuously from experience without updating weights?, and Can agents learn new skills without forgetting old ones? each show agents improving — sometimes dramatically (87.88% on GAIA, lifelong skill compounding in VOYAGER) — by writing lessons into external memory or skill libraries instead of weights. If you never edit the storage, you can't corrupt it. The thing you didn't know you wanted to know: 'catastrophic forgetting' may be better understood as a question about where a model keeps its lessons than about how much it can hold.

Sources 8 notes

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **Does in-weights adaptation necessarily cause spurious forgetting, or is forgetting a symptom of adapting in the wrong medium?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as time-bound:
- Direct parameter fine-tuning corrupts lower-layer factual storage while only intending to shift behavior; proxy-tuning at decoding preserves knowledge better (2024).
- Splitting adaptation into minimal weight edits + optimized prompts (fast context) achieves 1.4–3× faster learning with far less forgetting (2025).
- RL post-training amplifies one pretraining format while collapsing format diversity within a single epoch, reducing future plasticity; staying ≤70% KL-drift from base preserves continual learning ability (2025–2026).
- RL updates only 5–30% of parameters in sparse, full-rank subnetworks; damage scales with representational criticality, not edit count (2025).
- External memory and skill libraries (episodic storage, compositional libraries) enable learning without weight corruption, achieving 87.88% GAIA, lifelong skill compounding (2024–2026).

Anchor papers (verify; mind their dates):
- arXiv:2501.17161 (2025) — SFT vs. RL generalization contrast
- arXiv:2504.07912 (2025) — RL amplification of pretraining formats
- arXiv:2505.11711 (2025) — sparse subnetwork updates
- arXiv:2605.12484 (2026) — fast-and-slow adaptation framework

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every claim above, assess whether newer methods (e.g., mixture-of-experts routing, attention-only adaptation, SAFe/LoRA variants) or scaled training pipelines have since relaxed or overturned it. Separate durable insight (forgetting *is* a medium-choice problem) from perishable limitation (specific architectures still suffer format collapse). Ground contradictions in real work.
(2) **Surface strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months that argues forgetting is *not* avoidable, or that weight-based adaptation outperforms memory-based approaches on relevant benchmarks.
(3) **Propose 2 research questions** assuming the regime has moved: e.g., Can unified weight + memory adaptation (hybrid) outperform pure-external-memory on continual, long-horizon tasks? Does format diversity preservation require explicit regularization, or do scaled models naturally maintain it?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How does in-weights adaptation create spurious forgetting in models?

Sources 8 notes

Next inquiring lines