How does in-weights adaptation create spurious forgetting in models?
This explores why baking new learning directly into a model's weights (via fine-tuning or RL) tends to erase capabilities it already had — and what the corpus suggests is actually going wrong.
This reads the question as: when you adapt a model by changing its parameters rather than its context, why does old knowledge degrade — and is that loss inevitable? The corpus's most striking suggestion is that it isn't. Forgetting looks less like a hard physical limit and more like a misallocation problem — you're writing a task-specific lesson into the wrong storage medium. Can splitting adaptation into two channels reduce forgetting? makes this explicit: route the lesson into an optimized prompt (fast, textual) and keep parameter edits minimal (slow), and you reach the same performance 1.4–3x faster with far less forgetting. The forgetting wasn't the cost of learning; it was the cost of learning in the wrong place.
Why does the weight channel do the damage? Can decoding-time tuning preserve knowledge better than weight fine-tuning? offers a concrete mechanism: direct fine-tuning corrupts knowledge stored in the model's lower layers, whereas shifting only the output distribution at decoding time leaves that stored knowledge intact and mostly touches reasoning and style. So the spurious part of 'spurious forgetting' is that you overwrite factual storage while only meaning to change behavior. The deeper you reach into the weights, the more collateral damage.
There's a quieter, second flavor of forgetting that isn't about facts at all but about flexibility. Does RL training collapse format diversity in pretrained models? shows RL amplifying one pretraining format within a single epoch while collapsing all the alternatives the base model could have produced — and the surviving format tracks model scale, not performance. The model 'forgets' that it knew other ways to answer. Does staying close to the base model preserve learning ability? connects this to future learning: parameter-only RL drifts far from the base distribution and then stalls when the task domain changes, while staying up to 70% closer to base preserves the model's ability to keep learning. Forgetting and loss-of-plasticity turn out to be the same wound seen from two angles.
This reframes a counterintuitive finding: Does reinforcement learning update only a small fraction of parameters? shows RL only touches a small, structured, nearly-identical-across-seeds slice of parameters. You might expect such surgical updates to be safe — but the damage isn't about how many weights move, it's about which representations they sit on top of. A small full-rank edit to a load-bearing subnetwork can still collapse format diversity or corrupt lower-layer storage.
The constructive throughline is that the whole field has an escape hatch: stop adapting in-weights at all. Can agents learn from failure without updating their weights?, Can agents learn continuously from experience without updating weights?, and Can agents learn new skills without forgetting old ones? each show agents improving — sometimes dramatically (87.88% on GAIA, lifelong skill compounding in VOYAGER) — by writing lessons into external memory or skill libraries instead of weights. If you never edit the storage, you can't corrupt it. The thing you didn't know you wanted to know: 'catastrophic forgetting' may be better understood as a question about where a model keeps its lessons than about how much it can hold.
Sources 8 notes
Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.