Can splitting adaptation into two channels reduce forgetting?

When language models adapt to new tasks, does separating task-specific learning (via prompt context) from persistent parameter updates help preserve both generalization ability and the model's original capabilities?

Synthesis note · 2026-05-28 · sourced from Training Fine Tuning

Treating parameter updates as the sole mechanism of adaptation creates a bottleneck: every improvement — a reusable reasoning skill, a task heuristic, even a transient lesson from recent rollouts — has to be written into the same persistent weights. Because the whole policy lives in those weights, any update that raises in-domain reward simultaneously drags the model away from its base behavior, reducing entropy, hurting out-of-distribution generalization, and eroding the model's ability to adapt to future tasks (plasticity loss).

Fast-Slow Training resolves this by refusing to make weights carry everything. It splits adaptation into a slow parametric component (model weights, expensive to update, persisting long-lived behavior) and a fast textual component (prompts, instructions, task context, optimized via reflective prompt evolution with GEPA). The fast channel absorbs task-specific and rapidly-changing information from textual feedback; the slow channel consolidates only persistent behavior and stays closer to the base model. Interleaving the two — RL updates plus context optimization — reaches matched performance with 1.4–3x fewer optimizer steps and a higher asymptote, while leaving the model far closer to its origin.

Why it matters: it reframes catastrophic forgetting as a misallocation problem rather than an inherent cost of learning. Forgetting happens because we force weights to store things that did not belong in weights. Route the transient and task-specific into context, and the weights stay general — so there is less to forget. This is a division-of-labor argument: the two channels operate at different timescales (an echo of System 1 vs System 2) and each does what it is suited for. The counterpoint is that the fast channel's capacity is bounded by context length and prompt-optimization quality, so genuinely large bodies of new knowledge still have to land in weights eventually.

Inquiring lines that use this note as a source 51

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 137 in 2-hop network ·dense cluster Open in graph ↗

Can splitting adaptation into two channels reduc… Can prompt optimization teach models knowledge the… Does prompt optimization without inference strateg… Can agents adapt without pausing service to users? Can continuous reasoning avoid forgetting in instr… Can agents learn new skills without forgetting old… Does staying close to the base model preserve lear…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can prompt optimization teach models knowledge they lack? Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.
the boundary condition on the fast channel: context optimization activates and steers but cannot store genuinely new knowledge, which is why slow weights remain necessary
Does prompt optimization without inference strategy fail? Standard practice optimizes prompts and inference strategies separately. But do prompts optimized for single-shot evaluation actually perform worse when deployed at scale with aggregation methods like majority voting?
reinforces the interleaving design: fast (prompt) and the other channel must be co-optimized, not optimized separately
Can agents adapt without pausing service to users? Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.
same fast/slow dual-timescale architecture in the agent setting; convergent design from a different angle
Can continuous reasoning avoid forgetting in instruction-tuned models? Full fine-tuning for continuous-space reasoning degrades performance in capable instruction-tuned models. Why does this happen, and can architectural changes prevent it?
alternative forgetting-avoidance strategy: offload to an auxiliary module rather than to textual context, but the same principle of keeping the base weights untouched
Can agents learn new skills without forgetting old ones? Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.
another non-weight store for accumulating skills, supporting the general claim that adaptation should not all flow through parameters
Does staying close to the base model preserve learning ability? Explores whether limiting how far training pushes a model from its base distribution (measured by KL divergence) helps it learn new tasks more effectively over time, and why that trade-off matters for continual learning.
grounds: the mechanism behind the slow channel's payoff — keeping weights near the base (low KL drift) is precisely what preserves plasticity and reduces forgetting

Can splitting adaptation into two channels reduce forgetting?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 5