SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals

Can splitting adaptation into two channels reduce forgetting?

When language models adapt to new tasks, does separating task-specific learning (via prompt context) from persistent parameter updates help preserve both generalization ability and the model's original capabilities?

Synthesis note · 2026-05-28 · sourced from Training Fine Tuning
How do language models learn to think like humans?

Treating parameter updates as the sole mechanism of adaptation creates a bottleneck: every improvement — a reusable reasoning skill, a task heuristic, even a transient lesson from recent rollouts — has to be written into the same persistent weights. Because the whole policy lives in those weights, any update that raises in-domain reward simultaneously drags the model away from its base behavior, reducing entropy, hurting out-of-distribution generalization, and eroding the model's ability to adapt to future tasks (plasticity loss).

Fast-Slow Training resolves this by refusing to make weights carry everything. It splits adaptation into a slow parametric component (model weights, expensive to update, persisting long-lived behavior) and a fast textual component (prompts, instructions, task context, optimized via reflective prompt evolution with GEPA). The fast channel absorbs task-specific and rapidly-changing information from textual feedback; the slow channel consolidates only persistent behavior and stays closer to the base model. Interleaving the two — RL updates plus context optimization — reaches matched performance with 1.4–3x fewer optimizer steps and a higher asymptote, while leaving the model far closer to its origin.

Why it matters: it reframes catastrophic forgetting as a misallocation problem rather than an inherent cost of learning. Forgetting happens because we force weights to store things that did not belong in weights. Route the transient and task-specific into context, and the weights stay general — so there is less to forget. This is a division-of-labor argument: the two channels operate at different timescales (an echo of System 1 vs System 2) and each does what it is suited for. The counterpoint is that the fast channel's capacity is bounded by context length and prompt-optimization quality, so genuinely large bodies of new knowledge still have to land in weights eventually.

Inquiring lines that use this note as a source 51

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 137 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

splitting adaptation into slow weights and fast textual context avoids catastrophic forgetting and plasticity loss