Does staying close to the base model preserve learning ability?
Explores whether limiting how far training pushes a model from its base distribution (measured by KL divergence) helps it learn new tasks more effectively over time, and why that trade-off matters for continual learning.
There is a quiet variable connecting forgetting, generalization, and the ability to keep learning: how far training pushes the policy from its base distribution, measured as KL divergence. The Fast-Slow result makes the relationship explicit. FST-trained models stay up to 70% closer to the base LLM in KL than parameter-only RL — and that reduced drift is not just a forgetting story. It preserves plasticity: after training on one task, FST models adapt more effectively to a subsequent task, while parameter-only RL stalls when task domains change on the fly.
The pattern is that drift and plasticity trade off. Each parameter update that improves in-domain reward also moves the model toward a sharper, lower-entropy policy specialized to that task. Specialization is exactly what makes the model less able to absorb the next task — the weights have committed. By keeping most task-specific adaptation in the fast textual channel and letting the slow weights move only a little, FST holds the policy near its flexible base, where it retains the entropy and breadth needed to learn again. Low KL drift is the leading indicator; preserved plasticity and reduced forgetting are downstream consequences.
Why it matters: it gives continual learning a measurable target. Rather than treating "don't forget" and "stay adaptable" as separate desiderata to engineer, you can watch a single quantity — distance from base — and recognize that overshooting it is what produces both forgetting and plasticity loss. It also reframes KL regularization (already standard in RLHF as a leash) as not merely a stability or alignment-preservation device but as the mechanism that keeps the model trainable in the future. The counterpoint: staying near base also caps how much any single task can specialize the weights, so for a one-shot deployment with no future tasks, aggressive drift may be the better trade.
Inquiring lines that use this note as a source 53
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do proprietary models improve with training while open-source models decline?
- Why does fine-tuning for continuous space cause catastrophic forgetting?
- How do training objectives shape what a world model actually learns?
- How do early layers preserve unbiased information while late layers conform?
- Does the model learn depth-wise drift as an explicit strategy?
- Can self-distillation reduce catastrophic forgetting in continual learning?
- How do different training objectives shift whether models over-predict or under-predict?
- How does distributional distance from pre-training relate to model difficulty?
- How much does memorization capacity limit a model's ability to learn new information?
- When should model isolation be preferred over weight-averaging approaches?
- How can weak-to-strong progressive training target planning without interfering with grounding?
- Can gradient approximation at equilibrium replace backpropagation through time in practice?
- How does training data distribution determine what models can learn?
- What makes certain bond distributions more learnable than others?
- What causes irreversible model collapse when training on model-generated content?
- How do residual connections and layer norm stabilize training in deep RL?
- What distinguishes training-time entropy collapse from test-time variance inflation?
- Does foundational model training or user priors more strongly shape final outputs?
- Why does consistency training make models resistant to prompt perturbations?
- Why do student models learn better from internal pruning versus external compression?
- Why do weaker models generate better training data than stronger models?
- What role does a model's representational structure play in learning?
- How does representational convergence differ from policy entropy collapse in iterative training?
- Can continuous spectrum training outperform sequential SFT-then-RL stages?
- Can gradient-based influence estimation make test-time training more efficient?
- Does sparse parameter updating improve test-time training's computational cost?
- Why does monological training prevent models from overriding statistical priors?
- How do RL training and base models differ in creating MI peaks?
- Why do rare cases in medicine and science require models that preserve tail distributions?
- What happens to model capability as weight sparsity increases during training?
- Why does prolonged RL discover strategies absent from any base model sample?
- Does weight decay directly cause contractive behavior near training examples?
- How tight should a textual learning rate be before it prevents skill escape?
- Why do queries with low cross-rollout variance produce degenerate gradients?
- What happens to model grounding when preference optimization increases effective diversity?
- Why does the order of training examples matter for what models learn?
- Can we predict out-of-distribution generalization without access to downstream tasks?
- How much can externalized skills improve models before hitting diminishing returns?
- How does KL regularization prevent both forgetting and adaptation loss?
- Can memory-based adaptation and gradient fine-tuning operate on complementary timescales?
- Does importance sampling actually recover capabilities lost to hard sample training?
- Can population-level distributions shift usefully even when individual prediction fails?
- How do complementary learning systems explain the need for fast and slow consolidation?
- What makes a learned consolidation rule lossy and where does contamination enter?
- How does the optimal difficulty band shift as the model's capabilities improve during training?
- What mechanisms cause overly hard samples to degrade prior model performance?
- How does in-weights adaptation create spurious forgetting in models?
- How can a forgetting policy preserve rare knowledge while preventing over-generalization?
- What causes overfitting when forcing new facts into model weights?
- Can training order and structure shape what networks retain and learn?
- Why do optimal learning dynamics improve scaling law coefficients specifically?
- How do newly learned facts become accessible after gradient updates?
- Why does adaptation concentrate in low-dimensional subspaces of weights or representations?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can splitting adaptation into two channels reduce forgetting?
When language models adapt to new tasks, does separating task-specific learning (via prompt context) from persistent parameter updates help preserve both generalization ability and the model's original capabilities?
the architecture that achieves the low KL drift; this note isolates KL drift as the mechanism linking that architecture to preserved plasticity
-
Can agents adapt without pausing service to users?
Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.
continual-learning design that likewise minimizes disruptive weight movement by routing fast adaptation elsewhere
-
Can agents learn continuously from experience without updating weights?
This explores whether LLM agents can adapt to new tasks and failures by retrieving past experiences from memory alone, rather than requiring expensive parameter fine-tuning or rigid hardcoded rules.
the limiting case: zero weight drift via external memory, trading parametric plasticity preservation for a retrieval-based store
-
Can frozen language models continually improve through memory structure alone?
If agents can't update parameters, what form of textual memory lets them keep learning across trials and transfer to new tasks without retraining?
frozen-weight continual improvement (KL drift exactly zero), the extreme end of the drift-versus-plasticity spectrum this note describes
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Learning, Fast and Slow: Towards LLMs That Adapt Continually
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- Self-distillation Enables Continual Learning
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- How new data permeates LLM knowledge and how to dilute it
- Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis
- Spurious Forgetting in Continual Learning of Language Models
- Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
Original note title
lower kl drift from the base model preserves plasticity enabling stronger continual learning on later tasks