SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals Reasoning, Retrieval, and Evaluation

Is LLM forgetting really knowledge loss or alignment loss?

When language models appear to forget old knowledge after learning new tasks, is the underlying knowledge actually gone, or has the model simply lost the ability to activate it? This distinction matters for understanding how fragile safety training really is.

Synthesis note · 2026-02-23 · sourced from Flaws
How do LLMs fail to know what they seem to understand?

The conventional story of catastrophic forgetting says LLMs lose old knowledge when learning new tasks. But controlled experiments reveal something different: performance loss does not indicate knowledge loss. It indicates task alignment loss — the model's ability to effectively apply existing knowledge to specific tasks degrades, while the underlying knowledge remains intact.

The evidence is striking: safety alignment established through 100,000+ training instances can appear to be undone by as few as 10 harmful examples. But the "lost" safety performance can be recovered by training on just 10 safety instances or even irrelevant tasks that never appeared in the original training. If the knowledge were truly forgotten, irrelevant retraining could not recover it.

The decomposition is simple: Task Performance = Task Alignment + Underlying Knowledge. What changes during continual learning is primarily the alignment component — the model's disposition to activate the right knowledge for the right task. The knowledge itself persists.

This reframes several alignment concerns. The vulnerability of safety training to "jailbreaking through fine-tuning" is not about erasing safety knowledge — it's about misaligning the activation pathway. The knowledge of what's safe and unsafe remains; the model simply stops applying it. This is recoverable, which is both reassuring (knowledge persists) and concerning (alignment is fragile).

The connection to Does RL teach reasoning or just when to use it? is precise: if RL teaches timing not capability, then "forgetting" after new training is timing disruption not capability loss. The mechanisms are parallel — activation alignment is what training modifies, and it's what continual learning disrupts.

The in-weights adaptation bottleneck as a forgetting cause. Fast-Slow Training names the structural reason alignment is so fragile: treating parameter updates as the sole adaptation mechanism forces every improvement — a reusable skill, a task heuristic, even a transient lesson from recent rollouts — to be written into the same persistent weights. Because the whole policy lives in those weights, any update that raises in-domain reward simultaneously drags the model away from base behavior, reducing entropy and disrupting the activation pathways this note shows are what actually carry "alignment." That reframes spurious forgetting as a misallocation: we route task-specific and transient lessons into weights that should be holding only persistent behavior, so the alignment component (activation disposition) is exactly what gets perturbed. FST's remedy — keep transient adaptation in an optimized textual context and let slow weights drift up to 70% less in KL — predicts less spurious forgetting precisely because it stops overwriting the activation alignment that knowledge persistence depends on. The recoverability finding here and FST's prevention strategy are two views of one mechanism: knowledge survives in the weights; what breaks (and what FST protects) is the model's disposition to activate it.

Inquiring lines that use this note as a source 6

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 153 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

spurious forgetting in LLMs is task alignment loss not knowledge loss — recoverable with minimal retraining