Why does fine-tuning models for continuous reasoning cause catastrophic forgetting?
This explores why training a model's own weights to reason in continuous (soft, non-token) space tends to erode its existing capabilities — and what the corpus suggests is actually going on.
This reads the question as being about a specific tension: when you fine-tune an LLM to think in continuous space rather than discrete tokens, you're updating the same weights that hold everything else the model knows — and the corpus suggests that's exactly where the damage comes from. The clearest statement of the problem comes from SoftCoT, which treats catastrophic forgetting as an architectural inevitability of weight updates, not a tuning bug. Its fix is telling: freeze the main model entirely and hand continuous-thought generation to a small auxiliary module bolted on the side Can continuous reasoning avoid forgetting in instruction-tuned models?. The fact that *not touching the weights* is the solution implies the cause — continuous reasoning objectives pull the shared parameters away from the configuration that encoded the model's pre-trained knowledge.
Why are those weights so fragile? A few notes hint that the reasoning a model learns is more brittle and more entangled than it looks. One line of work finds that models don't learn general reasoning algorithms at all — they fit patterns tied to specific training instances, succeeding on anything that resembles what they've seen and breaking at novelty boundaries Do language models fail at reasoning due to complexity or novelty?. If reasoning competence is really a dense web of instance-specific patterns rather than a clean, separable skill, then retraining toward a new continuous objective overwrites that web indiscriminately. Relatedly, reasoning traces may function as computational scaffolding rather than meaningful content Do reasoning traces need to be semantically correct? — so optimizing the weights for a new kind of scaffold can quietly disrupt the old one.
There's also evidence that fine-tuning damages reasoning in a subtler way than outright forgetting. Even when accuracy holds, fine-tuning loosens the causal link between a model's reasoning steps and its final answer — the chain becomes performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?. That's a clue that weight updates for reasoning don't cleanly add a capability; they reshape internal computation in ways that can sever what was working before, which is forgetting by another name.
The most interesting lateral move in the corpus is that the field's escape route is consistently *don't use weight updates at all*. VOYAGER stores executable skills in an external, embedding-indexed library and composes new ones from old, explicitly to dodge the forgetting that weight-update methods cause Can agents learn new skills without forgetting old ones?. AgentFly pushes this further, doing continual learning and credit assignment entirely through episodic memory while the LLM's parameters stay frozen Can agents learn continuously from experience without updating weights?. Both treat the weights as a fixed substrate and put all the plasticity somewhere safer.
So the synthesis the reader might not have expected: catastrophic forgetting here isn't really a property of *continuous reasoning* — it's a property of *learning by overwriting shared weights*, and continuous-reasoning fine-tuning just happens to be an aggressive instance of it. The recurring answer across very different research lines is the same architectural insight: keep the knowledge frozen and put the new learning beside it — a frozen backbone, an external skill library, or an episodic memory — rather than inside it.
Sources 6 notes
SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.