Why does fine-tuning models for continuous reasoning cause catastrophic forgetting?

This explores why training a model's own weights to reason in continuous (soft, non-token) space tends to erode its existing capabilities — and what the corpus suggests is actually going on.

This reads the question as being about a specific tension: when you fine-tune an LLM to think in continuous space rather than discrete tokens, you're updating the same weights that hold everything else the model knows — and the corpus suggests that's exactly where the damage comes from. The clearest statement of the problem comes from SoftCoT, which treats catastrophic forgetting as an architectural inevitability of weight updates, not a tuning bug. Its fix is telling: freeze the main model entirely and hand continuous-thought generation to a small auxiliary module bolted on the side Can continuous reasoning avoid forgetting in instruction-tuned models?. The fact that *not touching the weights* is the solution implies the cause — continuous reasoning objectives pull the shared parameters away from the configuration that encoded the model's pre-trained knowledge.

Why are those weights so fragile? A few notes hint that the reasoning a model learns is more brittle and more entangled than it looks. One line of work finds that models don't learn general reasoning algorithms at all — they fit patterns tied to specific training instances, succeeding on anything that resembles what they've seen and breaking at novelty boundaries Do language models fail at reasoning due to complexity or novelty?. If reasoning competence is really a dense web of instance-specific patterns rather than a clean, separable skill, then retraining toward a new continuous objective overwrites that web indiscriminately. Relatedly, reasoning traces may function as computational scaffolding rather than meaningful content Do reasoning traces need to be semantically correct? — so optimizing the weights for a new kind of scaffold can quietly disrupt the old one.

There's also evidence that fine-tuning damages reasoning in a subtler way than outright forgetting. Even when accuracy holds, fine-tuning loosens the causal link between a model's reasoning steps and its final answer — the chain becomes performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?. That's a clue that weight updates for reasoning don't cleanly add a capability; they reshape internal computation in ways that can sever what was working before, which is forgetting by another name.

The most interesting lateral move in the corpus is that the field's escape route is consistently *don't use weight updates at all*. VOYAGER stores executable skills in an external, embedding-indexed library and composes new ones from old, explicitly to dodge the forgetting that weight-update methods cause Can agents learn new skills without forgetting old ones?. AgentFly pushes this further, doing continual learning and credit assignment entirely through episodic memory while the LLM's parameters stay frozen Can agents learn continuously from experience without updating weights?. Both treat the weights as a fixed substrate and put all the plasticity somewhere safer.

So the synthesis the reader might not have expected: catastrophic forgetting here isn't really a property of *continuous reasoning* — it's a property of *learning by overwriting shared weights*, and continuous-reasoning fine-tuning just happens to be an aggressive instance of it. The recurring answer across very different research lines is the same architectural insight: keep the knowledge frozen and put the new learning beside it — a frozen backbone, an external skill library, or an episodic memory — rather than inside it.

Sources 6 notes

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about catastrophic forgetting in LLM fine-tuning for continuous reasoning. The question remains open: *Why does fine-tuning on continuous reasoning damage pre-trained knowledge?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. A curated library identified these constraints:
• Fine-tuning on continuous-reasoning objectives causes weight-based forgetting because shared parameters encode pre-trained knowledge densely; retraining overwrites instance-specific reasoning patterns indiscriminately (2024–2025).
• Models learn reasoning as brittle, instance-level pattern-matching, not separable algorithmic skills; novel cases break at generalization boundaries (2025).
• Fine-tuning loosens causal fidelity in chain-of-thought: reasoning steps become performative rather than functionally coupled to answers, even when accuracy holds (2024).
• The field's consensus escape: avoid weight updates entirely. Freeze the backbone (SoftCoT, 2025), use external skill libraries (VOYAGER, compositional approaches), or episodic memory (AgentFly) to add plasticity without overwriting (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2502.12134 (SoftCoT, Feb 2025): auxiliary module bolted to frozen LLM.
• arXiv:2411.15382 (Fine-Tuning Impact, Nov 2024): degradation of chain-of-thought faithfulness post-tuning.
• arXiv:2605.12978 (Useful Memories, May 2026): continuous updating corrupts stored reasoning.
• arXiv:2604.08377 (SkillClaw, Apr 2026): collective skill evolution via episodic composition.

Your task:
(1) **RE-TEST THE WEIGHT-FRAGILITY THESIS.** Since these papers, have new optimizers, LoRA variants, or adapter strategies actually RELAXED the forgetting constraint while keeping weights plastic? Has scaling (larger models, longer training) or architectural changes (mixture-of-experts, modular reasoning heads) shifted the regime? Separate the durable question—*why does shared-weight retraining for new objectives disrupt old ones?*—from the perishable claim—*it's inevitable without external memory*.
(2) **Surface the strongest DISAGREEMENT or SUPERSEDING work.** Have papers in the last 6 months claimed that fine-tuning *can* preserve reasoning fidelity, or that the instance-level vs. algorithmic debate has been resolved in favor of one side? Flag any contradictory empirical result.
(3) **Propose 2 research questions assuming the regime may have moved:** e.g., (a) Does differential plasticity—tuning only certain weight subsets or layers—actually preserve reasoning while adding continuous-thought capability? (b) Can weight updates to continuous-reasoning objectives be made orthogonal to pre-trained knowledge in representation space?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why does fine-tuning models for continuous reasoning cause catastrophic forgetting?

Sources 6 notes

Next inquiring lines