Can continuous reasoning avoid forgetting in instruction-tuned models?
Full fine-tuning for continuous-space reasoning degrades performance in capable instruction-tuned models. Why does this happen, and can architectural changes prevent it?
Continuous-space reasoning methods like Coconut and Compressed CoT have shown promising results by replacing discrete token sequences with latent representations. However, these methods require full-model fine-tuning — and when applied to already-capable instruction-tuned models like LLaMA-3.1-8B-Instruct and Qwen2.5-7B-Instruct, performance degrades below zero-shot CoT. The degradation is attributable to catastrophic forgetting: the models already have strong reasoning capability that fine-tuning for continuous-space operations destroys.
This is an important practical finding because it reveals a gap between proof-of-concept (Coconut works on GPT-2) and deployment reality (Coconut's approach fails on the models people actually use). The capability that makes instruction-tuned models valuable is exactly what full fine-tuning compromises.
SoftCoT resolves this by architectural separation: freeze the backbone LLM entirely and delegate continuous thought generation to a small auxiliary assistant model. The assistant generates a sequence of "soft thought tokens" — last-layer hidden states conditioned on the task instruction and specific instance. These soft thoughts are mapped into the LLM's representation space via a trainable projection module, then prepended as instance-specific prompts.
The design draws on two established ideas. From prompt tuning: the soft thoughts function as learned instance-adaptive prompts that tailor the LLM's behavior per problem. From speculative decoding: a small model generates proposals that a large model consumes. The projection module bridges the representational gap between assistant and backbone, and training this module for each task is equivalent to soft prompt tuning.
By staying in the latent space (using hidden states rather than decoded tokens from the assistant), SoftCoT avoids the information loss inherent in autoregressive decoding while preserving the backbone's pre-trained knowledge completely.
The contrast with Can we explore multiple reasoning paths without committing to one token? is instructive: Soft Thinking is training-free and operates within a single model by modifying inference. SoftCoT requires training the assistant + projection module but achieves cross-model continuous reasoning — the assistant can be small and cheap while the backbone remains frozen and capable. They address different deployment scenarios: Soft Thinking for zero-cost enhancement, SoftCoT for task-specific optimization without backbone risk.
The forgetting finding also validates the architectural choice in Can models reason without generating visible thinking tokens?: Coconut's continuous thought approach works when training from scratch but fails as a retrofit to existing capable models. This suggests the field needs both training-time latent reasoning architectures (for new models) and inference-time or frozen-backbone approaches (for enhancing existing models).
Inquiring lines that use this note as a source 40
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do integrated and decoupled architectures trade off intervention accuracy for efficiency differently?
- What architectural changes would enable better common-ground tracking?
- Why does fine-tuning for continuous space cause catastrophic forgetting?
- Can continuum memory systems prevent catastrophic forgetting in neural networks?
- What makes self-modifying architectures learn their own update rules?
- What architectural variables make entropy-based patching work at 8B scale?
- Can self-distillation reduce catastrophic forgetting in continual learning?
- Why does full multi-task fine-tuning perform worse than sequential training?
- How can safety-aligned parameters be protected during user-specific fine-tuning?
- Why do monolithic systems resist autonomous optimization attempts?
- Why does fine-tuning degrade reasoning quality even as accuracy improves?
- Why does mixed instruction data sometimes hurt specific model capabilities?
- How do trait adapters interact with different base model architectures?
- Why does fine-tuning improve some capabilities while degrading others?
- What makes memory trajectories topologically stable under persistent reuse?
- Can prompt optimization or fine-tuning inject knowledge models do not already contain?
- How do retention gates regularize forgetting across different sequence model architectures?
- Why does KTO skip supervised fine-tuning while DPO cannot?
- Does fine-tuning actually change model capabilities or only output distribution?
- Why does instruction tuning hurt knowledge-intensive tasks more than reasoning tasks?
- Does scaling reasoning capability create tradeoffs with instruction following?
- How does scaling reasoning capability actually reduce instruction-following ability?
- Can finetuning sparse subnetworks alone match full parameter finetuning results?
- Why does fine-tuning models for continuous reasoning cause catastrophic forgetting?
- Which architectural choices matter most when a model must fit one billion parameters?
- Can reasoning fine-tuning improve both capability and instruction compliance together?
- Why does reasoning fine-tuning reduce a model's ability to abstain?
- Does parameter isolation per task enable online updates without retraining?
- What happens to base model capabilities when you apply finetuning?
- Does fine-tuning push models toward reasoning shortcuts that bypass the chain entirely?
- How do pre-training and distillation enable minimal routing signals to work?
- Why do hybrid memory systems outperform single-tier AI architectures?
- What mechanism transfers explicit memories into parametric model weights?
- Can zero-weight drift through external memory replace parameter plasticity entirely?
- How do KV cache pruning and subproblem contraction both free reasoning capacity?
- Why do hybrid memory and compute sparsity outperform pure parameter scaling?
- Can auxiliary modules preserve reasoning without catastrophic forgetting?
- What architectural variables most improve inference efficiency today?
- Why does parameter-efficient tuning scaling fail to improve finetuning performance?
- Why does architecture matter more than training compute for inference efficiency?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we explore multiple reasoning paths without committing to one token?
Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?
complementary approach: training-free single-model vs trained cross-model; SoftCoT's forgetting finding validates Soft Thinking's design
-
Can models reason without generating visible thinking tokens?
Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
Coconut works from-scratch but fails as retrofit; SoftCoT provides the retrofit-safe alternative
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
SoftCoT's design premise: the LLM already has reasoning; the assistant provides continuous-space activation without disturbing it
-
Is LLM forgetting really knowledge loss or alignment loss?
When language models appear to forget old knowledge after learning new tasks, is the underlying knowledge actually gone, or has the model simply lost the ability to activate it? This distinction matters for understanding how fragile safety training really is.
SoftCoT's catastrophic forgetting finding is the genuine version: full fine-tuning for continuous reasoning destroys capability that cannot be trivially recovered, unlike spurious task-alignment loss
-
Can latent thought vectors scale language models beyond parameters?
Explores whether explicit latent thought vectors with dual-rate learning create new scaling dimensions independent of model size. This matters because it suggests alternatives to simply building larger models.
LTMs train from scratch with latent vectors; SoftCoT retrofits latent reasoning onto existing models via frozen backbone + assistant; different solutions to the same goal
-
Can splitting adaptation into two channels reduce forgetting?
When language models adapt to new tasks, does separating task-specific learning (via prompt context) from persistent parameter updates help preserve both generalization ability and the model's original capabilities?
synthesizes: a sibling forgetting-avoidance strategy — keep base weights untouched, route adaptation elsewhere
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs
- Soft Tokens, Hard Truths
- Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
- Navigating the Latent Space Dynamics of Neural Models
- When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs
- On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
- Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
- Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst
Original note title
SoftCoT preserves frozen LLM reasoning by delegating continuous thought generation to a lightweight assistant model — avoiding catastrophic forgetting from full continuous-space training