Can continuous reasoning avoid forgetting in instruction-tuned models?

Full fine-tuning for continuous-space reasoning degrades performance in capable instruction-tuned models. Why does this happen, and can architectural changes prevent it?

Synthesis note · 2026-04-20 · sourced from Cognitive Models Latent

Continuous-space reasoning methods like Coconut and Compressed CoT have shown promising results by replacing discrete token sequences with latent representations. However, these methods require full-model fine-tuning — and when applied to already-capable instruction-tuned models like LLaMA-3.1-8B-Instruct and Qwen2.5-7B-Instruct, performance degrades below zero-shot CoT. The degradation is attributable to catastrophic forgetting: the models already have strong reasoning capability that fine-tuning for continuous-space operations destroys.

This is an important practical finding because it reveals a gap between proof-of-concept (Coconut works on GPT-2) and deployment reality (Coconut's approach fails on the models people actually use). The capability that makes instruction-tuned models valuable is exactly what full fine-tuning compromises.

SoftCoT resolves this by architectural separation: freeze the backbone LLM entirely and delegate continuous thought generation to a small auxiliary assistant model. The assistant generates a sequence of "soft thought tokens" — last-layer hidden states conditioned on the task instruction and specific instance. These soft thoughts are mapped into the LLM's representation space via a trainable projection module, then prepended as instance-specific prompts.

The design draws on two established ideas. From prompt tuning: the soft thoughts function as learned instance-adaptive prompts that tailor the LLM's behavior per problem. From speculative decoding: a small model generates proposals that a large model consumes. The projection module bridges the representational gap between assistant and backbone, and training this module for each task is equivalent to soft prompt tuning.

By staying in the latent space (using hidden states rather than decoded tokens from the assistant), SoftCoT avoids the information loss inherent in autoregressive decoding while preserving the backbone's pre-trained knowledge completely.

The contrast with Can we explore multiple reasoning paths without committing to one token? is instructive: Soft Thinking is training-free and operates within a single model by modifying inference. SoftCoT requires training the assistant + projection module but achieves cross-model continuous reasoning — the assistant can be small and cheap while the backbone remains frozen and capable. They address different deployment scenarios: Soft Thinking for zero-cost enhancement, SoftCoT for task-specific optimization without backbone risk.

The forgetting finding also validates the architectural choice in Can models reason without generating visible thinking tokens?: Coconut's continuous thought approach works when training from scratch but fails as a retrofit to existing capable models. This suggests the field needs both training-time latent reasoning architectures (for new models) and inference-time or frozen-backbone approaches (for enhancing existing models).

Inquiring lines that use this note as a source 40

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 142 in 2-hop network ·dense cluster Open in graph ↗

Can continuous reasoning avoid forgetting in ins… Can we explore multiple reasoning paths without co… Can models reason without generating visible think… Do base models already contain hidden reasoning ab… Is LLM forgetting really knowledge loss or alignme… Can latent thought vectors scale language models b… Can splitting adaptation into two channels reduce …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can we explore multiple reasoning paths without committing to one token? Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?
complementary approach: training-free single-model vs trained cross-model; SoftCoT's forgetting finding validates Soft Thinking's design
Can models reason without generating visible thinking tokens? Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
Coconut works from-scratch but fails as retrofit; SoftCoT provides the retrofit-safe alternative
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
SoftCoT's design premise: the LLM already has reasoning; the assistant provides continuous-space activation without disturbing it
Is LLM forgetting really knowledge loss or alignment loss? When language models appear to forget old knowledge after learning new tasks, is the underlying knowledge actually gone, or has the model simply lost the ability to activate it? This distinction matters for understanding how fragile safety training really is.
SoftCoT's catastrophic forgetting finding is the genuine version: full fine-tuning for continuous reasoning destroys capability that cannot be trivially recovered, unlike spurious task-alignment loss
Can latent thought vectors scale language models beyond parameters? Explores whether explicit latent thought vectors with dual-rate learning create new scaling dimensions independent of model size. This matters because it suggests alternatives to simply building larger models.
LTMs train from scratch with latent vectors; SoftCoT retrofits latent reasoning onto existing models via frozen backbone + assistant; different solutions to the same goal
Can splitting adaptation into two channels reduce forgetting? When language models adapt to new tasks, does separating task-specific learning (via prompt context) from persistent parameter updates help preserve both generalization ability and the model's original capabilities?
synthesizes: a sibling forgetting-avoidance strategy — keep base weights untouched, route adaptation elsewhere

Can continuous reasoning avoid forgetting in instruction-tuned models?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4