SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals Reasoning, Retrieval, and Evaluation

Can continuous reasoning avoid forgetting in instruction-tuned models?

Full fine-tuning for continuous-space reasoning degrades performance in capable instruction-tuned models. Why does this happen, and can architectural changes prevent it?

Synthesis note · 2026-04-20 · sourced from Cognitive Models Latent

Continuous-space reasoning methods like Coconut and Compressed CoT have shown promising results by replacing discrete token sequences with latent representations. However, these methods require full-model fine-tuning — and when applied to already-capable instruction-tuned models like LLaMA-3.1-8B-Instruct and Qwen2.5-7B-Instruct, performance degrades below zero-shot CoT. The degradation is attributable to catastrophic forgetting: the models already have strong reasoning capability that fine-tuning for continuous-space operations destroys.

This is an important practical finding because it reveals a gap between proof-of-concept (Coconut works on GPT-2) and deployment reality (Coconut's approach fails on the models people actually use). The capability that makes instruction-tuned models valuable is exactly what full fine-tuning compromises.

SoftCoT resolves this by architectural separation: freeze the backbone LLM entirely and delegate continuous thought generation to a small auxiliary assistant model. The assistant generates a sequence of "soft thought tokens" — last-layer hidden states conditioned on the task instruction and specific instance. These soft thoughts are mapped into the LLM's representation space via a trainable projection module, then prepended as instance-specific prompts.

The design draws on two established ideas. From prompt tuning: the soft thoughts function as learned instance-adaptive prompts that tailor the LLM's behavior per problem. From speculative decoding: a small model generates proposals that a large model consumes. The projection module bridges the representational gap between assistant and backbone, and training this module for each task is equivalent to soft prompt tuning.

By staying in the latent space (using hidden states rather than decoded tokens from the assistant), SoftCoT avoids the information loss inherent in autoregressive decoding while preserving the backbone's pre-trained knowledge completely.

The contrast with Can we explore multiple reasoning paths without committing to one token? is instructive: Soft Thinking is training-free and operates within a single model by modifying inference. SoftCoT requires training the assistant + projection module but achieves cross-model continuous reasoning — the assistant can be small and cheap while the backbone remains frozen and capable. They address different deployment scenarios: Soft Thinking for zero-cost enhancement, SoftCoT for task-specific optimization without backbone risk.

The forgetting finding also validates the architectural choice in Can models reason without generating visible thinking tokens?: Coconut's continuous thought approach works when training from scratch but fails as a retrofit to existing capable models. This suggests the field needs both training-time latent reasoning architectures (for new models) and inference-time or frozen-backbone approaches (for enhancing existing models).

Inquiring lines that use this note as a source 40

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 142 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

SoftCoT preserves frozen LLM reasoning by delegating continuous thought generation to a lightweight assistant model — avoiding catastrophic forgetting from full continuous-space training