Can models reason without generating visible thinking tokens?

Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.

Synthesis note · 2026-02-22 · sourced from Reasoning Architectures

The mainstream approach to test-time scaling requires the model to verbalize intermediate reasoning steps — producing tokens that represent thoughts before producing an answer. Two architectures challenge this assumption from different angles and converge on the same implication: verbalization is a historical artifact of training constraints, not a necessity for reasoning.

Latent depth-recurrent reasoning: A recurrent block is added to a transformer and iterated at inference time for an arbitrary number of steps. The model "thinks" by updating its hidden state repeatedly before producing any output token. Advantages: (1) no specialized training data required — the model trains with a variable compute budget on standard data; (2) less memory than CoT models, which need long context windows; (3) per-token adaptive compute, where difficult tokens get more recurrent iterations; (4) as model parameter count decreases, FLOPs per parameter increase — enabling high compute utilization on smaller models. The architecture naturally supports early stopping via KL-divergence convergence detection.

Heima (Hidden LLaMA): Each intermediate CoT step is compressed into a compact higher-level hidden representation using a single "thinking token." An adaptive decoder reconstructs variable-length textual sequences from the thinking tokens, enabling interpretability without verbosity. The model encodes each CoT step but doesn't need to generate all the intermediate tokens at inference time.

The synthesis point: both architectures suggest that the constraint requiring "expensive internal reasoning must always be projected down to a single verbalized next token appears wasteful" (Latent Depth paper). Continuous latent space can explore multiple reasoning directions simultaneously, without the linear sequential structure that token generation imposes.

This challenges Does more thinking time actually improve LLM reasoning? from an unexpected direction — the myth assumes verbalized tokens are the unit of thinking; latent reasoning questions whether tokens should be the unit at all.

The connection to human cognition is philosophically interesting: "a substantial amount of thought happens through complex, recurrent firing patterns in the brain, before the first word of an answer is uttered." Latent reasoning may capture facets of human reasoning (spatial thinking, physical intuition) that resist verbalization, which current verbalized CoT approaches cannot access by design.

Coconut (Chain of Continuous Thought): A fourth approach feeds the last hidden state back as the next input embedding directly in continuous space, bypassing the language model head and embedding layer entirely. Continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform breadth-first search (BFS) naturally — rather than committing to a single deterministic path like CoT. Coconut outperforms CoT on logical reasoning tasks requiring substantial backtracking. The neuroscience grounding is direct: neuroimaging studies consistently show that the language network remains largely inactive during reasoning tasks, and language appears optimized for communication rather than reasoning. This suggests verbalized CoT forces reasoning through a communication channel it was never designed for. The CoT unfaithfulness literature reinforces this: even when models generate explicit reasoning chains, they may use a different latent reasoning process internally.

Hierarchical Reasoning Model (HRM): A third distinct latent reasoning architecture adds brain-inspired multi-timescale processing. HRM couples a slow high-level module (abstract planning) with a fast low-level module (detailed computation) in hierarchical recurrence. The fast module reaches equilibrium, then the slow module advances — "hierarchical convergence" avoids premature convergence of standard recurrence. With only 27M parameters and 1000 samples (no pretraining, no CoT), HRM achieves near-perfect accuracy on Sudoku-Extreme and 30×30 maze pathfinding — tasks where CoT methods completely fail (0% accuracy). Uses O(1) memory gradient approximation at equilibrium, avoiding BPTT entirely. See Can recurrent hierarchies achieve reasoning that transformers cannot?.

Theoretical consolidation: These converging architectures now have a formal theoretical framework. Since Where does LLM reasoning actually happen during generation?, the depth-recurrent, Heima, Coconut, HRM, and energy-based approaches all constitute evidence for H1 (latent-state trajectories as the primary reasoning medium). The framework also clarifies why these approaches work: if reasoning is fundamentally a latent-state process, then architectures that operate directly in latent space are working with the native medium rather than forcing it through the bottleneck of discrete verbalization. Furthermore, since Can we trigger reasoning without explicit chain-of-thought prompts?, the latent reasoning capability exists even in standard transformer architectures — specialized latent architectures may be optimizing the medium rather than creating a new capability.

Practical constraint on retrofitting: A critical caveat for deployment: Can continuous reasoning avoid forgetting in instruction-tuned models? shows that fine-tuning already-capable instruction-tuned models for continuous reasoning via Coconut/CCoT methods causes catastrophic forgetting. This limits the Coconut approach to training-from-scratch scenarios and motivates frozen-backbone alternatives for enhancing existing models.

Inquiring lines that use this note as a source 108

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 12

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

26 direct connections · 180 in 2-hop network ·medium cluster Open in graph ↗

Can models reason without generating visible thi… How should we balance parallel versus sequential c… Does more thinking time actually improve LLM reaso… Can minimal reasoning chains match full explanatio… Can we allocate inference compute based on prompt … Can recurrent hierarchies achieve reasoning that t… Can parallel architectures solve inherently sequen… Can we explore multiple reasoning paths without co… Can energy minimization unlock reasoning without d…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

How should we balance parallel versus sequential compute at test time? Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
latent recurrence is neither: it scales depth per token rather than breadth or chain length
Does more thinking time actually improve LLM reasoning? The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
latent reasoning suggests the token-is-thinking assumption embedded in all TTS benchmarks may be wrong
Can minimal reasoning chains match full explanations? Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.
CoD uses fewer tokens; latent reasoning uses zero tokens for intermediate steps; same direction of travel
Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
latent recurrence with early stopping implements adaptive compute at the token level, not the prompt level
Can recurrent hierarchies achieve reasoning that transformers cannot? Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.
third latent reasoning architecture: hierarchical multi-timescale recurrence
Can parallel architectures solve inherently sequential problems? Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?
complexity-theoretic foundation: latent recurrence is necessary for inherently serial problems
Can we explore multiple reasoning paths without committing to one token? Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?
training-free approach to continuous-space reasoning via probability-weighted token mixture
Can energy minimization unlock reasoning without domain-specific training? Can a gradient descent-based architecture achieve system 2 thinking across any modality or problem type using only unsupervised learning, without verifiers or reasoning-specific rewards?
fifth latent reasoning approach: energy minimization as iterative gradient descent at inference time, distinct from depth-recurrent, Heima, Coconut, and HRM; 35% higher scaling rate than Transformer++, modality-agnostic without domain-specific training
Where does LLM reasoning actually happen during generation? Does multi-step reasoning emerge from visible chain-of-thought text, hidden layer dynamics, or simply more computation? Three competing hypotheses make different predictions and can be empirically tested.
provides the theoretical framework (H1/H2/H0) that organizes all these architectures as evidence for H1
Can we trigger reasoning without explicit chain-of-thought prompts? This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
mechanistic evidence: latent reasoning is not just architecturally achievable but causally controllable via a single feature
Can continuous reasoning avoid forgetting in instruction-tuned models? Full fine-tuning for continuous-space reasoning degrades performance in capable instruction-tuned models. Why does this happen, and can architectural changes prevent it?
validates a practical concern: Coconut-style fine-tuning causes catastrophic forgetting on capable models; SoftCoT provides the retrofit-safe alternative
Can stochastic latent reasoning help models explore multiple solutions? This explores whether making recursive reasoning paths probabilistic rather than deterministic lets models maintain uncertainty and consider alternative hypotheses when problems admit multiple valid solutions.
extends: GRAM makes the deterministic latent recurrence stochastic to represent multiple solutions

Can models reason without generating visible thinking tokens?

Related concepts in this collection 12

Related papers in this collection 8

Search by related questions 4