Can models reason without generating visible thinking tokens?
Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
The mainstream approach to test-time scaling requires the model to verbalize intermediate reasoning steps — producing tokens that represent thoughts before producing an answer. Two architectures challenge this assumption from different angles and converge on the same implication: verbalization is a historical artifact of training constraints, not a necessity for reasoning.
Latent depth-recurrent reasoning: A recurrent block is added to a transformer and iterated at inference time for an arbitrary number of steps. The model "thinks" by updating its hidden state repeatedly before producing any output token. Advantages: (1) no specialized training data required — the model trains with a variable compute budget on standard data; (2) less memory than CoT models, which need long context windows; (3) per-token adaptive compute, where difficult tokens get more recurrent iterations; (4) as model parameter count decreases, FLOPs per parameter increase — enabling high compute utilization on smaller models. The architecture naturally supports early stopping via KL-divergence convergence detection.
Heima (Hidden LLaMA): Each intermediate CoT step is compressed into a compact higher-level hidden representation using a single "thinking token." An adaptive decoder reconstructs variable-length textual sequences from the thinking tokens, enabling interpretability without verbosity. The model encodes each CoT step but doesn't need to generate all the intermediate tokens at inference time.
The synthesis point: both architectures suggest that the constraint requiring "expensive internal reasoning must always be projected down to a single verbalized next token appears wasteful" (Latent Depth paper). Continuous latent space can explore multiple reasoning directions simultaneously, without the linear sequential structure that token generation imposes.
This challenges Does more thinking time actually improve LLM reasoning? from an unexpected direction — the myth assumes verbalized tokens are the unit of thinking; latent reasoning questions whether tokens should be the unit at all.
The connection to human cognition is philosophically interesting: "a substantial amount of thought happens through complex, recurrent firing patterns in the brain, before the first word of an answer is uttered." Latent reasoning may capture facets of human reasoning (spatial thinking, physical intuition) that resist verbalization, which current verbalized CoT approaches cannot access by design.
Coconut (Chain of Continuous Thought): A fourth approach feeds the last hidden state back as the next input embedding directly in continuous space, bypassing the language model head and embedding layer entirely. Continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform breadth-first search (BFS) naturally — rather than committing to a single deterministic path like CoT. Coconut outperforms CoT on logical reasoning tasks requiring substantial backtracking. The neuroscience grounding is direct: neuroimaging studies consistently show that the language network remains largely inactive during reasoning tasks, and language appears optimized for communication rather than reasoning. This suggests verbalized CoT forces reasoning through a communication channel it was never designed for. The CoT unfaithfulness literature reinforces this: even when models generate explicit reasoning chains, they may use a different latent reasoning process internally.
Hierarchical Reasoning Model (HRM): A third distinct latent reasoning architecture adds brain-inspired multi-timescale processing. HRM couples a slow high-level module (abstract planning) with a fast low-level module (detailed computation) in hierarchical recurrence. The fast module reaches equilibrium, then the slow module advances — "hierarchical convergence" avoids premature convergence of standard recurrence. With only 27M parameters and 1000 samples (no pretraining, no CoT), HRM achieves near-perfect accuracy on Sudoku-Extreme and 30×30 maze pathfinding — tasks where CoT methods completely fail (0% accuracy). Uses O(1) memory gradient approximation at equilibrium, avoiding BPTT entirely. See Can recurrent hierarchies achieve reasoning that transformers cannot?.
Theoretical consolidation: These converging architectures now have a formal theoretical framework. Since Where does LLM reasoning actually happen during generation?, the depth-recurrent, Heima, Coconut, HRM, and energy-based approaches all constitute evidence for H1 (latent-state trajectories as the primary reasoning medium). The framework also clarifies why these approaches work: if reasoning is fundamentally a latent-state process, then architectures that operate directly in latent space are working with the native medium rather than forcing it through the bottleneck of discrete verbalization. Furthermore, since Can we trigger reasoning without explicit chain-of-thought prompts?, the latent reasoning capability exists even in standard transformer architectures — specialized latent architectures may be optimizing the medium rather than creating a new capability.
Practical constraint on retrofitting: A critical caveat for deployment: Can continuous reasoning avoid forgetting in instruction-tuned models? shows that fine-tuning already-capable instruction-tuned models for continuous reasoning via Coconut/CCoT methods causes catastrophic forgetting. This limits the Coconut approach to training-from-scratch scenarios and motivates frozen-backbone alternatives for enhancing existing models.
Inquiring lines that use this note as a source 108
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does the silent token approach compare to modeling intrinsic motivation for speaking?
- What is the relationship between reasoning depth and verbalization requirements?
- Can this principle apply to other intermediate text generation tasks?
- How do verbose and concise reasoning occupy different regions in activation space?
- How do soft thought tokens differ from decoded assistant outputs?
- Can AI output be tokenized without decoupling from the thought processes behind it?
- Can step-level deliberation flags guide other reasoning systems?
- How do thinking tokens exhibit diminishing returns beyond a critical threshold?
- How does token-by-token generation constrain a model's ability to plan ahead?
- Why do language models produce verbose reasoning when asked to think step by step?
- Does text-only evaluation hide reasoning collapse that tool use could repair?
- Can language models reason without relying on learned semantic patterns?
- Does the langue-parole distinction apply to human reasoning too?
- What behavioral markers signal when reasoning chains are performative?
- Why do language models imitate reasoning form without abstract inference capability?
- Do tokens beyond a critical threshold actually improve reasoning quality?
- Why do more capable models prefer shorter chains of thought?
- Why do models show performative reasoning on easy tasks but genuine reasoning on hard ones?
- What computational role do intermediate tokens actually play in transformers?
- Does thinking-token overuse actually degrade reasoning accuracy in practice?
- What determines the optimal thinking token threshold for a given task?
- Can chain of thought be deployed selectively to save inference tokens?
- Why does reasoning accuracy degrade beyond a critical thinking token threshold?
- How do thinking tokens function as mutual information peaks in reasoning?
- Do self-revision tokens measurably degrade reasoning accuracy in scaled models?
- Can latent reasoning in continuous space scale beyond supervised reasoning tasks?
- Does selective suppression of linguistic relations enable human meaning-making?
- How do covert thoughts differ from chain-of-thought reasoning in language models?
- Do reasoning models trade instruction following for deliberative capability?
- Why might latent reasoning capture types of thinking that verbalized CoT cannot?
- What hidden computations happen inside transformer layers during reasoning?
- Why do models learn reasoning form instead of actual abstract inference?
- How much does input format shape what reasoning strategy a model develops?
- Does distillation from reasoning models spread overthinking to smaller models?
- How can prompting help models gather information before attempting reasoning?
- Can token efficiency come from stopping before reflection?
- Can models hide their reasoning in continuous space rather than natural language?
- Why do reasoning models verbalize reasoning shortcuts less than necessary?
- Why does parallel thinking outperform sequential thinking with equal tokens?
- How much of a model's reasoning tokens are unnecessary for reaching the final answer?
- Why do language models generate reasoning tokens after internally deciding the answer?
- Why do explicit linguistic markers override semantic computation in models?
- Do reflection tokens and symbolic tokens serve different roles in reasoning?
- What happens to reasoning accuracy when models use more thinking tokens?
- Why do reasoning models reduce effort despite having token budget remaining?
- Why does parallel thinking outperform sequential thinking under token limits?
- Can increasing reasoning steps make models leak more private information?
- Why do models verbalize sensitive data they are instructed to hide?
- Can models learn when to think versus answer directly?
- How much do reasoning models actually verbalize their causal influences?
- Why do larger reasoning models show cyclicity only in later layers?
- Can latent space represent reasoning dimensions that text cannot?
- How much does test-time compute improve reasoning without more tokens?
- Why does cross-text analogical reasoning fail when semantics decouple from symbols?
- How do recursive language models rethink where to store reasoning?
- Why does latent reasoning override no-think instructions in models?
- How early in token generation does the reasoning mode activate?
- Can latent reasoning achieve the same substitution without tokens?
- Can language models generate plausible latent thoughts without human annotation?
- Can capability boundary collapse be addressed by operating at representational rather than token level?
- What happens to safety guardrails when we scale reasoning without instruction control?
- Why do models skip steps that would make reasoning clearer?
- Can language models perform genuine symbolic reasoning without semantic grounding?
- Does the answer stage perform substantial reasoning beyond the thinking draft?
- Why does representation recycling of MI-peak tokens improve reasoning accuracy?
- Can thinking token density explain reasoning performance beyond total length?
- What makes thought identifiability provable without auxiliary training data?
- What distinguishes memorized tokens from causally necessary reasoning steps?
- What semantic information is lost if analysis skips the token embedding layer?
- Why does concise reasoning maintain accuracy with far fewer tokens?
- Does task difficulty alone determine how many thinking tokens a model should use?
- What happens to model reasoning accuracy as thinking token requirements exceed critical thresholds?
- Do base models contain latent reasoning that minimal training can unlock?
- Why do concise reasoning chains match verbose chain-of-thought token efficiency?
- Do base models truly possess latent reasoning capability?
- Can cognitive scaffolding replace tool-based reasoning augmentation in language models?
- Does latent reasoning capability exist in base models before any training?
- How do reasoning-invariant tokens dilute learning signals in uniform averaging?
- What makes thinking tokens carry more information than other tokens?
- Can models internally identify which tokens matter most for reasoning?
- How do thought anchors differ from individual forking tokens mechanistically?
- What limits external scaling when a model lacks reasoning foundation?
- Can models reason at inference without specialized internal training?
- Does reasoning happen in hidden space or in generated tokens?
- Does next-token prediction actually explain how human thought works?
- How do soft token mixtures enable parallel reasoning exploration without explicit training?
- Do models cache intentions about response topics before generating the first token?
- Why does self-distillation suppress epistemic verbalization in student models?
- Do models verbalize their implicit knowledge when that knowledge influences their output?
- How do meta-tokens help models learn when to generate reasoning versus commit predictions?
- What makes token-level reasoning during pretraining different from test-time chain-of-thought?
- Can reasoning happen in latent space without chain of thought?
- What evidence shows that reasoning chains encode token-level functional structure?
- How do continuous concept tokens compare to latent trajectory sampling?
- Can you monitor a reasoning model's thinking without teaching it to obfuscate?
- Does the base model already contain latent reasoning capability?
- Why do thinking models execute longer tasks than standard language models?
- What kinds of reasoning tasks reveal the ceiling of text-only training?
- Does the token prediction framing actually capture what human reasoning does?
- What mechanisms activate latent reasoning capabilities already present in base models?
- How does token-level interaction like ColBERT overcome commutativity constraints?
- How do internal model mechanisms escape token-level reinforcement signals?
- How do semantic and symbolic reasoning capabilities differ in language models?
- Why does reflection in reasoning models often become theater rather than genuine thought?
- Do language models need words to think or just latent structure?
- Why does latent-level prediction beat token-level prediction for reasoning?
- How do early-prefix tokens control the generation of entire continuations?
- Why do language models use remaining tokens to rationalize instead of reconsider?
Related concepts in this collection 12
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
How should we balance parallel versus sequential compute at test time?
Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
latent recurrence is neither: it scales depth per token rather than breadth or chain length
-
Does more thinking time actually improve LLM reasoning?
The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
latent reasoning suggests the token-is-thinking assumption embedded in all TTS benchmarks may be wrong
-
Can minimal reasoning chains match full explanations?
Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.
CoD uses fewer tokens; latent reasoning uses zero tokens for intermediate steps; same direction of travel
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
latent recurrence with early stopping implements adaptive compute at the token level, not the prompt level
-
Can recurrent hierarchies achieve reasoning that transformers cannot?
Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.
third latent reasoning architecture: hierarchical multi-timescale recurrence
-
Can parallel architectures solve inherently sequential problems?
Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?
complexity-theoretic foundation: latent recurrence is necessary for inherently serial problems
-
Can we explore multiple reasoning paths without committing to one token?
Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?
training-free approach to continuous-space reasoning via probability-weighted token mixture
-
Can energy minimization unlock reasoning without domain-specific training?
Can a gradient descent-based architecture achieve system 2 thinking across any modality or problem type using only unsupervised learning, without verifiers or reasoning-specific rewards?
fifth latent reasoning approach: energy minimization as iterative gradient descent at inference time, distinct from depth-recurrent, Heima, Coconut, and HRM; 35% higher scaling rate than Transformer++, modality-agnostic without domain-specific training
-
Where does LLM reasoning actually happen during generation?
Does multi-step reasoning emerge from visible chain-of-thought text, hidden layer dynamics, or simply more computation? Three competing hypotheses make different predictions and can be empirically tested.
provides the theoretical framework (H1/H2/H0) that organizes all these architectures as evidence for H1
-
Can we trigger reasoning without explicit chain-of-thought prompts?
This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
mechanistic evidence: latent reasoning is not just architecturally achievable but causally controllable via a single feature
-
Can continuous reasoning avoid forgetting in instruction-tuned models?
Full fine-tuning for continuous-space reasoning degrades performance in capable instruction-tuned models. Why does this happen, and can architectural changes prevent it?
validates a practical concern: Coconut-style fine-tuning causes catastrophic forgetting on capable models; SoftCoT provides the retrofit-safe alternative
-
Can stochastic latent reasoning help models explore multiple solutions?
This explores whether making recursive reasoning paths probabilistic rather than deterministic lets models maintain uncertainty and consider alternative hypotheses when problems admit multiple valid solutions.
extends: GRAM makes the deterministic latent recurrence stochastic to represent multiple solutions
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
- Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
- Implicit Chain of Thought Reasoning via Knowledge Distillation
- LLM Reasoning Is Latent, Not the Chain of Thought
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- Hierarchical Reasoning Model
- Training Large Language Models to Reason in a Continuous Latent Space
- ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
Original note title
latent reasoning in continuous space scales test-time compute without verbalized tokens or specialized training data