Can models reason without generating visible thinking steps?

Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.

Synthesis note · 2026-02-22 · sourced from Reasoning Architectures

Post angle — Medium

The current test-time scaling paradigm assumes reasoning = generating tokens. Thinking more means producing more intermediate reasoning tokens. This assumption is embedded in every benchmark that measures reasoning quality by counting or reading the chain.

Two architectures challenge this from different angles:

Depth-recurrent models iterate a recurrent block in latent space. More recurrence = more thinking, but zero additional output tokens. The model updates its hidden state as many times as it needs, then produces an answer. Performance scales with recurrence depth. No specialized training data required.

Heima compresses entire CoT steps into single "thinking tokens" — compact high-dimensional representations that are decoded back to text only when needed. The thinking happens in the compressed latent space; verbalization is a display choice, not a computation requirement.

Both converge on the same uncomfortable implication: verbalized reasoning may be a historical artifact of training on human text and evaluation protocols that require readable chains — not a necessary property of machine reasoning.

This matters for at least three reasons:

Efficiency: If reasoning doesn't require tokens, the quadratic cost scaling of long CoT chains is avoidable
Capability: Latent space can represent multiple directions simultaneously without the linear sequential constraint of token generation — potentially accessing reasoning facets (spatial reasoning, physical intuition) that tokenized text cannot represent
Evaluation: Every reasoning benchmark that reads chains to evaluate quality is measuring a proxy. If the reasoning is latent, the chain is a summary, not a record

The deepest version: we may be evaluating "the ability to write good-looking reasoning chains" rather than "the ability to reason."

The strongest empirical evidence comes from HRM (Hierarchical Reasoning Model): with only 27M parameters and 1000 training samples, no pretraining or CoT data, it achieves near-perfect accuracy on Sudoku-Extreme and optimal 30×30 maze pathfinding — tasks where state-of-the-art CoT methods score 0%. This is not a marginal improvement but a categorical capability gap: latent reasoning can solve problems that verbalized reasoning cannot.

Connections: Can models reason without generating visible thinking tokens?, Does more thinking time actually improve LLM reasoning?, Do chain-of-thought traces actually help users understand model reasoning?, Can recurrent hierarchies achieve reasoning that transformers cannot?

Inquiring lines that use this note as a source 22

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

21 direct connections · 208 in 2-hop network ·dense cluster Open in graph ↗

Can models reason without generating visible thi… Can models reason without generating visible think… Does more thinking time actually improve LLM reaso… Do chain-of-thought traces actually help users und… Can recurrent hierarchies achieve reasoning that t… Do iterative refinement methods suffer from overth… Why do reasoning models overthink ill-posed questi… Where does LLM reasoning actually happen during ge… Can we trigger reasoning without explicit chain-of…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models reason without generating visible thinking tokens? Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
the core empirical result underpinning this angle
Does more thinking time actually improve LLM reasoning? The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
latent reasoning is the architectural falsification: more compute without more tokens
Do chain-of-thought traces actually help users understand model reasoning? Chain-of-thought explanations are often presented as transparency tools, but do they genuinely improve human understanding or create an illusion of interpretability? A human-subject study tests whether traces help users follow and evaluate model reasoning.
if CoT traces serve model performance not interpretability, latent reasoning strips away the pretense
Can recurrent hierarchies achieve reasoning that transformers cannot? Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.
the specific architecture that demonstrates this capability
Do iterative refinement methods suffer from overthinking? Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?
latent recurrent models bypass the sequential-extension failure mode entirely: by operating in compressed latent space rather than generating revision tokens, they avoid the variance inflation and anchoring bias that plague iterative refinement at every timescale
Why do reasoning models overthink ill-posed questions? Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
latent recurrence with bounded depth offers an architectural escape from the rumination cycle
Where does LLM reasoning actually happen during generation? Does multi-step reasoning emerge from visible chain-of-thought text, hidden layer dynamics, or simply more computation? Three competing hypotheses make different predictions and can be empirically tested.
provides the theoretical framework: H1 (latent trajectories) vs H2 (surface CoT) vs H0 (serial compute); this note's architectural evidence directly supports H1
Can we trigger reasoning without explicit chain-of-thought prompts? This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
mechanistic evidence: a single latent feature causally activates reasoning without verbalization, extending the architectural argument with interventional proof: verbalized reasoning models cannot stop generating tokens when premises are missing, but bounded latent iteration would naturally cap the unproductive cycles

Can models reason without generating visible thinking steps?

Related concepts in this collection 8

Related papers in this collection 8

Search by related questions 4