SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Model Architecture and Internals Training, RL, and Test-Time Scaling

Can models reason without generating visible thinking steps?

Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.

Synthesis note · 2026-02-22 · sourced from Reasoning Architectures

Post angle — Medium

The current test-time scaling paradigm assumes reasoning = generating tokens. Thinking more means producing more intermediate reasoning tokens. This assumption is embedded in every benchmark that measures reasoning quality by counting or reading the chain.

Two architectures challenge this from different angles:

Depth-recurrent models iterate a recurrent block in latent space. More recurrence = more thinking, but zero additional output tokens. The model updates its hidden state as many times as it needs, then produces an answer. Performance scales with recurrence depth. No specialized training data required.

Heima compresses entire CoT steps into single "thinking tokens" — compact high-dimensional representations that are decoded back to text only when needed. The thinking happens in the compressed latent space; verbalization is a display choice, not a computation requirement.

Both converge on the same uncomfortable implication: verbalized reasoning may be a historical artifact of training on human text and evaluation protocols that require readable chains — not a necessary property of machine reasoning.

This matters for at least three reasons:

  1. Efficiency: If reasoning doesn't require tokens, the quadratic cost scaling of long CoT chains is avoidable
  2. Capability: Latent space can represent multiple directions simultaneously without the linear sequential constraint of token generation — potentially accessing reasoning facets (spatial reasoning, physical intuition) that tokenized text cannot represent
  3. Evaluation: Every reasoning benchmark that reads chains to evaluate quality is measuring a proxy. If the reasoning is latent, the chain is a summary, not a record

The deepest version: we may be evaluating "the ability to write good-looking reasoning chains" rather than "the ability to reason."

The strongest empirical evidence comes from HRM (Hierarchical Reasoning Model): with only 27M parameters and 1000 training samples, no pretraining or CoT data, it achieves near-perfect accuracy on Sudoku-Extreme and optimal 30×30 maze pathfinding — tasks where state-of-the-art CoT methods score 0%. This is not a marginal improvement but a categorical capability gap: latent reasoning can solve problems that verbalized reasoning cannot.

Connections: Can models reason without generating visible thinking tokens?, Does more thinking time actually improve LLM reasoning?, Do chain-of-thought traces actually help users understand model reasoning?, Can recurrent hierarchies achieve reasoning that transformers cannot?

Inquiring lines that use this note as a source 22

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
21 direct connections · 208 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reasoning without words — latent recurrent models challenge whether verbalized thinking is necessary