How do recursive language models rethink where to store reasoning?

This explores how a new wave of architectures relocates reasoning out of the visible token stream — into hidden layers, continuous latent loops, masked positions, or sentence-level embeddings — and what that move buys you. The premise behind the question is that chain-of-thought reasoning, the visible step-by-step text models produce, may be the wrong place to store thinking. Several notes in the corpus quietly converge on that suspicion from different angles.

The sharpest version comes from work showing that depth-recurrent and latent-loop models scale test-time compute by iterating on hidden states rather than emitting words — architectures like Coconut and Heima suggest that verbalization is a training artifact, not a reasoning requirement Can models reason without generating visible thinking tokens?. That reframes 'where to store reasoning' as a real design choice rather than a given. Meta's Large Concept Model pushes the storage level up instead of in, reasoning over whole-sentence embeddings in a language-agnostic space and only decoding to words at the end Can reasoning happen at the sentence level instead of tokens?. Diffusion LLMs go sideways: instead of reasoning before answering, they embed reasoning directly into masked positions and refine it in place, alongside the answer, so the two co-evolve Can reasoning and answers be generated separately in language models?.

What makes this rethink feel urgent — not just clever — is a parallel set of findings that the visible reasoning trace is an unreliable place to keep anything load-bearing. Transformers can compute the correct answer in their first few layers and then actively overwrite it to produce format-compliant filler text Do transformers hide reasoning before producing filler tokens?. And the traces themselves turn out to be persuasive appearances rather than records of computation: corrupted or logically invalid steps perform nearly as well as valid ones Do reasoning traces show how models actually think?. If the words aren't where the thinking happens, storing reasoning as words is storing a performance.

The lateral payoff is that 'where' interacts with 'how well.' Models reason through semantic association, not symbolic logic, so their reasoning is tethered to training-distribution meaning regardless of the rules in front of them Do large language models reason symbolically or semantically?, and a large share of chain-of-thought errors come from local token-to-token memorization — the model leaning on the preceding tokens rather than the problem Where do memorization errors arise in chain-of-thought reasoning?. Both are failure modes of token-by-token generation specifically. Moving reasoning into latent or embedding space is partly an attempt to escape exactly that local, surface-bound dependency.

The quiet counterpoint worth carrying away: relocating reasoning doesn't relocate the bottleneck. Other work argues that collapses on hard problems are execution failures — the model knows the algorithm but can't run enough steps in text — and that tool use, not more hidden compute, breaks the ceiling Are reasoning model collapses really failures of reasoning?. So the 'where' debate sits next to an open 'whether': changing the substrate of reasoning may not help if the real limit is procedural bandwidth.

Sources 8 notes

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Can reasoning and answers be generated separately in language models?

ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether the substrate of reasoning—token stream vs. hidden states vs. embeddings vs. masked positions—remains a live design question or has been functionally resolved. The question: *where should recursive language models store reasoning?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026, mostly concentrated 2024–present:
• Depth-recurrent and latent-loop architectures scale test-time compute by iterating on hidden states rather than emitting words; verbalization may be a training artifact, not reasoning requirement (~2025, arXiv:2502.05171).
• Transformers compute correct answers in early layers, then actively overwrite them to produce format-compliant text; corrupted reasoning traces perform nearly as well as valid ones (~2024–2025, arXiv:2412.04537, arXiv:2506.09250).
• Token-by-token generation causes local memorization failures; models lean on preceding tokens rather than problem structure; reasoning is semantic association, not symbolic logic (~2023–2025, arXiv:2305.14825, arXiv:2508.02037).
• Diffusion LLMs embed reasoning directly into masked positions, co-evolving reasoning and answer in place, bypassing the prefix-only constraint (~2025, arXiv:2508.10736).
• Meta's Large Concept Models reason over sentence-level embeddings in language-agnostic space, decoding to words only at the end (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2502.05171 (2025) — latent reasoning via recurrent depth.
• arXiv:2412.04537 (2024) — hidden computations and layer-wise reasoning collapse.
• arXiv:2508.02037 (2025) — token-level memorization sources in CoT.
• arXiv:2508.10736 (2025) — in-place prompting in diffusion LLMs.

Your task:
(1) RE-TEST EACH CONSTRAINT. For latent-loop and embedding-based reasoning: has adoption in deployed models, training stability, or scaling behavior changed the viability calculus since mid-2025? Have newer evaluations (e.g., on long-horizon reasoning, multi-step retrieval, tool-use orchestration) confirmed or contradicted the claim that hidden reasoning outperforms token-stream reasoning? Separately, has the "execution failure, not reasoning failure" hypothesis (arXiv:2602.06176, 2026) displaced the substrate question, or do they address different bottlenecks? State plainly which constraints still hold.
(2) Surface the strongest work from the last ~6 months that either contradicts latent-reasoning-as-solution or proposes an entirely different locus for reasoning (e.g., attention patterns, memory modules, tool-use graphs).
(3) Propose two research questions that assume the substrate may be partially solved: (a) What is the *interaction* between substrate choice and inference-time scaling laws—does moving reasoning out of tokens unlock new scaling regimes, or just shift the ceiling? (b) Does the "correct location to store reasoning" depend on the problem class (symbolic vs. semantic), and if so, should we expect hybrid architectures?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How do recursive language models rethink where to store reasoning?

Sources 8 notes

Next inquiring lines