What computational structures can actually scale serial reasoning depth?
This explores which computational architectures genuinely add serial reasoning depth — real layered computation — rather than faking depth by generating more chain-of-thought tokens.
This explores which computational architectures genuinely add serial reasoning depth — real layered computation — rather than just spending more tokens to look like they're thinking harder. The corpus draws a sharp line between the two, and the most interesting finding is that the obvious lever (longer chains of thought) is the weakest one. The clearest structural answer comes from recurrence: the Hierarchical Reasoning Model couples a slow abstract-planning loop with a fast detailed-computation loop running on two timescales, and that dual-recurrence lets it solve Sudoku and mazes that chain-of-thought methods fail completely — with only 27M parameters Can recurrent hierarchies achieve reasoning that transformers cannot?. The reason this matters is a hard ceiling: a fixed-depth transformer is stuck in a low complexity class (AC0/TC0), so no amount of prompting adds the serial steps it structurally lacks. Recurrence adds those steps by looping computation, not by emitting more words.
Why not just write longer reasoning traces? Because token-level depth hits walls the corpus documents from several angles. Chain-of-thought accuracy follows an inverted-U — it peaks at some intermediate length and then declines, and stronger models actually prefer shorter chains Why does chain of thought accuracy eventually decline with length?. Worse, reasoning quality degrades sharply just from longer inputs, dropping from 92% to 68% with only 3000 tokens of padding, far below the context limit and even with CoT prompting Does reasoning ability actually degrade with longer inputs?. So 'more serial tokens' is not 'more serial reasoning' — it can actively corrode it.
The deeper reason longer chains don't scale depth is that token-level CoT may not be genuine computation at all. Several notes converge on the claim that chain-of-thought reproduces the *form* of reasoning through pattern-matching learned schemata, not valid underlying logic — which is why structurally invalid prompts still 'work' and why performance collapses predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching? What makes chain-of-thought reasoning actually work? Does chain-of-thought reasoning actually generalize beyond training data?. If the serial chain is imitation rather than computation, stretching it longer just imitates more. This reframes the whole question: failures show up at instance-novelty boundaries, not complexity thresholds, because models fit instance patterns rather than running a depth-scalable algorithm Do language models fail at reasoning due to complexity or novelty?.
So the structures that genuinely scale depth share a trait: they move computation off the flat token stream into a representational space where steps can stack. Recurrent hierarchies loop in latent space Can recurrent hierarchies achieve reasoning that transformers cannot?; Large Concept Models reason over sentence embeddings with paragraph-level planning before decoding, replacing flat token generation with hierarchical abstraction Can reasoning happen at the sentence level instead of tokens?. And here's the doorway you might not expect: depth isn't the only axis. GRAM argues reasoning scales better in *width* — sampling parallel latent trajectories sidesteps the serial latency cost of going deeper, getting the benefit of exploration without paying for length Can reasoning systems scale wider instead of only deeper?. Finally, what makes any of these productive is the training regime, not the inference budget: reasoning models beat non-reasoning ones at *any* compute budget because training installs a protocol that makes extra computation count Can non-reasoning models catch up with more compute?. The structure has to be trained to use its own depth, or the depth is wasted.
Sources 10 notes
The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.