Can bounded-depth transformers solve inherently sequential problems?

This explores whether transformers — which compute in a fixed number of parallel layers — hit a hard ceiling on problems that require step-by-step serial reasoning, and what the corpus says about getting around that ceiling.

This explores whether transformers, which compute in a fixed number of parallel layers, hit a hard wall on problems that genuinely require working through steps in order. The short version from the corpus: yes, there's a real ceiling, and it's a complexity-theory result, not an engineering inconvenience. The serial scaling hypothesis Can parallel architectures solve inherently sequential problems? argues that some problems need polynomial-depth reasoning that parallel architectures provably cannot perform — no amount of making the model wider or training it longer fixes this, because the limitation is in the shape of the computation, not its scale. A bounded-depth transformer lives inside a complexity class (the AC0/TC0 family) that simply doesn't contain these inherently sequential problems.

What makes this concrete is the Hierarchical Reasoning Model Can recurrent hierarchies achieve reasoning that transformers cannot?, which beats the ceiling not by scaling up but by adding recurrence — looping abstract planning over fast detailed computation so that effective depth grows with the problem. On Sudoku and mazes, where chain-of-thought transformers fail outright, a tiny 27M-parameter recurrent model gets near-perfect results. The lesson is that serial depth has to come from *somewhere*: either recurrence in the architecture, or — and this is the trick most deployed models use — chain-of-thought tokens, which let a fixed-depth model unroll a long computation across many forward passes. The catch is that the model has to actually *use* those tokens for computation. One striking finding shows transformers computing the real answer in their first few layers and then overwriting it with format-compliant filler Do transformers hide reasoning before producing filler tokens?, a reminder that visible "reasoning" tokens aren't always where the work happens.

There's a fascinating tension in the corpus, though. A single finite transformer is, in principle, Turing complete — given the right prompt it can compute any computable function Can a single transformer become universally programmable through prompts?. So the ceiling isn't about what transformers *could* represent; it's about what gradient-trained models actually learn to do, and about how much serial depth you let them spend at inference. Relatedly, non-reasoning models can't be made to match reasoning models just by handing them more inference compute Can non-reasoning models catch up with more compute? — extra tokens only help if training instilled a protocol for spending them productively. Depth, it turns out, matters more than width even at tiny scale, because composing concepts through layers is how these models build up multi-step structure Does depth matter more than width for tiny language models?.

The corpus also explains *why* the failures look the way they do. When transformers tackle compositional, multi-step tasks, they tend to reduce them to memorized pattern-matching over linearized subgraphs Do transformers actually learn systematic compositional reasoning? rather than executing a general algorithm — which is exactly what you'd expect from a bounded-depth machine faking sequential work, and it's why errors compound on novel compositions. Architectural fixes that inject the right inductive bias help: giving attention an explicit stack to track recursive structure yields large gains on syntax Can explicit stack tracking improve how transformers learn recursive syntax?, essentially handing the model the serial scratchpad it otherwise lacks.

The genuinely surprising doorway here is that you don't always need new architecture to escape the trap — sometimes you can bootstrap out of it. Self-improving transformers learn to generalize from 10-digit to 100-digit addition by repeatedly generating their own correct solutions, filtering, and retraining, gaining *exponential* length generalization across rounds Can transformers improve exponentially by learning from their own correct solutions?. So the practical answer to the question is layered: bounded-depth transformers cannot solve inherently sequential problems within a single fixed-depth pass, but recurrence, chain-of-thought unrolling, the right architectural scratchpad, or iterative self-training can each buy back the serial depth the problem demands.

Sources 9 notes

Can parallel architectures solve inherently sequential problems?

Complexity theory proves that problems requiring polynomial-depth reasoning cannot be solved by parallel architectures like Transformers, even with infinite scaling. Progress requires recurrent structures that increase serial computation depth.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can a single transformer become universally programmable through prompts?

Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Can explicit stack tracking improve how transformers learn recursive syntax?

Pushdown Layers—a drop-in self-attention replacement with explicit stack tracking—achieve 3-5x more sample-efficient syntactic generalization while maintaining perplexity. The improvement shows that recursive structure specifically benefits from architectural inductive bias despite general compositional generalization emerging from scale.

Can transformers improve exponentially by learning from their own correct solutions?

Standard transformers generalize from 10-digit to 100-digit addition by repeatedly generating solutions, filtering for correctness, and retraining—showing exponential (not linear) out-of-distribution improvement across rounds without saturation.

Can bounded-depth transformers solve inherently sequential problems?

Sources 9 notes

Next inquiring lines