Can parallel architectures solve inherently sequential problems?
Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?
The distinction between problems that are "wide" (parallelizable) and "deep" (inherently serial) is fundamental to computational complexity theory but underappreciated in machine learning. The Serial Scaling Hypothesis formalizes this: for many important ML problems — especially those involving complex reasoning, planning, or evolution of interacting systems — increasing parallel computation alone is insufficient. Progress requires scaling serial computation.
The complexity-theoretic argument is precise. Transformers and state-space models, despite processing "sequences," are parallelizable computations. Unless used with autoregressive inference (one token at a time), a Transformer layer ingests all N tokens concurrently. This places them in computational complexity classes like AC0 or TC0, preventing them from solving problems requiring polynomial time.
The most striking finding: diffusion models cannot solve inherently serial problems. Despite appearing non-parallelizable due to iterative denoising, diffusion models with a fixed-depth TC0 backbone remain in TC0 even with infinite diffusion steps. The iterative nature of diffusion is a red herring — the computational depth per step is what matters, and that is bounded.
Consider Sudoku as a parable. Easy puzzles can be solved by filling blanks independently — parallel. Hard puzzles require long chains of dependent reasoning where each blank depends on others — serial. No algorithm can shortcut this dependency chain. The same distinction applies to mathematical reasoning, physical simulation, and sequential decision-making.
The implications cut three ways:
- Model design: Future architectures need recurrent structures to increase serial computation, not just wider parallel ones. But adding recurrence introduces trainability concerns — gradient variance increases and Lipschitz constants grow, making training harder.
- Hardware: The field may need faster, lower-latency processors, not just more parallel ones.
- Evaluation: Serial compute should be measured and reported separately from total compute.
This provides the complexity-theoretic foundation for why Can models reason without generating visible thinking tokens? matters — recurrent depth-scaling is not just an efficiency trick but a necessity for inherently serial problems that parallel architectures provably cannot solve.
Inquiring lines that use this note as a source 9
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can sequential computation through depth solve problems that parallel width cannot?
- Why do standard transformers fail on problems requiring serial algorithmic reasoning?
- What interference occurs when planning and synthesis happen in the same component?
- Can any architecture fundamentally solve problems that require inherently sequential computation?
- What makes parallel thinking more efficient than sequential chains?
- What makes a problem fundamentally sequential versus parallelizable?
- Can bounded-depth transformers solve inherently sequential problems?
- Are some problems fundamentally unsolvable by parallel inference methods?
- How does directional diversity compare to other forms of parallel planning?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
How should we balance parallel versus sequential compute at test time?
Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
the serial hypothesis provides the theoretical grounding for when sequential is not just preferred but required
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
qualified: parallel wins only on parallelizable problems; for inherently serial problems, parallel scaling is provably insufficient
-
Can models reason without generating visible thinking tokens?
Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
recurrent latent reasoning is the architecture class that can scale serial computation
-
Can models reason without generating visible thinking steps?
Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.
latent recurrence may be the path to serial depth without the quadratic cost of verbalized CoT
-
Can reasoning systems scale wider instead of only deeper?
Explores whether sampling multiple parallel latent trajectories offers a faster scaling path than recursive refinement alone. Matters because it could unlock latency-efficient reasoning at test time.
qualifies: width cannot replace depth on inherently serial problems — the axes are complements
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Serial Scaling Hypothesis
- Ask, and it shall be given: Turing completeness of prompting
- Compositional Reasoning with Transformers, RNNs, and Chain of Thought
- Hierarchical Reasoning Model
- Implicit Chain of Thought Reasoning via Knowledge Distillation
- Faith and Fate: Limits of Transformers on Compositionality
- Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
- It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
Original note title
the serial scaling hypothesis — some problems are fundamentally sequential and parallel architectures cannot solve them