INQUIRING LINE

Can width-scaling replace depth-scaling on inherently sequential problems?

This explores whether you can get away with making a model wider (more parallel paths, more sampling) instead of deeper (more serial layers/steps) when a problem's steps genuinely depend on each other in order — and the corpus says width buys you a lot, but it can't substitute for depth where each step needs the previous one's output.


This explores whether width-scaling (parallel sampling, more breadth) can stand in for depth-scaling (serial layers, sequential computation) on problems where step N truly depends on step N-1. The short version the corpus suggests: width and depth solve different problems, and on *inherently sequential* tasks, depth is doing work width structurally cannot.

Start with what depth actually buys you. For small models, deep-and-thin architectures beat wide-and-balanced ones because layers *compose* — each layer builds an abstraction on top of the one below, and you can't spread that across width Does depth matter more than width for tiny language models?. The hard ceiling shows up in complexity terms: a fixed-depth transformer is stuck in the AC0/TC0 class, which is exactly why chain-of-thought collapses on Sudoku and mazes — tasks that are nothing but long chains of dependent steps. Coupling slow planning with fast computation across two timescales (effective recurrence) escapes that ceiling with a tiny 27M-parameter model Can recurrent hierarchies achieve reasoning that transformers cannot?. The lesson: when the problem is sequential, you need *more effective depth*, and adding width doesn't grant it.

The sharpest negative evidence is that LLMs simply can't run iterative procedures in latent space at all — faced with an optimization that requires looping toward a solution, they pattern-match a template and emit plausible-but-wrong numbers, and this fails across every scale tested Do large language models actually perform iterative optimization?. That's the signature of a missing serial mechanism, not a missing-parameters problem. It rhymes with the plateau on constrained optimization, where satisfaction stalls at ~55–60% regardless of parameter count or architecture — a structural ceiling, not a scaling gap Do larger language models solve constrained optimization better?.

So where *does* width win? Precisely where the work is parallelizable. Reasoning systems scale efficiently by sampling parallel latent trajectories, dodging the serial latency cost of depth-only scaling when independent paths can explore the solution space at once Can reasoning systems scale wider instead of only deeper?. And allocating test-time compute to *diverse abstractions* enforces a breadth-first search that beats deep-but-narrow chains, which otherwise 'underthink' by committing early to one line Can abstractions guide exploration better than depth alone?. Notice the pattern: width helps with *search and exploration* (which candidate strategy?) — an embarrassingly parallel question — but the moment you've picked a strategy and have to *execute* its dependent steps, you're back in serial territory.

The thing you might not have known you wanted to know: depth isn't really about layer count, it's about *composability under dependency*. Recursive subtask trees that internalize their own structure can sustain reasoning past context limits by pruning the cache between dependent stages Can recursive subtask trees overcome context window limits?, and the long-context bottleneck turns out to be the *compute to consolidate* prior context into state, not raw memory Is long-context bottleneck really about memory or compute?. Both reframe 'sequential' as 'needs serial transformation steps.' Width gives you more guesses; depth gives you the ability to build on your last answer. On inherently sequential problems, more guesses don't add up to a chain — so width complements depth, but doesn't replace it.


Sources 8 notes

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether width-scaling (parallel sampling, breadth) can replace depth-scaling (serial composition, sequential compute) on inherently sequential problems. The question remains open; treat the findings below as dated claims to verify against current capability.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A curated library reported:
• Depth composes abstractions layer-by-layer; fixed-depth transformers are stuck in AC0/TC0 complexity, collapsing on Sudoku/mazes — tasks requiring long dependent chains (~2024–2025).
• LLMs cannot execute iterative numerical methods in latent space at any scale; they pattern-match templates instead of looping — a signature of missing serial mechanism, not parameter deficit (~2025).
• Constraint satisfaction plateaus at 55–60% regardless of parameter count or architecture — a structural ceiling (~2026).
• Width wins on *parallelizable* work: reasoning systems scale efficiently by sampling parallel latent trajectories, and breadth-first search over diverse abstractions beats deep-but-narrow chains (~2025).
• Recursive subtask trees with KV-cache pruning sustain reasoning past context limits; the bottleneck is compute to transform evicted context into state, not memory (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2502.05171 (Feb 2025) — Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
• arXiv:2603.23004 (Mar 2026) — Can Large Language Models Reason and Optimize Under Constraints?
• arXiv:2512.24601 (Dec 2025) — Recursive Language Models
• arXiv:2505.20296 (May 2025) — Reasoning LLMs are Wandering Solution Explorers

Your task:
(1) RE-TEST EACH CONSTRAINT. For the AC0/TC0 ceiling, iterative-procedure barrier, and 55–60% plateau: have newer models, training regimes (e.g., process-supervised reasoning), inference harnesses (agentic recursion, tool-use chains), or evaluation methods *relaxed* these limits? Separate durable question (can width substitute for depth?) from perishable limitation (can current inference bypass it?). Cite what resolved it.
(2) Surface the strongest work from the last 6 months that contradicts or supersedes the claim that width cannot replace depth on sequential problems — especially any showing parallelizable solutions to Sudoku, mazes, or iterative optimization.
(3) Propose 2 research questions assuming the regime has moved: (a) If hybrid width–depth at inference time (e.g., parallel rollouts + serial refinement per trajectory) now tractably solves constrained optimization, what is the *sample complexity* cost vs. pure depth? (b) Can learned *routing* (which steps require serialization, which parallelize?) outperform static depth-width allocation?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines