Why does depth outperform width for sub-billion parameter models?

This explores why, when you only have a few hundred million parameters to spend, stacking more layers (depth) beats making each layer wider — and what 'depth' is actually buying you.

This explores why, when you only have a few hundred million parameters to spend, stacking more layers beats making each layer wider. The clearest answer in the corpus comes from MobileLLM, which found that deep-and-thin architectures deliver 2.7–4.3% accuracy gains over balanced designs at the 125M–350M scale — a result that directly contradicts the older Kaplan scaling intuition that depth and width are roughly interchangeable Does depth matter more than width for tiny language models?. The mechanism is the interesting part: depth lets a model *compose* abstract concepts through successive layers, where each layer transforms the output of the one below. Width, by contrast, mostly gives you more parallel features at the same level of abstraction. When your parameter budget is tiny, spending it on more rungs of the abstraction ladder pays off more than spending it on wider rungs.

Why would composition matter so much? A striking parallel comes from self-supervised reinforcement learning, where scaling networks toward 1000 layers produced *qualitative* behavioral jumps at specific depth thresholds — depth 16 unlocked walking, depth 256 unlocked wall-climbing — rather than smooth, gradual gains Does network depth unlock qualitatively new behaviors in RL?. That hints depth isn't just adding capacity; it's adding *kinds* of computation that simply can't happen in a shallow network no matter how wide. Related work on reasoning shows a similar layer-by-layer story: a model's genuine 'thinking' can be measured by how much its token predictions get revised as they pass up through the layers, and that revision-across-depth correlates robustly with accuracy Can we measure how deeply a model actually reasons?. Depth, in other words, is where iterative refinement lives.

But depth isn't a free lunch, and the corpus pushes back usefully. Deep-and-thin models pay a serial-latency cost — every layer must finish before the next begins. One line of work argues reasoning systems can instead scale in *width* by sampling parallel latent trajectories, sidestepping that serial bottleneck while still exploring the solution space Can reasoning systems scale wider instead of only deeper?. So 'depth vs. width' isn't a universal winner-take-all: depth wins for squeezing capability out of a fixed tiny parameter budget, while width wins when you care about latency and can spend compute at inference time instead.

That second framing — trading inference compute for parameters — is its own escape hatch. Smaller models given more test-time compute can match much larger ones on hard prompts, which means pretraining size and inference budget are not independent resources Can inference compute replace scaling up model size?. And once you start treating architecture as a tunable variable rather than a fixed shape, you can fold things like hidden size and the MLP-to-attention ratio directly into scaling laws — one such effort hit 42% higher throughput *and* better accuracy than a comparable LLaMA model under the same training budget Can architecture choices improve inference efficiency without sacrificing accuracy?. The depth-beats-width finding is really one instance of a larger shift: at small scale, the *shape* of the network, not just its parameter count, is the lever.

There's one quieter reason small models are worth this attention at all. Around the 500M mark, models actually generate *more* unique, diverse outputs per sample than bigger ones, because larger models concentrate probability mass on their favorite answers Why aren't bigger models better for generating diverse outputs?. So the sub-billion regime isn't just a constrained version of the big-model game — it has its own physics, where deep-and-thin architectures and surprising output diversity both emerge from the same fact: when parameters are scarce, how you arrange them matters more than how many you have.

Sources 7 notes

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Does network depth unlock qualitatively new behaviors in RL?

Scaling to 1000-layer networks in self-supervised RL produces dramatic capability jumps at specific thresholds—depth 16 enables walking, depth 256 enables wall-climbing—driven by synergistic gains in both exploration and expressivity rather than gradual improvement.

Can we measure how deeply a model actually reasons?

Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Why aren't bigger models better for generating diverse outputs?

Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.

Why does depth outperform width for sub-billion parameter models?

Sources 7 notes

Next inquiring lines