Does depth matter more than width for tiny language models?

Explores whether deep-and-thin architectures outperform wide-and-shallow ones at sub-billion scales, and why this might contradict larger-model scaling laws.

Synthesis note · 2026-05-03 · sourced from Mobile

Kaplan et al.'s scaling laws establish a roughly balanced relationship between model depth and width as parameters scale, with width growth often dominating at typical model sizes. MobileLLM demonstrates that this guidance breaks at the sub-billion-parameter scale relevant for on-device deployment. A deep-and-thin model structure outperforms balanced or wide-and-shallow alternatives, producing 2.7 percent and 4.3 percent accuracy boosts over preceding 125M and 350M state-of-the-art models respectively. The reason offered is that depth captures abstract concepts — composing simpler features into hierarchical representations through more layers — and at small scale the model has fewer raw parameters to spend, so making each one work harder through compositional depth pays back more than spreading them across wider layers.

This matters because it shows that scaling laws are regime-dependent rather than universal. The Kaplan results were derived from larger models where width and depth are both abundant; at the small scale where mobile deployment lives, the trade-offs reverse. The implication is that the architectural recipe for on-device LLMs is genuinely different from the recipe for cloud-scale LLMs — not just smaller, but structurally different. Can architecture choices improve inference efficiency without sacrificing accuracy? makes the same point at the inference-economics layer: vanilla scaling laws say nothing about deployment regimes.

The deeper lesson is methodological: scaling laws should always be qualified by the regime in which they were derived, and recommendations for sub-billion-parameter design should not be extrapolated downward from billion-plus-parameter studies. The right architecture for a 350M parameter model is not a scaled-down version of a 70B parameter model; it is a deep-and-thin model derived from the constraints of the small-scale regime. Can parallel architectures solve inherently sequential problems? gives a complementary reason to favor depth — some computations require sequential composition that width cannot supply at any scale.

Inquiring lines that use this note as a source 97

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 106 in 2-hop network ·medium cluster Open in graph ↗

Does depth matter more than width for tiny langu… What actually limits language models on mobile pho… Does recomputing weights cost less than moving the… Can architecture choices improve inference efficie… Can parallel architectures solve inherently sequen…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

What actually limits language models on mobile phones? Is the shift toward smaller LLMs driven by quality trade-offs, or by hard physical constraints on device memory and battery life? This note examines whether sub-billion models are a practical necessity rather than a compromise.
extends: same MobileLLM source; this note answers WHY sub-billion is the regime, depth-vs-width answers HOW to design within it
Does recomputing weights cost less than moving them on mobile? Explores whether mobile hardware's memory bottleneck makes it cheaper to recompute transformer blocks than to fetch their weights twice, and whether this trades accuracy for efficiency.
extends: same MobileLLM paper; depth wins partly because depth-with-shared-weights can be deeper than depth-with-distinct-weights at fixed parameter count; the two design moves compound
Can architecture choices improve inference efficiency without sacrificing accuracy? Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
extends: both reject regime-blind scaling laws; this note shows depth-width trade-offs flip in the small regime; conditional scaling laws formalize how architecture variables modulate the law
Can parallel architectures solve inherently sequential problems? Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?
extends: gives a theoretical reason to prefer depth (serial composition) over width (parallel breadth) for capability-bounded models

Does depth matter more than width for tiny language models?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4