Why do scaling laws fail to predict optimal architectures at small parameter counts?

This explores why classic scaling laws — which predict performance from raw parameter count — break down at small model sizes, where *how* you arrange those parameters starts to matter more than how many you have.

This explores why the standard scaling-law story (loss falls predictably as you add parameters and data) stops being a good guide once models get small — and the corpus points at a single root cause: classic laws treat parameters as fungible, but at small scale their *arrangement* dominates. The clearest direct evidence is Does depth matter more than width for tiny language models?, where deep-and-thin networks beat balanced ones by 2.7–4.3% at the 125M–350M range. Kaplan-style laws predict that result shouldn't depend on shape at all — yet it does, because deep stacks let the model *compose* abstract concepts through layers rather than smear capacity across width. When you only have a few hundred million parameters, that compositional structure is the whole game; at billions, the difference washes out, which is exactly why the laws look 'true' at large scale and fail at small.

The deeper issue is that standard scaling laws bake in no architectural variables. Can architecture choices improve inference efficiency without sacrificing accuracy? makes this concrete from the other direction: once you *add* hidden size, MLP-to-attention ratio, and grouped-query configuration into the law, you can predict and optimize architecture — getting 42% throughput and 2.1% accuracy gains under the same training budget. The implication is that the original laws weren't wrong so much as blind; they marginalized away the very knobs that decide which architecture is optimal at a given size.

Small models also escape the assumptions the laws are built on. Can recurrent hierarchies achieve reasoning that transformers cannot? shows a 27M-parameter model solving Sudoku and mazes that defeat much larger chain-of-thought systems, by using recurrence to break past the fixed-depth complexity ceiling that constrains ordinary transformers. A scaling law fit to fixed-depth transformers simply has no term for 'effective computational depth from recurrence' — so it can't see why a tiny recurrent design outperforms a bigger conventional one. The same lesson shows up in Can reasoning systems scale wider instead of only deeper? and in the broader claim of Has memory architecture replaced parameter count as the scaling frontier?, where returns increasingly come from restructuring memory and computation rather than counting parameters.

There's a subtler boundary worth knowing: scaling laws sometimes *do* work — when the task space is well covered. Can neural networks learn compositional skills without symbolic mechanisms? finds plain MLPs generalize compositionally through scale alone, no architectural tricks needed, *as long as training covers the combinations*. That's the flip side of the small-parameter failure: at large scale with enough data, architecture stops mattering and the law holds; at small scale with sparse coverage, architecture is the only lever you have, and the law goes silent. Two more notes sharpen the picture — Can inference compute replace scaling up model size? shows parameters and inference compute aren't independent axes (so a one-dimensional parameter law was always incomplete), and Do larger language models solve constrained optimization better? shows some ceilings don't move with scale *or* architecture at all.

The thing you didn't know you wanted to know: scaling laws don't 'fail' at small parameter counts so much as reveal what they quietly assumed — that architecture is noise. At small scale, architecture is signal, and the laws were never measuring it.

Sources 8 notes

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Has memory architecture replaced parameter count as the scaling frontier?

Three converging signals in late-2025 research—taxonomy maturation, memory-aware test-time scaling loops, and hybrid sparsity laws—show that returns from restructuring memory now exceed returns from adding parameters. The design bottleneck has shifted from compute to memory structure.

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher auditing claims about small-parameter model design. This question remains open: Why do standard scaling laws fail to predict optimal architectures below ~1B parameters?

What a curated library found — and when (dated claims, not current truth): Findings span 2024–2026 and include:
• Depth beats width by 2.7–4.3% in the 125M–350M range, contradicting parameter-fungibility assumptions (2024–2025).
• Adding architectural variables (hidden size, MLP-to-attention ratio, grouped-query config) into scaling laws recovers 42% throughput and 2.1% accuracy gains; standard laws ignore these knobs (2024–2025).
• Recurrence and effective computational depth can solve hard reasoning tasks with 27M parameters, defeating much larger fixed-depth transformers that scaling laws were fit to (2025–2026).
• Memory and computation restructuring, not parameter counting alone, drives returns at small scale (2025–2026).
• At large scale with dense data coverage, architecture becomes noise and laws hold; at small scale, architecture dominates (2025).

Anchor papers (verify; mind their dates):
• arXiv:2402.14905 (Feb 2024): MobileLLM sub-billion optimization
• arXiv:2510.18245 (Oct 2025): Scaling Laws Meet Model Architecture
• arXiv:2502.05171 (Feb 2025): Test-Time Compute and Latent Reasoning Depth
• arXiv:2507.07207 (Jul 2025): Compositional Generalization via Scaling

Your task:
(1) RE-TEST EACH CONSTRAINT. For depth-vs.-width, architectural variables, recurrence, and data-coverage coupling: has post-Sep-2026 work on inference-time scaling, mixture-of-experts, or adaptive architecture search since RELAXED or OVERTURNED these findings? Does the boundary between 'large scale (laws hold)' and 'small scale (architecture matters)' still hold, or has it moved? Cite what moved it, and flag where constraints appear durable.
(2) Surface the strongest work from the last ~6 months that CONTRADICTS the claim that small-scale failures stem from ignoring architecture. Does any recent paper show small models fail for wholly different reasons—e.g., training regime, tokenization, task mismatch—that *don't* require architectural redesign?
(3) Propose 2 research questions that assume the regime has shifted: (a) If test-time compute now substitutes for architecture at small scale, does the 'architecture is signal' thesis weaken? (b) Do autoregressive vs. non-autoregressive or hybrid decoding modes flip the depth-width tradeoff?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do scaling laws fail to predict optimal architectures at small parameter counts?

Sources 8 notes

Next inquiring lines