Which architectural choices matter most when a model must fit one billion parameters?

This explores what design decisions actually move the needle when you're building a small model — roughly a billion parameters or under — rather than scaling up freely, which is exactly the regime forced by phones and other tight hardware budgets.

This explores what design decisions actually move the needle when you're building a small model — around a billion parameters or under — rather than scaling up freely. The first thing the corpus does is reframe the question: at this size you're usually not choosing to be small, you're forced to be. Smartphone DRAM and battery budgets make sub-billion-parameter models the only sustainable option — a 7B model drains a 50kJ battery in under two hours, while a 350M model can run conversational AI for a full day on the same phone What actually limits language models on mobile phones?. So the real question becomes: given a fixed, small parameter budget, where do you spend it?

The sharpest single answer is shape over size. MobileLLM found that at the 125M–350M scale, deep-and-thin networks beat balanced width-vs-depth designs by 2.7–4.3% — because stacking more layers lets the model compose abstract concepts through depth, rather than spreading the same parameters thinly across width Does depth matter more than width for tiny language models?. Notably this contradicts the classic Kaplan scaling laws, which treated depth and width as roughly interchangeable. The lesson: scaling-law intuitions calibrated on huge models don't transfer down, and below a billion parameters how you arrange the parameters matters as much as how many you have.

The more surprising move is to stop treating the parameter count as the whole budget at all. Inference-time compute trades off against parameters: a smaller model given more thinking time at inference can match a larger one, especially on hard prompts Can inference compute replace scaling up model size?. Architecturally, that means a billion-parameter model is a better bet if you design it to lean on test-time compute rather than raw capacity. Two related patterns push the same way: freezing a pretrained backbone and bolting on a small auxiliary module preserves the big model's knowledge while adding new reasoning ability without retraining the whole thing Can continuous reasoning avoid forgetting in instruction-tuned models?, and splitting a monolith into a separate planner and solver outperforms one undifferentiated model — with the decomposition skill even transferring across domains Does separating planning from execution improve reasoning accuracy?.

There's also a counterintuitive case for staying small on purpose. For generating diverse outputs — say, synthetic training data — models around 500M parameters produce more unique samples per draw than larger ones, because bigger models concentrate probability mass on their favorite answers and collapse variety Why aren't bigger models better for generating diverse outputs?. And at the system level, routing queries across several small specialized models can beat a single frontier model: ten 7B models with smart routing surpassed GPT-4.1, suggesting selection is a stronger lever than scale Can routing beat building one better model?.

The thread tying these together: once parameters are scarce, the highest-leverage choices stop being "add capacity" and become structural — depth over width, frozen-plus-auxiliary over end-to-end retraining, planner-solver separation over monoliths, and inference compute or routing over a single bigger network. The thing you didn't know you wanted to know: for a model this size, the parameter count is one of the *least* informative numbers about how well it will perform.

Sources 7 notes

What actually limits language models on mobile phones?

Smartphones' DRAM budgets and battery capacity make sub-billion-parameter models the only sustainable option for mobile deployment. A 7B model drains a 50kJ battery in under two hours, while a 350M model can run conversational AI for a full day on the same device.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Why aren't bigger models better for generating diverse outputs?

Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.

Can routing beat building one better model?

Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher auditing claims about sub-billion-parameter model design. The question remains open: which architectural choices matter most when a model must fit one billion parameters?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–Mar 2026. Key constraints reported:
• Deep-and-thin networks beat balanced designs by 2.7–4.3% at 125M–350M scale, contradicting Kaplan scaling laws (~2024-02).
• Test-time compute can substitute for parameter scaling on hard prompts; smaller models + inference budget match larger ones (~2024).
• Frozen backbone + auxiliary module preserves knowledge and adds reasoning without full retraining (~2025-02).
• Models ~500M produce more unique outputs than larger ones; diversity peaks at smaller scales (~2024).
• Routed ensembles of ten 7B models surpassed GPT-4.1, suggesting selection > scale (~2025-08).

Anchor papers (verify; mind their dates):
• arXiv:2402.14905 (MobileLLM, Feb 2024)
• arXiv:2502.12134 (SoftCoT, Feb 2025)
• arXiv:2508.12631 (Beyond GPT-5 routing, Aug 2025)
• arXiv:2603.23004 (LLM reasoning under constraints, Mar 2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, GPT-4o, Claude-4), distillation/quantization methods (QLORA, bitwise tricks), inference stacks (vLLM, SGLang), multi-agent orchestration, or evals have since RELAXED or OVERTURNED it. Separate the durable question ("which structural choices matter at small scale?") from perishable limitations ("depth beats width", "routing beats monoliths"). Cite what resolved each, and flag where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially anything claiming width-over-depth, monolith-over-routing, or that parameter count IS the bottleneck.
(3) Propose 2 research questions that ASSUME the regime may have shifted (e.g., "Does test-time compute obsolete the depth-vs-width choice?" or "Do foundation-model quantization breakthroughs change the small-model calculus?").

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Which architectural choices matter most when a model must fit one billion parameters?

Sources 7 notes

Next inquiring lines