What mobile hardware constraints force the sub-billion parameter regime?
This explores the actual physical limits — memory and power — that make sub-billion-parameter models the only practical option on a phone, rather than a quality compromise developers settle for.
This explores the actual physical limits that make sub-billion-parameter models the only practical option on a phone, rather than a quality compromise. The corpus is blunt about it: the constraints are DRAM budget and battery capacity, not a preference for smaller, weaker models. A 7-billion-parameter model drains a typical 50kJ phone battery in under two hours, while a 350M model can run conversational AI for a full day on the same hardware What actually limits language models on mobile phones?. Once you frame it as energy-per-token rather than accuracy-per-parameter, the sub-billion ceiling stops looking arbitrary and starts looking like physics.
What's interesting is that the same memory wall reshapes how the model should be built, not just how big it is. On a phone the bottleneck is moving weights through memory, not computing with them — so MobileLLM found that recomputing a transformer block twice is actually cheaper than fetching a second block's weights from memory, buying accuracy with zero extra parameters Does recomputing weights cost less than moving them on mobile?. The hardware constraint flips an intuition: compute is the cheap resource on-device, memory movement is the expensive one.
That same pressure overturns a piece of scaling orthodoxy. At the 125M–350M scale, going deep-and-thin beats spreading the same parameter budget across width, yielding 2.7–4.3% accuracy gains by composing abstract concepts layer by layer Does depth matter more than width for tiny language models?. The Kaplan scaling laws that hold for datacenter models don't govern the phone regime — when parameters are capped by DRAM, you spend them differently.
The deeper lesson the corpus offers is that 'just make the model smaller' isn't the only escape hatch. You can move intelligence off the parameter axis entirely: spend more compute at inference time instead of more weights, which lets small models match larger ones on hard prompts Can inference compute replace scaling up model size?. Or keep the small model on-device and route only the genuinely hard queries to a bigger model elsewhere, cutting cost 40–50% by predicting query difficulty before generation Can routers select the right model before generation happens?. The mobile constraint, read this way, isn't just a size limit — it's a forcing function pushing capability out of raw parameter count and into architecture, inference-time compute, and where computation physically happens.
Sources 5 notes
Smartphones' DRAM budgets and battery capacity make sub-billion-parameter models the only sustainable option for mobile deployment. A 7B model drains a 50kJ battery in under two hours, while a 350M model can run conversational AI for a full day on the same device.
MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
RouteLLM and Hybrid-LLM both achieve 40-50% cost reduction by routing to a single model based on query difficulty prediction, not response evaluation. Single-model routing minimizes latency compared to ensemble or cascade alternatives.