Why does recomputing weights cost less than moving them on phones?
This explores why, on a phone, it's cheaper to run a neural network block twice than to load a fresh set of weights for the second block — and what that reveals about where mobile AI actually spends its budget.
This explores why recomputing weights can beat moving them on a phone — a result that only makes sense once you see that mobile hardware is bottlenecked by memory traffic, not by arithmetic. MobileLLM's block-wise weight sharing is the direct answer: by reusing one transformer block's weights and simply running that block twice, the model skips the costly step of fetching a separate set of weights from memory. The math of the extra pass is nearly free; the data movement it avoids is what was expensive. The result is a small accuracy gain with zero increase in parameter count Does recomputing weights cost less than moving them on mobile?.
The reason this trade works is that phones are memory- and battery-bound in ways that desktops and servers are not. DRAM budgets and battery capacity, not a preference for smaller or worse models, are what force mobile models below a billion parameters — a 7B model can drain a 50kJ battery in under two hours, while a 350M model runs conversational AI all day What actually limits language models on mobile phones?. When energy and memory bandwidth are the scarce resource, every weight you fetch costs more than the compute you'd spend regenerating its effect. That inverts the usual server intuition where compute is the thing you ration.
What's quietly interesting here is that this is one instance of a broader pattern: on constrained hardware, the architecture itself — not just the parameter count — is a lever you can tune for efficiency. Folding architectural variables like hidden size, MLP-to-attention ratio, and attention grouping into scaling laws lets you optimize specifically for inference, yielding large throughput gains while improving accuracy Can architecture choices improve inference efficiency without sacrificing accuracy?. Block sharing is a hand-designed version of the same move: spend the cheap resource (compute) to conserve the expensive one (memory traffic).
The deeper takeaway is that "cost" in machine learning is never absolute — it's set by whichever resource is scarcest on your hardware. The same logic shows up where sparsity lets you train bigger models within a fixed compute budget rather than paying quality for speed Does sparse attention trade off quality for speed?, and where persistent agents shift the meaningful cost denominator from tokens to finished artifacts once context can be cached and reused Do persistent agents really cost less per token?. Recomputing-over-moving on a phone is the same insight read through the lens of memory bandwidth: optimize for the bottleneck you actually have, not the one the textbook assumes.
Sources 5 notes
MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.
Smartphones' DRAM budgets and battery capacity make sub-billion-parameter models the only sustainable option for mobile deployment. A 7B model drains a 50kJ battery in under two hours, while a 350M model can run conversational AI for a full day on the same device.
Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.
The Sparse Frontier benchmark shows that at equivalent compute cost, larger sparse-attention models outperform smaller dense models on long-context tasks. Sparsity lets you train bigger models within the same budget, making it Pareto-improving rather than a pure trade-off.
A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.