Does recomputing weights cost less than moving them on mobile?
Explores whether mobile hardware's memory bottleneck makes it cheaper to recompute transformer blocks than to fetch their weights twice, and whether this trades accuracy for efficiency.
On mobile hardware, the latency bottleneck for transformer inference is often not arithmetic but memory movement — fetching weights from DRAM into compute is slower than the compute itself. MobileLLM exploits this asymmetry with immediate block-wise weight sharing: rather than storing two adjacent transformer blocks with separate weights, it stores one block's weights and computes the block twice in sequence. The total weight footprint stays the same, but the same weights are reused for two consecutive forward passes, avoiding the second weight fetch.
The latency overhead is minimal because the compute was happening anyway and the memory savings are concrete. Crucially this approach produces accuracy gains with no increase in model size — the shared block contributes representational capacity comparable to two distinct blocks because the second application operates on the output of the first, producing functionally different transformations even with shared parameters. This is different from across-layer sharing schemes that share weights between non-adjacent layers and lose more capacity.
The general principle is hardware-shaped architecture design. On compute-bound systems the optimization target is FLOP efficiency; on memory-bound systems it is memory-movement efficiency, and these can favor opposite architectural choices. Block-wise weight sharing makes sense on phones precisely because it trades compute for memory bandwidth — exactly the resource that is abundant relative to memory bandwidth on mobile silicon. The same model on a different hardware target might benefit from the opposite trade. Can architecture choices improve inference efficiency without sacrificing accuracy? formalizes this regime-dependence — inference-cost-aware scaling laws make architectural choices like weight sharing first-class variables.
Inquiring lines that use this note as a source 19
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can offline context optimization reduce test-time latency like sleep-time compute?
- What constraints force mobile deployments to operate in the sub-billion parameter regime?
- What decomposition level minimizes both error rate and computational cost in practice?
- What mobile hardware constraints force the sub-billion parameter regime?
- How do conditional scaling laws incorporate hardware into architecture choices?
- How does adjacent layer sharing differ from non-adjacent weight reuse?
- Why would compute-replacement cost determine wages instead of productivity?
- Can layer-wise KV caches enable truly lossless information transfer?
- How does layer removal affect transformers compared to ResNets?
- Why does recomputing weights cost less than moving them on phones?
- Could deploying GPT-4 for everyone require 100 million specialized chips?
- What computational costs does closed-loop memory refinement introduce?
- What makes two timescales better than one for minimizing weight movement?
- How does spending offline compute affect wake-time prediction latency?
- Can KV cache pruning serve as an alternative to consolidation?
- Why does looping computation outperform adding more transformer layers?
- Why does reapplying the same transformer block work better than computing new layers?
- Does attention linearity alone explain the efficiency gains over standard transformers?
- Do scaling laws change when weight precision becomes a design variable?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does depth matter more than width for tiny language models?
Explores whether deep-and-thin architectures outperform wide-and-shallow ones at sub-billion scales, and why this might contradict larger-model scaling laws.
extends: same MobileLLM paper; depth-favoring architecture and weight sharing are complementary moves — sharing lets the deep-and-thin model be even deeper at the same parameter budget
-
What actually limits language models on mobile phones?
Is the shift toward smaller LLMs driven by quality trade-offs, or by hard physical constraints on device memory and battery life? This note examines whether sub-billion models are a practical necessity rather than a compromise.
supports: weight sharing addresses precisely the DRAM bandwidth bottleneck that motivates the sub-billion regime; the constraint named there is the constraint exploited here
-
Can architecture choices improve inference efficiency without sacrificing accuracy?
Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
extends: the regime-dependence of architecture choice is exactly what conditional scaling laws formalize; weight sharing is the kind of architectural variable they incorporate
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
- TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
- Titans: Learning to Memorize at Test Time
- MatFormer: Nested Transformer for Elastic Inference
- The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
- Repeat After Me: Transformers are Better than State Space Models at Copying
- It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
- Byte Latent Transformer: Patches Scale Better Than Tokens
Original note title
immediate block-wise weight sharing exploits memory-movement bottlenecks on device — recomputing a block twice costs less than moving its weights twice