SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals

Does recomputing weights cost less than moving them on mobile?

Explores whether mobile hardware's memory bottleneck makes it cheaper to recompute transformer blocks than to fetch their weights twice, and whether this trades accuracy for efficiency.

Synthesis note · 2026-05-03 · sourced from Mobile

On mobile hardware, the latency bottleneck for transformer inference is often not arithmetic but memory movement — fetching weights from DRAM into compute is slower than the compute itself. MobileLLM exploits this asymmetry with immediate block-wise weight sharing: rather than storing two adjacent transformer blocks with separate weights, it stores one block's weights and computes the block twice in sequence. The total weight footprint stays the same, but the same weights are reused for two consecutive forward passes, avoiding the second weight fetch.

The latency overhead is minimal because the compute was happening anyway and the memory savings are concrete. Crucially this approach produces accuracy gains with no increase in model size — the shared block contributes representational capacity comparable to two distinct blocks because the second application operates on the output of the first, producing functionally different transformations even with shared parameters. This is different from across-layer sharing schemes that share weights between non-adjacent layers and lose more capacity.

The general principle is hardware-shaped architecture design. On compute-bound systems the optimization target is FLOP efficiency; on memory-bound systems it is memory-movement efficiency, and these can favor opposite architectural choices. Block-wise weight sharing makes sense on phones precisely because it trades compute for memory bandwidth — exactly the resource that is abundant relative to memory bandwidth on mobile silicon. The same model on a different hardware target might benefit from the opposite trade. Can architecture choices improve inference efficiency without sacrificing accuracy? formalizes this regime-dependence — inference-cost-aware scaling laws make architectural choices like weight sharing first-class variables.

Inquiring lines that use this note as a source 19

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 85 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

immediate block-wise weight sharing exploits memory-movement bottlenecks on device — recomputing a block twice costs less than moving its weights twice