TOPIC

Mobile and On-Device LLMs

4 synthesis notes · 2 source papers

View as

Does depth matter more than width for tiny language models?

Explores whether deep-and-thin architectures outperform wide-and-shallow ones at sub-billion scales, and why this might contradict larger-model scaling laws.

Does recomputing weights cost less than moving them on mobile?

Explores whether mobile hardware's memory bottleneck makes it cheaper to recompute transformer blocks than to fetch their weights twice, and whether this trades accuracy for efficiency.

What actually limits language models on mobile phones?

Is the shift toward smaller LLMs driven by quality trade-offs, or by hard physical constraints on device memory and battery life? This note examines whether sub-billion models are a practical necessity rather than a compromise.

Can ternary weights match full precision model performance?

Can models trained natively with only three weight values (−1, 0, 1) achieve the same perplexity and task performance as standard full-precision models? This matters because ternary weights could dramatically reduce computational and energy costs.

Source papers 2

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with f…
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Recent research, such as BitNet [WMD+23], is paving the way for a new era of 1- bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every si…