Can attention linearity achieve similar efficiency gains as weight quantization?

This explores whether changing attention's computational complexity (linear/sparse attention) buys you the same kind of efficiency as shrinking the weights themselves (quantization) — and the corpus suggests they're different species of efficiency that shouldn't be measured on the same axis.

This reads the question as a head-to-head between two efficiency strategies: rework attention so it stops costing quadratically (linearity, sparsity), versus compress the model's parameters into fewer bits (quantization). Worth flagging up front: the corpus has rich material on the first family but no paper on weight quantization specifically — the closest thing to 'quantization' here is product quantization of *embeddings* in recommenders (Can discretizing text embeddings improve recommendation transfer?, Can discrete codes transfer better than text embeddings?), which compresses representations to break text-similarity bias, not to save inference cost. So the honest synthesis is about what kind of efficiency attention restructuring delivers, and why it isn't the same trick as quantization.

The most direct evidence is the Sparse Frontier result: at equal compute, larger sparse-attention models *beat* smaller dense ones on long-context tasks (Does sparse attention trade off quality for speed?). The framing matters — sparsity isn't a quality-for-speed trade, it's Pareto-improving, expanding the whole cost-performance frontier. Quantization, by contrast, is fundamentally lossy compression: you accept some degradation to fit the model in less memory. That's the conceptual gap. Linearity changes the *shape* of the cost curve (killing the quadratic term); quantization slides you along a fixed accuracy-vs-size curve.

The corpus pushes further: the real long-context wins may not come from making attention itself linear, but from *offloading* the long-range work entirely. Titans separates short-term quadratic attention from a compressed neural-memory module that adaptively stores surprising tokens, reaching 2M+ context without the quadratic penalty (Can neural memory modules scale language models beyond attention limits?). This is a hint that 'attention linearity' is one point in a larger design space — you can also hybridize, keeping attention quadratic where it's cheap and routing the expensive part elsewhere.

There's a deeper warning, though. Attention's quadratic structure isn't pure overhead — it does load-bearing work. A handful of massive activations function as implicit attention bias and steer probability onto specific tokens (Do hidden massive activations act as attention bias terms?), and soft attention's tendency to over-weight repeated content is baked into the architecture (Does transformer attention architecture inherently favor repeated content?). Strip or linearize attention and you may lose mechanisms the model quietly depends on — a cost quantization simply doesn't have, since quantization preserves the computational graph and only coarsens the numbers flowing through it.

The unexpected takeaway is that the corpus keeps showing efficiency coming from *structure*, not compression. A single-layer linear autoencoder with a zero-diagonal constraint beats deep collaborative-filtering models (Can a linear model beat deep collaborative filtering?); deep-thin beats wide at small scale (Does depth matter more than width for tiny language models?); and folding architectural variables like MLP-to-attention ratio into scaling laws yields 42% more throughput *with* higher accuracy (Can architecture choices improve inference efficiency without sacrificing accuracy?). That's the real answer: attention linearity and quantization aren't rivals for the same prize. Quantization is a free-ish multiplier you apply after the fact; structural choices like linearity decide how steep the curve was to begin with — and the corpus's bet is that the structural wins are the bigger, and riskier, lever.

Sources 9 notes

Can discretizing text embeddings improve recommendation transfer?

VQ-Rec uses product quantization to map item text to discrete codes that index learned embeddings, breaking the tight coupling between text and recommendations. This decoupling prevents text-similarity bias and allows lookup tables to adapt to new domains without retraining the text encoder.

Can discrete codes transfer better than text embeddings?

VQ-Rec demonstrates that mapping item text to discrete codes via product quantization, then to embeddings, improves cross-domain transfer compared to direct text encoding. The discrete intermediate reduces text bias and enables efficient per-domain fine-tuning.

Does sparse attention trade off quality for speed?

The Sparse Frontier benchmark shows that at equivalent compute cost, larger sparse-attention models outperform smaller dense models on long-context tasks. Sparsity lets you train bigger models within the same budget, making it Pareto-improving rather than a pure trade-off.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Do hidden massive activations act as attention bias terms?

A very small number of input-agnostic activations with values up to 100,000× larger than others act as indispensable implicit bias terms and concentrate attention probability onto specific tokens. This phenomenon appears across model sizes and Vision Transformers.

Does transformer attention architecture inherently favor repeated content?

Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.

Can a linear model beat deep collaborative filtering?

ESLER, a single-layer linear autoencoder constrained so items cannot predict themselves, outperforms most deep CF models. The constraint forces prediction through item relationships, and negative weights encoding anti-affinity prove essential—structural bias matters more than model capacity.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Can attention linearity achieve similar efficiency gains as weight quantization?

Sources 9 notes

Next inquiring lines