Do hidden massive activations act as attention bias terms?
Explores whether a tiny handful of unusually large activations in LLMs function as structural bias terms that shape attention patterns, regardless of input content.
Most LLM study focuses on external behavior; this work looks inside and finds a surprising internal phenomenon — massive activations: a very small number of activations with values up to ~100,000× larger than the rest. They are widespread across model sizes and families, and they have three load-bearing properties. Their values stay largely constant regardless of input — so they function as indispensable implicit bias terms rather than carriers of input-specific information. And they concentrate attention probability onto their corresponding tokens, producing an implicit bias in the self-attention output. The same phenomenon appears in Vision Transformers.
The keeper is mechanistic: a tiny number of constant, input-agnostic activations are doing structural work — implementing a bias the architecture needs — and they are the substrate of the "attention sink" behavior where attention piles onto a few tokens. Pruning or quantizing naively can destroy them and break the model, which is why they matter for compression and interpretability.
This connects the vault's attention-mechanism thread. It is the activation-level companion to Does transformer attention architecture inherently favor repeated content? — both locate structural attention biases below the training layer — and it explains a failure mode for aggressive quantization like Can ternary weights match full precision model performance?, where preserving these rare massive values is essential.
Inquiring lines that use this note as a source 12
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does attention sink behavior relate to internal model architecture?
- How do LLM activations sparsify differently under out-of-distribution inputs?
- What structural biases does transformer attention have before training?
- What makes looped latent computation more efficient than scaling attention capacity?
- How does disentangled attention separate text from spatial reasoning?
- Why does attention concentrate on the first 25% of long input sequences?
- Can adaptive memory modules combine long-term filtering with short-term attention benefits?
- What task profiles favor recurrent filtering over scaled attention mechanisms?
- Does attention linearity alone explain the efficiency gains over standard transformers?
- How does reducing activation precision further extend context length?
- Can attention linearity achieve similar efficiency gains as weight quantization?
- Why do hybrid attention architectures outperform pure linear attention models?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does transformer attention architecture inherently favor repeated content?
Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
activation-level companion to that attention-bias finding
-
Can ternary weights match full precision model performance?
Can models trained natively with only three weight values (−1, 0, 1) achieve the same perplexity and task performance as standard full-precision models? This matters because ternary weights could dramatically reduce computational and energy costs.
rare massive values are exactly what aggressive quantization must preserve
-
Do language models sparsify their activations under difficult tasks?
When LLMs encounter unfamiliar or difficult inputs, do their internal representations become sparser rather than denser? Understanding this adaptive response could reveal how models stabilize reasoning under uncertainty.
both probe the structure of LLM internal activations rather than outputs
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Massive Activations in Large Language Models
- Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
- It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
- Attention Mechanisms Perspective: Exploring LLM Processing of Graph-Structured Data
- Mechanisms of Introspective Awareness
- System 2 Attention (is something you might need too)
- Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Original note title
a handful of input-agnostic massive activations function as implicit attention-bias terms in LLMs