Massive Activations in Large Language Models
We observe an empirical phenomenon in Large Language Models (LLMs)—very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in the self-attention output. Last, we also study massive activations in Vision Transformers.1 to others, we name them massive activations. We demonstrate their presence in a wide range of LLMs, spanning different model sizes and families.
Introduction. Large Language Models (LLMs) (Brown et al., 2020, OpenAI, 2023) have demonstrated remarkable capabilities. The majority of existing studies conducted on these models are focused on their external behaviors, e.g., evaluating their performance on various tasks (Katz et al., 2023, Bubeck et al., 2023), developing prompts to elicit accurate responses (Wei et al., 2022, Yang et al., 2023). While these studies are encouraging and highlight the potential of these models, it is also important to gain insights into their internal mechanisms, especially as they are being increasingly integrated into many real-world applications. However, research on the internal workings of these models remains relatively limited. In this work, we discover and study a surprising phenomenon in the internal representations of LLMs.
Discussion / Conclusion. Autoregressive training of large Transformers has brought significant advances in natural language processing. This study reveals the widespread existence of massive activations in these Large Language Models (LLMs). The values of these activations are input agnostic but crucial for model performance, despite their extremely rare quantity. We establish a close connection between massive activations and the self-attention mechanism. We show that LLMs use them to implement an implicit form of biases for attention computation. Our findings also generalize well to Vision Transformers (ViTs). We hope the new results presented in this work contribute to a deeper understanding of today’s large-scale foundation models.