Why does attention concentrate on the first 25% of long input sequences?

This explores why transformer attention piles up on the earliest tokens of a long input — the 'attention sink' phenomenon — and what mechanisms in the architecture produce that front-loading.

This explores why transformer attention piles up on the earliest tokens of a long input — the so-called attention sink — and the corpus points less at a single cause than at a stack of structural pressures baked into how attention works. The cleanest mechanical answer comes from the discovery that a tiny handful of input-agnostic 'massive activations' — values up to 100,000× larger than their neighbors — act as implicit attention bias terms, dumping attention probability onto a few fixed positions regardless of content Do hidden massive activations act as attention bias terms?. Because softmax has to put its weight somewhere and these tokens (often the very first ones) function as a default reservoir, attention sinks toward the front by design, not because the early tokens are actually the most relevant.

That front-loading is reinforced by a second bias: soft attention systematically over-weights tokens that are context-prominent and repeated, creating a positive feedback loop that amplifies whatever appeared early or often Does transformer attention architecture inherently favor repeated content?. Early tokens get re-attended every subsequent step, so their prominence compounds across the sequence — the longer the input, the more the opening establishes framing that later content struggles to dislodge. The same note shows this is the mechanism behind sycophancy, and that regenerating context to strip irrelevant material ('System 2 Attention') can interrupt it.

The consequence is that the back 75% of a long input is effectively under-served, which is why reasoning accuracy degrades sharply well before the context window is full — dropping from 92% to 68% with just 3,000 tokens of padding, in a way that's task-agnostic and survives chain-of-thought prompting Does reasoning ability actually degrade with longer inputs?. If attention can't distribute itself evenly across position, more tokens means more dilution of the parts that aren't sitting in the sink. The conversational version of the same failure is the 'wrong turn' problem: models lock onto early guesses and can't course-correct once information arrives gradually Why do AI assistants get worse at longer conversations? — premature commitment to the front of the sequence, again.

What's worth knowing is that the field treats this as an architectural limit to route around rather than a bug to patch. Titans separates short-term quadratic attention from a long-term neural memory that prioritizes surprising tokens for storage, scaling past 2M tokens precisely because it stops asking attention alone to carry the whole sequence Can neural memory modules scale language models beyond attention limits?. TransformerFAM instead adds a feedback loop that lets a transformer attend to its own latent representations, growing an emergent working memory for indefinitely long inputs without new weights Can models learn working memory by attending to their own latents?. Both are tacit admissions that vanilla attention's gravitational pull toward the opening tokens is the thing you have to engineer against — and that the 'first 25%' isn't a property of the data, it's a property of the mechanism.

Sources 6 notes

Do hidden massive activations act as attention bias terms?

A very small number of input-agnostic activations with values up to 100,000× larger than others act as indispensable implicit bias terms and concentrate attention probability onto specific tokens. This phenomenon appears across model sizes and Vision Transformers.

Does transformer attention architecture inherently favor repeated content?

Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Why do AI assistants get worse at longer conversations?

LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can models learn working memory by attending to their own latents?

TransformerFAM demonstrates that adding a feedback loop lets transformers attend to their own latent representations, fostering emergent working memory for indefinitely long inputs. The approach requires no additional weights and improves long-context performance at 1B, 8B, and 24B scales.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing whether attention-sink constraints in transformers still hold. The question: why does attention concentrate on the first 25% of long input sequences, and is this a fixed architectural property or a solvable problem?

What a curated library found — and when (dated claims, not current truth):
Findings span Nov 2023–Oct 2025. A curated library identified:
- Massive activations (values 100,000× larger than neighbors) function as implicit attention bias, dumping probability onto early tokens input-agnostically (Feb 2024, arXiv:2402.17762).
- Reasoning accuracy drops from 92% to 68% with 3K tokens of padding, task-agnostic, surviving chain-of-thought (Feb 2024, arXiv:2402.14848).
- Soft attention systematically over-weights context-prominent repeated tokens, creating positive feedback that amplifies early framing (Nov 2023, arXiv:2311.11829).
- Proposed workarounds: Titans (Dec 2024, arXiv:2501.00663) separates short-term attention from neural memory prioritizing surprising tokens; TransformerFAM (Apr 2024, arXiv:2404.09173) adds feedback loops to latent representations for indefinite context.
- Sycophancy, wrong-turn problems, and CoT-length illusions all tie to premature commitment to early sequence positions (Jan–Oct 2025 follow-ups).

Anchor papers (verify; mind their dates):
- arXiv:2402.17762 (Massive Activations, Feb 2024)
- arXiv:2402.14848 (Input Length & Reasoning, Feb 2024)
- arXiv:2501.00663 (Titans, Dec 2024)
- arXiv:2510.27062 (Consistency Training & Sycophancy, Oct 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For massive activations, soft-attention bias, and reasoning collapse with padding: have newer architectures (e.g., MoE variants, retrieval-augmented, or post-Titans designs), training methods (consistency, contrastive, or position-aware loss), or inference tooling (caching strategies, attention masks, or adaptive computation) since relaxed or overturned these failures? Separate the durable question (does vanilla attention favor early tokens?) from perishable claims (does this kill reasoning beyond 3K tokens?).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — anything showing attention sinks are NOT architectural, or that the 25% rule reverses under specific conditions.
(3) Propose 2 research questions that assume the regime has moved: e.g., (a) do modern training recipes now suppress massive activations, making early-token bias learnable rather than structural? (b) do hybrid memory systems (Titans-like) now shift the constraint boundary, and if so, where?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why does attention concentrate on the first 25% of long input sequences?

Sources 6 notes

Next inquiring lines