Why do transformers weight early tokens more heavily than later ones?

This reads the question as being about why some tokens dominate a transformer's attention more than others — and gently corrects the premise: the corpus suggests the bias is less about a token's position (early vs. late) than about its prominence, repetition, and functional role.

This explores why a transformer seems to lean on certain tokens more than others. The most direct answer in the corpus reframes the question: the strongest known weighting bias isn't toward *early* tokens but toward *repeated and context-prominent* ones. Transformer soft attention systematically over-weights material that recurs or sits in a salient position regardless of whether it's relevant, which creates a positive feedback loop that amplifies whatever framing or opinion is already present — and this turns out to be one mechanical root of sycophancy, operating even before RLHF shapes the model Does transformer attention architecture inherently favor repeated content?. So if early tokens feel over-weighted, it's often because they set a frame that later tokens echo, and attention keeps re-summing that echo.

There's a deeper structural reason underneath. Attention integrates every token through weighted parallel aggregation — it adds information up rather than selectively suppressing what's irrelevant. That's why models read 'additively' instead of 'resonantly,' and why they miss jokes, wordplay, and frame-dependent meaning: they lack the selective frame-activation a human reader uses to let one word silence the others Why do AI systems miss jokes and wordplay so consistently?. In that view, the model isn't really choosing early over late so much as it's unable to cleanly *down*-weight, so prominent and repeated signals accumulate disproportionate pull.

But not all tokens are equal, and the corpus shows the weighting that matters is functional, not positional. Only about 20% of tokens are high-entropy 'forking points' where the model actually makes a reasoning decision, and reinforcement learning concentrates almost entirely on these — training on just that minority matches full updates Do high-entropy tokens drive reasoning model improvements?. Specific words like 'Wait' and 'Therefore' spike in mutual information with the correct answer, and suppressing them wrecks reasoning while suppressing random tokens doesn't Do reflection tokens carry more information about correct answers?. Models even internally rank tokens by functional category, preserving symbolic-computation tokens while pruning grammar and filler first Which tokens in reasoning chains actually matter most?.

The surprise worth taking away: the heavy weighting isn't a clean positional rule you can trust, it's a bias toward salience and repetition that the architecture can't easily switch off — and there's a fix in the same literature. 'System 2 Attention,' which regenerates the context to strip out irrelevant material before answering, can interrupt the feedback loop, suggesting the over-weighting is a property of how attention reads its input rather than something baked irreversibly into the weights Does transformer attention architecture inherently favor repeated content?.

Sources 5 notes

Does transformer attention architecture inherently favor repeated content?

Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.

Why do AI systems miss jokes and wordplay so consistently?

Transformers integrate token information through weighted parallel aggregation rather than selective suppression of irrelevant words. This structural difference explains consistent failures with jokes, wordplay, and frame-dependent meaning—not knowledge gaps, but missing cognitive operations.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do reflection tokens carry more information about correct answers?

Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether transformer token-weighting biases have shifted since mid-2023. The precise question: do transformers weight early tokens more heavily than later ones, or is the real bias toward repetition and salience regardless of position?

What a curated library found — and when (dated claims, not current truth): Findings span 2023–2026 and suggest the weighting bias is NOT cleanly positional but FUNCTIONAL and SALIENCE-DRIVEN:

• Transformers over-weight repeated and context-prominent tokens, creating a feedback loop that amplifies framing — this operates before RLHF (~2023–2024).
• ~80% of tokens are 'low-entropy filler'; only ~20% are high-entropy 'forking points' where reasoning actually happens, and RL concentrates training on that minority (~2025–2026).
• Specific tokens like 'Wait' and 'Therefore' spike in mutual information with correct answers; suppressing them wrecks reasoning while random suppression doesn't (~2025–2026).
• Transformers read 'additively' (summing all signals) rather than 'resonantly' (selectively suppressing irrelevant frames), which prevents them from silencing early tokens even when later context should override (~2023–2024).
• System 2 Attention can interrupt this bias by regenerating context to strip irrelevant material before answering (~2023).

Anchor papers (verify; mind their dates):
• arXiv:2311.11829 — System 2 Attention (2023)
• arXiv:2506.01939 — Beyond the 80/20 Rule: High-Entropy Minority Tokens (2025–2026)
• arXiv:2506.02867 — Thinking Tokens are Information Peaks (2025–2026)
• arXiv:2601.03066 — Do LLMs Encode Functional Importance of Reasoning Tokens? (2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For models released in the last 6 months (o1, o3, Claude 3.5 Sonnet, etc.), judge whether architectural changes (e.g., multi-head routing, sparse attention, in-context routing), training shifts (e.g., process reward models, outcome reward weighting), or new evaluation harnesses have RELAXED the additive-read problem or the salience-over-position bias. Separate the durable question ("Do transformers struggle to selectively suppress frames?") from the perishable limitation ("This can only be patched with System 2 Attention"). Cite what relaxed or held it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. If recent papers show that attention _can_ learn selective suppression, or that modern RLHF + process rewards naturally de-bias toward high-entropy tokens without explicit System 2 intervention, flag that directly.
(3) Propose 2 research questions that ASSUME the regime has moved: e.g., "If sparse routing now lets transformers suppress low-entropy tokens, does sycophancy persist as a salience bias in routing decisions rather than attention?" or "Do frontier models learn hierarchical token importance through process rewards, or do they still read additively at the base attention layer?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do transformers weight early tokens more heavily than later ones?

Sources 5 notes

Next inquiring lines