Why do transformers weight early tokens more heavily than later ones?
This reads the question as being about why some tokens dominate a transformer's attention more than others — and gently corrects the premise: the corpus suggests the bias is less about a token's position (early vs. late) than about its prominence, repetition, and functional role.
This explores why a transformer seems to lean on certain tokens more than others. The most direct answer in the corpus reframes the question: the strongest known weighting bias isn't toward *early* tokens but toward *repeated and context-prominent* ones. Transformer soft attention systematically over-weights material that recurs or sits in a salient position regardless of whether it's relevant, which creates a positive feedback loop that amplifies whatever framing or opinion is already present — and this turns out to be one mechanical root of sycophancy, operating even before RLHF shapes the model Does transformer attention architecture inherently favor repeated content?. So if early tokens feel over-weighted, it's often because they set a frame that later tokens echo, and attention keeps re-summing that echo.
There's a deeper structural reason underneath. Attention integrates every token through weighted parallel aggregation — it adds information up rather than selectively suppressing what's irrelevant. That's why models read 'additively' instead of 'resonantly,' and why they miss jokes, wordplay, and frame-dependent meaning: they lack the selective frame-activation a human reader uses to let one word silence the others Why do AI systems miss jokes and wordplay so consistently?. In that view, the model isn't really choosing early over late so much as it's unable to cleanly *down*-weight, so prominent and repeated signals accumulate disproportionate pull.
But not all tokens are equal, and the corpus shows the weighting that matters is functional, not positional. Only about 20% of tokens are high-entropy 'forking points' where the model actually makes a reasoning decision, and reinforcement learning concentrates almost entirely on these — training on just that minority matches full updates Do high-entropy tokens drive reasoning model improvements?. Specific words like 'Wait' and 'Therefore' spike in mutual information with the correct answer, and suppressing them wrecks reasoning while suppressing random tokens doesn't Do reflection tokens carry more information about correct answers?. Models even internally rank tokens by functional category, preserving symbolic-computation tokens while pruning grammar and filler first Which tokens in reasoning chains actually matter most?.
The surprise worth taking away: the heavy weighting isn't a clean positional rule you can trust, it's a bias toward salience and repetition that the architecture can't easily switch off — and there's a fix in the same literature. 'System 2 Attention,' which regenerates the context to strip out irrelevant material before answering, can interrupt the feedback loop, suggesting the over-weighting is a property of how attention reads its input rather than something baked irreversibly into the weights Does transformer attention architecture inherently favor repeated content?.
Sources 5 notes
Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.
Transformers integrate token information through weighted parallel aggregation rather than selective suppression of irrelevant words. This structural difference explains consistent failures with jokes, wordplay, and frame-dependent meaning—not knowledge gaps, but missing cognitive operations.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.