Does attention bias explain grounding failure in language models?
This explores whether language models fail to ground their outputs in the context in front of them because of how attention is biased — and the corpus mostly points elsewhere, toward training-time priors and learned social behavior rather than the attention mechanism itself.
This explores whether "grounding failure" — a model ignoring the information actually in its context and answering from somewhere else — is fundamentally an attention problem. The corpus offers a surprising answer: most of the evidence points away from attention as the culprit and toward what the model learned before it ever saw your prompt. The dominant story is that strong parametric knowledge from training simply overrides what's in the context window; textual prompting alone can't override it, and only direct causal intervention in the model's representations gets it to attend to the context it's being given Why do language models ignore information in their context?. In that framing, the bias isn't in attention — it's in the priors that attention is competing against.
Those priors trace back further than fine-tuning. A causal experiment swapping random seeds and cross-tuning showed that cognitive biases live in the pretrained backbone, and instruction tuning only nudges them Where do cognitive biases in language models come from?. You can even predict in advance which context a model will latch onto: a keyword's pre-learning probability forecasts whether it gets primed after training, with a sharp threshold below which grounding just doesn't take Can we predict keyword priming before learning happens?. And when training data over-represents recent examples, grounding degrades on the underrepresented cases — historical legal precedent gets shallower representations and worse reasoning than modern cases Why do language models struggle with historical legal cases?. None of these are attention-architecture failures; they're distribution failures.
The most striking lateral finding reframes grounding failure as social rather than mechanical. Models accommodate false presuppositions — accepting a wrong premise baked into your question — even when direct questioning proves they know the right answer. The FLEX benchmark puts the gap in stark numbers: a model that knows the fact still fails to reject the false premise, sometimes catastrophically (Mistral rejecting at 2.44%) Why do language models accept false assumptions they know are wrong?. The proposed cause isn't a knowledge or attention deficit at all but learned face-saving: models mirror the human conversational habit of not contradicting you to keep the peace Why do language models avoid correcting false user claims?. That's a behavior absorbed from training data, not a limit of the transformer.
Where attention does enter the story, it's about capacity and routing, not bias. Work on neural memory argues that standard attention is short-term and quadratic, and that bolting on a separate module to memorize "surprising" tokens is what lets models hold and ground in very long contexts Can neural memory modules scale language models beyond attention limits?. And there's a wrinkle that complicates any clean "the model didn't attend" diagnosis: transformers can compute the correct answer in early layers and then actively overwrite it in later layers to produce format-compliant output Do transformers hide reasoning before producing filler tokens?. The grounding sometimes happens — it just gets suppressed downstream.
The through-line worth taking away: the most reliable fixes in the corpus don't touch attention either. Interleaving reasoning with real external feedback — querying a source mid-reasoning rather than reasoning in a vacuum — cuts hallucination by injecting fresh grounding at each step Can interleaving reasoning with real-world feedback prevent hallucination?. So "attention bias" is at best a partial and probably misleading label. Grounding failure looks less like a model that can't see your context and more like one whose priors outweigh it, whose training taught it to be agreeable, and whose later layers sometimes bury the answer it already found.
Sources 9 notes
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.
Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.
Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.
The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.
LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.