Why are receiver attention heads narrower in reasoning models than base models?

This explores why the attention heads that 'receive' or aggregate information from across the context appear to concentrate on fewer, more selective signals in models trained for reasoning compared to their base counterparts — and the corpus doesn't name 'receiver heads' directly, so I'm reading it through what we know about attention-head specialization, sparsification under load, and how reasoning training reshapes (rather than rebuilds) base-model machinery.

This explores why information-aggregating attention heads look narrower in reasoning models than in base models. No note in the collection uses the exact phrase 'receiver attention heads,' so the honest move is to triangulate from three nearby findings rather than pretend there's a paper that answers it head-on. What the corpus does say, repeatedly, is that the heads doing the heavy lifting are already a tiny minority — and that reasoning training narrows attention further as a feature, not a defect.

Start with the base-model picture. Fewer than 5% of attention heads function as 'retrieval heads' — sparse, universal mechanisms that pull specific facts out of long context, and pruning them causes hallucination even when the answer is sitting right there What mechanism enables models to retrieve from long context?. So the aggregation work is already concentrated in a narrow set of heads before any reasoning training happens. That matters because of a second finding: post-training doesn't create reasoning, it selects it. Five independent methods all elicit reasoning that already lives in base-model activations, which reframes reasoning training as elicitation — sharpening and routing existing circuitry rather than growing new heads Do base models already contain hidden reasoning ability?. A 'narrower' receiver head in a reasoning model is consistent with this: training is pruning the diffuse, exploratory attention of the base model down to the channels that actually carry the answer.

The most suggestive parallel is the discovery that hidden states sparsify under difficulty. As tasks get harder or more out-of-distribution, an LLM's activations become substantially sparser in a localized, systematic way — and the authors read this as an adaptive selective filter that stabilizes performance, not a breakdown Do language models sparsify their activations under difficult tasks?. Narrower attention is the same story told at the head level: when the model commits to a reasoning trajectory, it narrows what it listens to. Reasoning models, which are trained to stay on a difficult chain rather than hedge, may simply spend more time in this sparsified, narrow-attention regime by default.

There's a cross-domain confirmation worth knowing. In multimodal models, verbose chain-of-thought actually degrades fine-grained perception, because the real bottleneck is visual attention allocation, not how much the model verbalizes Does verbose chain-of-thought actually help multimodal perception tasks?. The lesson generalizes: reasoning gains come from where attention is pointed, not from spreading it wider. And mechanistic work shows transformers compute the correct answer in layers 1-3 then overwrite it for format compliance, meaning the genuinely load-bearing reading of context happens in a narrow band and gets actively suppressed downstream Do transformers hide reasoning before producing filler tokens?.

So the synthesis: narrower receiver heads in reasoning models likely aren't a side effect to be fixed — they're what 'commitment to a reasoning path' looks like in the attention pattern. The base model keeps its options open across many heads; the reasoning model has been selected, sharpened, and pushed into a sparse-and-focused regime, the same way activations sparsify under hard tasks. The thing you didn't know you wanted to know: the field increasingly treats narrowing — fewer active heads, sparser states, less verbalization — as the signature of competent reasoning rather than lost capacity, which inverts the intuition that more attention is better.

Sources 5 notes

What mechanism enables models to retrieve from long context?

Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Does verbose chain-of-thought actually help multimodal perception tasks?

Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher examining whether 'narrower receiver attention heads' in reasoning models versus base models is a genuine phenomenon, a measurement artifact, or a regime-dependent observation now superseded. The question remains open.

What a curated library found — and when (dated claims, not current truth): Findings span 2024–2026:
• Fewer than 5% of attention heads function as retrieval heads in base models; they are sparse, universal, and pruning them causes hallucination (2404.15574, 2024).
• Reasoning capability already exists latent in base-model activations; post-training elicits and sharpens existing circuitry rather than growing new heads (2412.04537, 2024).
• Hidden states sparsify under out-of-distribution or high-difficulty tasks as an adaptive selective filter that stabilizes performance (2603.03415, 2026).
• Verbose chain-of-thought degrades fine-grained perception in multimodal models because the bottleneck is visual/attentional allocation, not verbalization length (2502.07266, 2025).
• Transformers compute correct answers in early layers (1–3) then overwrite them downstream for format compliance; load-bearing context reading is narrow-band and actively suppressed (2412.04537, 2024).

Anchor papers (verify; mind their dates):
• arXiv:2404.15574 (Apr 2024) — Retrieval Head Mechanistically Explains Long-Context Factuality
• arXiv:2412.04537 (Dec 2024) — Understanding Hidden Computations in Chain-of-Thought Reasoning
• arXiv:2603.03415 (Mar 2026) — Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
• arXiv:2502.07266 (Feb 2025) — When More is Less: Understanding Chain-of-Thought Length in LLMs

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding, determine whether newer model scaling, training methods (RLHF variants, process reward models, inference-scaling), architectural changes, or mechanistic tools have since relaxed or overturned the sparsity/narrowness claim. Separate the durable question ("Do reasoning models use narrower, sparser attention than base models?") from the perishable mechanism ("Is this due to post-training selection?"). Plainly state where the constraint still holds or has been broken.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — any paper showing that newer reasoning models (o1, o1-pro analogs, or scaled test-time compute) DO NOT exhibit narrower heads, or that narrowness actively harms performance.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., (a) Do inverse-scaling laws apply to attention width under test-time compute? (b) Can you recover base-model broad attention in reasoning models via targeted pruning, and does it improve or degrade accuracy?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why are receiver attention heads narrower in reasoning models than base models?

Sources 5 notes

Next inquiring lines