How do model priors enable targeted context queries without full attention?

This explores how a model's pre-trained knowledge (its priors) lets it find the right needle in a long context using only a sparse slice of its machinery — rather than every attention head scanning everything — and the corpus suggests the story is one of sparse, intrinsic mechanisms in productive tension with the priors themselves.

This explores how a model's pre-trained knowledge lets it pull the relevant fact out of a long context without every attention head reading everything — and the corpus has a surprisingly concrete answer hiding under different vocabulary. The cleanest piece is the discovery that retrieval isn't spread across the whole network: fewer than 5% of attention heads do the actual fact-fetching, and these 'retrieval heads' are universal across model families, present even in short-context models, and switch on dynamically depending on what the context asks for What mechanism enables models to retrieve from long context?. They're causally necessary — prune them and the model hallucinates even though the answer is sitting right there in the prompt. That's the mechanism your question is circling: targeted querying is already sparse and prior-shaped, not full attention.

But 'enabled by priors' cuts both ways, and the interesting tension is that the same priors that let a model know what to look for can also stop it from looking. When a model's parametric knowledge is strong, it overrides what's actually in the context — and no amount of clever prompting fixes this, because text alone can't beat a confident prior; you need to intervene in the representations themselves Why do language models ignore information in their context?. So priors are both the targeting system and the failure mode: they tell the retrieval heads where to aim, but if they're too loud they answer from memory instead of from the page.

There's a deeper claim worth knowing here: prompting and context queries only ever reorganize what the model already knows — they can't inject anything new. Prompt optimization works entirely inside the training distribution, creating a hard ceiling no query strategy can cross Can prompt optimization teach models knowledge they lack?. This reframes 'targeted context query' as activation rather than retrieval — you're not pulling information in so much as triggering knowledge that's already latent. The priming research makes this almost quantitative: whether a context cue successfully activates a piece of knowledge is predictable in advance from the keyword's pre-existing probability, with a sharp threshold around 10^-3 separating cues that fire from those that don't Can we predict keyword priming before learning happens?. Priors don't just enable targeted queries — they decide which queries can land at all.

If you want the architectural alternative — what happens when you stop relying on attention to do the fetching — the Titans line splits the problem in two: keep attention for short-range work and offload long-range recall to a separate neural memory that selectively stores 'surprising' tokens, scaling past 2M tokens without attention's quadratic cost Can neural memory modules scale language models beyond attention limits?. And a complementary reframing argues the real long-context bottleneck was never memory capacity but the compute needed to fold evicted context into the model's fast weights — essentially turning context into prior Is long-context bottleneck really about memory or compute?. Read together, these say the field is actively trying to replace 'full attention over everything' with sparse, prior-mediated targeting — which is exactly the move the retrieval-heads finding shows the model already half-discovered on its own.

The thing you didn't know you wanted to know: the boundary between 'context' and 'prior' is far blurrier than the framing implies. Targeted querying isn't a model reaching out to grab external facts — it's a sparse set of inherited circuits deciding which of its own latent associations to wake up, gated by probabilities set long before your prompt arrived.

Sources 6 notes

What mechanism enables models to retrieve from long context?

Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with re-testing whether a curated arXiv library's claims about model priors and sparse retrieval remain current or have been superseded.

The precise question (still open): How do model priors enable targeted context queries without full attention—and does the boundary between 'context' and 'prior' hold under newer models and methods?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025; treat as perishable checkpoints.
• Fewer than 5% of attention heads perform fact-fetching ('retrieval heads'); they are universal across model families and switch on dynamically depending on context (2024-04, arXiv:2404.15574).
• Retrieval heads are causally necessary: pruning them causes hallucination even when the answer is in the prompt (2024-04).
• Model priors override context when parametric knowledge is strong; text-level prompting cannot defeat this; only representation-level intervention works (implied 2024).
• Prompting and context queries can only activate latent knowledge, never inject new knowledge; a sharp threshold at ~10^-3 keyword probability separates queries that fire from those that don't (2024).
• Long-context architectures are shifting toward sparse, prior-mediated targeting: Titans separates short-range attention from neural memory for 'surprising' tokens, scaling to 2M+ tokens (2024-12, arXiv:2501.00663).
• The true long-context bottleneck is compute needed to fold evicted context into fast weights, not raw memory capacity (2024-12).

Anchor papers (verify; mind their dates):
• arXiv:2404.15574 (2024-04) – Retrieval Head Mechanistically Explains Long-Context Factuality
• arXiv:2501.00663 (2024-12) – Titans: Learning to Memorize at Test Time
• arXiv:2502.10708 (2025-02) – Injecting Domain-Specific Knowledge into Large Language Models
• arXiv:2512.24601 (2025-12) – Recursive Language Models

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, Claude 4, Gemini 2), training methods (DPO, PPO variants, synthetic data), tooling (RAG systems, external memory APIs), orchestration (multi-agent routing, hierarchical retrieval), or evals have since RELAXED or OVERTURNED it. Distinguish the durable question ('Can models query context without global attention?') from perishable limitations ('Only 5% of heads do retrieval'). Cite what resolved each constraint; plainly state where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months (Sept 2025–Jan 2026). Look for papers that challenge the 'prior dominates context' thesis, show prompting CAN inject knowledge, or demonstrate retrieval heads aren't universal.
(3) Propose 2 research questions that ASSUME the regime may have shifted—e.g., do reasoning-time scaling models (o1, o3 chain-of-thought) sidestep the prior/context boundary? Do retrieval-augmented generation (RAG) systems with external knowledge bases + consistency training (arXiv:2510.27062) decouple prior strength from context integration?

Closing guardrail: Cite arXiv IDs; flag anything you cannot ground in a real paper.

How do model priors enable targeted context queries without full attention?

Sources 6 notes

Next inquiring lines