How does separating local and global context dependencies affect long-context performance?
This explores what happens when an architecture stops treating all context the same way and instead splits short-range, local dependencies (handled by attention) from long-range, global ones (handled by a separate memory or storage mechanism).
This explores architectures that deliberately separate local context (what's near the current token) from global context (everything else), rather than forcing one attention mechanism to juggle both. The corpus suggests this separation isn't just an optimization — it tracks a real failure mode. Reasoning accuracy drops sharply long before you hit the context window's stated limit: Does reasoning ability actually degrade with longer inputs? shows accuracy falling from 92% to 68% with just 3,000 tokens of padding, and the degradation is task-agnostic and survives chain-of-thought. So the problem isn't capacity. It's that uniform attention over a long span dilutes what matters.
The cleanest answer to the question is Titans in Can neural memory modules scale language models beyond attention limits?, which makes the split explicit: attention does short-term, quadratic, local work, while a separate neural memory module compresses long-term information and prioritizes 'surprising' tokens. Splitting the two lets the model scale past 2M tokens while beating standard Transformers — the global channel stops fighting the local one for the same compute. A complementary view comes from Is long-context bottleneck really about memory or compute?, which argues the real bottleneck was never memory but the compute to consolidate evicted context into fast weights; performance improves with more consolidation passes. In other words, global context is expensive precisely because it has to be transformed into a different kind of representation than local context.
A second family achieves the same separation by removing global dependence almost entirely. Can reasoning systems forget history without losing coherence? contracts a problem into a DAG where each reasoning state depends only on the current sub-problem, not accumulated history — discarding the global channel keeps reasoning coherent instead of bloating it. Can recursive subtask trees overcome context window limits? prunes up to 90% of the KV cache and still reasons accurately, and Can algorithms control LLM reasoning better than LLMs alone? hands each step only its locally relevant context, hiding the rest. The shared insight: much of what looks like 'long context' is global baggage that actively hurts the local computation.
There's a counter-current worth knowing. Not everything benefits from minimizing the global channel — sometimes you want to lean into it. Can models treat long prompts as external code environments? keeps the whole prompt as an external Python environment and queries it on demand, handling inputs 100x beyond the window — separation by externalization rather than compression. And Can long-context models resolve retriever-reader imbalance? shows the boundary shifting the other way: as readers got better at deep local reading over 4K-token chunks, the global retrieval step could afford to be coarser. The mechanism that actually does the global work inside attention turns out to be tiny and specific — What mechanism enables models to retrieve from long context? finds fewer than 5% of attention heads are causally responsible for pulling facts from distant context, and pruning them causes hallucination even though the information is right there.
The thing you might not have known you wanted to know: the separation has a hard ceiling. Can long-context LLMs replace retrieval-augmented generation systems? shows long-context models can absorb semantic retrieval but still can't run relational queries that need joins across structured data — and Why do language models ignore information in their context? shows that even when global context is present, strong training priors can override it entirely. So separating local from global buys you scale and coherence, but it doesn't, by itself, teach the model to reason over the global half or trust it when its own training disagrees.
Sources 11 notes
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.
The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.
LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.
Recursive Language Models store long prompts in a Python REPL and query them via code execution, avoiding attention degradation. RLMs outperform base models even on shorter prompts while handling inputs two orders of magnitude beyond context windows.
LongRAG shows that 4K-token units and long-context readers outperform 100-word retrieval on standard benchmarks. The optimal RAG design shifts from precise retrieval to coarse ranking plus deep reading as context windows expanded.
Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.
The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.