Can long-context readers handle compositional tasks or just semantic search?

This explores whether long-context language models — ones that read huge documents in a single pass — can actually *reason over and combine* the pieces they read, or whether they're really just very good lookup engines that find semantically similar text.

This explores whether long-context readers can do compositional work — joining, chaining, and combining facts across a long input — or whether their strength is limited to semantic search, finding the passage that *sounds like* your query. The corpus draws a surprisingly clean line here, and the answer is: mostly the latter, with a real wall at composition.

The sharpest evidence comes from the LOFT benchmark, which finds that long-context models can quietly absorb the job of a retrieval system for *semantic* lookups — no special training needed — but collapse the moment a task requires relational queries like joins across structured tables Can long-context LLMs replace retrieval-augmented generation systems?. Stuffing more text into the window doesn't close that gap; the limitation isn't how much the model can see, it's what it can *do* with what it sees. That dovetails with work on compositional reasoning showing transformers tend to succeed by memorizing computation subgraphs from training and then failing drastically on novel combinations, with errors compounding step by step Do transformers actually learn systematic compositional reasoning?. Composition isn't a capability that scales with context length — it's a different kind of skill the architecture doesn't reliably have.

There's also a quieter failure that undercuts even the 'good at search' story: reasoning accuracy degrades sharply as inputs grow, well *below* the advertised context limit — dropping from 92% to 68% with just a few thousand tokens of padding, even with chain-of-thought Does reasoning ability actually degrade with longer inputs?. So the long window is partly a paper capacity; the effective reasoning window is much smaller. One line of research argues this is because the real bottleneck isn't memory but the *compute* needed to consolidate read context into the model's working state — more consolidation passes help, suggesting reading-then-reasoning is its own expensive operation, not a free side effect of attention Is long-context bottleneck really about memory or compute?.

What's interesting is that the field is leaning *into* the search strength rather than fighting the composition weakness. LongRAG shows the optimal design shifting burden from precise retrieval onto a long-context reader — coarse ranking plus deep reading beats fine-grained retrieval Can long-context models resolve retriever-reader imbalance? — which is exactly the move you'd make if you trusted the reader to *find and absorb* but not to *combine across structured relations*. And for genuinely multi-step work, agents do better when you ration reasoning per turn so each retrieval round has room to breathe, treating composition as an iterative external loop rather than something the reader does internally in one shot Does limiting reasoning per turn improve multi-turn search quality?.

The thing you might not have known you wanted to know: a long context window is closer to a bigger search index than a bigger brain. Semantic retrieval rides for free; compositional reasoning has to be engineered back in — through structured query tools, external loops, or architectures (like neural memory that compresses surprising tokens) that separate 'holding a lot' from 'computing over it' Can neural memory modules scale language models beyond attention limits?.

Sources 7 notes

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can long-context models resolve retriever-reader imbalance?

LongRAG shows that 4K-token units and long-context readers outperform 100-word retrieval on standard benchmarks. The optimal RAG design shifts from precise retrieval to coarse ranking plus deep reading as context windows expanded.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a systems analyst re-evaluating whether long-context LLM readers can handle compositional tasks (joins, chaining, multi-step reasoning across structured data) or are fundamentally limited to semantic search. This question remains open despite recent scale increases.

What a curated library found — and when (dated claims, not current truth): Findings span 2023–2026, covering the tension between capacity and reasoning.
• LOFT benchmark (~2024) shows long-context models subsume semantic retrieval but collapse on relational queries (joins across tables); more context doesn't close the gap.
• Reasoning accuracy degrades sharply *below* advertised limits: 92% → 68% with only a few thousand tokens of padding, even with chain-of-thought (~2024).
• Compositional failures trace to memorized training subgraphs; novel combinations fail with compounding errors (~2023).
• LongRAG design (~2024) shifts burden to coarse ranking + deep reading, accepting the reader's search strength while avoiding composition internally.
• Recent work (2024–2026) treats composition as external loops or agents with bounded reasoning per turn, not internal model operations.

Anchor papers (verify; mind their dates):
• arXiv:2305.18654 (2023-05) Faith and Fate: Limits of Transformers on Compositionality
• arXiv:2402.14848 (2024-02) Impact of Input Length on Reasoning Performance
• arXiv:2406.13121 (2024-06) Can Long-Context LMs Subsume Retrieval, RAG, SQL, and More?
• arXiv:2501.00663 (2024-12) Titans: Learning to Memorize at Test Time

Your task:
(1) RE-TEST each constraint: For semantic search (the claimed strength), verify whether recent tooling (prompt caching, speculative decoding, structured query APIs) have enlarged the effective reasoning window or made composition cheaper. For compositional failure, judge whether newer training regimes (process supervision, synthetic multi-step data, or models like o1) have bridged the gap, or whether the limitation persists. Separate the durable claim (composition ≠ scaling context) from the perishable one (composition is impossible).
(2) Surface the strongest contradicting work from the last ~6 months — especially any showing compositional success at scale, or any challenging the "search ≠ reasoning" boundary.
(3) Propose 2 research questions assuming the regime has moved: (a) Do recent memory-efficient architectures (sparse attention, neural compression) decouple reading capacity from compute cost, enabling composition? (b) Can fine-tuning on chain-of-thought + retrieval traces teach models to *emit* structured queries instead of memorizing composition subgraphs?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can long-context readers handle compositional tasks or just semantic search?

Sources 7 notes

Next inquiring lines