INQUIRING LINE

Why do long-context language models struggle with compositional reasoning tasks?

This explores why models that can *hold* a huge context still fail to *compose* — to chain pieces of reasoning together — and the corpus suggests the problem isn't memory size but how transformers do reasoning at all.


This question reads as: long context windows give models more room to work, so why does combining multiple reasoning steps still break down? The corpus's surprising answer is that the failure has little to do with running out of context space — the cracks appear long before the window fills, and they trace back to how transformers reason in the first place. One benchmark study found accuracy collapsing from 92% to 68% with just 3,000 tokens of padding, far below any capacity limit, and the drop was task-agnostic and survived chain-of-thought prompting Does reasoning ability actually degrade with longer inputs?. So 'long context' is partly a red herring — the bottleneck shows up early and isn't about how much you can fit.

The deeper story is what transformers are actually doing when they appear to reason compositionally. One line of work shows they don't learn systematic rules at all — they memorize 'computation subgraphs' from training and stitch them together, which works in-distribution but fails drastically on novel combinations, with errors compounding step by step Do transformers actually learn systematic compositional reasoning?. That compounding matters enormously for compositional tasks, where each step feeds the next. A complementary finding reframes the breaking point: models don't fail at a complexity threshold but at an *unfamiliarity* threshold — any reasoning chain succeeds if it resembles training instances, regardless of length Do language models fail at reasoning due to complexity or novelty?. Compositional tasks are exactly where you generate combinations the model has never seen, so it falls off the memorized manifold.

There's also a question of *kind* of reasoning. When researchers strip semantic content away and leave only the formal logical structure, performance collapses even with the correct rules sitting in context — models lean on token associations and parametric commonsense rather than manipulating symbols Do large language models reason symbolically or semantically?. Compositional reasoning is the symbolic kind. This connects to a broader pattern where in-context information loses to strong training priors: models generate outputs inconsistent with their own context when parametric knowledge dominates, and prompting alone can't override it Why do language models ignore information in their context?. The more reasoning steps you stack, the more chances for a prior to hijack the chain.

Two notes complicate the gloomy picture in useful ways. One argues that some 'reasoning collapses' are really *execution* failures, not reasoning failures — text-only models know the algorithm but can't run a long multi-step procedure reliably, and giving them tools pushes them past the supposed cliff Are reasoning model collapses really failures of reasoning?. Another reframes long context as a *compute* problem rather than a memory one: the real cost is consolidating earlier context into usable internal state, and more consolidation passes improve hard reasoning Is long-context bottleneck really about memory or compute?. Both point the same way — composition demands work the architecture doesn't naturally do in a single forward pass.

If you want the hopeful counter-thread, look at evidence that networks *do* form modular compositional subroutines in isolated subnetworks, and that pretraining makes this structure more reliable Do neural networks naturally learn modular compositional structure?. The tension between that and the subgraph-matching result is the live frontier: the machinery for genuine composition may be latent, but current training rewards memorized shortcuts over using it. For a wider map of *which* tasks predictably break, the autoregression-limits framing predicts failures from low-probability targets even on logically trivial problems Can we predict where language models will fail?.


Sources 9 notes

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about long-context language models and compositional reasoning. The question remains open: why do these models struggle with compositional tasks despite having room in their context window?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as a snapshot, not current state.
- Accuracy collapses from 92% to 68% with just 3,000 tokens of padding, far below context limits (~2024).
- Models memorize 'computation subgraphs' from training and fail on novel combinations; errors compound step-by-step (~2023).
- Reasoning breakdown is driven by instance-level unfamiliarity, not task complexity — the model falls off the memorized manifold (~2024).
- Models are in-context semantic reasoners, not symbolic reasoners; they lean on token associations rather than manipulating formal logic (~2023).
- Long-context bottleneck may be compute (consolidating evicted context into usable state), not memory; more consolidation passes improve hard reasoning (~2024).

Anchor papers (verify; mind their dates):
- arXiv:2402.14848 (2024): Same Task, More Tokens — the canonical padding collapse result.
- arXiv:2305.14825 (2023): In-Context Semantic Reasoners — the semantic vs. symbolic divide.
- arXiv:2412.04537 (2024): Hidden Computations in Chain-of-Thought — execution vs. reasoning failures.
- arXiv:2512.24601 (2025): Recursive Language Models — possible modular composition structures.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the padding-collapse result, the subgraph-matching bottleneck, and the semantic-reasoning claim: have newer models (o1, extended-context variants, 2025–2026 releases), training methods (synthetic compositional data, modular pretraining), or evaluation harnesses overturned or relaxed these limits? Separate the durable question (why composition remains hard) from perishable limitations (e.g., padding collapse with GPT-4o in Jan 2025). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Focus on tensions: do modular subnetworks (arXiv:2301.10884) genuinely emerge, or is that finding an artifact of isolated tasks? Does tool-use or scaffolding (e.g., Tree of Thoughts, 2023) durably raise the composition ceiling, or do models still memorize shortcuts?
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If newer models learn true compositional subroutines under scaled-up training, what training signal or objective would reward that over memorization? (b) Does the bottleneck shift from memorization to *generalization* once models are large enough, and can we detect that shift empirically?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines