Why do long-context language models struggle with compositional reasoning tasks?
This explores why models that can *hold* a huge context still fail to *compose* — to chain pieces of reasoning together — and the corpus suggests the problem isn't memory size but how transformers do reasoning at all.
This question reads as: long context windows give models more room to work, so why does combining multiple reasoning steps still break down? The corpus's surprising answer is that the failure has little to do with running out of context space — the cracks appear long before the window fills, and they trace back to how transformers reason in the first place. One benchmark study found accuracy collapsing from 92% to 68% with just 3,000 tokens of padding, far below any capacity limit, and the drop was task-agnostic and survived chain-of-thought prompting Does reasoning ability actually degrade with longer inputs?. So 'long context' is partly a red herring — the bottleneck shows up early and isn't about how much you can fit.
The deeper story is what transformers are actually doing when they appear to reason compositionally. One line of work shows they don't learn systematic rules at all — they memorize 'computation subgraphs' from training and stitch them together, which works in-distribution but fails drastically on novel combinations, with errors compounding step by step Do transformers actually learn systematic compositional reasoning?. That compounding matters enormously for compositional tasks, where each step feeds the next. A complementary finding reframes the breaking point: models don't fail at a complexity threshold but at an *unfamiliarity* threshold — any reasoning chain succeeds if it resembles training instances, regardless of length Do language models fail at reasoning due to complexity or novelty?. Compositional tasks are exactly where you generate combinations the model has never seen, so it falls off the memorized manifold.
There's also a question of *kind* of reasoning. When researchers strip semantic content away and leave only the formal logical structure, performance collapses even with the correct rules sitting in context — models lean on token associations and parametric commonsense rather than manipulating symbols Do large language models reason symbolically or semantically?. Compositional reasoning is the symbolic kind. This connects to a broader pattern where in-context information loses to strong training priors: models generate outputs inconsistent with their own context when parametric knowledge dominates, and prompting alone can't override it Why do language models ignore information in their context?. The more reasoning steps you stack, the more chances for a prior to hijack the chain.
Two notes complicate the gloomy picture in useful ways. One argues that some 'reasoning collapses' are really *execution* failures, not reasoning failures — text-only models know the algorithm but can't run a long multi-step procedure reliably, and giving them tools pushes them past the supposed cliff Are reasoning model collapses really failures of reasoning?. Another reframes long context as a *compute* problem rather than a memory one: the real cost is consolidating earlier context into usable internal state, and more consolidation passes improve hard reasoning Is long-context bottleneck really about memory or compute?. Both point the same way — composition demands work the architecture doesn't naturally do in a single forward pass.
If you want the hopeful counter-thread, look at evidence that networks *do* form modular compositional subroutines in isolated subnetworks, and that pretraining makes this structure more reliable Do neural networks naturally learn modular compositional structure?. The tension between that and the subgraph-matching result is the live frontier: the machinery for genuine composition may be latent, but current training rewards memorized shortcuts over using it. For a wider map of *which* tasks predictably break, the autoregression-limits framing predicts failures from low-probability targets even on logically trivial problems Can we predict where language models will fail?.
Sources 9 notes
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.