Is long-context bottleneck really about memory or compute?

Explores whether the challenge of handling long context windows stems from storage capacity limits or from the computational cost of transforming context into internal state. Understanding this distinction reshapes how we design language models.

Synthesis note · 2026-05-28 · sourced from Novel Architectures

The standard framing of the long-context problem is capacity: attention scales poorly with context length, the KV cache grows, and we run out of room. "Language Models Need Sleep" reframes it as a compute-allocation problem. When the context window fills, the model enters a "sleep" — it performs N offline recurrent passes over the accumulated context and updates the fast weights in its state-space-model blocks through a learned local rule, then clears the KV cache and resumes. The information that would be lost on eviction is not stored verbatim; it is transformed into internal state by spending compute.

This relocates the bottleneck. The question is not "how much can we hold?" but "how much compute do we spend converting recent context into persistent weights, and when?" The design shifts that compute to the sleep phase, preserving wake-time prediction latency. The empirical signature confirms it is a compute story: increasing sleep duration N improves performance, with the largest gains on examples that require deeper reasoning — more offline compute buys more capability on hard cases, exactly the test-time-scaling pattern moved to an offline window.

The reframe is significant because it dissolves the capacity ceiling rather than raising it. A capacity solution adds memory; a compute solution adds passes. This connects to the vault's emerging theme that when a model thinks is as designable as how much — since When should AI systems do their thinking?, shifting inference to idle windows is a third temporal position for compute, and the sleep-consolidation mechanism is its architectural realization inside the weights. It also relates to alternatives that attack the capacity framing differently — since Can neural memory modules scale language models beyond attention limits?, one can add a long-term memory module instead of consolidating into fast weights. Counterpoint: spending compute on consolidation is only a win if the offline budget is genuinely free; under continuous load with no idle time, the sleep cost competes with serving. Why it matters: it tells architects to budget consolidation compute rather than chase ever-larger context windows.

Inquiring lines that use this note as a source 123

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 107 in 2-hop network ·medium cluster Open in graph ↗

Is long-context bottleneck really about memory o… When should AI systems do their thinking? Can neural memory modules scale language models be… Can recursive subtask trees overcome context windo… Can models consolidate memories during offline sle…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

When should AI systems do their thinking? Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.
the general principle of shifting inference to idle windows; sleep-consolidation realizes it inside the weights
Can neural memory modules scale language models beyond attention limits? Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
an alternative that adds long-term memory capacity rather than consolidating into fast weights
Can recursive subtask trees overcome context window limits? Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.
another non-capacity approach: prune the KV cache rather than consolidate it into weights
Can models consolidate memories during offline sleep phases? This explores whether LLMs can use dedicated offline periods to consolidate short-term learning into permanent weights, avoiding catastrophic forgetting and the need for expensive retraining.
a *different* paper sharing the exact title "Language Models Need Sleep" (2606.03979, Behrouz/Mirrokni); its sleep consolidates via upward distillation + RL dreaming rather than offline recurrence over evicted KV — same metaphor, different mechanism

Is long-context bottleneck really about memory or compute?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 5