Is long-context bottleneck really about memory or compute?
Explores whether the challenge of handling long context windows stems from storage capacity limits or from the computational cost of transforming context into internal state. Understanding this distinction reshapes how we design language models.
The standard framing of the long-context problem is capacity: attention scales poorly with context length, the KV cache grows, and we run out of room. "Language Models Need Sleep" reframes it as a compute-allocation problem. When the context window fills, the model enters a "sleep" — it performs N offline recurrent passes over the accumulated context and updates the fast weights in its state-space-model blocks through a learned local rule, then clears the KV cache and resumes. The information that would be lost on eviction is not stored verbatim; it is transformed into internal state by spending compute.
This relocates the bottleneck. The question is not "how much can we hold?" but "how much compute do we spend converting recent context into persistent weights, and when?" The design shifts that compute to the sleep phase, preserving wake-time prediction latency. The empirical signature confirms it is a compute story: increasing sleep duration N improves performance, with the largest gains on examples that require deeper reasoning — more offline compute buys more capability on hard cases, exactly the test-time-scaling pattern moved to an offline window.
The reframe is significant because it dissolves the capacity ceiling rather than raising it. A capacity solution adds memory; a compute solution adds passes. This connects to the vault's emerging theme that when a model thinks is as designable as how much — since When should AI systems do their thinking?, shifting inference to idle windows is a third temporal position for compute, and the sleep-consolidation mechanism is its architectural realization inside the weights. It also relates to alternatives that attack the capacity framing differently — since Can neural memory modules scale language models beyond attention limits?, one can add a long-term memory module instead of consolidating into fast weights. Counterpoint: spending compute on consolidation is only a win if the offline budget is genuinely free; under continuous load with no idle time, the sleep cost competes with serving. Why it matters: it tells architects to budget consolidation compute rather than chase ever-larger context windows.
Inquiring lines that use this note as a source 123
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does context collapse affect what language models can meaningfully communicate?
- Why does removing language from its context destroy what makes it work?
- Can this distillation pattern apply beyond e-commerce to other latency-constrained domains?
- Can context compression preserve what matters without introducing bias?
- Do retrieval-augmented memory systems actually solve the compartmentalization problem?
- Does transformer attention architecture fundamentally prevent topic-aware memory?
- How do the six memory components combine across explicit and implicit paths?
- Can adaptive prompt-difficulty allocation compound with architectural efficiency improvements?
- How do sub-token and architecture-level compute optimization strategies compare?
- How does prompt optimization differ from building persistent activation context?
- Can offline context optimization reduce test-time latency like sleep-time compute?
- What execution feedback signals drive context updates without supervision labels?
- How does era sensitivity in legal cases compound with context length failures?
- What constraints force mobile deployments to operate in the sub-billion parameter regime?
- Why does bidirectional attention in diffusion models prevent KV cache reuse?
- Why do language models fail at coreference across long contexts?
- What does attentional state look like in a static context window?
- Are threads or virtual instances better candidates than hardware for the interlocutor?
- Can latent recurrence and energy minimization both escape the same computational depth constraints?
- When does long-context LLM reasoning fail where structured retrieval succeeds?
- Can long-context readers handle compositional tasks or just semantic search?
- How does adjacent layer sharing differ from non-adjacent weight reuse?
- Does input length alone explain instruction density performance loss?
- Can layer-wise KV caches enable truly lossless information transfer?
- How does per-token adaptive compute improve efficiency in recurrent reasoning?
- Can fast-slow separation improve both memory and generation in language models?
- Do decoder-only models have inherent architectural limits for non-sequential information?
- Can context windows and RAG actually change what language models generate?
- How would you redesign context integration to prevent prior associations from dominating?
- How do the three grokking phases connect to memorization capacity limits?
- How do neural memory modules extend context length beyond attention limits?
- Can long-context models handle compositional reasoning requiring structured logic?
- How does context complexity affect LLM performance on temporal reasoning tasks?
- Can parallel retrieval chains avoid the context consumption problem?
- How do cortical columns implement local inference over memory cycles?
- What makes memory trajectories topologically stable under persistent reuse?
- Can precomputed inferences be stored in memory modules between model interactions?
- How do retention gates regularize forgetting across different sequence model architectures?
- Why do embedding table lookups become memory-bound bottlenecks at scale?
- Can post-thinking compute on memory reduce query-time reasoning costs?
- Why do reasoning models fail when input length increases even below context limits?
- Why is offline knowledge distillation preferred when in-session signals matter?
- How does completion-driven KV pruning differ from attention-based cache management?
- What tree depth is achievable before GPU memory becomes the bottleneck?
- Why does attention quality degrade as context length increases?
- How do model priors enable targeted context queries without full attention?
- Can recursive sub-calls decompose reasoning across multiple context chunks?
- What is the cost difference between filtering context versus attending to everything?
- Can models internalize retrieved context as static parametric knowledge?
- How should inference compute budget be allocated across different prompt difficulties?
- What persistent memory architectures best support storing precomputed inferences across sessions?
- How does precomputing context reasoning reduce latency in stateful applications?
- How should tiny language models be architected differently than large ones?
- How do parallel sampling and sequential depth compare as scaling dimensions?
- Where does inference compute stop substituting for model capacity?
- Can compressive memory track what matters most across 35 conversation sessions?
- Why do longer context windows alone fail to capture temporal dynamics in dialogue?
- What makes multi-session context tracking harder than single-turn underspecification problems?
- What computational cost does trajectory-bursty inference impose on per-query context requirements?
- Why does context work differently in AI than in conventional software?
- How do trajectory quality and memory hygiene differ as evaluation metrics?
- What makes a memory reachable in the right context?
- What computational costs does closed-loop memory refinement introduce?
- Can memory consolidation fragility be detected and reversed during execution?
- How does context budget create tradeoffs between memory and skills?
- Which memory components trigger context-length problems in agents?
- What update rules should govern dialogue-scoped versus turn-scoped memory?
- How can memory shift from a passive datastore to an actively trained component?
- How do parallel and sequential retrieval strategies compare in compute efficiency?
- Does conditional memory reduce computation alongside conditional sparsity?
- Does compressing all past memories into one representation lose irretrievable details?
- How does separating local and global context dependencies affect long-context performance?
- Can memory primitives become first-class design objects like computation sparsity?
- What makes memory consolidation fragile compared to raw trajectory storage?
- Can episodic raw memory outperform consolidated summaries in practice?
- Why does teacher forcing fail to capture long-range dependencies?
- Why do longer sequences tolerate higher sparsity than shorter ones?
- What mechanisms cause short contexts to degrade more under aggressive sparsity?
- How do prior errors in context history amplify future mistakes in long tasks?
- Why do short interaction benchmarks fail to predict long horizon performance?
- Why do hybrid memory systems outperform single-tier AI architectures?
- How does the hippocampus bind disparate elements without storing everything itself?
- Why do hybrid memory and compute sparsity outperform pure parameter scaling?
- Can test-time scaling compound through memory consolidation into a new scaling law?
- Why do long-context language models struggle with compositional reasoning tasks?
- Why does credit assignment through memory rewriting avoid expensive LLM parameter updates?
- When is numeric computation the real bottleneck versus reasoning depth?
- Can width-scaling replace depth-scaling on inherently sequential problems?
- Why does uniform memory consolidation sometimes degrade below the no-memory baseline?
- Can sleep-time compute reduce latency demands during model inference?
- Does sequence length affect sparsity tolerance the same way across task types?
- What limits the capacity of context-based fast adaptation channels?
- Can memory workspaces resolve contradictory evidence that stateless systems miss?
- How do external invocation latencies drive technique convergence?
- Can models consolidate context into weights during idle offline phases?
- Can KV cache pruning serve as an alternative to consolidation?
- When should architects prioritize consolidation compute over larger context windows?
- Does including full context always degrade memory retrieval quality in practice?
- What makes memory curation harder to solve than simply expanding storage?
- How do sleep-time and post-completion methods reduce inference latency?
- Why do language models ignore condensed memory even when it is the only memory?
- How does gist-first lookup compare to pure retrieval or context stuffing?
- How should memory systems split between short-term and long-term storage?
- Can task-agnostic compression of documents remain broadly useful for later queries?
- Why do LLMs degrade on long inputs before hitting context limits?
- Does recurrent memory or gist compression work better for ultra-long context?
- Can external managers optimize context better than the model itself?
- How do memory hierarchies and compression reduce context management demands?
- What structural updates prevent context collapse in evolving conversations?
- Why do weaker agents need more aggressive context compression than stronger ones?
- Can recurrent state mechanisms process longer sequences than attention-based working memory approaches?
- How do adaptive memory modules compare to feedback-based working memory for long context?
- What makes looped latent computation more efficient than scaling attention capacity?
- Why does attending to own latents work better than bolted-on external memory stores?
- How should agents compress episodic interactions into working memory without accumulation?
- Does retrieval quality depend more on access structure or write gating?
- Can externalizing bookkeeping to a stateful harness replace internalized memory control?
- How does the inference steps dial compare to test-time compute trade-offs in language models?
- How does externalized state affect the long-context bottleneck in language models?
- How does reducing activation precision further extend context length?
- How do recurrent memory systems handle ultra-long context differently than attention?
- Can fixed-size latent states losslessly store arbitrary input context?
- Can architectural changes reduce representational inequality in unified generators?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
When should AI systems do their thinking?
Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.
the general principle of shifting inference to idle windows; sleep-consolidation realizes it inside the weights
-
Can neural memory modules scale language models beyond attention limits?
Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
an alternative that adds long-term memory capacity rather than consolidating into fast weights
-
Can recursive subtask trees overcome context window limits?
Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.
another non-capacity approach: prune the KV cache rather than consolidate it into weights
-
Can models consolidate memories during offline sleep phases?
This explores whether LLMs can use dedicated offline periods to consolidate short-term learning into permanent weights, avoiding catastrophic forgetting and the need for expensive retraining.
a *different* paper sharing the exact title "Language Models Need Sleep" (2606.03979, Behrouz/Mirrokni); its sleep consolidates via upward distillation + RL dreaming rather than offline recurrence over evicted KV — same metaphor, different mechanism
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Recursive Language Models
- Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
- Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning
- Language Models Need Sleep
- Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
- Sleep-time Compute: Beyond Inference Scaling at Test-time
- Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor
- A Decomposition Perspective to Long-context Reasoning for LLMs
Original note title
the long-context bottleneck is compute to transform evicted context into internal state not memory capacity