When should AI systems do their thinking?
Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.
The entire test-time scaling literature implicitly assumes inference happens when a query arrives. Sleep-time compute challenges this temporal assumption: in stateful applications, the model can "think" between interactions — precomputing inferences about persistent context that will be useful when queries arrive.
This is a spatial/temporal reframing, not just an efficiency trick. It makes a conceptual distinction between:
- Context (stable background information — a codebase, a document, a conversation history)
- Queries (ephemeral questions about that context)
Current test-time compute bundles context processing and query answering into the same inference call, forcing all thinking to happen at query time. Sleep-time compute separates them: process context when convenient, answer queries when required.
The implications cascade: latency drops (the expensive thinking is pre-done), cost amortizes across multiple queries sharing the same context, and the model can invest more sophisticated reasoning in context processing than would be economically feasible at query time.
The deeper reframe: "thinking" is not a response to queries. It's a process that happens on a different timescale. Designing AI systems around this distinction could change inference architecture fundamentally.
Inquiring lines that use this note as a source 1
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models precompute answers before users ask questions?
Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
the implementation of this reframing
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
a complementary rethinking of *how* to allocate compute
-
How do internal and external test-time scaling compare?
Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.
sleep-time compute is neither internal nor external; it fractures the dichotomy by shifting inference to a third temporal position
-
Can models treat long prompts as external code environments?
Do language models handle vastly longer inputs by offloading context to a Python REPL and querying it programmatically, rather than fitting everything into the transformer's attention window?
parallel spatial reframing: sleep-time asks WHEN to compute, RLMs ask WHERE to keep data; together they define two independent axes for rethinking inference architecture beyond the "everything in the context window at query time" default
-
Can storing evolved thoughts prevent inconsistent reasoning in conversations?
When LLMs repeatedly reason over the same conversation history for different questions, they produce inconsistent results. Can storing pre-reasoned thoughts instead of raw history solve this problem?
concrete instantiation in conversational systems: TiM post-thinks between turns, exactly the temporal reframing this note proposes — reasoning happens after responses (not at queries) and persists as evolved memory
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Sleep-time Compute: Beyond Inference Scaling at Test-time
- Reasoning Models Can Be Effective Without Thinking
- Language Models Need Sleep
- Large Language Models Think Too Fast To Explore Effectively
- What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT
- Thinkless: LLM Learns When to Think
- On the Reasoning Capacity of AI Models and How to Quantify It
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Original note title
sleep-time compute reframes when AI thinks not how much it thinks