When should AI systems do their thinking?

Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.

Synthesis note · 2026-02-20 · sourced from Test Time Compute

The entire test-time scaling literature implicitly assumes inference happens when a query arrives. Sleep-time compute challenges this temporal assumption: in stateful applications, the model can "think" between interactions — precomputing inferences about persistent context that will be useful when queries arrive.

This is a spatial/temporal reframing, not just an efficiency trick. It makes a conceptual distinction between:

Context (stable background information — a codebase, a document, a conversation history)
Queries (ephemeral questions about that context)

Current test-time compute bundles context processing and query answering into the same inference call, forcing all thinking to happen at query time. Sleep-time compute separates them: process context when convenient, answer queries when required.

The implications cascade: latency drops (the expensive thinking is pre-done), cost amortizes across multiple queries sharing the same context, and the model can invest more sophisticated reasoning in context processing than would be economically feasible at query time.

The deeper reframe: "thinking" is not a response to queries. It's a process that happens on a different timescale. Designing AI systems around this distinction could change inference architecture fundamentally.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

When should an AI system actively intervene versus remain silent?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 140 in 2-hop network ·medium cluster Open in graph ↗

When should AI systems do their thinking? Can models precompute answers before users ask que… Can we allocate inference compute based on prompt … How do internal and external test-time scaling com… Can models treat long prompts as external code env… Can storing evolved thoughts prevent inconsistent …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models precompute answers before users ask questions? Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
the implementation of this reframing
Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
a complementary rethinking of *how* to allocate compute
How do internal and external test-time scaling compare? Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.
sleep-time compute is neither internal nor external; it fractures the dichotomy by shifting inference to a third temporal position
Can models treat long prompts as external code environments? Do language models handle vastly longer inputs by offloading context to a Python REPL and querying it programmatically, rather than fitting everything into the transformer's attention window?
parallel spatial reframing: sleep-time asks WHEN to compute, RLMs ask WHERE to keep data; together they define two independent axes for rethinking inference architecture beyond the "everything in the context window at query time" default
Can storing evolved thoughts prevent inconsistent reasoning in conversations? When LLMs repeatedly reason over the same conversation history for different questions, they produce inconsistent results. Can storing pre-reasoned thoughts instead of raw history solve this problem?
concrete instantiation in conversational systems: TiM post-thinks between turns, exactly the temporal reframing this note proposes — reasoning happens after responses (not at queries) and persists as evolved memory

When should AI systems do their thinking?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 5