SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Model Architecture and Internals

When should AI systems do their thinking?

Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.

Synthesis note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The entire test-time scaling literature implicitly assumes inference happens when a query arrives. Sleep-time compute challenges this temporal assumption: in stateful applications, the model can "think" between interactions — precomputing inferences about persistent context that will be useful when queries arrive.

This is a spatial/temporal reframing, not just an efficiency trick. It makes a conceptual distinction between:

Current test-time compute bundles context processing and query answering into the same inference call, forcing all thinking to happen at query time. Sleep-time compute separates them: process context when convenient, answer queries when required.

The implications cascade: latency drops (the expensive thinking is pre-done), cost amortizes across multiple queries sharing the same context, and the model can invest more sophisticated reasoning in context processing than would be economically feasible at query time.

The deeper reframe: "thinking" is not a response to queries. It's a process that happens on a different timescale. Designing AI systems around this distinction could change inference architecture fundamentally.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 140 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

sleep-time compute reframes when AI thinks not how much it thinks