SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Can we allocate inference compute based on prompt difficulty?

Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?

Synthesis note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The key finding from Snell et al. is that inference-time compute effectiveness varies dramatically based on how hard the prompt is relative to the base LLM's capabilities. A fixed compute budget applied uniformly across prompts is inefficient — easy prompts don't need much, hard ones need disproportionately more.

This motivates "compute-optimal" scaling: prescribing an adaptive, prompt-dependent strategy rather than a blanket allocation. The implication is significant: the same inference budget, reallocated adaptively, can substantially outperform a larger model given uniform compute. The question isn't how much total compute to spend, but how to spend it — and the answer depends on the prompt.

This shifts the design question from "how much inference compute?" to "which prompts should get more compute, and by how much?" — a harder question, but a more tractable one once you have a difficulty estimator.

Sub-token granularity via byte-level models: BLT (Byte Latent Transformer) implements adaptive compute at a fundamentally finer grain than prompt-level allocation. By operating on raw bytes and grouping them into variable-length patches based on next-byte entropy, BLT allocates more computation to high-entropy (surprising, information-dense) byte sequences and less to predictable ones. This is per-token adaptive compute realized without any explicit difficulty estimator — the entropy of the byte stream IS the difficulty signal. Combined with latent recurrence approaches that enable per-token adaptive depth, compute-optimal allocation now spans three granularity levels: prompt-level (Snell et al.), token-level (latent recurrence), and sub-token-level (BLT byte entropy). See Can byte-level models match tokenized performance with better efficiency?.

Model routing as a complementary optimization axis: RouteLLM, Hybrid-LLM, and Avengers-Pro (from Arxiv/Routers) demonstrate that which model handles a query is an independent optimization dimension alongside how much compute per query. Avengers-Pro routes via embedding-cluster scoring and surpasses GPT-5-medium by +7% or matches it at 27% lower cost. Hybrid-LLM adds a tunable quality threshold adjustable at test time. These two axes — compute allocation and model selection — are independent and composable: route to a smaller model AND give it less compute on easy queries, or route to a larger model AND give it more compute on hard ones. Compute-optimal allocation now spans four dimensions: prompt-level budget (Snell et al.), token-level depth (latent recurrence), sub-token granularity (BLT), and model selection (routing). See Can routers select the right model before generation happens? and Can routing beat building one better model?.

Inquiring lines that use this note as a source 87

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 11

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
24 direct connections · 220 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

compute-optimal scaling allocates inference budget adaptively per prompt difficulty