Can we allocate inference compute based on prompt difficulty?

Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?

Synthesis note · 2026-02-20 · sourced from Test Time Compute

The key finding from Snell et al. is that inference-time compute effectiveness varies dramatically based on how hard the prompt is relative to the base LLM's capabilities. A fixed compute budget applied uniformly across prompts is inefficient — easy prompts don't need much, hard ones need disproportionately more.

This motivates "compute-optimal" scaling: prescribing an adaptive, prompt-dependent strategy rather than a blanket allocation. The implication is significant: the same inference budget, reallocated adaptively, can substantially outperform a larger model given uniform compute. The question isn't how much total compute to spend, but how to spend it — and the answer depends on the prompt.

This shifts the design question from "how much inference compute?" to "which prompts should get more compute, and by how much?" — a harder question, but a more tractable one once you have a difficulty estimator.

Sub-token granularity via byte-level models: BLT (Byte Latent Transformer) implements adaptive compute at a fundamentally finer grain than prompt-level allocation. By operating on raw bytes and grouping them into variable-length patches based on next-byte entropy, BLT allocates more computation to high-entropy (surprising, information-dense) byte sequences and less to predictable ones. This is per-token adaptive compute realized without any explicit difficulty estimator — the entropy of the byte stream IS the difficulty signal. Combined with latent recurrence approaches that enable per-token adaptive depth, compute-optimal allocation now spans three granularity levels: prompt-level (Snell et al.), token-level (latent recurrence), and sub-token-level (BLT byte entropy). See Can byte-level models match tokenized performance with better efficiency?.

Model routing as a complementary optimization axis: RouteLLM, Hybrid-LLM, and Avengers-Pro (from Arxiv/Routers) demonstrate that which model handles a query is an independent optimization dimension alongside how much compute per query. Avengers-Pro routes via embedding-cluster scoring and surpasses GPT-5-medium by +7% or matches it at 27% lower cost. Hybrid-LLM adds a tunable quality threshold adjustable at test time. These two axes — compute allocation and model selection — are independent and composable: route to a smaller model AND give it less compute on easy queries, or route to a larger model AND give it more compute on hard ones. Compute-optimal allocation now spans four dimensions: prompt-level budget (Snell et al.), token-level depth (latent recurrence), sub-token granularity (BLT), and model selection (routing). See Can routers select the right model before generation happens? and Can routing beat building one better model?.

Inquiring lines that use this note as a source 87

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 11

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

24 direct connections · 220 in 2-hop network ·dense cluster Open in graph ↗

Can we allocate inference compute based on promp… Can inference compute replace scaling up model siz… How should we balance parallel versus sequential c… Does search budget scale like reasoning tokens for… Can byte-level models match tokenized performance … Can routers select the right model before generati… Can routing beat building one better model? Does the choice of reasoning framework actually ma… Can retrieval be extended into multi-step chains l…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can inference compute replace scaling up model size? Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
the consequence: adaptive allocation enables the substitution
How should we balance parallel versus sequential compute at test time? Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
adaptive allocation is a meta-question that sits above this trade-off
Does search budget scale like reasoning tokens for answer quality? Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
extends: search budget is a second adaptive-allocation axis alongside reasoning tokens; adaptive allocation must now optimize across both dimensions
Can byte-level models match tokenized performance with better efficiency? Tokenized models use fixed vocabularies and allocate equal compute per token, but what if we dynamically group bytes based on prediction difficulty instead? Could this approach achieve competitive performance while using fewer FLOPs?
sub-token granularity: BLT implements adaptive compute at byte level via entropy-based patching
Can routers select the right model before generation happens? Explores whether LLMs can be matched to queries by estimating difficulty upfront, before any generation begins. This matters because routing could cut costs significantly while preserving response quality.
model selection as fourth dimension of compute-optimal allocation
Can routing beat building one better model? Does directing queries to specialized models via semantic clustering outperform investing in a single frontier model? This challenges whether model improvement or model selection drives performance gains.
empirical evidence: routing across model pool outperforms any single model
Does the choice of reasoning framework actually matter for test-time performance? Explores whether different slow-thinking methods like BoN and MCTS produce meaningfully different outcomes, or whether total compute budget is the dominant factor determining reasoning success.
complementary claim: this note says allocate budget adaptively per difficulty; that note says within the allocated budget, framework choice (BoN vs MCTS) is irrelevant because total compute determines efficacy; together they define the optimization space
Can retrieval be extended into multi-step chains like reasoning? Standard RAG retrieves once, but multi-hop tasks need intermediate steps. Can we train models to plan retrieval sequences the way chain-of-thought trains reasoning, and scale retrieval at test time?
extends adaptive compute allocation to retrieval: chain length and count become compute dials for retrieval-intensive tasks, adding a fifth dimension alongside prompt-level budget, token-level depth, sub-token granularity, and model selection
When should retrieval happen during model generation? Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
applies adaptive allocation specifically to the retrieval trigger: FLARE allocates retrieval budget on uncertainty rather than at fixed intervals, the same "allocate where needed" logic applied at the token-confidence level
Does prompt optimization without inference strategy fail? Standard practice optimizes prompts and inference strategies separately. But do prompts optimized for single-shot evaluation actually perform worse when deployed at scale with aggregation methods like majority voting?
constraint: adaptive budget allocation is necessary but not sufficient; the prompt itself must be co-optimized with the inference strategy, because prompts optimized at N=1 can become "deceiving" under scaled inference
Can minimal reasoning chains match full explanations? Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.
amplifies adaptive allocation: CoD compresses individual chains to 7.6% of standard CoT tokens, enabling 13x more parallel chains within the same allocated budget — the combination of adaptive budget allocation (this note) with per-chain compression (CoD) creates a multiplicative efficiency gain

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

compute-optimal scaling allocates inference budget adaptively per prompt difficulty

Can we allocate inference compute based on prompt difficulty?

Related concepts in this collection 11

Related papers in this collection 8

Search by related questions 4