Can test-time compute budgets be allocated differently per query difficulty?
This explores whether systems can spend more inference-time compute on hard queries and less on easy ones — adaptive budgeting per query difficulty, rather than a fixed budget for everything.
This explores whether systems can spend more inference-time compute on hard queries and less on easy ones — adaptive budgeting per query difficulty, rather than treating every prompt the same. The corpus answers yes, clearly, and the gains are large: dynamically adjusting how much compute a model spends per prompt beats uniform spending, because flat budgets waste resources on easy problems while starving hard ones How should we allocate compute budget at inference time?. The sharpest version of this finding is that you don't even need more total compute — just reallocating the *same* budget, giving easy prompts less and hard ones more, can outperform a bigger model running under a uniform budget Can we allocate inference compute based on prompt difficulty?.
What makes this interesting is that it reframes inference compute as a resource that trades against model size. On difficult prompts especially, a smaller model given more thinking time can match a much larger one — meaning pretraining compute and inference compute aren't separate budgets but partly interchangeable ones Can inference compute replace scaling up model size?. That substitution is exactly why difficulty-aware allocation matters: the payoff from extra compute is concentrated on the hard tail, so spending uniformly leaves most of the value on the table.
But 'allocate more compute' isn't a single dial — it's several, and the corpus suggests the *shape* of the spend matters as much as the amount. You can scale in parallel (sample many independent attempts for coverage) or sequentially (reason deeper in one chain), and the right choice depends on the task: parallel wins for independent short problems, sequential for compositional chains that must accumulate intermediate results How should we balance parallel versus sequential compute at test time?. On genuinely structured problems like graph connectivity, sequential chain-of-thought beats parallel voting by an exponential margin, because the solution actually requires building up steps in order When does sequential reasoning beat parallel voting?. So a fully adaptive allocator would tune not just *how much* but *which kind* of compute per query.
There's also a subtler caution in the corpus: the framework you use to spend the budget matters less than people assume. An information-theoretic analysis found that best-of-N and Monte Carlo tree search converge in accuracy once you control for total compute — what actually limits you is search scope and the reliability of your reward/value function, not the specific algorithm Does the choice of reasoning framework actually matter for test-time performance?. And the whole approach has a floor: a model that wasn't trained to reason productively can't be rescued by more inference budget, because additional tokens only pay off if training instilled a protocol that makes them productive Can non-reasoning models catch up with more compute?.
The idea also generalizes beyond reasoning tokens. In agentic deep-research systems, *search* budget follows the same scaling curve as reasoning tokens — monotonic with diminishing returns — which opens a second axis to allocate against: models can trade reasoning budget for search budget per query to optimize answer quality Does search budget scale like reasoning tokens for answer quality? How does search scale like reasoning in agent systems?. And the difficulty-aware logic even leaks back into training: 'thinking-augmented' pretraining naturally gives harder tokens longer reasoning traces, a built-in compute-allocation mechanism that mirrors test-time scaling Can training data augmentation match test-time compute scaling benefits?. The thread running through all of it: difficulty-proportional compute is one of the most reliable free lunches in current LLM inference — but only on top of a model trained to use the compute well.
Sources 10 notes
Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.
Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
Parallel methods improve coverage; sequential methods enable depth. The optimal choice depends on task structure: parallel wins for independent short problems, sequential for compositional chains requiring intermediate accumulation.
On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.
Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.
Test-time scaling laws generalize from reasoning to retrieval: search steps follow identical scaling curves to reasoning tokens, making deep research a test-time scaling problem. This insight reframes search as a compute axis comparable to chain-of-thought reasoning.
Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.