SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation

Does prompt optimization without inference strategy fail?

Standard practice optimizes prompts and inference strategies separately. But do prompts optimized for single-shot evaluation actually perform worse when deployed at scale with aggregation methods like majority voting?

Synthesis note · 2026-02-23 · sourced from Inference time scaling
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

The standard practice treats prompt optimization and inference scaling as independent. Optimize the prompt first (via reward-based search, instruction tuning, etc.), then separately decide the inference strategy (best-of-N sampling, majority voting, etc.). IAPO demonstrates this decoupling is a methodological error with measurable cost.

The mechanism: different prompts generate responses with different distributional properties. Some prompts produce outputs that are individually strong but don't benefit from aggregation — their variance is low, so generating N samples and voting adds compute without improving quality. Other prompts produce outputs with higher variance but better centering — individually weaker, but under majority voting or best-of-N with a reward model, the aggregation exploits the variance to select high-quality responses. A prompt optimized at N=1 will favor the first type. But if the deployment uses N=8 with majority voting, the second type is strictly better.

This creates "deceiving prompts" — prompts that appear optimal in single-shot evaluation but become suboptimal (or harmful) under inference scaling. The PSST algorithm addresses this by treating prompt selection and inference scale as a joint contextual best-arm identification problem, exploring prompt-inference configurations together rather than sequentially.

The empirical results across six tasks: IAPO outperforms disjoint optimization by up to 25% and prompt-only optimization by up to 50%. The gains are consistent across mathematical reasoning, commonsense reasoning, and multi-objective text generation.

The practical implication for inference system design: any pipeline that separately optimizes prompts and inference strategies is leaving significant performance on the table. Since Can we allocate inference compute based on prompt difficulty?, the IAPO finding adds a second dimension — not just how much inference compute per prompt, but which prompt given the inference strategy. The two must be co-optimized.

Inquiring lines that use this note as a source 36

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 196 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

prompt optimization decoupled from inference scaling produces systematic misalignment — joint optimization outperforms disjoint by up to 50 percent