What deployment tradeoffs emerge between single-pass and multi-pass inference adaptation?
This explores the deployment cost of *when* and *how often* a model adjusts itself at inference — adapting in a single forward pass (cheap, fixed) versus looping through observation, dialogue, or expert-composition before committing (costly, but more responsive).
This reads the question as a tradeoff axis: single-pass adaptation steers the model once and ships the answer, while multi-pass adaptation lets the model inspect a task, route, reflect, or recompose itself before answering. The corpus suggests the choice is less about raw capability and more about what you're willing to pay at deployment — in latency, in extra forward passes, and in operational complexity.
On the single-pass side, the appeal is that adaptation happens inside the normal decoding path with no weight surgery and no second look. Proxy-tuning Can decoding-time tuning preserve knowledge better than weight fine-tuning? shifts the output distribution at decoding time while leaving base weights untouched, closing most of the alignment gap without the knowledge corruption that direct fine-tuning causes in lower layers. Fast-Slow Training Can splitting adaptation into two channels reduce forgetting? makes a similar bet: push task-specific lessons into the prompt (fast, textual, disposable) and keep parameter updates minimal, reaching the same performance faster and with far less forgetting. The lesson across both is that a lot of adaptation can be delivered as a one-shot distributional nudge rather than a structural change.
Multi-pass adaptation buys more, but charges for it. Transformer² Can models dynamically activate expert skills at inference time? literally runs two passes — one to diagnose the task, one to compose the right expert vectors — outperforming LoRA with fewer parameters, but at the cost of an extra inference stage. Push further and you get test-time learning systems like ARIA Can LLMs learn reliably at test time without human oversight? and Reflexion Can agents learn from failure without updating their weights?, which adapt by talking to themselves across turns or episodes, storing verbal reflections and timestamped knowledge. These improve without touching weights, but they introduce the deployment headaches the cheaper methods avoid: ARIA can't autonomously reconcile contradictory rules and needs a human in the loop, and reflection only works when feedback is unambiguous enough to prevent the model rationalizing its failures.
The sharpest tradeoff is *spending compute you can't recover*. Two notes warn that more inference passes don't substitute for the right training. Reasoning models persistently beat non-reasoning ones no matter how large the inference budget Can non-reasoning models catch up with more compute?, because training installs a protocol that makes extra tokens productive — burning multi-pass compute on a model that never learned to use it is wasted spend. Thinkless Can models learn when to think versus respond quickly? turns this into a routing decision: train the model to *choose* single-pass (quick answer) versus multi-pass (extended thinking) per query, so you only pay the multi-pass cost when the problem warrants it.
The thing worth taking away: the real frontier isn't 'one pass or many' as a fixed policy — it's making the model itself decide, and pairing that with architecture chosen for the inference budget you actually have Can architecture choices improve inference efficiency without sacrificing accuracy?. Adaptive routing plus inference-aware architecture lets you reserve expensive multi-pass adaptation for the cases that repay it, instead of taxing every query.
Sources 8 notes
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
ARIA demonstrates that LLMs can adapt during inference through three integrated components: structured self-dialogue for uncertainty assessment, timestamped knowledge bases for conflict detection, and human-mediated resolution queries. Autonomous systems fail at reconciling contradictory rules because the correct choice depends on context outside the system.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.