INQUIRING LINE

What deployment tradeoffs emerge between single-pass and multi-pass inference adaptation?

This explores the deployment cost of *when* and *how often* a model adjusts itself at inference — adapting in a single forward pass (cheap, fixed) versus looping through observation, dialogue, or expert-composition before committing (costly, but more responsive).


This reads the question as a tradeoff axis: single-pass adaptation steers the model once and ships the answer, while multi-pass adaptation lets the model inspect a task, route, reflect, or recompose itself before answering. The corpus suggests the choice is less about raw capability and more about what you're willing to pay at deployment — in latency, in extra forward passes, and in operational complexity.

On the single-pass side, the appeal is that adaptation happens inside the normal decoding path with no weight surgery and no second look. Proxy-tuning Can decoding-time tuning preserve knowledge better than weight fine-tuning? shifts the output distribution at decoding time while leaving base weights untouched, closing most of the alignment gap without the knowledge corruption that direct fine-tuning causes in lower layers. Fast-Slow Training Can splitting adaptation into two channels reduce forgetting? makes a similar bet: push task-specific lessons into the prompt (fast, textual, disposable) and keep parameter updates minimal, reaching the same performance faster and with far less forgetting. The lesson across both is that a lot of adaptation can be delivered as a one-shot distributional nudge rather than a structural change.

Multi-pass adaptation buys more, but charges for it. Transformer² Can models dynamically activate expert skills at inference time? literally runs two passes — one to diagnose the task, one to compose the right expert vectors — outperforming LoRA with fewer parameters, but at the cost of an extra inference stage. Push further and you get test-time learning systems like ARIA Can LLMs learn reliably at test time without human oversight? and Reflexion Can agents learn from failure without updating their weights?, which adapt by talking to themselves across turns or episodes, storing verbal reflections and timestamped knowledge. These improve without touching weights, but they introduce the deployment headaches the cheaper methods avoid: ARIA can't autonomously reconcile contradictory rules and needs a human in the loop, and reflection only works when feedback is unambiguous enough to prevent the model rationalizing its failures.

The sharpest tradeoff is *spending compute you can't recover*. Two notes warn that more inference passes don't substitute for the right training. Reasoning models persistently beat non-reasoning ones no matter how large the inference budget Can non-reasoning models catch up with more compute?, because training installs a protocol that makes extra tokens productive — burning multi-pass compute on a model that never learned to use it is wasted spend. Thinkless Can models learn when to think versus respond quickly? turns this into a routing decision: train the model to *choose* single-pass (quick answer) versus multi-pass (extended thinking) per query, so you only pay the multi-pass cost when the problem warrants it.

The thing worth taking away: the real frontier isn't 'one pass or many' as a fixed policy — it's making the model itself decide, and pairing that with architecture chosen for the inference budget you actually have Can architecture choices improve inference efficiency without sacrificing accuracy?. Adaptive routing plus inference-aware architecture lets you reserve expensive multi-pass adaptation for the cases that repay it, instead of taxing every query.


Sources 8 notes

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can LLMs learn reliably at test time without human oversight?

ARIA demonstrates that LLMs can adapt during inference through three integrated components: structured self-dialogue for uncertainty assessment, timestamped knowledge bases for conflict detection, and human-mediated resolution queries. Autonomous systems fail at reconciling contradictory rules because the correct choice depends on context outside the system.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a deployment systems analyst. The question remains open: which single-pass vs. multi-pass inference adaptation tradeoff is optimal for a given latency budget, task uncertainty, and feedback regime?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat all as perishable constraints to re-test.
• Single-pass methods (proxy-tuning, fast-slow training) close alignment gaps without weight surgery or forgetting, shifting output distribution in one decoding pass (~2024–2025).
• Two-pass routing (Transformer², Thinkless) outperforms LoRA with fewer parameters but costs an extra inference stage; Thinkless learns *when* to invoke multi-pass per query, reducing wasteful compute (~2025).
• Test-time learning systems (ARIA, Reflexion) improve without touching weights via self-dialogue and episodic memory, but require unambiguous feedback and cannot autonomously reconcile contradictions (~2024–2025).
• Non-reasoning models cannot match reasoning models even with unlimited inference budget, because training installs a *protocol* that makes extra tokens productive; multi-pass compute on untrained models is wasted spend (~2025).
• Inference-aware architecture + adaptive routing reserves expensive multi-pass adaptation for queries that warrant it, rather than taxing every query (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2501.06252 (Transformer²: Self-adaptive LLMs, Jan 2025)
• arXiv:2505.13379 (Thinkless: LLM Learns When to Think, May 2025)
• arXiv:2507.17131 (Self-Improving Agents, Human-In-The-Loop, Jul 2025)
• arXiv:2510.18245 (Scaling Laws Meet Model Architecture, Oct 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, assess whether newer models, training methods (RL post-training, continued pretraining), inference optimizations (KV-cache, speculative decode), or orchestration (agentic loops, external memory) have since relaxed or overturned it. Separate durable questions (task routing, protocol-learning trade-offs) from perishable limits (which single-pass approximation suffices, whether feedback must be human-curated). Cite what relaxed which constraint.
(2) Surface the strongest *contradicting or reconciling* work from the last ~6 months—especially any papers showing single-pass methods matching multi-pass performance, or vice versa, under realistic deployment conditions.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., does RL fine-tuning on reasoning-aware models make single-pass routing sufficient? Can adaptive routing be trained offline and deployed cold, or does it require online calibration?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines