Can dynamic instance-specific prompt selection solve the generalization problem across tasks?

This explores whether choosing a tailored prompt for each input at inference time — rather than one fixed prompt — can make a model generalize across many different tasks, and where that approach hits a wall.

This explores whether dynamic, instance-specific prompt selection can crack the generalization problem — picking the right prompt per input instead of betting on one universal template. The corpus says the *premise* is sound but the *ceiling* is real, and the most durable solutions tend to drift away from prompting alone toward adapting the model itself.

The strongest case for prompt selection is that there is no single best prompt to begin with. A 23-prompt benchmark across 12 models found that rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning actually *hurts* high-performance ones — task structure, not generic best practice, decides what helps Do prompt techniques work the same across all LLM tiers?. If the optimal prompt swings by model tier and task, then selecting per instance isn't a hack, it's the honest response to that variance. And in principle the headroom is enormous: a single finite transformer provably exists that can compute any computable function given the right prompt Can a single transformer become universally programmable through prompts?. Prompts really can act like programs.

But the same note carries the catch — standard training rarely produces models that *learn* to be programmed that way, so Turing-completeness is a statement about what's possible, not what's reachable by prompt search. Two hard ceilings hem in any selection strategy. First, prompting only reorganizes what's already in the model; it cannot supply foundational knowledge missing from training, so no prompt can rescue a task the model was never equipped for Can prompt optimization teach models knowledge they lack?. Second, even when the right information is sitting in the prompt, strong parametric priors can override it — the model ignores its own context, and textual prompting alone can't force the issue without intervening in the representations Why do language models ignore information in their context?. There's also a subtler trap: what instruction tuning actually teaches is the output *format*, not task understanding — models trained on semantically empty instructions perform about as well as those given correct ones Does instruction tuning teach task understanding or output format?. So a cleverly selected prompt may be steering the output shape more than the reasoning.

Where the corpus gets interesting is the methods that keep the *spirit* of instance-specific selection — adapt at inference, per input — but move the lever off the prompt. Transformer² composes task-specific 'expert vectors' on the fly by tuning only the singular values of weight matrices, mixing experts dynamically at inference without interference Can models dynamically activate expert skills at inference time?. Thinkless learns to *route* each query between extended reasoning and a quick answer, self-calibrating without difficulty labels Can models learn when to think versus respond quickly?. And Fast-Slow Training treats the question as one of allocation: route task-specific lessons into optimized prompts while keeping weight updates minimal — reaching the same performance faster with far less catastrophic forgetting Can splitting adaptation into two channels reduce forgetting?. That last one is the cleanest reframing: it treats the textual context as the *fast* channel and weights as the *slow* one, which is exactly what dynamic prompt selection is — the fast channel.

So the honest answer the corpus points to: instance-specific prompt selection is a genuine lever on cross-task generalization, and it's the right move given how much the optimal prompt varies. But it 'solves' generalization only up to the model's existing knowledge and its willingness to listen to context. The frontier work doesn't abandon the idea — it splits adaptation into a fast textual channel and a slow parametric one, and the durable generalization gains keep showing up on the slow side. What you didn't know you wanted to know: the most promising version of 'pick the right prompt' is one that also quietly decides *when* prompting isn't enough and reaches for the weights instead.

Sources 8 notes

Do prompt techniques work the same across all LLM tiers?

A 23-prompt benchmark across 12 LLMs shows rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning reduces accuracy in high-performance models. Task structure, not generic best practices, determines which prompts help.

Can a single transformer become universally programmable through prompts?

Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can dynamic instance-specific prompt selection solve the generalization problem across tasks?

Sources 8 notes

Next inquiring lines