Can demo placement be tuned as a task-specific hyperparameter?
This explores whether *where* you place demonstrations in a prompt — their position and ordering — is something you can deliberately tune per task, the way you'd tune a learning rate, rather than an incidental detail.
This explores whether demo placement — position and ordering — behaves like a tunable knob you set per task. The corpus says yes, and more strongly than you'd expect: placement isn't a minor formatting choice, it's a lever with measurable, sometimes dramatic effects. The most direct evidence is that moving an *identical* block of demonstrations from the start of a prompt to the end can swing in-context-learning accuracy by up to 20% and flip nearly half the model's predictions How much does demo position alone affect in-context learning accuracy?. The content didn't change — only the position. That's the signature of a real hyperparameter: same input, different result depending on a setting you control.
But placement isn't one knob, it's two. Position (where the demos sit) is distinct from *order* (the sequence within the demo block), and the corpus shows order is tunable too — and tunable without hand-labeling difficulty. Sparsity-Guided Curriculum In-Context Learning uses the model's own last-layer activation sparsity to rank demonstrations from harder to easier, then arranges them in that curriculum, yielding solid gains across diverse tasks with no external difficulty labels Can representation sparsity order few-shot demonstrations effectively?. So you can let the model's internal signal pick the ordering automatically — which is exactly what 'tune it as a hyperparameter' should mean in practice: a setting you can search over or derive, not guess.
Here's the part that answers the 'task-specific' half of your question. The same corpus repeatedly finds that the *right* ordering depends on the task type, so a single fixed placement policy won't be optimal everywhere. Omni-Thinker shows training structured tasks before creative ones (a sequencing choice at the data level) prevents entropy collapse and beats joint training by 6.2% — but the benefit comes precisely from matching the schedule to how each domain's entropy behaves Does training order reshape how models handle different task types?. Preference tuning tells the same story from another angle: the same intervention reduces diversity in code but *increases* it in creative writing, because each domain rewards different things Does preference tuning always reduce diversity the same way?. The lesson that carries over to demo placement: ordering effects are domain-dependent, so the optimal setting is task-specific by nature — which is the whole premise of treating it as a per-task hyperparameter rather than a universal default.
There's a deeper, slightly unsettling reason placement matters so much: a lot of what demonstrations 'teach' may be format and output-space, not task understanding. Models trained on semantically empty or even deliberately wrong instructions perform almost identically to those given correct ones — what transfers is knowledge of the output space, not the meaning Does instruction tuning teach task understanding or output format?. If demos work largely by anchoring format and steering the model toward a region of output space, then *where and in what order* you place them — what the model sees last, what primes it first — is doing real mechanical work, which is exactly why position can flip half the predictions.
The thing you might not have known you wanted to know: placement tuning rhymes with a broader pattern in the corpus of treating *structure* as the tunable thing rather than weights. Self-adaptive models compose task-specific expert vectors at inference time Can models dynamically activate expert skills at inference time?, and multi-task systems get isolated, task-specific parameter regions Can isolating task-specific parameters prevent multi-task fine-tuning interference?. Demo placement is the cheapest member of that family — no training, no weight surgery, just rearranging the prompt — yet it sits on the same principle: per-task configuration, applied at inference, with effects large enough to take seriously.
Sources 7 notes
Repositioning an identical demo block from prompt start to end swaps up to 20% accuracy and flips nearly half of predictions. This spatial effect operates independently of demo content and spans multiple task types.
Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.