How much does demo position alone affect in-context learning accuracy?

Moving demonstrations from prompt start to end without changing their content produces surprisingly large accuracy swings. Does spatial position in the prompt matter more than what demonstrations actually contain?

Synthesis note · 2026-02-23 · sourced from Context Engineering

In-context learning performance is sensitive to which demos are selected and in what order they appear — that much was known. What the DPP bias paper reveals is a different and larger effect: where the entire demo block sits relative to other prompt components (system prompt, user message) matters more than the content of the demos themselves.

Moving an unchanged block of demonstrations from the start of a prompt to the end can swing task accuracy by up to 20% and flip almost half of the model's predictions. This is a purely spatial effect. The demos are identical — their position relative to instructions and queries is the only variable. The effect spans classification, QA, summarization, and reasoning tasks, measured via two metrics: ACCURACY-CHANGE (net accuracy shift) and PREDICTION-CHANGE (output volatility from repositioning).

The mechanistic hypothesis draws on three architectural tendencies: primacy bias (transformers disproportionately emphasize early tokens due to induction head mechanisms), sequential processing bias (earlier context steers subsequent predictions more strongly), and lost-in-the-middle (tokens in middle positions receive less attention). These are known individually, but the DPP paper provides the first role-aware stress test — examining how these biases interact with prompt roles (system vs. user).

This extends the ordering sensitivity documented in Why do chain-of-thought examples fail across different conditions? to a larger spatial scale. CoT exemplar brittleness finds 3.3% accuracy swings from shuffling exemplars among themselves. DPP bias finds 20% swings from repositioning the entire block — roughly 6x the magnitude, operating at prompt-architecture level rather than within-exemplar level. Both share a root cause: LLMs process prompts as sequential narratives, not as unordered information sets.

The connection to How much does the order of premises actually matter for reasoning? reinforces the pattern: ordering effects appear at every spatial granularity — within premises (30%), within exemplar sets (3.3%), and across prompt components (20%). The shared mechanism is that Does transformer attention architecture inherently favor repeated content? — positional prominence in the attention window is an architectural property, not a training artifact.

The practical implication for prompt engineering is that demo placement is not a formatting choice but a performance-critical decision. For production systems using ICL, the position of demonstrations relative to instructions should be treated as a hyperparameter — and one that may need task-specific tuning.

Inquiring lines that use this note as a source 12

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 161 in 2-hop network ·dense cluster Open in graph ↗

How much does demo position alone affect in-cont… Why do chain-of-thought examples fail across diffe… How much does the order of premises actually matte… Does transformer attention architecture inherently…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do chain-of-thought examples fail across different conditions? Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.
extends order sensitivity from within-exemplar (3.3%) to prompt-architecture level (20%); shared sequential-processing mechanism at different spatial scales
How much does the order of premises actually matter for reasoning? When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.
ordering effects at every granularity: premises, exemplars, prompt components
Does transformer attention architecture inherently favor repeated content? Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
primacy bias and positional prominence as architectural root cause

How much does demo position alone affect in-context learning accuracy?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4