How much does demo position alone affect in-context learning accuracy?
Moving demonstrations from prompt start to end without changing their content produces surprisingly large accuracy swings. Does spatial position in the prompt matter more than what demonstrations actually contain?
In-context learning performance is sensitive to which demos are selected and in what order they appear — that much was known. What the DPP bias paper reveals is a different and larger effect: where the entire demo block sits relative to other prompt components (system prompt, user message) matters more than the content of the demos themselves.
Moving an unchanged block of demonstrations from the start of a prompt to the end can swing task accuracy by up to 20% and flip almost half of the model's predictions. This is a purely spatial effect. The demos are identical — their position relative to instructions and queries is the only variable. The effect spans classification, QA, summarization, and reasoning tasks, measured via two metrics: ACCURACY-CHANGE (net accuracy shift) and PREDICTION-CHANGE (output volatility from repositioning).
The mechanistic hypothesis draws on three architectural tendencies: primacy bias (transformers disproportionately emphasize early tokens due to induction head mechanisms), sequential processing bias (earlier context steers subsequent predictions more strongly), and lost-in-the-middle (tokens in middle positions receive less attention). These are known individually, but the DPP paper provides the first role-aware stress test — examining how these biases interact with prompt roles (system vs. user).
This extends the ordering sensitivity documented in Why do chain-of-thought examples fail across different conditions? to a larger spatial scale. CoT exemplar brittleness finds 3.3% accuracy swings from shuffling exemplars among themselves. DPP bias finds 20% swings from repositioning the entire block — roughly 6x the magnitude, operating at prompt-architecture level rather than within-exemplar level. Both share a root cause: LLMs process prompts as sequential narratives, not as unordered information sets.
The connection to How much does the order of premises actually matter for reasoning? reinforces the pattern: ordering effects appear at every spatial granularity — within premises (30%), within exemplar sets (3.3%), and across prompt components (20%). The shared mechanism is that Does transformer attention architecture inherently favor repeated content? — positional prominence in the attention window is an architectural property, not a training artifact.
The practical implication for prompt engineering is that demo placement is not a formatting choice but a performance-critical decision. For production systems using ICL, the position of demonstrations relative to instructions should be treated as a hyperparameter — and one that may need task-specific tuning.
Inquiring lines that use this note as a source 12
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can curated demonstrations compensate for smaller or simpler training environments?
- How does the location of causal passages differ between news and lectures?
- How does demo position create spatial bias in prompts?
- Can demo placement be tuned as a task-specific hyperparameter?
- What role does attention structure play in creating position bias?
- How do ordering effects compound across different prompt component scales?
- Why does profile position in context windows affect personalization strength?
- Why does entropy-based frame sampling work better than uniform stride selection?
- What happens when prompter skill matters more than domain expertise?
- Can a single accuracy threshold work across different prompt categories?
- Can activation sparsity patterns guide the selection of in-context learning demonstrations?
- What specific qualities make some demonstrations more effective for agency training?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do chain-of-thought examples fail across different conditions?
Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.
extends order sensitivity from within-exemplar (3.3%) to prompt-architecture level (20%); shared sequential-processing mechanism at different spatial scales
-
How much does the order of premises actually matter for reasoning?
When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.
ordering effects at every granularity: premises, exemplars, prompt components
-
Does transformer attention architecture inherently favor repeated content?
Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
primacy bias and positional prominence as architectural root cause
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning
- A Survey on Prompt Tuning
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind
- Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning
- Dynamic Prompting: A Unified Framework for Prompt Tuning
- Experimental Design for Active Transductive Inference in Large Language Models
- Prompting Science Report 4: Playing Pretend: Expert Personas Don't Improve Factual Accuracy
Original note title
demo position in prompt creates a spatial bias that swings ICL accuracy by up to 20 percent independent of demo content