INQUIRING LINE

How does demo position create spatial bias in prompts?

This explores why simply moving an identical demonstration block to a different spot in the prompt — start vs. end — changes what the model predicts, separate from anything about the demo's actual content.


This explores how the *position* of a demonstration in a prompt, independent of what that demonstration says, can swing in-context learning results. The headline finding is blunt: relocate an identical demo block from the beginning of a prompt to the end and accuracy can move by up to twenty percent, flipping nearly half of all predictions across several task types How much does demo position alone affect in-context learning accuracy?. Same words, same examples — different answer. That tells you the model isn't reading the prompt as a neutral container of information; it's reading it as a space where *where* something sits carries weight the author never intended.

The deeper question is where this sensitivity comes from, and the corpus points at the foundations rather than the surface. Cognitive biases in models appear to be planted during pretraining and merely nudged by finetuning — a causal study using random-seed variation and cross-tuning found that models sharing a pretrained backbone keep similar bias patterns no matter what instruction data they later see Where do cognitive biases in language models come from?. Spatial bias reads as one member of this family: a structural tendency baked deep enough that you can't prompt or finetune your way around it easily. It sits alongside other content-independent swings, like emotional tone changing the substance of an answer to an otherwise identical question Does emotional tone in prompts change what information LLMs provide?. In both cases a variable that shouldn't matter does.

What's striking is that this undercuts a quiet assumption in prompt design — that a prompt is a list of facts the model weighs on their merits. One line of work reframes the prompt instead as a single static frame that bundles utterance, context, and role all at once, which the model cannot renegotiate mid-conversation the way humans cooperatively rebuild shared context How do prompts reshape the role of context in AI conversation?. If the whole prompt is one frozen frame, then position is part of the message whether you like it or not. This also reframes how we think about "prompt quality": rather than a flat checklist, quality looks like a structured space with interacting dimensions where one change cascades into others Can we measure prompt quality independent of model outputs?.

The more useful turn is what you do about it. One answer is to stop hand-tuning around fragilities and train the fragility out: consistency training teaches a model to respond identically to a clean prompt and a "wrapped" or perturbed version by using the model's own clean answers as targets Can models learn to ignore irrelevant prompt changes?. Spatial bias is exactly the kind of irrelevant perturbation that approach targets — invariance to where the demo sits is the same goal as invariance to surface wrapping. The contrast worth sitting with is the alternative the corpus flags as a dead end: ad hoc iterative prompt-tweaking by a single person, which bakes in individual bias and self-fulfilling feedback loops instead of removing them Does iterative prompt engineering undermine scientific validity?.

The thing you didn't know you wanted to know: spatial bias isn't a quirk of one benchmark, it's a visible symptom of how models treat a prompt as a single physical layout rather than a bag of facts — which means the same demo, unchanged, is genuinely a different input depending on where you put it, and the fix is teaching invariance, not finding the "right" spot.


Sources 7 notes

How much does demo position alone affect in-context learning accuracy?

Repositioning an identical demo block from prompt start to end swaps up to 20% accuracy and flips nearly half of predictions. This spatial effect operates independently of demo content and spans multiple task types.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Does emotional tone in prompts change what information LLMs provide?

GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.

How do prompts reshape the role of context in AI conversation?

LLM prompts bundle utterance, context assignment, and role specification into a single static frame the model cannot renegotiate, unlike human dialogue where context evolves cooperatively. This makes mid-conversation pivots require explicit re-prompting rather than implicit adjustment.

Can we measure prompt quality independent of model outputs?

Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Does iterative prompt engineering undermine scientific validity?

Iterative prompt revision by single researchers introduces individual bias, shifts evaluation criteria to match LLM capabilities rather than task requirements, and creates self-fulfilling feedback loops. A validated pipeline with inter-coder reliability and pre-specified criteria is required instead.

Next inquiring lines