Can prompt position alone shift language model predictions by twenty percent?
This explores whether *where* and *how* a prompt is framed — not new information, just surface placement and wording — can swing a model's output by a large margin, and what the corpus says about the size and source of that effect.
This reads the question as: can surface-level prompt choices (position, ordering, framing) — without adding any new knowledge — meaningfully move what a model predicts? The honest answer from this corpus is that no note pins down the precise "twenty percent" figure from position *alone*, but several notes converge on the larger truth behind it: prompt surface is powerful but bounded, and order effects in particular can move predictions by double-digit margins. The closest hard number is in multi-turn settings, where models that lock onto early assumptions show a ~39% average performance drop — and agent-style mitigations recover only 15–20% of that loss Why do language models fail in gradually revealed conversations?. So the *order* in which information arrives demonstrably shifts outcomes by far more than twenty percent.
Why is the surface so influential? Because prompting reorganizes the model's existing distribution rather than adding to it. One note frames prompt optimization as activation, not injection: a prompt can retrieve and rearrange what's already in the training distribution, but cannot supply knowledge that isn't there Can prompt optimization teach models knowledge they lack?. That's exactly why position and phrasing can swing predictions so much — they're steering a probability machine, and small steering inputs to a sensitive distribution produce large output changes.
But there's a ceiling, and it cuts the other way. When the model's parametric priors are strong enough, textual prompting *fails* to move the output at all — the training associations override whatever the context says, and only causal intervention in the representations changes the answer Why do language models ignore information in their context?. So the swing from prompt position isn't a fixed twenty percent; it's a function of how confident the model already is. Weak priors are wildly malleable; strong priors are nearly immovable by wording alone.
This fragility-versus-rigidity tension is exactly what consistency training tries to neutralize. Methods like BCT and ACT teach a model to respond identically to a clean prompt and a "wrapped" or repositioned one, using the model's own clean responses as the target — explicitly training away the sensitivity to irrelevant prompt changes Can models learn to ignore irrelevant prompt changes?. The very existence of this research is evidence that, by default, prompt perturbations *do* shift predictions enough to be worth engineering against.
The deeper why comes from treating the model as an autoregressive probability machine: failure (and malleability) is predictable from how low-probability the target response is Can we predict where language models will fail?. Combine that with the finding that a model holds a *superposition* of consistent continuations and samples one at generation time Do large language models actually commit to a single character?, and the twenty-percent intuition makes sense: a prompt position doesn't reveal a fixed answer, it nudges which branch of a probability distribution gets sampled. The number you'd measure depends entirely on how sharply peaked that distribution already was.
Sources 6 notes
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.