How does output variability disguise confirmation bias in prompt refinement?
This explores a feedback trap in prompt engineering: because LLM outputs naturally shift with every wording change, a user who keeps tweaking prompts until the answer matches what they expected can mistake that selection process for the model 'getting it right.'
This explores a feedback trap in prompt engineering: because LLM outputs naturally shift with every wording change, a user who keeps tweaking prompts until the answer matches what they expected can mistake that selection process for the model 'getting it right.' The corpus suggests the disguise works because two facts about LLMs get conflated — outputs are inherently variable, and refinement is inherently a steering process — so the steering hides inside the variability.
Start with the variability itself. Outputs are described as essentially mutable: they swing with sampling, prompt wording, and even audience interpretation, and this is framed not as a bug but as a defining property that resists ordinary quality assurance Why does AI output change with every prompt and context?. Crucially, that swing is largest exactly where it matters — when the model is uncertain, small prompt rephrasings cause big output changes, while confident answers stay stable Does model confidence predict robustness to prompt changes?. So on hard, ambiguous questions — the ones where a user most wants confirmation — the model is most willing to hand over a different answer for every reformulation.
Now add what refinement actually does. Iterative prompt engineering has been characterized as the user injecting their own anticipated answer distribution into generation: each revision minimizes the gap between the output and what the user already expected, until the result is a co-production of model and user prior rather than an independent finding How much does the user shape what a model generates?. Put the two together and the mechanism is clear: variability supplies an endless stream of candidate outputs, refinement selects among them by closeness-to-expectation, and the selection looks like discovery because the surface text genuinely changed each round. You are not lying to yourself about the words — you are misreading 'I kept going until it agreed with me' as 'it converged on the answer.'
The sharpest statement of the danger is the argument that ad-hoc prompt revision violates the scientific method: a single person revising prompts introduces individual bias, quietly shifts the evaluation criteria to match what the model can produce, and builds self-fulfilling feedback loops — with the proposed fix being pre-specified criteria and inter-coder reliability rather than one person's iterative taste Does iterative prompt engineering undermine scientific validity?. That 'shifting criteria' is the tell: confirmation bias here doesn't just pick a favorite answer, it rewrites the standard of a good answer mid-process, and variability gives it cover by making each shift look like a new data point.
The genuinely useful turn the corpus offers is that the cure attacks the variability-as-evidence link directly. One line of work measures prompt quality on six dimensions — communication, cognition, instruction, logic, hallucination, responsibility — entirely independent of the model's output, so you can judge a prompt before seeing whether you like what it returns Can we measure prompt quality independent of model outputs?. Another trains models to respond identically to clean and reworded prompts, collapsing the perturbation swing that the bias feeds on Can models learn to ignore irrelevant prompt changes?. Both make the same bet: if you either fix the prompt's quality up front or remove the model's willingness to give you a new answer per phrasing, there's nothing left for confirmation bias to hide behind.
Sources 6 notes
AI outputs exhibit essential mutability—they vary with sampling, prompt wording, and audience interpretation. This is not a defect but a defining feature of tokens as media, making them fundamentally different from fixed commodities and resistant to traditional quality assurance.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
Foundation Priors research shows prompt engineering as divergence minimization between synthetic output and user priors. The refinement process systematically steers generation toward what users already expect, making outputs co-productions of model and user subjectivity.
Iterative prompt revision by single researchers introduces individual bias, shifts evaluation criteria to match LLM capabilities rather than task requirements, and creates self-fulfilling feedback loops. A validated pipeline with inter-coder reliability and pre-specified criteria is required instead.
Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.