What measurement artifacts emerge when annotators interpret the same question differently?

This explores what goes wrong in measurement when different annotators read the same question in different ways — and whether that variation is noise to be cleaned up or signal worth keeping.

This explores what goes wrong in measurement when annotators interpret the same question differently — and the corpus suggests the deepest artifact is a category error: we treat disagreement as failure when much of it is real signal. Start with the finding that annotation responses don't all measure the same thing. They decompose into at least three signal types — genuine preferences, non-attitudes (people answering with no real underlying view), and constructed-on-the-spot preferences — distinguishable only by whether they stay consistent when you change how you ask Do all annotation responses measure the same underlying thing?. Pool them into one score and you contaminate everything downstream, including reward-model training and alignment. So the first artifact is invisible heterogeneity: a single number hiding three different things.

The second artifact is treating valid perspective-difference as labeling error. For socially embedded sentences, interpretations are irreducibly multiple — a reader's social position genuinely changes what the sentence means to them, and the spread of labels carries information rather than corrupting it Why do readers interpret the same sentence so differently?. The mirror-image failure shows up in the texts themselves: some sentences are deliberately ambiguous, and the cost of not recognizing that is steep. GPT-4 disambiguates only 32% of genuinely ambiguous cases against humans' 90%, unable to hold two readings at once Can language models recognize when text is deliberately ambiguous?. Whether the interpreter is human or machine, collapsing multiple legitimate readings into one 'correct' answer is where the artifact is manufactured.

Third, the asking itself isn't neutral. Prompt quality turns out to be a structured space with six measurable dimensions — clarity, logic, hallucination-resistance and so on — not a flat property, which means two phrasings of 'the same' question can sit at very different points in that space Can we measure prompt quality independent of model outputs?. And the sensitivity to phrasing isn't uniform across respondents: when a model (or, by analogy, a confident annotator) is sure of its answer it resists rephrasing, while low confidence produces large swings from trivial wording changes Does model confidence predict robustness to prompt changes?. So identical-looking questions silently sort respondents by confidence, and the variance you measure is partly an artifact of who was unsure. Emotional framing does the same thing through a different door — identical questions get systematically different answers depending on the tone they're wrapped in Does emotional tone in prompts change what information LLMs provide?.

What ties these together is the most unsettling note in the collection: emergent abilities. Sharp, dramatic capability jumps in LLMs largely vanish when you swap a discontinuous metric for a continuous one — the 'jump' was a measurement choice, not a real change in behavior Are LLM emergent abilities real or measurement artifacts?. The lesson generalizes directly to annotation: the artifact often lives in the measurement instrument, not the thing measured. When annotators diverge, the reflex is to assume the annotators are noisy. These notes invite the opposite hypothesis first — that the scale, the aggregation rule, or the question's hidden ambiguity is what's generating the divergence.

If you want a way out, two corpus threads point forward. One is to stop trying to fake disagreement with LLM personas — running the same persona prompt repeatedly produces variance that matches or exceeds the variance between different personas, so model uncertainty, not stable human-like difference, is driving the spread Why do LLM persona prompts produce inconsistent outputs across runs?. The other is to engineer the question better up front: decomposing 'question quality' into specific attributes like clarity, relevance, and specificity measurably reduces the kind of ambiguity that makes annotators read the same prompt two ways Can models learn to ask genuinely useful clarifying questions?. The discovery worth leaving with: 'annotator disagreement' is rarely one phenomenon — it's a stack of distinct artifacts, and which fix applies depends entirely on which layer you're actually looking at.

Sources 9 notes

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Why do readers interpret the same sentence so differently?

Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Can we measure prompt quality independent of model outputs?

Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Does emotional tone in prompts change what information LLMs provide?

GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.

Are LLM emergent abilities real or measurement artifacts?

Sharp, unpredictable capability transitions vanish when using continuous metrics instead of discontinuous ones. The same model outputs show smooth predictable improvement with scale, suggesting emergence is a measurement choice rather than a real behavioral change.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Can models learn to ask genuinely useful clarifying questions?

The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.

What measurement artifacts emerge when annotators interpret the same question differently?

Sources 9 notes

Next inquiring lines