Do all annotation responses measure the same underlying thing?

Explores whether RLHF's treatment of all annotations as equivalent signals overlooks fundamental differences in what those responses actually represent—stable preferences versus non-attitudes versus context-dependent constructions.

Synthesis note · 2026-04-07 · sourced from Alignment

Behavioral science's six-decade accumulation of preference elicitation research produces a taxonomy that RLHF practice collapses into a single signal. The three categories matter because they require different treatment — and treating them uniformly is the upstream mistake that Are RLHF annotations actually measuring genuine human preferences? argues contaminates the entire pipeline.

Genuine preferences manifest stably across equivalent measurement conditions. Ask the same question with different surface wording, different framing, different order, and the response stays the same. This is what the reward model is supposed to be learning. Only this category is safe to aggregate in the way standard RLHF aggregates.

Non-attitudes are responses generated to satisfy the question without any stable underlying opinion. The respondent has never formed a view on the matter, but the measurement protocol demands an answer, so one gets produced. Non-attitudes are especially pervasive for value-laden questions — precisely the questions that matter most for alignment. Non-attitudes look like genuine preferences in a single measurement but fail the consistency test: re-ask the same respondent and you get a different answer because there was never a stable view to retrieve. Current RLHF treats these as noise to filter or minority views to downweight. The behavioral science view is different: non-attitudes contain no signal at all and should be excluded, not averaged with genuine preferences.

Constructed preferences are assembled on the spot from contextual cues and framing. The respondent is not uncertain (as in a non-attitude); they are producing a coherent answer that depends on the measurement context. Change the context — different anchoring, different comparison class, different framing — and you get a different coherent answer. This category carries real information, but about the interaction between person and context, not about a stable property of the person. RLHF treats constructed preferences as context-independent preferences and trains reward models on them as if they were. The result: reward models that look good on in-distribution evaluation but fail when the deployment context differs from the annotation context.

Measurement artifacts form a fourth related category: same question measuring different constructs for different respondents. One annotator interprets "helpful" as "completes the task"; another interprets it as "gives correct information even when unasked"; a third interprets it as "avoids making the user feel incompetent." They provide coherent, stable responses — each tracking a real preference of theirs — but they are not tracking the same thing. RLHF aggregates them as if they were.

The diagnostic criterion that separates these is consistency across equivalent measurement conditions. Genuine preferences pass; non-attitudes, constructed preferences, and measurement artifacts each fail in distinctive ways. Non-attitudes fail on re-ask (no stable view). Constructed preferences fail on context perturbation (context-dependent). Measurement artifacts fail on question rephrasing (different construct elicited). These are distinguishable empirically, and the distinction determines what should be done with each.

The operational implication is a pre-aggregation filtering step that RLHF currently lacks. Before training the reward model, submit annotation tasks to consistency protocols: re-ask selected items, perturb framings, rephrase questions. Responses that fail consistency tests are not aggregated as preferences; they are either excluded (non-attitudes), contextualized (constructed preferences), or routed to separate annotators (measurement artifacts). This is operationally demanding but conceptually necessary: the alternative is the status quo, in which Why do preference models favor surface features over substance? documents 40% divergences without being able to attribute them to a specific upstream cause.

The taxonomy also suggests why Can models learn to ignore irrelevant prompt changes? works as an output-side intervention. If the upstream measurement problem is consistency failure across equivalent conditions, then training models to be invariant to equivalent-condition perturbations is a downstream patch for the same underlying phenomenon: the system's current robustness against irrelevant cue variation.

Inquiring lines that use this note as a source 113

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 159 in 2-hop network ·dense cluster Open in graph ↗

Do all annotation responses measure the same und… Are RLHF annotations actually measuring genuine hu… Why do preference models favor surface features ov… Why do reasoning models fail at predicting disagre… Can models learn to ignore irrelevant prompt chang… Why do LLM persona prompts produce inconsistent ou… Why do LLM judges fail at predicting sparse user p… Should AI alignment target preferences or social r… Can text summaries beat embeddings for personalize…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Are RLHF annotations actually measuring genuine human preferences? RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?
the parent argument this taxonomy operationalizes
Why do preference models favor surface features over substance? Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
the 40% divergence as downstream symptom; this taxonomy points upstream
Why do reasoning models fail at predicting disagreement? RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?
disagreement that should be preserved vs disagreement that signals non-attitude — current RLHF conflates them
Can models learn to ignore irrelevant prompt changes? Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.
consistency-as-diagnostic maps to consistency-as-training-objective
Why do LLM persona prompts produce inconsistent outputs across runs? Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.
unstable-across-runs is the constructed-preference signature in simulated annotators
Why do LLM judges fail at predicting sparse user preferences? When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.
verbal uncertainty estimation as an abstention analog for identifying non-attitudes
Should AI alignment target preferences or social role norms? Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?
the normative critique; this note is the measurement refinement that specifies what the inputs actually contain
Can text summaries beat embeddings for personalized reward models? When training reward models on diverse user preferences, does conditioning on learned text-based summaries of user preferences outperform embedding vectors? This matters because better representations could make personalization more interpretable and portable.
text summaries preserve the context that constructed preferences depend on, where scalar rewards lose it

Do all annotation responses measure the same underlying thing?

Related concepts in this collection 8

Related papers in this collection 8

Search by related questions 4