Are RLHF annotations actually measuring genuine human preferences?

RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?

Synthesis note · 2026-04-07 · sourced from Alignment

The RLHF research program has invested enormous effort in the final links of its chain: better reward modeling architectures, better preference aggregation rules, better fine-tuning algorithms. A logically prior question has received less systematic attention: do the annotation responses being modeled reflect genuine preferences at all? This paper argues — drawing on sixty years of behavioral science literature that the ML community has largely ignored — that they often may not, and that this measurement validity question must be answered before any aggregation or fine-tuning decision makes sense.

The behavioral science findings are well-established. Humans routinely produce answers to survey questions without holding genuine opinions, a phenomenon called non-attitudes (Converse 1964; Krosnick 1991). Preferences are often constructed on the spot, influenced by framing and context rather than retrieved from stable mental representations (Slovic 1995; Payne et al. 1993). The same question can measure different constructs for different people (Vandenberg & Lance 2000). These are not marginal effects. They are pervasive for precisely the value-laden judgments that matter most for alignment: "should the AI refuse this request," "which response is more helpful," "is this harmful." Current RLHF practice trains reward models to predict the majority label, filters or downweights high-disagreement items, and produces a scalar reward that discards information about whether judgments were contested. The result: RLHF may be "systematically modeling noise as signal and elicitation artifacts as human values."

The logical ordering matters. Before asking how to aggregate diverse preferences, the field must ask whether the responses being aggregated are preferences at all. Before personalizing reward models to individual annotators, the field must ask whether those annotators have stable preferences to personalize. Before filtering high-disagreement items as noise, the field must ask whether disagreement signals contested values, absent values, or constructed preferences that would give different answers to the same question twenty minutes later. Each of these downstream questions presumes a solved version of the measurement validity question — and that presumption is not warranted by current practice.

This provides a second-line defense against preferentism that reaches even readers who accept preferentism in principle. Should AI alignment target preferences or social role norms? argues preferences are the wrong target on normative grounds. Measuring Human Preferences argues that even within the preferentist framework, the measurement inputs are invalid — so aggregation cannot save the approach. Together they form a pincer: preferences are both wrong-in-kind and wrong-in-measurement.

The paper's constructive contribution is a research agenda: treat measurement validity as logically prior to aggregation. Diagnose non-attitudes, constructed preferences, and measurement artifacts using the consistency criterion (do responses stabilize across equivalent conditions?). Route each type to appropriate treatment rather than collapsing them into a single signal. The alternative is an RLHF pipeline that fights downstream artifacts it inherits from upstream measurement failures — which is where the field finds itself when Why do preference models favor surface features over substance? and Why do reasoning models fail at predicting disagreement? document 40%+ divergences and systematic disagreement-suppression without being able to point to the upstream cause.

The practical implication is uncomfortable. If measurement validity is suspect, then a significant portion of the alignment investment of the last several years has been optimizing the wrong objective — not because preferences are the wrong target (the Beyond Preferences critique) but because the "preferences" in the training data are not preferences.

Inquiring lines that use this note as a source 17

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 10

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

19 direct connections · 175 in 2-hop network ·dense cluster Open in graph ↗

Are RLHF annotations actually measuring genuine … Do all annotation responses measure the same under… Should AI alignment target preferences or social r… Why do preference models favor surface features ov… Why do reasoning models fail at predicting disagre… Can text summaries beat embeddings for personalize… Why do LLM persona prompts produce inconsistent ou… Can models learn to ignore irrelevant prompt chang… Can models learn to abstain when uncertain about p…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do all annotation responses measure the same underlying thing? Explores whether RLHF's treatment of all annotations as equivalent signals overlooks fundamental differences in what those responses actually represent—stable preferences versus non-attitudes versus context-dependent constructions.
the companion insight detailing the diagnostic taxonomy
Should AI alignment target preferences or social role norms? Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?
the normative pincer: preferences are wrong-in-kind; this note is the measurement pincer: preferences are wrong-in-measurement
Why do preference models favor surface features over substance? Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
the 40% divergence is a downstream symptom of measurement validity failure
Why do reasoning models fail at predicting disagreement? RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?
suppression of legitimate disagreement variance is the measurement failure in action
Can text summaries beat embeddings for personalized reward models? When training reward models on diverse user preferences, does conditioning on learned text-based summaries of user preferences outperform embedding vectors? This matters because better representations could make personalization more interpretable and portable.
text-based summaries may recover context lost when scalar rewards discard disagreement signals
Why do LLM persona prompts produce inconsistent outputs across runs? Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.
unstable across re-asks is exactly the constructed-preference signature
Can models learn to ignore irrelevant prompt changes? Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.
consistency as a diagnostic for validity maps directly to consistency as a training objective
Can models learn to abstain when uncertain about predictions? Explores whether language models can be trained to recognize when they lack sufficient information to forecast conversation outcomes, rather than forcing uncertain predictions into confident-sounding responses.
abstention on uncertain outputs is the modeling-side analog of filtering non-attitudes at the input side
Why do LLM judges fail at predicting sparse user preferences? When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.
persona sparsity and measurement validity are adjacent: sparse sampling produces artifacts
Does transformer attention architecture inherently favor repeated content? Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
complementary failure at the architectural layer: upstream measurement produces artifacts as data, architectural attention amplifies whatever is in context regardless; together they guarantee sycophancy survives any pipeline cleanup

Are RLHF annotations actually measuring genuine human preferences?

Related concepts in this collection 10

Related papers in this collection 8

Search by related questions 4