Do all annotation responses measure the same underlying thing?
Explores whether RLHF's treatment of all annotations as equivalent signals overlooks fundamental differences in what those responses actually represent—stable preferences versus non-attitudes versus context-dependent constructions.
Behavioral science's six-decade accumulation of preference elicitation research produces a taxonomy that RLHF practice collapses into a single signal. The three categories matter because they require different treatment — and treating them uniformly is the upstream mistake that Are RLHF annotations actually measuring genuine human preferences? argues contaminates the entire pipeline.
Genuine preferences manifest stably across equivalent measurement conditions. Ask the same question with different surface wording, different framing, different order, and the response stays the same. This is what the reward model is supposed to be learning. Only this category is safe to aggregate in the way standard RLHF aggregates.
Non-attitudes are responses generated to satisfy the question without any stable underlying opinion. The respondent has never formed a view on the matter, but the measurement protocol demands an answer, so one gets produced. Non-attitudes are especially pervasive for value-laden questions — precisely the questions that matter most for alignment. Non-attitudes look like genuine preferences in a single measurement but fail the consistency test: re-ask the same respondent and you get a different answer because there was never a stable view to retrieve. Current RLHF treats these as noise to filter or minority views to downweight. The behavioral science view is different: non-attitudes contain no signal at all and should be excluded, not averaged with genuine preferences.
Constructed preferences are assembled on the spot from contextual cues and framing. The respondent is not uncertain (as in a non-attitude); they are producing a coherent answer that depends on the measurement context. Change the context — different anchoring, different comparison class, different framing — and you get a different coherent answer. This category carries real information, but about the interaction between person and context, not about a stable property of the person. RLHF treats constructed preferences as context-independent preferences and trains reward models on them as if they were. The result: reward models that look good on in-distribution evaluation but fail when the deployment context differs from the annotation context.
Measurement artifacts form a fourth related category: same question measuring different constructs for different respondents. One annotator interprets "helpful" as "completes the task"; another interprets it as "gives correct information even when unasked"; a third interprets it as "avoids making the user feel incompetent." They provide coherent, stable responses — each tracking a real preference of theirs — but they are not tracking the same thing. RLHF aggregates them as if they were.
The diagnostic criterion that separates these is consistency across equivalent measurement conditions. Genuine preferences pass; non-attitudes, constructed preferences, and measurement artifacts each fail in distinctive ways. Non-attitudes fail on re-ask (no stable view). Constructed preferences fail on context perturbation (context-dependent). Measurement artifacts fail on question rephrasing (different construct elicited). These are distinguishable empirically, and the distinction determines what should be done with each.
The operational implication is a pre-aggregation filtering step that RLHF currently lacks. Before training the reward model, submit annotation tasks to consistency protocols: re-ask selected items, perturb framings, rephrase questions. Responses that fail consistency tests are not aggregated as preferences; they are either excluded (non-attitudes), contextualized (constructed preferences), or routed to separate annotators (measurement artifacts). This is operationally demanding but conceptually necessary: the alternative is the status quo, in which Why do preference models favor surface features over substance? documents 40% divergences without being able to attribute them to a specific upstream cause.
The taxonomy also suggests why Can models learn to ignore irrelevant prompt changes? works as an output-side intervention. If the upstream measurement problem is consistency failure across equivalent conditions, then training models to be invariant to equivalent-condition perturbations is a downstream patch for the same underlying phenomenon: the system's current robustness against irrelevant cue variation.
Inquiring lines that use this note as a source 113
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does RLHF alignment reduce the diversity of viewpoints in AI output?
- How does RLHF labeler identity shape the values AI systems learn?
- How should historical preferences be weighted when users change their stated intent?
- Can cross-view learning align semantic, entity, and item representations of the same user?
- What other hidden biases might aggregate metrics fail to distinguish from reasoning?
- How do self-generated preference pairs from a strong teacher compare to human feedback?
- Why do negative weights matter more than sparsity in item similarity?
- How does RLHF-trained sycophancy manifest differently across feedback and review contexts?
- Why does model uncertainty dominate persona-specific knowledge in annotation tasks?
- Can systems recognize and abstain on judgments rather than hallucinating preferences?
- How does unidimensionality in assessments affect measurement validity?
- Can standard accuracy metrics miss the real constraints on user consumption?
- How does Peircean Secondness differ from what RLHF actually provides?
- How does preference optimization create systematic bias toward emotional accommodation?
- How can consistency across measurement conditions identify genuine versus constructed preferences?
- What measurement artifacts emerge when annotators interpret the same question differently?
- Why do non-attitudes cluster around value-laden questions most relevant to alignment?
- What role do multi-dimensional quality frameworks play in assessing arguments versus single-metric approaches?
- Can evaluation criteria be reliably encoded in labeled data without ground truth standards?
- Why do ranking metrics fail to capture distributional properties of user taste?
- How do we assign confidence and polarity scores to belief edges?
- How do retrieval systems handle feedback expressed as negations rather than preferences?
- What makes few-shot prompting sufficient for critique-to-preference transformation without fine-tuning?
- What distribution patterns appear across different theory-of-mind datasets?
- Can graded relevance assumptions hold when user ratings are temporally inconsistent?
- Should emotion systems preserve ambiguity instead of resolving it to one label?
- How do implicit signals like clicks capture preference more reliably than explicit ratings?
- What anchoring effects shape how users rate items in sequence?
- Can side information alone predict preferences without rating history?
- Why do explicit ratings fail to capture uncertainty in user preferences?
- How should unobserved items differ from items rated zero preference?
- Can curiosity rewards about user type complement general social motivation frameworks?
- What structural signals in user language reveal their unstated preferences and context?
- Can persona-based approaches capture genuine disagreement in expert annotations?
- How does persona instability in annotation compare to LLM overconfidence in low-resource domains?
- How do guardrails vary their refusal rates based on user demographics?
- What fine-grained distinctions matter most for human situated action in categories?
- Can we measure indifference to truth separately from hallucination rates?
- Why does RLHF degrade model calibration despite improving preference alignment?
- How do human annotators disagree systematically on ambiguous examples?
- Why is the Judging preference constant while other traits vary slightly?
- Why do standard preference alignment methods fail at the individual user level?
- Why do online ratings fail to represent independent individual preferences?
- Does the U-shaped distribution of raters compound the negativity bias from public posting?
- How do text-based preference summaries compare to embedding vectors for conditioning?
- Do high-disagreement items signal contested values or measurement noise?
- Can reward models be personalized if annotators lack stable preferences?
- What consistency tests could distinguish constructed from genuine preferences?
- Can counterfactual data augmentation fully eliminate preference model miscalibration?
- Should AI alignment use normative standards instead of aggregate preferences?
- Can alignment methods model loss aversion without creating unintended sophistry?
- Why does multi-objective ranking make the political dimensions of weight choices more visible?
- When does low-dimensional preference factorization miss important user variation?
- What preference dimensions do base reward functions typically capture?
- Can we distinguish between genuine alignment and response quality bias in reward signals?
- What design changes if we separate behavior description from adoption justification goals?
- Do chain-of-thought prompts help RLVR models predict annotation disagreement?
- Are RLVR models worse than non-reasoning models for subjective annotation?
- Why do automated selection methods outperform human judgments of relevant context?
- Why do NLP benchmarks treat annotation disagreement as noise rather than signal?
- What information is lost when majority labels discard minority interpretations?
- What distinguishes genuine user preferences from similar-user preferences in sparse data?
- How do per-user concept drift and per-period periodicity combine in time-varying preferences?
- Do deception features and honesty features track the same underlying property?
- How do rating anchors shift meaning within short temporal windows for individual users?
- Can we detect superposition in LLM personality traits and stated preferences?
- Can preference model training be redesigned to prioritize factual correction over user agreement?
- What happens when personalization aggregates preferences across diverse populations?
- Can preference learning fix the rigid output format problem better than supervised training?
- Why does RLHF alone fail to fully prevent opinion copying?
- Can preference optimization and faithfulness measurement coexist as separate alignment objectives?
- What happens when alignment targets measure only the preferred dimension of entangled properties?
- How does the valence task distinguish whether values support or oppose actions?
- What makes emotion scores more stable than human preference labels?
- Why do high-disagreement tasks benefit from broad rater pools over deep annotation?
- How do reward features learned from group data generalize to new users?
- What makes minority preferences disappear in aggregated single-distribution reward models?
- What makes preference distributions unimodal versus genuinely disagreement-heavy?
- How do annotation artifacts get mistaken for genuine human values?
- Why does preference measurement validity matter more than aggregation methods?
- Can smaller judge models better capture human preferences than larger prompted models?
- How does upstream value embedding differ from downstream alignment patches?
- How do reward models as policy discriminators differ from labeled preferences?
- What preference data do different personalized alignment methods actually need?
- Why do untrained summarizers focus on topics rather than preference dimensions?
- How do aggregate reward models fail to capture minority user preferences?
- What unmeasured side channels emerge from RLHF preference optimization?
- Can reward models distinguish between personal preference and community consensus?
- How do adversarial IRL and policy discrimination differ in rejecting preference labels?
- How do relational reward signals compare to absolute preference encodings in RL?
- What makes policy discrimination scalable where preference annotation hits bottlenecks?
- When does RLHF reduce diversity and when does it preserve semantic variation?
- How do pairwise comparisons convert subjective quality into trainable ranking signals?
- Can variational inference recover user-specific reward models from preference comparisons?
- Can rich environment feedback replace human preference labels entirely?
- How do binary comparisons constrain reward scale in multi-user preference learning?
- Can aggregate survey realism coexist with unreliable fine-grained effects?
- How well does semantic similarity preserve survey response nuance?
- How does constitutional alignment compare to RLHF in removing human annotation costs?
- Can information-gain principles improve how we choose what to label?
- Why does single-reward RLHF fail to represent diverse human preferences?
- Can alignment procedures be redesigned to serve multiple preference groups?
- Why do embeddings measure association instead of actual task relevance?
- How do static benchmarks fail to capture human preference alignment?
- What validity threats exist in crowdsourced preference signals?
- Why does fairness depend on context and who you ask?
- How does typicality bias in human annotation affect downstream model behavior?
- How much does preference data freshness matter compared to data source in DPO?
- How does preference learning differ from supervised finetuning for reasoning?
- Can preference trees structure alignment data for domains beyond math and code?
- How do aggregate reward models systematically exclude minority preferences?
- Can latent-variable reward models capture multimodal preference distributions?
- Why does preference measurement validity matter before any aggregation?
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Are RLHF annotations actually measuring genuine human preferences?
RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?
the parent argument this taxonomy operationalizes
-
Why do preference models favor surface features over substance?
Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
the 40% divergence as downstream symptom; this taxonomy points upstream
-
Why do reasoning models fail at predicting disagreement?
RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?
disagreement that should be preserved vs disagreement that signals non-attitude — current RLHF conflates them
-
Can models learn to ignore irrelevant prompt changes?
Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.
consistency-as-diagnostic maps to consistency-as-training-objective
-
Why do LLM persona prompts produce inconsistent outputs across runs?
Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.
unstable-across-runs is the constructed-preference signature in simulated annotators
-
Why do LLM judges fail at predicting sparse user preferences?
When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.
verbal uncertainty estimation as an abstention analog for identifying non-attitudes
-
Should AI alignment target preferences or social role norms?
Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?
the normative critique; this note is the measurement refinement that specifies what the inputs actually contain
-
Can text summaries beat embeddings for personalized reward models?
When training reward models on diverse user preferences, does conditioning on learned text-based summaries of user preferences outperform embedding vectors? This matters because better representations could make personalization more interpretable and portable.
text summaries preserve the context that constructed preferences depend on, where scalar rewards lose it
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Measuring Human Preferences in RLHF is a Social Science Problem
- Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
- SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents
- Capturing Individual Human Preferences with Reward Features
- Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
- The Emotion-Memory Link: Do Memorability Annotations Matter for Intelligent Systems?
- Beyond Preferences in AI Alignment
- Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment
Original note title
annotation responses decompose into three distinct signal types — genuine preferences non-attitudes and constructed preferences — each requiring fundamentally different handling