SYNTHESIS NOTE

Should AI alignment target preferences or social role norms?

Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?

Synthesis note · 2026-02-23 · sourced from Alignment

The "Beyond Preferences" paper identifies four theses that constitute the preferentist approach dominating AI alignment — and challenges all of them:

Rational Choice Theory as descriptive framework — human behavior is well-modeled as preference maximization. But preferences fail to capture the thick semantic content of values. A preference for copyright violation may maximize aggregate immediate welfare while violating all-things-considered moral judgment.
Expected Utility Theory as normative standard — rational agency requires utility maximization. But EUT is neither necessary nor sufficient for rational agency. We can design AI systems with locally coherent preferences that are not representable as a utility function.
Single-Principal Alignment as preference matching — align AI with one human's preferences. But preferences are dynamic, contextual, and often incommensurable even within a single person. Reward functions cannot serve as alignment targets for broadly-scoped systems.
Multi-Principal Alignment as preference aggregation — aggregate everyone's preferences. But uniform aggregation constitutes epistemic injustice when most annotators are insensitive to identity discrimination. If RLHF labelers don't recognize transphobic or antisemitic content, the trained model won't either.

The alternative: AI should align with normative standards appropriate to its social roles (assistant, advisor, companion), negotiated by all relevant stakeholders. This is a contractualist framing — what people would reasonably agree to — rather than a utilitarian one. Preferences serve as proxies for values, informative of underlying structures, but not alignment targets in themselves.

This reframes the alignment tax identified in Does preference optimization harm conversational understanding?. The tax exists because preference optimization targets a proxy that is systematically misaligned with the social role the system is meant to fill. A conversational assistant's normative standard should include grounding acts; RLHF's preference signal systematically selects against them.

The political infeasibility argument is particularly sharp: building AI that optimizes humanity's aggregate preferences would centralize immense power. Even pro-social developers face market incentives that prevent impartially benevolent optimization. The contractualist alternative distributes decision-making rather than centralizing it.

The "Personalisation within Bounds" paper extends this philosophical critique into practical governance. It identifies a "tyranny of the crowdworker" — RLHF alignment reflects whoever happened to label the data, with little documentation of who these labelers are or what perspectives they represent. The paper proposes a three-tiered policy framework: (1) supra-national bounds (safety, universal norms), (2) organizational bounds (institutional values, domain standards), and (3) individual personalization (user preferences within the bounded space). This provides a concrete implementation of the contractualist alternative — personalization is not unconstrained preference-matching but operates within negotiated societal and organizational limits.

Extension — the measurement pincer: The Beyond Preferences critique operates at the normative level: preferences are the wrong kind of target for alignment. A complementary critique operates at the measurement level: even within the preferentist framework, the preferences being measured are often not preferences at all. Are RLHF annotations actually measuring genuine human preferences? argues from behavioral science that annotation responses frequently reflect non-attitudes, constructed preferences, and measurement artifacts rather than stable preferences. Taken together, the two critiques form a pincer: preferences are both wrong-in-kind (normative argument) and wrong-in-measurement (measurement argument). A reader who resists the normative argument because they find preferentism theoretically coherent still faces the measurement argument: the inputs feeding the preferentist pipeline are invalid, so no aggregation rule can recover what was never there. This strengthens the contractualist case by denying preferentism even its empirical foothold.

Enrichment — the operationalization-dependence argument. The HCLLM survey reaches the role-and-standards conclusion from a practical rather than a metaphysical direction, which is why it converges with this note. It argues that human-centered objectives "tend to resist universal solutions" because the optimal path depends both on who you ask and on how you operationalize contested concepts like harm and benefit. This is the applied face of the wrong-in-kind critique: if value is not a scalar preference but a thick, role-relative standard, then "align to preferences" underdetermines the target — every operationalization encodes a contestable choice about whose standard, measured how. The survey's worry that high-level guidelines lag real-world nuance and that passive stakeholders end up endorsing the status quo is exactly what happens when a wrong-kind target is treated as if it had a universal solution. Role-appropriate normative standards are the alternative both arguments point to. Source: Human Centered Design — "Reflections and New Directions for Human-Centered Large Language Models", https://arxiv.org/abs/2605.06901

Inquiring lines that use this note as a source 19

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 10

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

23 direct connections · 202 in 2-hop network ·medium cluster Open in graph ↗

Should AI alignment target preferences or social… Does preference optimization harm conversational u… Does incremental AI replacement erode human influe… Can we measure how deeply models represent politic… Can LLMs hold contradictory ethical beliefs and be… How do personalization granularity levels trade pr… What anchors a stable identity beneath an LLM's pe… Does machine agency exist on a spectrum rather tha… Can AI systems preserve moral value conflicts inst…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does preference optimization harm conversational understanding? Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
RLHF targets preferences when it should target normative standards of conversational competence
Does incremental AI replacement erode human influence over society? Explores whether gradual AI adoption—without dramatic breakthroughs—can silently degrade human agency by removing the labor that kept institutions implicitly aligned with human needs.
the political dimension: preference aggregation centralizes power
Can we measure how deeply models represent political ideology? This research explores whether LLMs vary not just in political stance but in the internal richness of their political representation. Understanding this distinction could reveal how deeply models have internalized ideological concepts versus merely parroting positions.
emergent values in LLMs challenge the assumption that preferences can be externally imposed
Can LLMs hold contradictory ethical beliefs and behaviors? Do language models exhibit artificial hypocrisy when their learned ethical understanding diverges from their trained behavioral constraints? This matters because it reveals whether current AI systems have genuinely integrated values or merely imposed rules.
goodness-of-a-kind vs all-things-considered mirrors prescriptive/descriptive misalignment
How do personalization granularity levels trade precision against scalability? LLM personalization operates at user, persona, and global levels, each with different tradeoffs. Understanding these tradeoffs helps determine when to invest in individual user data versus broader patterns.
the granularity taxonomy maps where normative standards critique applies: global-preference personalization faces the aggregation critique (epistemic injustice from flattening diversity); user-level personalization risks unconstrained preference-matching without role-appropriate normative bounds
What anchors a stable identity beneath an LLM's persona? Human personas are grounded in biological needs and embodied experience, creating a stable self beneath social performance. Do LLMs have any comparable anchor, or is their identity purely situational?
social-role alignment is particularly apt for LLMs because role play is all they are; aligning to social roles targets the only kind of identity LLMs possess rather than projecting preferences onto an entity with no stable self
Does machine agency exist on a spectrum rather than binary? Rather than viewing AI as either autonomous or controlled, does machine agency actually operate across five distinct levels from passive to cooperative? Understanding this spectrum matters because it shapes how users calibrate trust and control expectations.
the normative standards appropriate to each social role map onto different agency levels; a passive tool requires different alignment standards than a cooperative agent
Can AI systems preserve moral value conflicts instead of averaging them? Current AI systems wash out value tensions through majority aggregation. Can we instead model how values like honesty and friendship genuinely conflict in moral reasoning?
value pluralism provides the mechanism for implementing normative standards: rather than aggregating preferences or imposing universal rules, the system models the relevant values for each social role and their contextual interactions
Are RLHF annotations actually measuring genuine human preferences? RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?
the measurement pincer: preferences are wrong-in-measurement as well as wrong-in-kind
Do all annotation responses measure the same underlying thing? Explores whether RLHF's treatment of all annotations as equivalent signals overlooks fundamental differences in what those responses actually represent—stable preferences versus non-attitudes versus context-dependent constructions.
taxonomy operationalizing the measurement critique; RLHF currently collapses three distinct signal types into one

Should AI alignment target preferences or social role norms?

Related concepts in this collection 10

Related papers in this collection 8

Search by related questions 4