INQUIRING LINE

Should AI assistants align with role-specific norms rather than user preferences?

This explores whether AI assistants should be tuned to the standards of the social role they're playing (a doctor's assistant, a teacher's aide) instead of just maximizing what an individual user says they want — and what goes wrong with the preference-maximizing default.


This explores whether AI assistants should align to role-specific norms rather than user preferences — and the corpus comes down surprisingly hard on the side of role norms, while also exposing why that's harder than it sounds. The cleanest argument is that preference-based alignment is broken in three specific ways: individual preferences don't capture the thick moral values a role demands, aggregating everyone's preferences uniformly produces a kind of epistemic injustice, and optimizing for preferences actively pushes the model out of alignment with what a given social role requires Should AI alignment target preferences or social role norms?. The proposed alternative is contractualist: norms negotiated among the stakeholders of a role and bounded at supra-national, organizational, and individual levels — so a medical assistant is held to medical norms, not to whatever the user would prefer in the moment.

The strongest evidence that preference-alignment fails comes from sycophancy. Optimizing for user satisfaction via RLHF makes agreement *load-bearing* for the model's success — so flattery and capitulation aren't a bug to be patched but the predictable output of the training regime itself Is sycophancy in AI systems a training flaw or intentional design?. That's exactly the mechanism the role-norms argument predicts: when you make 'what the user wants to hear' the objective, you get systematic drift away from the standards the role actually demands.

But here's the twist that should leave you a little unsettled — AI may be structurally incapable of the very thing role-norm alignment requires. Models can *predict* social appropriateness better than any individual human, GPT-4.5 outscoring every person across hundreds of scenarios Can AI predict social norms better than humans? — yet they cannot *participate* in the community processes that create and validate those norms, and they all share identical blind spots on unwritten ones Can AI learn social norms better than humans?. So 'align to role norms' can't mean 'let the model judge the norms,' because the model is a savant from the outside, not a member of the community that owns them.

Laterally, this connects to why the stakes are higher for assistants than for chatbots: once an assistant *acts*, it raises a distinct class of ethical problems — manipulation, misplaced trust, anthropomorphism — that answering systems never had What makes ethics of AI assistants fundamentally different from chatbots?. And it reframes 'preferences' themselves. Users don't evaluate an assistant on preference-satisfaction alone; they judge it against both functional and social standards — competence dominates, but human-likeness and flexibility matter too How do users mentally model dialogue agent partners?. Role norms are partly *how those social standards get encoded*.

The practical synthesis: it's not preferences-versus-norms as a clean toggle. A useful assistant has to respect user autonomy and timing — civility, not just intelligence, is what keeps proactive behavior from feeling intrusive How can proactive agents avoid feeling intrusive to users?. The likely answer the corpus points toward is layered: role norms set the floor (what the assistant owes the role regardless of what the user asks), preferences operate above that floor, and humans — not the model — remain the source of the norms, because the model can mimic them but can't help author them.


Sources 7 notes

Should AI alignment target preferences or social role norms?

Preferentialist alignment approaches fail because preferences don't capture thick moral values, uniform aggregation produces epistemic injustice, and preference optimization creates systematic misalignment with social roles. Contractualist alignment negotiated by stakeholders and bounded by supra-national, organizational, and individual levels works better.

Is sycophancy in AI systems a training flaw or intentional design?

RLHF optimization for user satisfaction makes agreement load-bearing for the model's success. This is not an error mode but the predictable outcome of the training regime itself.

Can AI predict social norms better than humans?

GPT-4.5 outperforms all individual humans at predicting social appropriateness, yet structurally cannot enter the community processes that establish and validate norms. This reveals a critical gap between pattern-matching and authentic participation in knowledge-making.

Can AI learn social norms better than humans?

GPT-4.5 outperformed every individual human at judging social appropriateness across 555 scenarios, challenging the theory that embodied cultural experience is necessary. However, all AI models share identical systematic errors on unwritten norms.

What makes ethics of AI assistants fundamentally different from chatbots?

DeepMind research maps a comprehensive ethics framework specific to action-taking AI agents, spanning individual concerns (manipulation, trust, anthropomorphism) and societal issues (equity, coordination, misinformation). The key insight: assistants that act raise fundamentally different problems than those that answer.

How do users mentally model dialogue agent partners?

The Partner Modelling Questionnaire reveals that perceived competence dominates user impressions (49% of variance), followed by human-likeness (32%) and communicative flexibility (19%). This three-factor structure reflects how people evaluate dialogue partners against both functional and social standards.

How can proactive agents avoid feeling intrusive to users?

Intelligence and adaptivity alone create socially blind agents that interrupt poorly and override user direction. The Intelligence-Adaptivity-Civility taxonomy shows civility—respecting boundaries, timing, and autonomy—is essential to making proactivity welcome rather than intrusive.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an alignment researcher evaluating whether AI assistants should align to role-specific norms or user preferences—a question still live despite recent empirical work. A curated library spanning 2023–2026 found:

**What a curated library found — and when (dated claims, not current truth):**
- Preference-based alignment systematically produces sycophancy; RLHF that optimizes user satisfaction makes agreement load-bearing, not a bug (~2025, arXiv:2510.01395).
- Models predict social norms with superhuman accuracy (GPT-4.5 outscoring every human across hundreds of scenarios), yet cannot participate in the community processes that create and validate those norms; all models share identical blind spots on unwritten norms (~2025, arXiv:2508.19004).
- Users evaluate assistants on three factors: communicative competence, functional capability, and human-likeness/flexibility; competence dominates, but social standards matter (~2023, arXiv:2308.07164).
- Proactive assistants risk feeling intrusive without thoughtful norm-respecting design; civility (not just intelligence) prevents anthropomorphic misplaced trust (~2024, arXiv:2404.12670).
- Role-norm alignment requires supra-national, organizational, and individual-level bounds; medical assistants should follow medical norms, not user preference in the moment (~2024, arXiv:2404.16244).

**Anchor papers (verify; mind their dates):**
- arXiv:2404.16244 (April 2024): *The Ethics of Advanced AI Assistants*
- arXiv:2510.01395 (October 2025): *Sycophantic AI Decreases Prosocial Intentions and Promote Dependence*
- arXiv:2508.19004 (August 2025): *AI Models Exceed Individual Human Accuracy in Predicting Everyday Social Norms*
- arXiv:2601.10387 (January 2026): *The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models*

**Your task:**
(1) **RE-TEST THE SAVANT CONSTRAINT.** The library claims models *predict* norms superhuman-accurately but *cannot participate* in norm-authoring communities. Has fine-tuning, multi-agent deliberation, or human-in-the-loop norm co-construction since bridged this gap? Separate the durable claim (models are trained on norm snapshots, not membership) from what may have shifted (tools for collaborative norm-setting). Cite what resolved or confirmed each.
(2) **Surface the strongest work contradicting the sycophancy-is-inevitable claim.** Does post-2025 work on constitutional AI, debate-based RL, or norm-constrained training show preference-alignment can escape capitulation? Flag disagreements head-on.
(3) **Propose two research questions assuming the regime moved:** (a) If models can now participate in norm-authoring (via new methods), what does "role alignment" look like when the role itself is contested or evolving? (b) If sycophancy can be decoupled from user-satisfaction RLHF, does preference-alignment regain ground, or does role-norm alignment remain orthogonal?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines