What does egalitarian social choice theory contribute to AI alignment?
This reads the question as: what does the formal theory of fairly aggregating individual preferences into a collective choice — voting rules, equal weighting, welfare aggregation — actually buy us when we try to align AI with human values, and where the corpus says it breaks down.
This explores what social choice theory's egalitarian impulse — give everyone equal weight, then aggregate preferences into one collective answer — contributes to AI alignment, and the corpus's verdict is mostly a warning: the egalitarian move that looks fairest in theory is where alignment quietly goes wrong. The dominant 'preferentialist' approach to alignment is essentially applied social choice — collect human preference judgments (the Pin RLHF) and optimize a model toward their aggregate. The corpus argues this inherits social choice theory's deepest known problem. When you aggregate uniformly, you don't get a neutral average; you get the majority's values stamped onto everyone, which Should AI alignment target preferences or social role norms? names as epistemic injustice — minority moral framings get rounded off. Its proposed fix is anti-aggregative: contractualist alignment negotiated by stakeholders at distinct levels, closer to a bargaining table than a ballot box.
The sharpest contribution comes from flipping the egalitarian goal on its head. Classic social choice wants to *resolve* disagreement into a single ranking; Can AI systems preserve moral value conflicts instead of averaging them? argues alignment should *preserve* it. ValuePrism tracks 218k values across 31k situations and deliberately refuses to vote them down to one answer, keeping the conflicts legible. The egalitarian intuition here isn't 'count everyone equally then collapse' — it's 'represent everyone's value even when it loses.' That reframes equality as visibility rather than aggregation, which is a genuinely different design target than a welfare-maximizing social welfare function.
There's also a participation problem that social choice theory assumes away. The whole apparatus presumes preferences exist prior to the vote, ready to be counted. But Can AI predict social norms better than humans? and Can AI learn social norms better than humans? show that norms aren't a fixed distribution to sample — GPT-4.5 can predict appropriateness better than any individual human yet structurally can't enter the community process that *creates and validates* the norms in the first place. Egalitarian aggregation has nothing to say about who gets to author the menu of options being voted on, which may be the more decisive form of power.
Two further notes widen the frame. Does incremental AI replacement erode human influence over society? suggests the relevant 'votes' in real societal alignment aren't survey responses but the economic dependence on human labor — as AI removes that, the implicit channel through which human preferences steer institutions decays, no formal aggregation rule required. And Can models learn behavioral principles without preference labels? (SAMI) shows you can align a model to written principles *without preference labels at all* by maximizing mutual information between a constitution and responses — an end-run around the entire collect-and-aggregate paradigm, where a weaker model can even author principles that align a stronger one.
So the contribution is largely diagnostic. Egalitarian social choice gives alignment its default vocabulary — equal weighting, preference aggregation, welfare functions — and the corpus uses that vocabulary mostly to mark its limits: uniform aggregation manufactures injustice, voting destroys pluralism it should preserve, and counting preferences ignores who gets to participate in making them. The more promising directions in the collection — contractual negotiation, explicit value-tension modeling, constitution-from-principles — are all reactions against the aggregative core, not refinements of it.
Sources 6 notes
Preferentialist alignment approaches fail because preferences don't capture thick moral values, uniform aggregation produces epistemic injustice, and preference optimization creates systematic misalignment with social roles. Contractualist alignment negotiated by stakeholders and bounded by supra-national, organizational, and individual levels works better.
ValuePrism demonstrates that AI can track 218k values across 31k situations while preserving conflicts rather than resolving them through voting. Four modeling tasks—generation, relevance, valence, and explanation—make pluralistic moral reasoning computationally tractable.
GPT-4.5 outperforms all individual humans at predicting social appropriateness, yet structurally cannot enter the community processes that establish and validate norms. This reveals a critical gap between pattern-matching and authentic participation in knowledge-making.
GPT-4.5 outperformed every individual human at judging social appropriateness across 555 scenarios, challenging the theory that embodied cultural experience is necessary. However, all AI models share identical systematic errors on unwritten norms.
Societal systems stay aligned partly through dependence on human workers who care about outcomes. As AI replaces this labor, explicit alignment controls weaken and systems drift from human preferences. Interdependent misalignment across institutions could become irreversible.
SAMI finetunes language models to increase mutual information between constitutions and responses without preference labels or demonstrations. A mistral-7b trained this way outperformed base and instruction-tuned baselines, and surprisingly, a weaker model could write principles to align a stronger one.