Can AI-assisted alignment eventually solve fairness at scale?
This explores whether better alignment techniques, scaled up, can eventually deliver fair AI for everyone — and the corpus suggests fairness isn't the kind of problem that 'scale' solves.
This reads the question as: can we keep improving alignment methods until fairness becomes a solved, general property of large models? The collection's strongest answer is that the framing itself is the trap. Fairness can't be certified for a general-purpose model at all — the standard group-fairness and fair-representation frameworks either fail to extend to open-ended language tasks or become intractable once you try to cover countless populations and contexts. Fairness has to be pursued per use-case, with developer responsibility and affected stakeholders in the room Can fairness frameworks extend to general-purpose language models?. "At scale" and "fairness" pull in opposite directions.
Worse, the alignment process is not a neutral instrument you can simply aim at fairness — it actively manufactures disparities. RLHF and DPO measurably advantage some English dialects and global viewpoints over others, and those gaps trace back to deliberate design choices in who annotates and how tasks are framed, not to unavoidable technical limits How does LLM alignment affect representation across dialects?. The same optimization quietly flattens what models will even say: alignment rewards hedged, calibrated neutrality, which structurally suppresses speech acts like alarm or warning Does alignment training suppress socially necessary speech acts?. So the tool you'd use to 'solve' fairness has its own systematic biases baked in.
There's also a scaling paradox lurking. More alignment tends to make models more alike, not more representative — 70+ models across 26K queries showed an 'Artificial Hivemind,' converging on near-identical outputs because they share training data and alignment recipes Do different AI models actually produce diverse outputs?. If fairness means honoring diverse perspectives, a process that homogenizes outputs is working against you even as it 'improves.'
The deepest cut comes from work on social norms. GPT-4.5 can predict what's socially appropriate more accurately than any individual human — yet it structurally cannot enter the community processes that create and validate those norms in the first place Can AI predict social norms better than humans? Can AI learn social norms better than humans?. Fairness is a norm we make together, not a pattern to be matched from outside. An AI that's superhuman at predicting fairness judgments is still on the wrong side of the glass when it comes to deciding what fair means.
What AI-assisted alignment can do, the corpus is more hopeful about: it can be cheaper, more controllable, and more participatory. A thousand well-curated examples beat oceans of data Can careful curation replace massive alignment datasets?, decoding-time methods can shift behavior without corrupting the base model Can decoding-time tuning preserve knowledge better than weight fine-tuning?, and crowd preference at scale produces credible signal Can crowdsourced votes reliably rank language models?. The honest takeaway is that AI-assisted alignment can make fairness more achievable case by case — but the dream of a single technique that scales to 'fair for everyone' runs into the fact that fairness is contextual, participatory, and partly defined by the very communities the model can predict but never join.
Sources 9 notes
Group fairness and fair representation frameworks break on general-purpose LLMs because they either fail to extend logically to unstructured language tasks or become intractable across countless populations and contexts. Fairness must be pursued per use-case with developer responsibility and stakeholder participation.
RLHF and DPO alignment create measurable disparities between English dialects and global opinions, while improving some languages. These disparities reflect deliberate design choices in annotator selection and task definition, not inevitable outcomes.
RLHF optimization rewards calibrated neutrality and hedged claims, which structurally prevents models from performing speech acts requiring overclaiming relative to baseline—like alarm, warning, prophecy, and denunciation. This is a direct consequence of the alignment objective, not a fixable bug.
INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.
GPT-4.5 outperforms all individual humans at predicting social appropriateness, yet structurally cannot enter the community processes that establish and validate norms. This reveals a critical gap between pattern-matching and authentic participation in knowledge-making.
GPT-4.5 outperformed every individual human at judging social appropriateness across 555 scenarios, challenging the theory that embodied cultural experience is necessary. However, all AI models share identical systematic errors on unwritten norms.
LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Chatbot Arena's 240K+ crowdsourced preference votes produce credible model rankings because the underlying questions are diverse and discriminating, and crowd judgments correlate with expert raters—validating human preference as a scalable evaluation signal.