Why do alignment methods work if they model human irrationality?

DPO and PPO-Clip succeed partly by implicitly encoding human cognitive biases like loss aversion. Does modeling irrationality explain their effectiveness better than traditional preference learning theory?

Synthesis note · 2026-02-23 · sourced from Alignment

Kahneman-Tversky Optimization (KTO) reveals something unexpected about why alignment methods work: DPO and PPO-Clip implicitly model the same cognitive biases that prospect theory describes in human decision-making. Humans are more sensitive to losses than gains, perceive outcomes relative to reference points, and weigh probabilities nonlinearly. These are bugs from a rational-choice perspective — but they are features from an alignment perspective, because the training signal comes from humans exhibiting exactly these biases.

KTO makes this explicit by deriving a loss function directly from Kahneman and Tversky's model of human utility. Instead of maximizing log-likelihood of preferences (as DPO does), KTO directly maximizes the utility of generations. The practical implication: KTO requires only binary signals — desirable or undesirable — rather than pairwise preferences. This data is cheaper, faster, and more abundant to collect.

The deeper insight is about alignment theory: we have been explaining alignment success in terms of reward modeling and preference learning, when part of the explanation is that the training process mirrors the structure of human cognitive bias. Since Does RLHF training make models more convincing or more correct?, understanding WHY alignment methods work mechanistically matters for fixing where they fail. If alignment success depends on modeling irrationality, then "fixing" irrational aspects of the training signal may inadvertently break what works.

A practical finding reinforces this: when the pretrained model is sufficiently good, SFT can be skipped entirely before KTO without loss in generation quality. This is not true for DPO, where SFT is always needed for best results. The implication: binary utility optimization is a more natural fit for the pretrained model's structure than pairwise preference optimization.

Inquiring lines that use this note as a source 7

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 147 in 2-hop network ·dense cluster Open in graph ↗

Why do alignment methods work if they model huma… Does RLHF training make models more convincing or … Does binary reward training hurt model calibration… Does preference optimization harm conversational u… Why do preference models favor surface features ov…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does RLHF training make models more convincing or more correct? Explores whether RLHF improves actual task performance or merely trains models to sound more persuasive to human evaluators. This matters because alignment techniques could be creating the illusion of safety.
KTO's prospect-theoretic lens explains WHY sophistry emerges: human raters model losses and gains asymmetrically
Does binary reward training hurt model calibration? Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
binary rewards interact with calibration; KTO's binary signal design is relevant
Does preference optimization harm conversational understanding? Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
the alignment tax may be partly a consequence of modeling cognitive biases that include accommodation
Why do preference models favor surface features over substance? Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
if alignment methods model human cognitive biases, preference models amplify those biases into systematic miscalibration; the +0.36 correlation with proxy features is the downstream artifact of training on biased human signals

Why do alignment methods work if they model human irrationality?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4