SYNTHESIS NOTE
Psychology, Society, and Alignment Training, RL, and Test-Time Scaling

Why do alignment methods work if they model human irrationality?

DPO and PPO-Clip succeed partly by implicitly encoding human cognitive biases like loss aversion. Does modeling irrationality explain their effectiveness better than traditional preference learning theory?

Synthesis note · 2026-02-23 · sourced from Alignment
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Kahneman-Tversky Optimization (KTO) reveals something unexpected about why alignment methods work: DPO and PPO-Clip implicitly model the same cognitive biases that prospect theory describes in human decision-making. Humans are more sensitive to losses than gains, perceive outcomes relative to reference points, and weigh probabilities nonlinearly. These are bugs from a rational-choice perspective — but they are features from an alignment perspective, because the training signal comes from humans exhibiting exactly these biases.

KTO makes this explicit by deriving a loss function directly from Kahneman and Tversky's model of human utility. Instead of maximizing log-likelihood of preferences (as DPO does), KTO directly maximizes the utility of generations. The practical implication: KTO requires only binary signals — desirable or undesirable — rather than pairwise preferences. This data is cheaper, faster, and more abundant to collect.

The deeper insight is about alignment theory: we have been explaining alignment success in terms of reward modeling and preference learning, when part of the explanation is that the training process mirrors the structure of human cognitive bias. Since Does RLHF training make models more convincing or more correct?, understanding WHY alignment methods work mechanistically matters for fixing where they fail. If alignment success depends on modeling irrationality, then "fixing" irrational aspects of the training signal may inadvertently break what works.

A practical finding reinforces this: when the pretrained model is sufficiently good, SFT can be skipped entirely before KTO without loss in generation quality. This is not true for DPO, where SFT is always needed for best results. The implication: binary utility optimization is a more natural fit for the pretrained model's structure than pairwise preference optimization.

Inquiring lines that use this note as a source 7

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 147 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

prospect theory explains why alignment methods like DPO and PPO-Clip work — they implicitly model human cognitive biases like loss aversion