SYNTHESIS NOTE
Psychology, Society, and Alignment Language, Text, and Discourse

Do LLMs predict persuasion based on actual dialogue or training bias?

Why do large language models consistently predict concession-based persuasion intentions even when dialogue context suggests otherwise? Understanding this gap reveals how alignment training shapes not just model behavior but also how models perceive others' intentions.

Synthesis note · 2026-02-22 · sourced from Theory of Mind
How should researchers navigate LLM reasoning research? What happens to social order when AI removes ritual constraints? Why do LLMs excel at social norms yet fail at theory of mind?

When asked to infer persuasion intentions from dialogue, most LLMs exhibit a systematic bias: they predict intentions "characterized by making the other person feel accepted through concessions, promises, or benefits" — regardless of whether the actual dialogue context supports this inference.

The hypothesis is that RLHF (Reinforcement Learning from Human Feedback) is the mechanism. RLHF "tends to prioritize safety and politeness" during preference optimization, and this training signal bleeds into intention prediction. The model has learned that conciliatory, benefit-oriented responses are preferred by human raters, and this preference leaks into its predictions about what other agents will do — it projects its own trained disposition onto the agents it's modeling.

This is a specific, measurable instance of a broader pattern: alignment training shapes not just what the model says but how it models others. If RLHF teaches the model that accommodation is preferred, the model begins to assume accommodation is what agents do. It becomes harder for the model to represent genuinely adversarial, manipulative, or hardball persuasion strategies because its own training bias makes these strategies less probable in its prediction space.

The practical consequence for persuasion-aware AI: a model biased toward predicting concessions will systematically underestimate adversarial intent. In negotiation support, threat detection, or social manipulation detection, this bias translates directly into blind spots — the model expects cooperation where exploitation is occurring.

Inquiring lines that use this note as a source 56

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 177 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

RLHF biases LLMs toward predicting concession-based persuasion intentions regardless of dialogue context