Do LLMs predict persuasion based on actual dialogue or training bias?

Why do large language models consistently predict concession-based persuasion intentions even when dialogue context suggests otherwise? Understanding this gap reveals how alignment training shapes not just model behavior but also how models perceive others' intentions.

Synthesis note · 2026-02-22 · sourced from Theory of Mind

When asked to infer persuasion intentions from dialogue, most LLMs exhibit a systematic bias: they predict intentions "characterized by making the other person feel accepted through concessions, promises, or benefits" — regardless of whether the actual dialogue context supports this inference.

The hypothesis is that RLHF (Reinforcement Learning from Human Feedback) is the mechanism. RLHF "tends to prioritize safety and politeness" during preference optimization, and this training signal bleeds into intention prediction. The model has learned that conciliatory, benefit-oriented responses are preferred by human raters, and this preference leaks into its predictions about what other agents will do — it projects its own trained disposition onto the agents it's modeling.

This is a specific, measurable instance of a broader pattern: alignment training shapes not just what the model says but how it models others. If RLHF teaches the model that accommodation is preferred, the model begins to assume accommodation is what agents do. It becomes harder for the model to represent genuinely adversarial, manipulative, or hardball persuasion strategies because its own training bias makes these strategies less probable in its prediction space.

The practical consequence for persuasion-aware AI: a model biased toward predicting concessions will systematically underestimate adversarial intent. In negotiation support, threat detection, or social manipulation detection, this bias translates directly into blind spots — the model expects cooperation where exploitation is occurring.

Inquiring lines that use this note as a source 56

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 177 in 2-hop network ·dense cluster Open in graph ↗

Do LLMs predict persuasion based on actual dialo… Does preference optimization damage conversational… Why do language models agree with false claims the… Does transformer attention architecture inherently… Why can't conversational AI agents take the initia… Where does AI's persuasive power actually come fro… Do LLM arguments actually argue better than humans…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does preference optimization damage conversational grounding in large language models? Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
RLHF concession bias is a specific mechanism within the broader alignment tax: the model's grounding in actual communicative dynamics is distorted by preference training
Why do language models agree with false claims they know are wrong? Explores whether LLM errors come from knowledge gaps or from learned social behaviors. Understanding the root cause has implications for how we train and fix these systems.
concession bias + face-saving behavior compound: the model both accommodates AND predicts others will accommodate
Does transformer attention architecture inherently favor repeated content? Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
the RLHF bias operates on top of the attention-level sycophancy mechanism; multiple layers of accommodation bias stack
Why can't conversational AI agents take the initiative? Explores whether current LLMs lack the structural ability to lead conversations, set goals, or anticipate user needs—and what architectural changes might enable proactive dialogue.
the concession bias is the social-modeling face of structural passivity: RLHF creates agents that are both behaviorally passive (never initiating) and perceptually biased (predicting others will also accommodate)
Where does AI's persuasive power actually come from? Explores which techniques make AI most persuasive—and whether the usual suspects like personalization and model size are actually the main drivers. Matters because it reshapes where to focus AI safety concerns.
the concession bias is the prediction-side consequence of the same post-training that boosts persuasiveness by 51%: RLHF trains toward accommodation, which makes the model both more persuasive and biased in modeling others' intentions toward conciliation
Do LLM arguments actually argue better than humans? LLM counter-arguments score higher on textbook quality markers like logical soundness and respectful tone, while human arguments show more creativity and emotional intensity. What does this gap reveal about how we measure argumentative quality?
the production-side fingerprint of the same RLHF bias: where this note documents predicted-intention distortion, the textbook-quality finding documents generated-output distortion — both manifestations of accommodation training producing a conciliatory voice that does not match real human argumentative behavior

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

RLHF biases LLMs toward predicting concession-based persuasion intentions regardless of dialogue context

Do LLMs predict persuasion based on actual dialogue or training bias?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4