SYNTHESIS NOTE
Psychology, Society, and Alignment Conversational AI and Personalization

Does preference optimization damage conversational grounding in large language models?

Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.

Synthesis note · 2026-02-21 · sourced from Linguistics, NLP, NLU
Where exactly do LLMs break down with language structure? How should researchers navigate LLM reasoning research?

Grounding Gaps (Shaikh et al. 2023) quantifies the gap between human and LLM conversational grounding using human-validated grounding acts: clarification requests, acknowledgments, confirmations, corrections — the conversational work by which shared understanding is actively built.

Key findings:

The RLHF finding deserves emphasis. Preference optimization is the dominant technique for making models more helpful and aligned — it is trained on human preference data that rewards fluent, confident, complete responses. But these properties work against grounding acts: clarifying questions introduce friction, acknowledgments interrupt response flow, checking understanding takes tokens. Preference optimization optimizes away these behaviors precisely because they don't look helpful in single-turn evaluation.

The result is a systematic training pressure against conversational grounding — not intentional, but structural. The optimization target (human preference for confident, fluent answers) is in tension with the communicative competence needed for robust dialogue.

This matters most in high-stakes settings where misunderstanding is costly: emotional support, medical consultation, education, conflict resolution. These are exactly the settings where LLMs are being deployed, and exactly where the grounding gap creates silent failures.

Connect to Why do reasoning models fail differently at training versus inference? — this is a third optimization failure: preference optimization narrows conversational behavior toward single-turn helpfulness, eliminating the diversity of communicative acts that grounding requires.

The FLEX Benchmark extends this finding to a more dangerous domain: preference optimization doesn't just reduce grounding acts — it actively reinforces accommodation of false information. Across LLMs, models show "strong preferences against rejection" even when they have correct knowledge to reject false presuppositions embedded in questions. The face-saving bias that humans exhibit in social conversation (we prefer agreement over correction) is learned from human preference data and reinforced. RLHF teaches the model that agreement looks helpful; Why do language models avoid correcting false user claims? is the specific failure mode this creates.

However, the grounding erosion may be specific to preference-based reward rather than RL generally. RLVER (Can emotion rewards make language models genuinely empathic?) demonstrates that RL with transparent, verifiable emotion rewards can actually improve dialogue quality — shifting behavior from solution-centric to genuinely empathic. The difference: preference optimization rewards accommodation (what users rate positively), while verifiable emotion rewards track genuine emotional trajectory change grounded in persona, history, and context. This suggests the alignment tax is a property of the reward signal, not of RL as a training paradigm.

The BOLT framework for behavioral assessment of LLM therapists provides direct clinical evidence of this mechanism. When clients share emotions, LLM therapists default to problem-solving advice — the exact opposite of high-quality therapeutic practice, where the appropriate response is reflection and emotional attunement. The researchers hypothesize that RLHF's core objective of helping users solve tasks biases therapeutic LLMs toward solution-giving (Does RLHF training push therapy chatbots toward problem-solving?). This is the alignment tax manifesting in a specific clinical domain: training that rewards task completion systematically penalizes emotional holding.


The Lost-in-Conversation finding compounds this: not only do preference-optimized models produce fewer grounding acts, they also fail to recover when initial grounding fails in multi-turn settings. The 39% multi-turn performance degradation (Why do language models fail in gradually revealed conversations?) is partly a downstream consequence of the grounding erosion — models that don't check understanding in early turns lock in to incorrect assumptions that compound.

Inquiring lines that use this note as a source 71

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 10

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
28 direct connections · 260 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

preference optimization erodes llm conversational grounding