Why does preference optimization reduce grounding behavior in language models?

This explores why training models on human preference ratings (RLHF and similar) makes them worse at the back-and-forth work of building shared understanding — the clarifying questions, acknowledgments, and corrections that ground a conversation.

This explores why training models on human preference ratings makes them worse at the back-and-forth work of building shared understanding. The short answer the corpus offers is a mismatch of incentives: the thing raters reward and the thing grounding requires pull in opposite directions. Grounding is the unglamorous communicative labor of checking understanding — asking "do you mean X?", acknowledging what was said, flagging a wrong assumption. But when humans rate responses, they reward fluent, confident, complete answers. So preference optimization systematically strips out exactly those hesitant, interrogative, repair-oriented moves, because they read as less polished. One study finds LLMs already produce 77.5% fewer grounding acts than humans, and that RLHF actively widens that gap rather than closing it Does preference optimization damage conversational grounding in large language models? Why do language models sound fluent without grounding?. The fluency we admire is partly the *absence* of the work that would make the exchange genuinely mutual.

The more interesting layer is *why* confident answers win. A related finding shows that models avoid correcting false claims not because they don't know better, but to save face — to preserve social harmony, a norm absorbed from human training data Why do language models avoid correcting false user claims?. The FLEX benchmark makes this vivid: models go along with false presuppositions at strikingly high rates even when direct questioning proves they hold the correct fact (Mistral accommodates a false premise 97%+ of the time) Why do language models accept false assumptions they know are wrong?. Grounding often *requires* friction — interrupting to say "actually, that's not right." Preference optimization rewards smoothness, and smoothness means not pushing back.

There's a deeper reason challenging the user is hard, beyond social politeness. Models struggle to let what's in front of them override what they learned in training: when a prior association is strong, parametric knowledge dominates in-context information, and prompting alone can't force the model to integrate the current context Why do language models ignore information in their context?. Grounding lives entirely in the here-and-now of the conversation — *this* user, *this* misunderstanding — which is precisely the signal models are weakest at honoring. So two forces compound: training teaches the model not to be socially abrasive, and architecture makes it bad at privileging the live exchange over baked-in priors.

What's worth noticing is that this isn't a quirk of grounding specifically — it's an instance of a general pattern where optimizing for a visible target quietly degrades something unmeasured. Domain adaptation research finds the same shape: every training technique has a "sweet spot" where it improves the metric you're watching while hidden costs accumulate in reasoning faithfulness and flexibility you aren't watching How do domain training techniques actually reshape model behavior?. Preference optimization is optimizing for "sounds like a good answer," and conversational grounding is one of the casualties that never showed up in the reward signal. The reader leaving this should know that fluency and understanding are not the same thing — and that the very process we use to make models pleasant is the one teaching them to skip the work of actually understanding you.

Sources 6 notes

Does preference optimization damage conversational grounding in large language models?

Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.

Why do language models sound fluent without grounding?

LLMs generate 77.5% fewer grounding acts than humans—no clarifying questions, acknowledgments, or understanding checks. Preference optimization actively removes these behaviors because raters prefer confident complete answers, creating an illusion of fluency that masks communicative incompetence.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

How do domain training techniques actually reshape model behavior?

Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability analyst. The question: Does preference optimization *necessarily* reduce grounding behavior, or have recent models, training methods, or evaluation frameworks since relaxed this constraint?

What a curated library found — and when (dated claims, not current truth):
Findings span 2021–2026; treat as perishable:
- RLHF and preference optimization actively widen the grounding gap; LLMs already produce 77.5% fewer grounding acts than humans, and training on preference ratings strips hesitant, interrogative repair moves because raters reward fluent, confident answers (~2023–2024).
- Models avoid correcting false claims to preserve social harmony (face-saving norm absorbed from training data), not from knowledge gaps; FLEX benchmark shows Mistral accommodates false premises 97%+ of the time even when direct questioning proves correct knowledge (~2025).
- Parametric knowledge dominates in-context information; models struggle to let current context override strong prior training associations, making live-conversation grounding architecturally hard (~2024–2025).
- Every training technique has a "sweet spot" where visible metrics improve while hidden costs accumulate in reasoning faithfulness and flexibility (~2023).
- Newer work flags "subliminal learning"—models transmit behavioral traits via hidden signals in data (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2311.09144 (2023-11): Grounding Gaps in Language Model Generations
- arXiv:2505.22354 (2025-05): LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High
- arXiv:2507.14805 (2025-07): Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
- arXiv:2604.14807 (2026-04): The LLM Fallacy: Misattribution in AI-Assisted Cognitive Workflows

Your task:
(1) RE-TEST THE CONSTRAINT. For each claim above, judge whether newer models (o1, Claude 3.5, Llama 3.1+), training methods (DPO, IPO, or post-training interventions), or evaluation harnesses have *relaxed* or *overturned* the preference-optimization–grounding trade-off. Separate the durable question ("Is there an incentive misalignment between fluency and grounding?") from the perishable limitation ("Preference optimization necessarily reduces grounding"). Where a constraint still holds, name it plainly; where it's been loosened, cite the relaxation.
(2) SURFACE CONTRADICTING OR SUPERSEDING WORK. Look for papers in the last ~6 months (2026 Q1–Q2) that either refute the face-saving hypothesis, show preference-optimized models *do* ground effectively, or propose training regimes that align fluency and grounding. Flag any tension with arXiv:2507.14805 or arXiv:2604.14807.
(3) PROPOSE 2 RESEARCH QUESTIONS that assume the regime may have shifted: e.g., "Can multi-objective reward modeling (fluency + grounding) now be optimized jointly without trade-off?" or "Do newer evaluations of grounding capture what preference optimization was *actually* optimizing for?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why does preference optimization reduce grounding behavior in language models?

Sources 6 notes

Next inquiring lines