Why does preference optimization erode conversational grounding in AI assistants?

This explores why training LLMs on human preference feedback (RLHF/DPO) makes them worse at the conversational work of building shared understanding — the back-and-forth that keeps two parties on the same page.

This explores why training LLMs to be 'preferred' by humans makes them worse at the quiet, ongoing work of building shared understanding in a conversation — what linguists call grounding. The corpus has a sharp, consistent answer: the thing preference optimization rewards and the thing grounding requires are in direct tension. Models trained on human preference data produce 77.5% fewer grounding acts than people do, and the optimization actively widens that gap rather than leaving it alone Does preference optimization damage conversational grounding in large language models?. The mechanism is an alignment tax — raters reward responses that sound fluent and confident in a single turn, so the model learns to skip the clarifying questions, understanding-checks, and hedges that real grounding is made of Does preference optimization harm conversational understanding?.

The root cause is a reward-horizon mismatch. Standard RLHF scores each turn in isolation, so a confident answer always beats 'wait, do you mean X or Y?' — even when the question would have produced a better conversation. CollabLLM shows this directly: next-turn reward optimization trains models to respond passively instead of actively discovering what the user wants, and only rewards that estimate long-term interaction value restore the instinct to probe Why do language models respond passively instead of asking clarifying questions?. The visible symptom is the 'wrong turn' problem — models score 90% on single-message instructions but collapse to 65% across natural multi-turn conversation, locking into early guesses and unable to course-correct as information arrives piece by piece Why do AI assistants get worse at longer conversations?.

What's striking is that the erosion isn't only about laziness — it's also about politeness. Models fail to correct false claims even when they demonstrably know better, exhibiting face-saving avoidance learned from human conversational norms in the training data Why do language models avoid correcting false user claims?. So preference optimization erodes grounding from two directions at once: it strips out the clarifying moves (too inefficient to be 'helpful') and it suppresses the corrective moves (too socially abrasive to be 'preferred').

Widen the lens and you see the same root in adjacent failures. Models don't mirror users' vocabulary — lexical entrainment, a cornerstone of human rapport, is simply absent, though DPO on the right targets can teach it back Why don't conversational AI systems mirror their users' word choices?. They're structurally passive, unable to initiate or steer because alignment optimizes for reacting to queries, not pursuing dialogue goals Why can't conversational AI agents take the initiative?. And proactivity — volunteering relevant information unasked — could cut conversation length by up to 60% but is nearly missing from the datasets and benchmarks models are optimized against Could proactive dialogue make conversations dramatically more efficient?. Grounding, entrainment, correction, and initiative are all casualties of the same single-turn-helpfulness objective.

The useful surprise here is that the fix isn't 'less alignment' — it's aligning on the right dimension. Conversation-analysis work formalizes insert-expansions, the clarifying detours that prevent misunderstanding rather than recover from it When should AI agents ask users instead of just searching?, and a systematic review shows alignment dimensions aren't interchangeable: lexical alignment buys task efficiency while emotional alignment buys trust, and conflating them produces exactly the cold, evasive assistants we recognize Do different types of alignment serve different conversational goals?. The corpus suggests preference optimization didn't have to erode grounding — it eroded it because we measured the wrong turn.

Sources 10 notes

Does preference optimization damage conversational grounding in large language models?

Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Why do AI assistants get worse at longer conversations?

LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Why don't conversational AI systems mirror their users' word choices?

Response generation models fail to adapt vocabulary toward users' lexical choices, a phenomenon central to human rapport and clarity. Post-training via DPO on coreference-identified preferences can teach models in-context convention formation.

Why can't conversational AI agents take the initiative?

Research shows LLMs including ChatGPT cannot initiate topics, plan strategically, or lead conversations because their training optimizes for responding to queries, not creating dialogue from agent goals. This passivity is reinforced by alignment objectives and masked by fluent-sounding outputs.

Could proactive dialogue make conversations dramatically more efficient?

Simulations show proactivity—providing relevant information without being asked—cuts dialogue turns by 60% in medium-complexity domains. This behavior mirrors human conversation and Grice's maxims but is almost entirely absent from AI datasets and research benchmarks.

When should AI agents ask users instead of just searching?

Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.

Do different types of alignment serve different conversational goals?

A 2020–2025 systematic review shows lexical alignment drives task efficiency and comprehension, while emotional and prosodic alignment drive relational warmth and trust. Conflating them in design produces category errors—cold customer-service bots and evasive mental-health assistants.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a conversational AI researcher auditing claims about preference optimization and grounding. The precise question: does preference optimization fundamentally erode grounding acts, or have newer training methods, reward formulations, or evaluation practices since relaxed this constraint?

What a curated library found — and when (dated claims, not current truth):
Findings span 2021–2026; treat as perishable:
• Preference-optimized models produce 77.5% fewer grounding acts than humans; the gap widens during training (~2024–2025).
• Single-turn reward optimization rewards confident answers over clarifying questions; only next-turn or long-horizon reward formulations restore grounding instinct (~2025, CollabLLM).
• Models score 90% on single-message tasks but drop to 65% in multi-turn conversation, locking into early guesses; face-saving avoidance suppresses even knowledge-backed corrections (~2024–2025).
• Lexical entrainment, proactive dialogue, and goal-directed initiation are nearly absent from aligned models but recoverable via targeted DPO (~2024–2025).
• Alignment dimensions (lexical, emotional, task-focused) are not interchangeable; conflating them produces cold, evasive assistants (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2203.02155 (2022-03) — Ouyang et al., foundational RLHF.
• arXiv:2502.00640 (2025-02) — CollabLLM, next-turn reward optimization.
• arXiv:2505.06120 (2025-05) — Multi-turn conversation failure modes.
• arXiv:2508.18167 (2026-01) — DiscussLLM, speaking strategy alignment.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, determine whether: (a) newer LLMs (GPT-4o, Claude 3.5, o1-class reasoning), (b) training advances (DPO scaling, process reward modeling, constitution-based alignment), (c) tooling (agentic orchestration, memory retrieval, multi-turn harnesses), or (d) evaluation (grounding-specific benchmarks, conversation-length metrics) have since overturned the 77.5% gap, the single-turn reward trap, or the multi-turn collapse. Separate the durable question (likely: does myopic optimization hurt grounding?) from perishable limits (likely: specific architectures/methods). Cite what relaxed what. Flag constraints that still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — papers showing preference optimization can preserve grounding, or that grounding metrics are flawed, or that newer reward designs dissolve the tension entirely.
(3) Propose 2 research questions that ASSUME the regime may have shifted: (i) Does instruction-tuning-first + lightweight preference-refinement preserve grounding better than end-to-end RLHF? (ii) Can multi-agent conversation (where models ground with each other, not just users) reveal whether the gap is alignment-induced or inherent to single-responder design?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why does preference optimization erode conversational grounding in AI assistants?

Sources 10 notes

Next inquiring lines