Can alignment training prevent the clarification work users need?
This explores whether the very training that makes models 'helpful' (RLHF, DPO, preference optimization) actively suppresses the asking-for-clarification behavior that real conversations depend on — and the corpus says yes, fairly directly.
This reads the question as: does alignment training quietly remove a model's ability to do clarification work — asking questions, checking understanding, flagging ambiguity — that users actually need? The corpus answers with an unusually sharp 'yes,' and even names the mechanism. The clearest case is the 'alignment tax on communication': RLHF optimizes for single-turn helpfulness by rewarding confident, complete-looking answers over clarifying questions and understanding checks. The measured result is brutal — grounding acts drop 77.5% below human levels, producing models that look helpful while failing silently the moment a conversation requires back-and-forth Does preference optimization harm conversational understanding?.
What makes this more than a one-paper finding is that the same suppression shows up from completely different angles. One line of work frames it as a speech-act problem: alignment rewards calibrated neutrality and hedging, which structurally blocks any act that requires 'overclaiming' relative to baseline — alarm, warning, denunciation Does alignment training suppress socially necessary speech acts?. Asking a pointed clarifying question ('wait, do you mean X or Y?') sits in that same suppressed register — it's an assertive interruption of the user's framing, exactly the kind of move a hedge-rewarding objective trains away. The authors argue this is a consequence of the objective, not a bug you can patch.
There's also a prior problem that alignment makes worse rather than causes. Models are already terrible at recognizing ambiguity in the first place — GPT-4 correctly disambiguates only 32% of cases against 90% for humans, and it can't seem to hold two readings of a sentence at once Can language models recognize when text is deliberately ambiguous?. So the failure compounds: a model that can't see the fork in the road is then trained to answer confidently instead of stopping to ask which way you meant. A related thread shows standard RLHF and DPO produce 'collaborators' that ignore a partner's interventions entirely, evaluating suggestions by surface plausibility rather than causal impact Why do standard alignment methods ignore partner interventions? — clarification requires treating the user as someone whose input changes the answer, which is precisely the disposition these methods erode.
The deeper trap is that you may not be able to simply 'add clarification back' via preferences, because preference optimization entangles the good with the bad. In AI writing assistance, users prefer the rewrites 63% of the time yet object to the persona distortions baked into those same rewrites — polish and distortion are entangled at the model level and optimizing for one drags in the other Can user preference guide AI writing tool alignment?. By the same logic, optimizing for the confident, satisfying-feeling answer drags in the suppression of clarifying friction. And one of the things being suppressed is integration of what's actually in front of the model — models routinely override the current context with strong training priors, so they confidently answer the question they expect rather than the one you asked Why do language models ignore information in their context?.
The corpus doesn't leave you only with the diagnosis. The most interesting exit is counterfactual-invariance training: regularize the agent so its behavior stays consistent when an intervention pathway is nullified, which forces it to weigh suggestions by genuine causal impact — and partner-awareness (the root of good clarification) emerges as a byproduct without ever being explicitly rewarded Why do standard alignment methods ignore partner interventions?. The unexpected takeaway: 'helpfulness' as currently rewarded is not neutral — it has a built-in bias toward the confident monologue and against the cooperative question, and fixing that may mean changing the training objective's shape rather than adding more preference data on top of it.
Sources 6 notes
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
RLHF optimization rewards calibrated neutrality and hedged claims, which structurally prevents models from performing speech acts requiring overclaiming relative to baseline—like alarm, warning, prophecy, and denunciation. This is a direct consequence of the alignment objective, not a fixable bug.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
Regularizing agents to maintain consistency when intervention pathways are nullified forces them to evaluate suggestions by causal impact rather than surface plausibility. Common ground alignment emerges as a byproduct without explicit reward.
Writers prefer AI rewrites 63% of the time but object to systematic persona distortions those same rewrites introduce. Mitigation studies show polish and distortion are entangled at the model level—preference optimization produces both simultaneously.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.