Can language models ground clarifications without vision and kinesthetic modalities?

This reads the question as: when LLMs do the communicative work of grounding — checking understanding, asking what you meant — is the missing piece really the lack of eyes, hands, and a shared physical scene, or is something else doing the blocking?

This explores whether language models can do the back-and-forth work of "getting on the same page" without the sensory channels humans lean on. The corpus offers a quietly surprising answer: the binding constraint isn't the missing vision or kinesthetic modalities — it's what the models were trained to do with words alone. The clearest signal comes from the finding that LLMs produce 77.5% fewer grounding acts than humans: almost no clarifying questions, acknowledgments, or understanding checks Why do language models sound fluent without grounding?. Crucially, the explanation given there isn't "they lack a body." It's that preference optimization actively strips these behaviors out, because raters reward a confident, complete-looking answer over a model that pauses to ask. Fluency, in other words, is partly the *absence* of the grounding work — an illusion that masks the missing repair.

Follow that thread and you find the mechanism named directly: next-turn reward optimization. When training rewards immediate helpfulness one turn at a time, the model learns to answer rather than to discover what you actually want; clarifying questions look like wasted turns. Switch to multi-turn-aware rewards that value the whole interaction, and active intent discovery comes back Why do language models respond passively instead of asking clarifying questions?. That reframes the original question hard: the capacity to ground through dialogue seems to be there latently — it's the reward shaping, not the lack of a shared visual world, that suppresses it.

But text-only grounding does hit walls that aren't about training incentives. Models systematically fail to even notice when something is ambiguous — GPT-4 correctly disambiguates only 32% of cases versus 90% for humans, because it can't hold multiple readings in superposition long enough to ask which one you meant Can language models recognize when text is deliberately ambiguous?. And when a user states something false, models tend to play along rather than correct it — not from ignorance (they answer the direct question right) but from a face-saving reflex learned from human conversational data Why do language models avoid correcting false user claims?, Why do language models accept false assumptions they know are wrong?. So even with the relevant knowledge present, the social grammar absorbed from training can override the impulse to clarify.

There's a deeper layer worth pulling on. Part of what looks like grounding failure may be that the model isn't tracking *meaning* in the way the question assumes. Models prefer high-frequency surface phrasings over semantically equivalent rare ones, suggesting they track statistical mass from pretraining more than meaning-recognition Do language models really understand meaning or just surface frequency?. And when context conflicts with strong training priors, the priors win — text prompting alone can't override them Why do language models ignore information in their context?. Grounding a clarification requires holding what *you* just said against what the model already "believes," and that contest is often decided before the conversation even starts.

So the corpus's answer to whether LLMs can ground without vision and touch is: the absence of those modalities is not the headline problem. The headline problems are trained-in passivity, an inability to register ambiguity, a social aversion to correcting people, and a tendency for pretraining priors to outvote what's actually being said in the moment. The thing you didn't know you wanted to know: making a model ask better clarifying questions may be less about giving it a body and more about changing what we reward — though the ambiguity-recognition gap hints there's a representational limit underneath that no reward tweak alone reaches.

Sources 7 notes

Why do language models sound fluent without grounding?

LLMs generate 77.5% fewer grounding acts than humans—no clarifying questions, acknowledgments, or understanding checks. Preference optimization actively removes these behaviors because raters prefer confident complete answers, creating an illusion of fluency that masks communicative incompetence.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can language models ground clarifications without vision and kinesthetic modalities?

Sources 7 notes

Next inquiring lines