How do humans decide which level of clarification to request?
This explores how people choose *which level* of a conversation to repair — whether they're flagging that they didn't hear, didn't understand, or didn't grasp the intent — and what the corpus reveals about the machinery behind that choice.
This explores how people choose which level of a conversation to repair — and the most direct answer in the collection is that humans don't pick a level arbitrarily; they target the specific layer of communication that broke down. One line of work maps clarification onto Clark's "action ladder": four stacked levels — attention, signal, meaning, and action — each grounded in a different sense (you clarify a missed *signal* with hearing, a missed *meaning* with vision, and so on) Why do clarification requests look different at each communication level?. The striking detail there is that most real clarifications aren't even phrased as questions — they're declarative ("You mean the blue one") — which means the level you're repairing at shows up in the *form* of the repair, not just its content.
Underneath the choice of level sits a quieter cognitive step: noticing that something is underspecified at all. The corpus treats this as a separate skill from solving the problem — models that ace complete reasoning tasks crater to 40–50% when they have to figure out *what's missing* Can models identify what information they actually need?, and they fail badly at recognizing deliberate ambiguity, where GPT-4 disambiguates only 32% of cases against humans' 90% Can language models recognize when text is deliberately ambiguous?. That gap is the real story: deciding which level to clarify presupposes you can hold multiple readings of the same utterance at once and see exactly where they diverge — something humans do almost automatically and machines largely can't.
Once you know *that* you need to ask, the question of *what* to ask is itself a level-selection problem. Specific, facet-targeted questions ("What kind of monitor?") beat vague "tell me more" prompts, because people engage when they can see how their answer will improve the result Which clarifying questions actually improve user satisfaction?. There's even a formal version of the calculus humans seem to run intuitively: simulate the possible answers to each candidate question and pick the one that would shrink your uncertainty the most How can models select the most informative question to ask?. That's a principled account of "which level" — you clarify at whichever level resolves the most ambiguity per question asked.
What you didn't know you wanted to know: this decision can be taught, but it's fragile. Models can be trained to proactively spot missing information — accuracy jumped from near-zero to 74% with reinforcement learning — yet the same capability *degrades* under inference-time scaling unless that training is in place Can models learn to ask clarifying questions instead of guessing?. A separate thread shows models can even acquire the habit of asking, rather than guessing, just by learning to treat conversation as a source of information they can draw on Can models learn to ask clarifying questions without explicit training?. And there's a framing borrowed straight from human conversation analysis — "insert-expansions," the little sub-dialogues we open to scope a request before acting — offered as the structure agents should adopt so they consult the user *before* drifting off-intent rather than after When should AI agents ask users instead of just searching?. Taken together, the corpus reframes your question: humans pick a clarification level by detecting where understanding fractured and asking the question that most reduces that specific uncertainty — a two-step skill (notice the gap, then aim the repair) that the field is only beginning to reproduce.
Sources 8 notes
Research maps clarification mechanisms to four levels of communication—attention, signal, meaning, action—each grounded in a different modality (socioperception, hearing, vision, kinesthetics). Most clarifications use declarative form, not questions, making them invisible to systems that detect by syntax alone.
Models achieving high accuracy on complete reasoning tasks drop to 40-50% accuracy identifying what clarifying question to ask when one variable is withheld. Information gathering and problem execution are separable cognitive operations.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
Clarifying questions that target concrete information gaps ("What type of monitor?") consistently beat those that ask users to rephrase their needs ("What are you trying to do?"). Users engage most when they can foresee how answering improves results.
UoT combines uncertainty-aware scenario simulation with information-gain scoring and reward propagation to identify questions whose possible answers maximally reduce diagnostic uncertainty—providing a principled mechanism for specific, high-value clarification rather than generic prompts.
Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.
Models trained via SML on complete problems generalize to underspecified tasks by asking for needed information and delaying answers. The training paradigm instills a meta-strategy of using conversation as an information source, addressing the premature-answering failure mode.
Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.