INQUIRING LINE

Which types of clarifying questions actually help users versus wasting their time?

This explores what separates a clarifying question that earns a user's time from one that wastes it — and the corpus turns out to have a clear answer plus a surprising twist about whether models even know when to ask.


This explores what separates a useful clarifying question from a time-wasting one. The sharpest signal in the corpus is also the simplest: questions that target a concrete information gap beat questions that ask users to restate their goal. "What size monitor?" outperforms "What are you trying to do?" — and the reason is psychological, not technical. Users engage when they can foresee how their answer improves the result; a vague request to rephrase makes them do the work of guessing what the system needs (Which clarifying questions actually improve user satisfaction?). So the first rule is specificity that pays off visibly.

What makes specificity hard to fake is that 'good question' isn't one thing. One line of work decomposes question quality into separate attributes — clarity, relevance, specificity — and trains on each independently rather than against a single quality score; in clinical reasoning, asking the *right* missing question directly changes the decision (Can models learn to ask genuinely useful clarifying questions?). A more formal version of the same instinct scores candidate questions by how much they'd shrink the model's uncertainty — simulating the possible answers a question could get and picking the one whose answers split the possibilities most (How can models select the most informative question to ask?). Both point the same way: a question is worth asking in proportion to how much it narrows what the system doesn't yet know.

The uncomfortable finding is that models are bad at knowing *when* a question is needed at all. Being good at solving a problem doesn't transfer to spotting that a problem is missing a piece — models that ace complete reasoning tasks drop to 40–50% when they have to identify which clarifying question to ask after one variable is withheld (Can models identify what information they actually need?). Reasoning-tuned models are worse still: faced with an ill-posed question, they don't reject it, they overthink it, generating long redundant chains because training rewarded producing reasoning steps and never taught them when to disengage (Why do reasoning models overthink ill-posed questions?). The capability to pause and ask is learnable but fragile — reinforcement training pushed proactive 'something's missing here' accuracy from near-zero to ~74%, yet without that training, giving the model more inference time actually made it worse (Can models learn to ask clarifying questions instead of guessing?). And it can be self-taught: STaR-GATE has a model improve its own questions by keeping the ones that raise answer quality, reaching 72% preference over its base after two rounds with no human supervising the questions (Can models learn to ask better clarifying questions through self-improvement?).

The thing you didn't know you wanted to know: a clarifying question doesn't have to be a question. Mapping clarification onto Clark's levels of communication — attention, signal, meaning, action — shows most real-world clarifications are *declarative*, not interrogative ("I heard 'Tuesday'…" rather than "Did you say Tuesday?"), which means any system that detects clarification by looking for question syntax is blind to most of it (Why do clarification requests look different at each communication level?). And what counts as a good clarification depends on the kind of question underneath — comparison and debate questions need different handling than fact-lookup ones (Does question type determine the right retrieval strategy?). So the full answer to 'which clarifying questions help' is layered: ask for specific, answer-visible facets; only ask when something is genuinely missing (the hard part); and don't assume the helpful move is always phrased as a question at all.


Sources 9 notes

Which clarifying questions actually improve user satisfaction?

Clarifying questions that target concrete information gaps ("What type of monitor?") consistently beat those that ask users to rephrase their needs ("What are you trying to do?"). Users engage most when they can foresee how answering improves results.

Can models learn to ask genuinely useful clarifying questions?

The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.

How can models select the most informative question to ask?

UoT combines uncertainty-aware scenario simulation with information-gain scoring and reward propagation to identify questions whose possible answers maximally reduce diagnostic uncertainty—providing a principled mechanism for specific, high-value clarification rather than generic prompts.

Can models identify what information they actually need?

Models achieving high accuracy on complete reasoning tasks drop to 40-50% accuracy identifying what clarifying question to ask when one variable is withheld. Information gathering and problem execution are separable cognitive operations.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Can models learn to ask clarifying questions instead of guessing?

Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.

Can models learn to ask better clarifying questions through self-improvement?

STaR-GATE iteratively finetunes a model on questions that increase response quality, achieving 72% preference over the base model after two iterations. The research shows preference elicitation is trainable through self-play without human question supervision.

Why do clarification requests look different at each communication level?

Research maps clarification mechanisms to four levels of communication—attention, signal, meaning, action—each grounded in a different modality (socioperception, hearing, vision, kinesthetics). Most clarifications use declarative form, not questions, making them invisible to systems that detect by syntax alone.

Does question type determine the right retrieval strategy?

Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about which clarifying questions actually help users. The question itself—how to separate useful from wasteful clarification—remains open and durable. A curated library of arXiv papers (2021–2025) found the following, but these are dated claims; your job is to check whether newer models, training methods, or evaluation tooling have since relaxed or overturned them.

What a curated library found — and when:
• Specificity that users can foresee will improve results beats vague requests to rephrase goals; "What size monitor?" outperforms "What are you trying to do?" (2024-02, arXiv:2402.01934).
• Decomposing question quality into separate attributes (clarity, relevance, specificity) and training on each independently works better than training against a single quality score; clinical models show this directly changes reasoning decisions (2025-02, arXiv:2502.14860).
• Information-gain scoring—simulating possible answers and picking questions whose answers split the possibility space most—predicts useful questions (2024-02, arXiv:2402.03271).
• Models that solve well-specified reasoning tasks drop to 40–50% accuracy when asked to identify which clarifying question is needed after a variable is withheld; reasoning-tuned models overthink ill-posed questions instead of rejecting them (2025-04, arXiv:2505.00127).
• Reinforcement training can push proactive "something's missing" accuracy from near-zero to ~74%; this capability is learnable but fragile and absent without explicit training (2025-07, arXiv:2507.23407).
• Self-teaching via STaR-GATE (keeping questions that raise answer quality) reaches 72% preference over base models in two rounds with no human supervision (2024-03, arXiv:2403.19154).
• Most real-world clarifications are declarative, not interrogative; systems detecting clarification only via question syntax miss most of it (2021-04, arXiv:2104.08964).

Anchor papers (verify; mind their dates):
– arXiv:2402.01934 (2024-02): Clarification usefulness and user satisfaction
– arXiv:2502.14860 (2025-02): Decomposed quality training in clinical reasoning
– arXiv:2403.19154 (2024-03): Self-improving question generation via STaR-GATE
– arXiv:2507.23407 (2025-07): Proactive critical thinking and question-asking

Your task:
(1) RE-TEST EACH CONSTRAINT. For the specificity finding, test whether newer instruction-tuning or preference-learning (e.g., DPO, IPO) has made vague open-ended prompts perform better or worse. For the 40–50% identification gap, check whether recent reasoning-focused models (o1-like architectures, extended-context scaling, or chain-of-thought variants post-2025Q2) have narrowed that drop. For the fragility of proactive questioning, test whether multi-agent setups (where one agent flags missing info and another acts on it) or memory-augmented systems have made the learned behavior more robust. Separate the durable question from the perishable limitation and cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially any showing that vagueness sometimes outperforms specificity, or that models can identify missing information without explicit RL.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Does preference-learning on clarification exchanges (rather than single-turn Q&A) relax the brittleness of proactive questioning?" or "Can question-type classification (from 2025-03 work) unlock better delegation of clarification to specialized sub-models?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines