SYNTHESIS NOTE

Can models identify what information they actually need?

When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.

Synthesis note · 2026-02-22 · sourced from Reasoning Logic Internal Rules

QuestBench formalizes a capability that real-world deployment requires but benchmarks ignore: when a task is underspecified, can the model identify what information is missing and ask the right clarifying question?

The benchmark presents reasoning tasks (logic, planning, math) where exactly one piece of information is withheld. The model must select the correct clarification question from multiple options. The key finding: while current models excel on math variants (GSM-Q, GSME-Q), they achieve only 40-50% accuracy on Logic-Q and Planning-Q.

The critical insight is the separability result: models that solve the fully-specified version of a problem still fail to identify the right question when one variable is missing. Problem-solving capability and information-gathering capability are distinct cognitive operations. The ability to execute reasoning when all inputs are present does not transfer to recognizing which input is absent.

This extends Why do reasoning models overthink ill-posed questions? from a complementary angle. That note documents the BEHAVIORAL response to missing information (overthinking, redundant self-doubt). This documents the DIAGNOSTIC failure — models can't even identify what's missing, let alone respond appropriately. Together they describe a two-part deficit:

Cannot detect what information is needed (QuestBench)
Cannot disengage when information is absent (missing premises overthinking)

The connection to Can language models recognize when text is deliberately ambiguous? is structural: both involve recognizing that the current input is insufficient for a definitive answer. Ambiguity recognition asks "is this input multiply interpretable?" while information gathering asks "is this input incomplete?" Both require meta-reasoning about the input rather than reasoning within it.

The formalization as a constraint satisfaction problem (CSP) with missing variable assignments is useful: it defines information gathering as identifying the minimal necessary question — a well-defined optimization target. This separates the problem from subjective clarification tasks where multiple valid questions exist.

Inquiring lines that use this note as a source 20

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 11

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

22 direct connections · 211 in 2-hop network ·dense cluster Open in graph ↗

Can models identify what information they actual… Why do reasoning models overthink ill-posed questi… Can language models recognize when text is deliber… Does reasoning fine-tuning make models worse at de… Why do LLMs struggle to connect unrelated entities… Can models learn to ask clarifying questions inste… How do users actually form intent when prompting A… Why do language models lose performance in longer … Does training objective determine which direction …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do reasoning models overthink ill-posed questions? Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
behavioral response to missing info; this is the diagnostic failure
Can language models recognize when text is deliberately ambiguous? Explores whether LLMs can identify and handle multiple valid interpretations in a single phrase—a core human language skill that appears largely absent in current models despite their fluency on standard tasks.
shared structure: recognizing input insufficiency
Does reasoning fine-tuning make models worse at declining to answer? When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
reasoning training suppresses both abstention and information gathering
Why do LLMs struggle to connect unrelated entities speculatively? LLMs reliably organize and summarize evidence but fail when asked to speculate about connections between dissimilar entities. Understanding this failure could reveal fundamental limits in how models handle complex analytical reasoning.
evidence organization (well-specified) vs hypothesis generation (underspecified) is the same split
Can models learn to ask clarifying questions instead of guessing? Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
proactive critical thinking is the trainable solution to the information-gathering deficit: RL training raises missing-information detection from 0.15% to 73.98%, directly addressing the capability gap QuestBench identifies
How do users actually form intent when prompting AI systems? Users face a 'gulf of envisioning'—they must simultaneously imagine possibilities and express them to language models. This cognitive gap creates breakdowns not from AI incapability but from users struggling to articulate what they truly need.
intent maturation requires recognizing what information is missing from underspecified user expressions, which is exactly the capability QuestBench shows models lack
Why do language models lose performance in longer conversations? Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.
the Mediator-Assistant architecture addresses the QuestBench deficit by separating intent understanding (where missing-information detection is needed) from task execution (where well-specified reasoning suffices)
Does training objective determine which direction models fail at abstention? Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.
under-abstention compounds the underspecification problem: reasoning-trained models are both unable to identify missing information (this note) and trained to force answers regardless (that note), creating a compound failure on underspecified inputs
Why do language models fail in gradually revealed conversations? Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
the conversational manifestation of the information-gathering deficit: when instructions arrive gradually (the normal case), models that cannot identify what's missing make premature assumptions instead, producing the 39% multi-turn degradation
Why do users drift away from their original information need? When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
the user-side complement: QuestBench shows AI cannot identify what information is missing; ASK shows users cannot articulate what knowledge they lack; when both sides of the interaction have information-gathering deficits, neither can help the other resolve underspecification
Why do AI agents miss most of what users actually want? UserBench explores why current models align with user intent only 20% of the time, even when users reveal preferences across multiple turns. The question examines whether agents can learn to actively clarify ambiguous or evolving goals.
UserBench quantifies the practical cost of the information-gathering deficit: models that cannot identify missing information from underspecified tasks achieve only 20% full intent alignment because three core traits of user communication (underspecification, incrementality, indirectness) demand exactly the capability QuestBench shows models lack

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

solving well-specified reasoning problems is insufficient for identifying missing information in underspecified tasks

Can models identify what information they actually need?

Related concepts in this collection 11

Related papers in this collection 8

Search by related questions 4