SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Psychology, Society, and Alignment

Why do models fail at asking good questions during interaction?

When models must actively seek information through questions rather than receive it passively, they struggle dramatically. This explores why GPT-4o plateaus at 35% accuracy and whether training or prompting can fix the underlying deficit.

Synthesis note · 2026-04-18 · sourced from Reasoning Methods CoT ToT
How do LLMs fail to know what they seem to understand? What makes chain-of-thought reasoning actually work?

AR-Bench introduces a critical distinction: passive reasoning (all information given, solve the problem) versus active reasoning (information must be sought through interaction). This distinction exposes a capability gap that standard benchmarks completely miss.

The results are stark. On number guessing — a task with well-defined information-theoretic structure — GPT-4o achieves only 35%. The information gain curve reveals why: models extract 7.7% information gain in rounds 5-10, but this drops to just 2.5% in rounds 20-25. More interaction does not proportionally reduce uncertainty. The models plateau because they cannot formulate increasingly precise questions — they ask vague, repetitive queries that fail to efficiently partition the remaining hypothesis space.

What makes this finding particularly damaging is the intervention analysis. SFT, DPO, Tree-of-Thought, human-written instructions, Proactive CoT, and Uncertainty-of-Thought (UoT) all provide minimal benefit. The active reasoning deficit is not a prompting problem or a fine-tuning problem — it appears to be a structural limitation in how current models represent and reduce uncertainty through sequential interaction.

This connects directly to Can models identify what information they actually need?, which showed models cannot identify what information is missing even when they can solve the fully-specified version. AR-Bench extends this from identification to acquisition: even when the model has the opportunity to ask questions, it cannot formulate effective ones. The deficit spans the full pipeline — detection, formulation, and iterative refinement of information needs.

The connection to Why do RL agents stop asking informative questions? is structural: both describe systems that fail to escape low-information states. Self-locking describes the mechanism (weak belief tracking creates a trap); AR-Bench measures the behavioral consequence (plateau in information gain despite continued interaction).

The early plateau pattern also resonates with Does more thinking time always improve reasoning accuracy? — both reveal non-monotonic returns to continued processing, whether through more thinking tokens or more interaction rounds. The mechanism differs (overthinking vs. question quality degradation) but the failure mode is analogous: more compute/interaction without better strategy yields diminishing or negative returns.

Since Can models learn to ask clarifying questions instead of guessing?, the AR-Bench results suggest that even proactive critical thinking may be insufficient — the bottleneck is not willingness to ask but ability to ask well.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 142 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

active reasoning through interaction is dramatically harder than passive reasoning — models plateau early and ask vague repetitive questions