SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Conversational AI and Personalization Training, RL, and Test-Time Scaling

Can models learn to ask clarifying questions without explicit training?

Do language models trained only on fully-specified problems spontaneously develop the ability to ask for missing information when facing underspecified tasks? This tests whether conversational problem-solving strategies emerge from meta-learning rather than direct instruction.

Synthesis note · 2026-05-18 · sourced from Training Fine Tuning

A surprising generalization result from the social meta-learning training paradigm. The training procedure uses only fully-specified problems — the student receives the complete problem statement from the first turn, and the teacher provides feedback during attempts to solve it. None of the training problems require the student to handle missing information. Yet the trained model performs significantly better on underspecified tasks at test time, where critical information is revealed only across multiple conversational turns.

The behavioral signature is specific: SML-trained models make fewer premature answer attempts and are more likely to ask for the information they need. They learn to recognize when they lack enough information to answer well and to extract that information from the conversation partner. This is the human pattern of "ask before answering when you're not sure" — emerging in an LLM that was never explicitly trained on the pattern.

The mechanism appears to be that SML training teaches the model a meta-strategy: use the conversation as a resource. This strategy generalizes from "use the conversation to refine an answer to a fully-specified problem" (training distribution) to "use the conversation to get missing information first, then answer" (test distribution). The student has learned not just to solicit corrective feedback but to model the conversation as a place where information flows.

The result can be sharpened with a two-stage training procedure called Q-priming. A preliminary SFT stage trains the model on dialogues where it has been explicitly prompted to ask questions, leveraging the teacher's private knowledge to generate good question examples. After Q-priming, online RL via SML refines the behavior further. The combined pipeline produces stronger clarifying-question behavior than either alone.

For conversational AI design, this is an existence proof: the structural skill of "ask before answering" can be installed via training rather than via runtime prompting. Systems that have struggled with the "LLM answers prematurely" failure mode can address it at the training level rather than relying on prompt engineering.

Inquiring lines that use this note as a source 36

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 113 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

SML produces emergent clarifying-question behavior — models trained only on fully-specified problems learn to handle underspecified tasks by asking for missing information