Can dialogue planning balance fast responses with strategic depth?

Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.

Synthesis note · 2026-02-22 · sourced from Conversation Architecture Structure

Proactive dialogue requires planning — steering conversations toward predetermined goals. LLMs typically struggle with this because of their reactive nature. The Dual-Process Dialogue Planning (DPDP) framework addresses this by implementing Kahneman's System 1/System 2 distinction:

System 1 — A neural policy language model that handles familiar dialogue contexts with quick, instinctive responses. Trained through offline RL to build a robust initial policy that mitigates suboptimal strategies from noisy training data.

System 2 — An MCTS-based planner that provides analytical, rational (but slower) planning for complex or novel scenarios where the policy model is uncertain.

Dynamic switching between systems is driven by the policy model's own uncertainty estimate. When the model is confident about the next dialogue action, System 1 fires. When uncertainty is high — novel context, complex goal structure, ambiguous user behavior — System 2 activates for deeper search.

The two-stage training is the key innovation. Stage 1 uses offline RL to refine the policy model's base capabilities. Stage 2 uses MCTS simulations to guide the policy model toward generating superior strategies, accelerating convergence. The policy model progressively internalizes the MCTS planner's strategic depth, so over time System 1 handles more situations directly.

This connects directly to existing test-time compute findings. Since Can models learn when to think versus respond quickly?, DPDP applies the same principle to dialogue planning: spend more compute (MCTS) only when the policy model's uncertainty warrants it. The result is efficiency matching or exceeding pure MCTS-based methods while maintaining strategic depth.

The architecture embodies the broader principle that Does RL post-training create reasoning or just deploy it? — the policy model's dialogue capabilities already exist from pretraining, and the uncertainty-switching mechanism teaches WHEN to deploy deep planning rather than how to plan. Additionally, by restricting System 2 (MCTS) to uncertain contexts, DPDP naturally avoids the overthinking threshold documented in Does more thinking time always improve reasoning accuracy? — deep search activates only when warranted, preventing the universal application of extended reasoning that degrades performance.

Inquiring lines that use this note as a source 19

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

19 direct connections · 185 in 2-hop network ·dense cluster Open in graph ↗

Can dialogue planning balance fast responses wit… Can models learn when to think versus respond quic… Can we allocate inference compute based on prompt … Can tree search replace human feedback in LLM trai… Does RL post-training create reasoning or just dep… Does more thinking time always improve reasoning a… How can models select the most informative questio… When should an agent actually stop and deliberate? Why do reasoning models overthink ill-posed questi…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models learn when to think versus respond quickly? Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
same principle applied to dialogue: adaptive compute based on difficulty
Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
DPDP is the dialogue-specific instance of adaptive compute allocation
Can tree search replace human feedback in LLM training? Explores whether Monte Carlo Tree Search can generate quality signals for self-improvement without expensive human annotations. Matters because annotation bottlenecks currently limit LLM scaling.
MCTS as System 2 for dialogue planning
Does RL post-training create reasoning or just deploy it? Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.
DPDP is a dialogue-specific instance of the "when not how" principle: the policy model already has dialogue capabilities, and the uncertainty-based switching teaches WHEN to deploy deep planning
Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
DPDP's uncertainty-based switching to System 2 naturally avoids the overthinking threshold by restricting deep search to genuinely uncertain contexts rather than applying it universally
How can models select the most informative question to ask? Explores whether simulating possible futures and scoring questions by information gain can identify which clarifying question would best reduce uncertainty—moving beyond just deciding whether to ask toward deciding what to ask.
UoT could serve as the System 2 question-selection mechanism: when uncertainty triggers MCTS planning, information-gain scoring determines which clarifying question to generate next
When should an agent actually stop and deliberate? How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
SAND implements the same dual-process principle at a different granularity: DPDP switches between System 1 (instinctive policy) and System 2 (MCTS) based on uncertainty at the dialogue-turn level; SAND switches between direct action and deliberation at the step level within trajectories; both use uncertainty as the switching criterion
Why do reasoning models overthink ill-posed questions? Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
DPDP's System 2 MCTS activation on uncertainty must be constrained when the uncertainty stems from ill-posed input rather than genuine decision complexity; without this distinction, the model applies deep planning to questions that require recognition of missing information, not more search

Can dialogue planning balance fast responses with strategic depth?

Related concepts in this collection 8

Related papers in this collection 8

Search by related questions 4