SYNTHESIS NOTE
Conversational AI and Personalization Agentic Systems and Tool Use Psychology, Society, and Alignment

Can dialogue planning balance fast responses with strategic depth?

Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.

Synthesis note · 2026-02-22 · sourced from Conversation Architecture Structure
How should we allocate compute budget at inference time? Why do AI agents fail to take initiative? How should researchers navigate LLM reasoning research?

Proactive dialogue requires planning — steering conversations toward predetermined goals. LLMs typically struggle with this because of their reactive nature. The Dual-Process Dialogue Planning (DPDP) framework addresses this by implementing Kahneman's System 1/System 2 distinction:

System 1 — A neural policy language model that handles familiar dialogue contexts with quick, instinctive responses. Trained through offline RL to build a robust initial policy that mitigates suboptimal strategies from noisy training data.

System 2 — An MCTS-based planner that provides analytical, rational (but slower) planning for complex or novel scenarios where the policy model is uncertain.

Dynamic switching between systems is driven by the policy model's own uncertainty estimate. When the model is confident about the next dialogue action, System 1 fires. When uncertainty is high — novel context, complex goal structure, ambiguous user behavior — System 2 activates for deeper search.

The two-stage training is the key innovation. Stage 1 uses offline RL to refine the policy model's base capabilities. Stage 2 uses MCTS simulations to guide the policy model toward generating superior strategies, accelerating convergence. The policy model progressively internalizes the MCTS planner's strategic depth, so over time System 1 handles more situations directly.

This connects directly to existing test-time compute findings. Since Can models learn when to think versus respond quickly?, DPDP applies the same principle to dialogue planning: spend more compute (MCTS) only when the policy model's uncertainty warrants it. The result is efficiency matching or exceeding pure MCTS-based methods while maintaining strategic depth.

The architecture embodies the broader principle that Does RL post-training create reasoning or just deploy it? — the policy model's dialogue capabilities already exist from pretraining, and the uncertainty-switching mechanism teaches WHEN to deploy deep planning rather than how to plan. Additionally, by restricting System 2 (MCTS) to uncertain contexts, DPDP naturally avoids the overthinking threshold documented in Does more thinking time always improve reasoning accuracy? — deep search activates only when warranted, preventing the universal application of extended reasoning that degrades performance.

Inquiring lines that use this note as a source 19

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
19 direct connections · 185 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

dual-process dialogue planning applies System 1 and System 2 cognition to conversation — instinctive policy for familiar contexts and MCTS for novel scenarios switching on uncertainty