Can dialogue planning balance fast responses with strategic depth?
Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.
Proactive dialogue requires planning — steering conversations toward predetermined goals. LLMs typically struggle with this because of their reactive nature. The Dual-Process Dialogue Planning (DPDP) framework addresses this by implementing Kahneman's System 1/System 2 distinction:
System 1 — A neural policy language model that handles familiar dialogue contexts with quick, instinctive responses. Trained through offline RL to build a robust initial policy that mitigates suboptimal strategies from noisy training data.
System 2 — An MCTS-based planner that provides analytical, rational (but slower) planning for complex or novel scenarios where the policy model is uncertain.
Dynamic switching between systems is driven by the policy model's own uncertainty estimate. When the model is confident about the next dialogue action, System 1 fires. When uncertainty is high — novel context, complex goal structure, ambiguous user behavior — System 2 activates for deeper search.
The two-stage training is the key innovation. Stage 1 uses offline RL to refine the policy model's base capabilities. Stage 2 uses MCTS simulations to guide the policy model toward generating superior strategies, accelerating convergence. The policy model progressively internalizes the MCTS planner's strategic depth, so over time System 1 handles more situations directly.
This connects directly to existing test-time compute findings. Since Can models learn when to think versus respond quickly?, DPDP applies the same principle to dialogue planning: spend more compute (MCTS) only when the policy model's uncertainty warrants it. The result is efficiency matching or exceeding pure MCTS-based methods while maintaining strategic depth.
The architecture embodies the broader principle that Does RL post-training create reasoning or just deploy it? — the policy model's dialogue capabilities already exist from pretraining, and the uncertainty-switching mechanism teaches WHEN to deploy deep planning rather than how to plan. Additionally, by restricting System 2 (MCTS) to uncertain contexts, DPDP naturally avoids the overthinking threshold documented in Does more thinking time always improve reasoning accuracy? — deep search activates only when warranted, preventing the universal application of extended reasoning that degrades performance.
Inquiring lines that use this note as a source 19
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can dialogue systems abstain from responding when uncertainty is too high?
- What dialogue dynamics distinguish negotiation from standard information-provision tasks?
- Can systems guide users adaptively without imposing predetermined dialogue structures?
- What speaker selection protocol prevents both stalling and premature convergence?
- Can offline reinforcement learning improve dialogue policy baseline performance?
- What does an intermediate interface between planning and grounding actually look like?
- How do graduated phase rewards emerge complex dialogue behavior from simple objectives?
- Can topic planning and response generation reduce dialogue turns?
- How does single-turn training undermine multi-turn strategic dialogue?
- Can hierarchical reinforcement learning manage phase-dependent initiative switching in dialogue?
- Why do aha moments emerge specifically during the planning phase?
- Why do conversational systems benefit from post-thinking between user turns?
- Can skipping transcription reduce speech dialogue latency below 300 milliseconds?
- When should a system choose extended thinking versus quick responses?
- Why do conversational agents lack the goal awareness needed to lead rather than just respond?
- How might dual-process dialogue use information gain to trigger clarification?
- Can emotion-grounded rewards replace coarse bonus signals in hierarchical dialogue RL?
- How does preference optimization erode the conversational grounding it aims to improve?
- How does structured self-dialogue improve uncertainty assessment over confidence scores?
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models learn when to think versus respond quickly?
Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
same principle applied to dialogue: adaptive compute based on difficulty
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
DPDP is the dialogue-specific instance of adaptive compute allocation
-
Can tree search replace human feedback in LLM training?
Explores whether Monte Carlo Tree Search can generate quality signals for self-improvement without expensive human annotations. Matters because annotation bottlenecks currently limit LLM scaling.
MCTS as System 2 for dialogue planning
-
Does RL post-training create reasoning or just deploy it?
Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.
DPDP is a dialogue-specific instance of the "when not how" principle: the policy model already has dialogue capabilities, and the uncertainty-based switching teaches WHEN to deploy deep planning
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
DPDP's uncertainty-based switching to System 2 naturally avoids the overthinking threshold by restricting deep search to genuinely uncertain contexts rather than applying it universally
-
How can models select the most informative question to ask?
Explores whether simulating possible futures and scoring questions by information gain can identify which clarifying question would best reduce uncertainty—moving beyond just deciding whether to ask toward deciding what to ask.
UoT could serve as the System 2 question-selection mechanism: when uncertainty triggers MCTS planning, information-gain scoring determines which clarifying question to generate next
-
When should an agent actually stop and deliberate?
How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
SAND implements the same dual-process principle at a different granularity: DPDP switches between System 1 (instinctive policy) and System 2 (MCTS) based on uncertainty at the dialogue-turn level; SAND switches between direct action and deliberation at the step level within trajectories; both use uncertainty as the switching criterion
-
Why do reasoning models overthink ill-posed questions?
Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
DPDP's System 2 MCTS activation on uncertainty must be constrained when the uncertainty stems from ill-posed input rather than genuine decision complexity; without this distinction, the model applies deep planning to questions that require recognition of missing information, not more search
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Planning Like Human: A Dual-process Framework for Dialogue Planning
- Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents
- Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games
- Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager
- POMDP-based Statistical Spoken Dialogue Systems: a Review
- Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning
- React - Synergizing Reasoning And Acting In Language Models
- Enhancing Large Language Model Induced Task-Oriented Dialogue Systems Through Look-Forward Motivated Goals
Original note title
dual-process dialogue planning applies System 1 and System 2 cognition to conversation — instinctive policy for familiar contexts and MCTS for novel scenarios switching on uncertainty