When should an agent actually stop and deliberate?
How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
SAND (Self-taught Action Deliberation) addresses a question that recurs across the reasoning and agentic literatures: when should a model invest extra computation? In large or unbounded action spaces, deliberating over all possible actions at every step is intractable. But never deliberating misses opportunities to catch errors at critical decision points.
The solution is elegant: at each step, sample N actions from the current policy alongside the expert action. Define an inconsistency indicator: if all N+1 actions are identical (the policy distribution is sharply peaked), set deliberation flag to 0 — the decision is trivial or the model is confident. If any actions differ, set flag to 1 — the model is uncertain, and deliberation should occur.
When deliberation triggers, SAND generates execution-guided critiques: instead of judging actions abstractly, it runs forward rollouts from each candidate action and uses the actual outcomes to inform evaluation. This is grounded assessment — not "which action looks better?" but "which action leads to better results?" The critiques are then synthesized into a deliberation thought that augments the trajectory.
The mechanism is self-teaching: deliberation trajectories are used for iterative finetuning of the agent itself. The model learns not just what to do but when to deliberate, internalizing the meta-decision of compute allocation.
This connects to the adaptive compute literature at a different granularity. Can we allocate inference compute based on prompt difficulty? operates at the prompt level (how much total compute for this problem?). Can models learn when to think versus respond quickly? operates at the response level (think or not?). SAND operates at the step level within a trajectory (deliberate at this step or not?). Each solves the same fundamental problem — allocating variable compute based on difficulty — at a different scale.
The contrast with Do reasoning models switch between ideas too frequently? is instructive: underthinking wastes compute by switching topics too early, while universal deliberation wastes compute by thinking too hard at trivial steps. Both are compute-allocation failures, but in opposite directions.
Inquiring lines that use this note as a source 8
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- When should action deliberation trigger during reasoning steps?
- Can extended deliberation in agents become counterproductive like human overthinking?
- What distinguishes redundant cycles from productive reconsidering cycles?
- How do agents decide when to abstain from contributing?
- What makes a possibility actionable versus merely metaphysically possible?
- How do agents decide when to pause and reflect on their strategy?
- Why does per-step deliberation lose global perspective compared to dynamic discovery?
- How do agents decide when to stop and reflect on failure?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
same principle (adaptive compute) at prompt level; SAND operates at step level
-
Can models learn when to think versus respond quickly?
Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
response-level compute allocation; SAND adds step-level
-
Do reasoning models switch between ideas too frequently?
Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
the complementary failure mode: too little thinking at critical moments
-
Which sentences actually steer a reasoning trace?
Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
thought anchors may correlate with SAND's deliberation-flagged steps: high-causal-influence points are likely where action consistency diverges
-
When should retrieval happen during model generation?
Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
same uncertainty-triggered adaptive resource allocation at a different granularity: FLARE triggers retrieval on low-probability tokens, SAND triggers deliberation on inconsistent action samples; both avoid wasting resources on confident steps
-
Do iterative refinement methods suffer from overthinking?
Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?
SAND prevents the overthinking failure by making deliberation conditional: instead of deliberating at every step (which reproduces overthinking at the action level), SAND's self-consistency check gates computation to uncertain steps only, avoiding the variance inflation that universal deliberation causes
-
Can dialogue planning balance fast responses with strategic depth?
Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.
DPDP applies the same dual-process principle to dialogue: instinctive policy (System 1) for familiar contexts, MCTS (System 2) for novel scenarios, with uncertainty-based switching; SAND operates at per-step granularity within trajectories while DPDP operates at per-turn granularity within conversations, but both implement the Kahneman insight that deliberation should be selective, not universal
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- SAND: Boosting LLM Agents with Self-Taught Action Deliberation
- Reinforcement Learning be Enough for Thinking?
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
- LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
- React - Synergizing Reasoning And Acting In Language Models
- Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration
- Teaching Large Language Models to Reason with Reinforcement Learning
- A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
Original note title
action deliberation should trigger only at uncertain steps — self-consistency sampling identifies when deliberation adds value versus wastes compute