When should an agent actually stop and deliberate?

How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.

Synthesis note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback

SAND (Self-taught Action Deliberation) addresses a question that recurs across the reasoning and agentic literatures: when should a model invest extra computation? In large or unbounded action spaces, deliberating over all possible actions at every step is intractable. But never deliberating misses opportunities to catch errors at critical decision points.

The solution is elegant: at each step, sample N actions from the current policy alongside the expert action. Define an inconsistency indicator: if all N+1 actions are identical (the policy distribution is sharply peaked), set deliberation flag to 0 — the decision is trivial or the model is confident. If any actions differ, set flag to 1 — the model is uncertain, and deliberation should occur.

When deliberation triggers, SAND generates execution-guided critiques: instead of judging actions abstractly, it runs forward rollouts from each candidate action and uses the actual outcomes to inform evaluation. This is grounded assessment — not "which action looks better?" but "which action leads to better results?" The critiques are then synthesized into a deliberation thought that augments the trajectory.

The mechanism is self-teaching: deliberation trajectories are used for iterative finetuning of the agent itself. The model learns not just what to do but when to deliberate, internalizing the meta-decision of compute allocation.

This connects to the adaptive compute literature at a different granularity. Can we allocate inference compute based on prompt difficulty? operates at the prompt level (how much total compute for this problem?). Can models learn when to think versus respond quickly? operates at the response level (think or not?). SAND operates at the step level within a trajectory (deliberate at this step or not?). Each solves the same fundamental problem — allocating variable compute based on difficulty — at a different scale.

The contrast with Do reasoning models switch between ideas too frequently? is instructive: underthinking wastes compute by switching topics too early, while universal deliberation wastes compute by thinking too hard at trivial steps. Both are compute-allocation failures, but in opposite directions.

Inquiring lines that use this note as a source 8

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

20 direct connections · 163 in 2-hop network ·medium cluster Open in graph ↗

When should an agent actually stop and deliberat… Can we allocate inference compute based on prompt … Can models learn when to think versus respond quic… Do reasoning models switch between ideas too frequ… Which sentences actually steer a reasoning trace? When should retrieval happen during model generati… Do iterative refinement methods suffer from overth… Can dialogue planning balance fast responses with …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
same principle (adaptive compute) at prompt level; SAND operates at step level
Can models learn when to think versus respond quickly? Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
response-level compute allocation; SAND adds step-level
Do reasoning models switch between ideas too frequently? Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
the complementary failure mode: too little thinking at critical moments
Which sentences actually steer a reasoning trace? Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
thought anchors may correlate with SAND's deliberation-flagged steps: high-causal-influence points are likely where action consistency diverges
When should retrieval happen during model generation? Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
same uncertainty-triggered adaptive resource allocation at a different granularity: FLARE triggers retrieval on low-probability tokens, SAND triggers deliberation on inconsistent action samples; both avoid wasting resources on confident steps
Do iterative refinement methods suffer from overthinking? Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?
SAND prevents the overthinking failure by making deliberation conditional: instead of deliberating at every step (which reproduces overthinking at the action level), SAND's self-consistency check gates computation to uncertain steps only, avoiding the variance inflation that universal deliberation causes
Can dialogue planning balance fast responses with strategic depth? Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.
DPDP applies the same dual-process principle to dialogue: instinctive policy (System 1) for familiar contexts, MCTS (System 2) for novel scenarios, with uncertainty-based switching; SAND operates at per-step granularity within trajectories while DPDP operates at per-turn granularity within conversations, but both implement the Kahneman insight that deliberation should be selective, not universal

When should an agent actually stop and deliberate?

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4