SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

When should an agent actually stop and deliberate?

How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.

Synthesis note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback
How should we allocate compute budget at inference time?

SAND (Self-taught Action Deliberation) addresses a question that recurs across the reasoning and agentic literatures: when should a model invest extra computation? In large or unbounded action spaces, deliberating over all possible actions at every step is intractable. But never deliberating misses opportunities to catch errors at critical decision points.

The solution is elegant: at each step, sample N actions from the current policy alongside the expert action. Define an inconsistency indicator: if all N+1 actions are identical (the policy distribution is sharply peaked), set deliberation flag to 0 — the decision is trivial or the model is confident. If any actions differ, set flag to 1 — the model is uncertain, and deliberation should occur.

When deliberation triggers, SAND generates execution-guided critiques: instead of judging actions abstractly, it runs forward rollouts from each candidate action and uses the actual outcomes to inform evaluation. This is grounded assessment — not "which action looks better?" but "which action leads to better results?" The critiques are then synthesized into a deliberation thought that augments the trajectory.

The mechanism is self-teaching: deliberation trajectories are used for iterative finetuning of the agent itself. The model learns not just what to do but when to deliberate, internalizing the meta-decision of compute allocation.

This connects to the adaptive compute literature at a different granularity. Can we allocate inference compute based on prompt difficulty? operates at the prompt level (how much total compute for this problem?). Can models learn when to think versus respond quickly? operates at the response level (think or not?). SAND operates at the step level within a trajectory (deliberate at this step or not?). Each solves the same fundamental problem — allocating variable compute based on difficulty — at a different scale.

The contrast with Do reasoning models switch between ideas too frequently? is instructive: underthinking wastes compute by switching topics too early, while universal deliberation wastes compute by thinking too hard at trivial steps. Both are compute-allocation failures, but in opposite directions.

Inquiring lines that use this note as a source 8

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
20 direct connections · 163 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

action deliberation should trigger only at uncertain steps — self-consistency sampling identifies when deliberation adds value versus wastes compute