SYNTHESIS NOTE
Agentic Systems and Tool Use

Why do planning and grounding pull against each other in agents?

Planning requires flexibility and error recovery while grounding demands action accuracy. Do these conflicting optimization requirements force a design choice about how to structure agent architectures?

Synthesis note · 2026-05-03 · sourced from Visual GUI Agents

AutoGLM's first key insight from building deployable foundation agents for Web Browser and Android is that planning and grounding are not just different sub-tasks — they have opposing optimization requirements, and bundling them in one end-to-end policy means each pulls against the other.

Planning demands flexibility and error recovery. The agent must construct creative paths to goals, abandon failed approaches, and recover when the environment behaves unexpectedly. Optimizing planning means tolerating exploration, allowing the model to consider multiple branches, and rewarding adaptability.

Grounding demands action accuracy. Once the plan is set, the click must hit the right pixel, the form must receive the exact right text, the API call must use the exact right argument. Optimizing grounding means narrowing variability, locking in deterministic behavior, and punishing near-misses.

These two regimes pull in opposite directions during training. A model trained for planning flexibility ungrounds; a model trained for grounding accuracy becomes brittle. The intermediate interface is the architectural artifact that separates them — letting each be developed and optimized on its own terms while still composing into a complete agent.

This finding generalizes a pattern visible across the GUI agent literature (Can structured interfaces help language models control GUIs better? for Agent S's ACI, Why do vision-only GUI agents struggle with screen interpretation? for OmniParser's screen parsing layer): the load-bearing design move is not a better single-pass policy but a clean factoring at the right joint. AutoGLM's second insight — that error recovery is crucial for robustness yet difficult to acquire offline, motivating self-evolving online curriculum RL with weak-to-strong progressive training — depends on the first: the curriculum can target planning behaviors specifically because the interface has separated them from grounding behaviors.

The transferable claim: in any agent stack where two sub-capabilities have conflicting optimization requirements, the architecture must factor before training, not the other way around. This is the same principle behind Does separating planning from execution improve reasoning accuracy? — only here the joint is between planning and grounding rather than planning and execution.

Brain-inspired three-phase factoring with a process+outcome reward (BTL-UI, https://arxiv.org/abs/2509.15566). Blink-Think-Link refines the same factoring move by adding a third joint and a training mechanism. It decomposes GUI interaction into three biologically-motivated phases: Blink (rapid detection and attention to relevant screen regions, analogous to saccades), Think (higher-level reasoning and planning), and Link (generation of executable commands for precise motor control). Blink/Link map onto the perception-grounding split AutoGLM and OmniParser isolate, while Think is the planning layer — so BTL is essentially this planning-grounding factoring with an explicit attention/perception stage carved off the front. Two technical additions matter: an automated Blink-data annotation pipeline, and BTL Reward — the first rule-based reward that drives RL from both process and outcome rather than outcome alone, which lets training target the intermediate phases (the same need AutoGLM's curriculum addresses). The convergence of independent labs on "factor the GUI agent at its natural joints, then train each separately" is the durable pattern; BTL adds that the reward should also be factored across process and outcome, not just the architecture.

Inquiring lines that use this note as a source 12

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 100 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

foundation GUI agents need an intermediate interface that disentangles planning from grounding — the two have opposing optimization requirements