SYNTHESIS NOTE

Why do planning and grounding pull against each other in agents?

Planning requires flexibility and error recovery while grounding demands action accuracy. Do these conflicting optimization requirements force a design choice about how to structure agent architectures?

Synthesis note · 2026-05-03 · sourced from Visual GUI Agents

AutoGLM's first key insight from building deployable foundation agents for Web Browser and Android is that planning and grounding are not just different sub-tasks — they have opposing optimization requirements, and bundling them in one end-to-end policy means each pulls against the other.

Planning demands flexibility and error recovery. The agent must construct creative paths to goals, abandon failed approaches, and recover when the environment behaves unexpectedly. Optimizing planning means tolerating exploration, allowing the model to consider multiple branches, and rewarding adaptability.

Grounding demands action accuracy. Once the plan is set, the click must hit the right pixel, the form must receive the exact right text, the API call must use the exact right argument. Optimizing grounding means narrowing variability, locking in deterministic behavior, and punishing near-misses.

These two regimes pull in opposite directions during training. A model trained for planning flexibility ungrounds; a model trained for grounding accuracy becomes brittle. The intermediate interface is the architectural artifact that separates them — letting each be developed and optimized on its own terms while still composing into a complete agent.

This finding generalizes a pattern visible across the GUI agent literature (Can structured interfaces help language models control GUIs better? for Agent S's ACI, Why do vision-only GUI agents struggle with screen interpretation? for OmniParser's screen parsing layer): the load-bearing design move is not a better single-pass policy but a clean factoring at the right joint. AutoGLM's second insight — that error recovery is crucial for robustness yet difficult to acquire offline, motivating self-evolving online curriculum RL with weak-to-strong progressive training — depends on the first: the curriculum can target planning behaviors specifically because the interface has separated them from grounding behaviors.

The transferable claim: in any agent stack where two sub-capabilities have conflicting optimization requirements, the architecture must factor before training, not the other way around. This is the same principle behind Does separating planning from execution improve reasoning accuracy? — only here the joint is between planning and grounding rather than planning and execution.

Brain-inspired three-phase factoring with a process+outcome reward (BTL-UI, https://arxiv.org/abs/2509.15566). Blink-Think-Link refines the same factoring move by adding a third joint and a training mechanism. It decomposes GUI interaction into three biologically-motivated phases: Blink (rapid detection and attention to relevant screen regions, analogous to saccades), Think (higher-level reasoning and planning), and Link (generation of executable commands for precise motor control). Blink/Link map onto the perception-grounding split AutoGLM and OmniParser isolate, while Think is the planning layer — so BTL is essentially this planning-grounding factoring with an explicit attention/perception stage carved off the front. Two technical additions matter: an automated Blink-data annotation pipeline, and BTL Reward — the first rule-based reward that drives RL from both process and outcome rather than outcome alone, which lets training target the intermediate phases (the same need AutoGLM's curriculum addresses). The convergence of independent labs on "factor the GUI agent at its natural joints, then train each separately" is the durable pattern; BTL adds that the reward should also be factored across process and outcome, not just the architecture.

Inquiring lines that use this note as a source 12

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 100 in 2-hop network ·medium cluster Open in graph ↗

Why do planning and grounding pull against each … Can structured interfaces help language models con… Why do vision-only GUI agents struggle with screen… Does separating planning from execution improve re… Do text-based GUI agents actually work in the real… Does gradually tightening token budgets beat fixed…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can structured interfaces help language models control GUIs better? Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.
exemplifies: Agent S is the ACI instantiation of AutoGLM's general factoring claim; same architectural move applied to a specific stack.
Why do vision-only GUI agents struggle with screen interpretation? Exploring whether GPT-4V's performance bottleneck in GUI automation stems from the simultaneous cognitive load of parsing icon semantics and predicting actions, and whether factoring these tasks improves reliability.
exemplifies: OmniParser is the perception-side instantiation of the factoring principle — when foundation models fail composite tasks, factor the perception sub-problem out.
Does separating planning from execution improve reasoning accuracy? Can modular LM architectures that split problem decomposition from solution execution outperform monolithic models? This explores whether decoupling these cognitive operations reduces interference and boosts performance.
extends: same architectural principle (factor before training when sub-tasks have conflicting requirements) applied to reasoning rather than GUI agents.
Do text-based GUI agents actually work in the real world? Can language-only agents that rely on HTML or accessibility trees handle actual user interfaces without structured metadata? This matters because deployed systems face visual screenshots, not oracle data.
complicates: ShowUI argues text-only interfaces are architecturally limited; AutoGLM's intermediate interface combines text and vision precisely to avoid the text-only ceiling while preserving the planning-grounding factoring.
Does gradually tightening token budgets beat fixed budget training? Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
complements: AutoGLM's "weak-to-strong progressive training" is a curriculum-based RL pattern in agentic-rollout settings; matches the broader principle that curricula outperform fixed-budget RL.

Why do planning and grounding pull against each other in agents?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4