What does an intermediate interface between planning and grounding actually look like?
This explores what the 'intermediate interface' that sits between an agent's planning (deciding what to do) and grounding (acting on a real screen or environment) is actually made of — its concrete shape, not just the claim that one should exist.
This explores what the 'intermediate interface' between planning and grounding actually looks like in practice — the layer that lets an agent decide *what* to do separately from figuring out *where to click*. The short answer the corpus keeps arriving at: it looks like a language-centric description of the environment, not a pixel buffer. Several independent systems converged on this. AutoGLM's work argues planning and grounding have opposing optimization needs and pull against each other when crammed into one policy, so they insert a deliberate seam between them Why do planning and grounding pull against each other in agents?. Agent S, AutoGLM, and OmniParser all landed on the same factoring — a planning layer, a grounding layer, and a mediating Agent-Computer Interface in between How should agents split planning from visual grounding?.
Concretely, that interface is often a *structured representation of the screen* rather than the raw screen. Agent S feeds the model two things: a visual input for understanding the scene, plus an image-augmented accessibility tree — essentially a labeled map of the interface elements — that the planner reasons over and the grounder resolves into actions. That dual input beat raw-screenshot baselines by roughly 9% because each half got to optimize for its own job Can structured interfaces help language models control GUIs better?. So the interface 'looks like' a textual/semantic inventory of what's on screen, sitting between high-level intent and low-level coordinates.
The same shape shows up far outside GUIs, which is the more interesting discovery. In multi-step reasoning, splitting a 'decomposer' from a 'solver' produces a clean interface — the decomposition (a plan in language) — and the decomposer's skill even transfers across domains while the solver's doesn't Does separating planning from execution improve reasoning accuracy?. RLAD makes the interface an explicit *abstraction* generated before solving, which forces breadth-first exploration the planner alone wouldn't attempt Can abstractions guide exploration better than depth alone?. And ReAct's classic move — interleaving a reasoning trace with tool calls — is arguably the thinnest version of this interface: the verbal reasoning step *is* the plan, and each external action grounds it before the next thought, which is what stops hallucination from compounding Can interleaving reasoning with real-world feedback prevent hallucination?.
Two framings deepen the picture. Dual-process dialogue planning shows the interface can be a *switch*: a fast System-1 policy handles familiar cases and hands off to slow System-2 MCTS planning only when the model's own uncertainty spikes — so the boundary between planning and execution is itself dynamic, not fixed Can dialogue planning balance fast responses with strategic depth?. And the grounding side isn't monolithic: 'grounding' decomposes into functional, social, and causal kinds Does semantic grounding in language models come in degrees?, which means the interface a planner needs depends on *which* grounding it's reaching for.
Worth knowing: the value of this seam may be less about reasoning power and more about *timing and interaction*. Test-time interaction scaling — more environment steps for exploration and replanning — turns out to be a separate axis from chain-of-thought depth Does agent interaction time scale separately from reasoning depth?, and a parallel finding suggests RL post-training mostly teaches a model *when* to deploy reasoning it already has, not how Does RL post-training create reasoning or just deploy it?. Read together, the intermediate interface looks like the place where an agent decides *when and how much* to plan before grounding — a routing and abstraction layer, expressed in language, that keeps two differently-shaped skills from corrupting each other.
Sources 10 notes
AutoGLM's research shows planning and grounding have opposing optimization requirements that pull against each other when bundled in one policy. An intermediate interface that separates them lets each capability be developed and optimized independently while still composing into a complete agent.
Multiple independent systems (Agent S, AutoGLM, OmniParser) converged on factoring agent reasoning into a planning layer and a grounding layer, with a language-centric Agent-Computer Interface mediating between them due to their opposing optimization requirements.
Agent S's dual-input design—visual input for environmental understanding plus image-augmented accessibility trees for grounding—achieved 9.37% improvement over baseline by factoring planning and grounding into separate optimization paths rather than forcing end-to-end prediction.
Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.
A framework combining a neural policy model (System 1) for familiar contexts with MCTS planning (System 2) for novel scenarios, switching based on the model's own uncertainty estimates, matches or exceeds pure MCTS performance while reducing computational cost.
Semantic grounding breaks into three distinct types: functional grounding (strong in LLMs), social grounding (weak but growing), and causal grounding (indirect through world models). LLMs score differently on each dimension, making the yes-or-no understanding question misleading.
Test-time interaction—increasing environment steps—enables exploration, backtracking, and replanning that per-step reasoning cannot achieve. Curriculum-based RL on rollout length produces SOTA web agents, showing interaction scaling dominates on tasks with partial observability.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.