SYNTHESIS NOTE
Agentic Systems and Tool Use

Can structured interfaces help language models control GUIs better?

Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.

Synthesis note · 2026-05-03 · sourced from Tool Computer Use

Agent S's contribution is conceptual as much as engineering: it ports the Agent-Computer Interface (ACI) idea from coding agents to GUI agents. The motivating observation is that MLLMs handed raw screenshots are asked to do too much at once — identify icon semantics and predict the next action on a specific element simultaneously — which is observationally where they fail.

The ACI is therefore designed to factor the problem. The dual-input strategy uses visual input for understanding environmental changes (what the screen looks like, what just happened) while pairing it with an image-augmented accessibility tree for precise element grounding (which element is which, and where). The action space is bounded to language-based primitives like click(element id) — narrow enough to be reliably common-sense reasonable for an MLLM, broad enough to compose into complex tasks, and at a temporal resolution that lets the agent observe immediate task-relevant feedback after each action.

This factoring matches a deeper architectural choice: planning and grounding have distinct optimization requirements. Planning needs flexibility and error recovery. Grounding needs accuracy. Mixing them in a single end-to-end policy means each pulls against the other (see Why do planning and grounding pull against each other in agents?). The ACI's job is to be the abstraction layer that lets each concern be optimized separately.

Empirically the design pays off — 9.37% absolute gain over the OSWorld baseline, plus generalization across operating systems on WindowsAgentArena. The transferable claim is that "look at the screen and act" is the wrong primitive for GUI agents at the current model frontier. The right primitive is a structured interface that hands the model what each cognitive sub-task actually needs.

Inquiring lines that use this note as a source 37

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 69 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

GUI agents need a language-centric Agent-Computer Interface to separate planning from grounding — visual understanding plus accessibility tree plus bounded primitives beats raw screenshots