SYNTHESIS NOTE

Can structured interfaces help language models control GUIs better?

Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.

Synthesis note · 2026-05-03 · sourced from Tool Computer Use

Agent S's contribution is conceptual as much as engineering: it ports the Agent-Computer Interface (ACI) idea from coding agents to GUI agents. The motivating observation is that MLLMs handed raw screenshots are asked to do too much at once — identify icon semantics and predict the next action on a specific element simultaneously — which is observationally where they fail.

The ACI is therefore designed to factor the problem. The dual-input strategy uses visual input for understanding environmental changes (what the screen looks like, what just happened) while pairing it with an image-augmented accessibility tree for precise element grounding (which element is which, and where). The action space is bounded to language-based primitives like click(element id) — narrow enough to be reliably common-sense reasonable for an MLLM, broad enough to compose into complex tasks, and at a temporal resolution that lets the agent observe immediate task-relevant feedback after each action.

This factoring matches a deeper architectural choice: planning and grounding have distinct optimization requirements. Planning needs flexibility and error recovery. Grounding needs accuracy. Mixing them in a single end-to-end policy means each pulls against the other (see Why do planning and grounding pull against each other in agents?). The ACI's job is to be the abstraction layer that lets each concern be optimized separately.

Empirically the design pays off — 9.37% absolute gain over the OSWorld baseline, plus generalization across operating systems on WindowsAgentArena. The transferable claim is that "look at the screen and act" is the wrong primitive for GUI agents at the current model frontier. The right primitive is a structured interface that hands the model what each cognitive sub-task actually needs.

Inquiring lines that use this note as a source 37

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 69 in 2-hop network ·medium cluster Open in graph ↗

Can structured interfaces help language models c… Why do planning and grounding pull against each ot… Why do vision-only GUI agents struggle with screen… How can GUI agents adapt when software constantly … Do text-based GUI agents actually work in the real… Can API-first agents outperform UI-based agent int…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do planning and grounding pull against each other in agents? Planning requires flexibility and error recovery while grounding demands action accuracy. Do these conflicting optimization requirements force a design choice about how to structure agent architectures?
extends: Agent S's ACI is the concrete instantiation of the planning-grounding factoring AutoGLM generalizes; same architectural claim, narrower stack.
Why do vision-only GUI agents struggle with screen interpretation? Exploring whether GPT-4V's performance bottleneck in GUI automation stems from the simultaneous cognitive load of parsing icon semantics and predicting actions, and whether factoring these tasks improves reliability.
complements: OmniParser factors perception (parse first, then act); Agent S factors interface (vision + accessibility tree + bounded primitives). Both arrive at structured intermediate representations from different angles.
How can GUI agents adapt when software constantly changes? Can desktop automation agents stay current by combining real-time web documentation with learned task patterns and concrete execution memories? This explores how to avoid training obsolescence in open-world software environments.
complements: same paper, memory-side companion. ACI factors perception and action; the memory architecture factors abstract task patterns from concrete subtask traces.
Do text-based GUI agents actually work in the real world? Can language-only agents that rely on HTML or accessibility trees handle actual user interfaces without structured metadata? This matters because deployed systems face visual screenshots, not oracle data.
tension with: ShowUI argues accessibility-tree-based agents have an architectural ceiling because real users see visually; Agent S includes accessibility tree as a grounding aid alongside vision, hedging the trade-off rather than rejecting accessibility data.
Can API-first agents outperform UI-based agent interaction? This explores whether directing agents to use APIs instead of navigating UIs reduces task completion time and errors. The question matters because current LLM agents struggle with sequential UI steps that multiply latency and hallucination risk.
complements: API-first agents bypass the GUI-grounding problem entirely; ACI is the fallback architecture for when APIs aren't available.

Can structured interfaces help language models control GUIs better?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4