What role does visual perception play alongside accessibility tree information?
This explores how GUI agents combine raw visual perception (what's on screen as pixels) with accessibility trees (the structured, machine-readable element data underneath) — and whether the two are redundant or complementary.
This explores how GUI agents combine raw visual perception (what the screen looks like as pixels) with accessibility trees (the structured element labels the operating system exposes underneath) — and the corpus suggests the answer is that neither alone is enough, and the interesting work is in *dividing labor* between them. The clearest statement comes from Agent S, whose dual-input design uses visual input for understanding the environment and image-augmented accessibility trees for grounding — pinning an intended action to a specific clickable element. Splitting these into separate optimization paths, rather than forcing one model to do everything end-to-end, produced a meaningful jump in performance Can structured interfaces help language models control GUIs better?. The accessibility tree isn't a backup for weak vision; it's a different *kind* of signal — symbolic and exact where vision is rich but ambiguous.
Why split the work at all? Because vision-only agents buckle under a composite task. OmniParser showed that even GPT-4V fails when it has to simultaneously figure out what an icon *means* and predict what action to take from raw screenshots. Pre-parsing the screen into structured, described elements — essentially manufacturing the semantic layer that an accessibility tree would provide — let the model drop the perception burden and focus purely on deciding what to do Why do vision-only GUI agents struggle with screen interpretation?. So accessibility-tree-style structure earns its place precisely by removing a bottleneck that visual perception alone creates.
Here's the part you might not expect: the bottleneck in visual perception isn't usually *reasoning*, it's *attention allocation*. Work on multimodal models found that piling on verbose chain-of-thought actually degrades fine-grained perception, because the real constraint is where the model looks, not how much it talks Does verbose chain-of-thought actually help multimodal perception tasks?. That reframes the accessibility tree's role: it's a way to hand the model crisp, pre-localized targets so it doesn't have to spend scarce visual attention hunting for them. Vision tells you the scene; structure tells you where the actionable handles are.
The complementarity shows up in adjacent domains too, under different names. In robotics, visual similarity alone retrieves objects that look right but can't actually be acted on — so an affordance layer reranks candidates by what's physically executable, converting 'looks like a match' into 'can be grasped' Can visual similarity alone guide robot object retrieval?. That's the same move as the GUI case: a non-visual, action-grounded signal disciplines an otherwise-ungrounded perceptual one. And when the underlying tension is framed as vision and language competing for capacity, the resolution turns out to be architectural rather than inherent — give each modality its own capacity instead of forcing them to fight in shared parameters Can we solve modality competition through architectural design?.
The through-line across all of these: visual perception and accessibility-tree information aren't rivals, and they aren't redundant. Vision carries open-ended environmental understanding; structured trees carry exact, executable grounding. The agents that work best are the ones that stop asking a single model to fuse both jobs and instead let each signal do what it's good at.
Sources 5 notes
Agent S's dual-input design—visual input for environmental understanding plus image-augmented accessibility trees for grounding—achieved 9.37% improvement over baseline by factoring planning and grounding into separate optimization paths rather than forcing end-to-end prediction.
OmniParser demonstrates that GPT-4V fails when forced to simultaneously identify icon meanings and predict actions from raw screenshots. Pre-parsing screenshots into structured semantic elements with descriptions lets the model focus solely on action prediction, removing the composite-task bottleneck.
Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.
AffordanceRAG reranks visually retrieved objects by affordance scores, ensuring the robot selects only physically executable actions. This architectural shift from similarity to task-grounded ranking prevents plans that fail at execution time.
Modality competition arises from caption distributional shift and rigid dense capacity allocation, not from vision and language being fundamentally incompatible. Mixture of Experts resolves the architectural bottleneck by allocating capacity per token, enabling modalities to coexist without competing.