Can text-based and vision-based screen understanding achieve similar performance?

This explores whether reading a screen as structured text (accessibility trees, bounding boxes, parsed elements) can match — or beat — reading it as raw pixels, and what the corpus says about the tradeoff.

This explores whether reading a screen as structured text — accessibility trees, bounding boxes, parsed element descriptions — can match reading it as raw pixels. The corpus's blunt answer: pure vision usually *under*performs, and the fix is almost always to inject text-like structure rather than to make the vision model stronger. Why do vision-only GUI agents struggle with screen interpretation? is the clearest case — GPT-4V falls down when it has to identify what an icon *means* and decide what to *do* from a screenshot at the same time. Pre-parsing the screen into named, described elements removes the composite-task bottleneck, and the model only then performs. So the question isn't quite "text vs. vision" — it's that vision alone forces two jobs into one, and text structure splits them apart.

The winning recipe across the corpus is hybrid, not either/or. Can structured interfaces help language models control GUIs better? keeps the screenshot for understanding the environment but adds an image-augmented accessibility tree for grounding — feeding planning and grounding down separate optimization paths instead of forcing one end-to-end prediction. That factoring, not raw perceptual power, is what buys the gains. Can one model understand both UIs and infographics equally well? makes the same bet differently: teach a model to emit text annotations of UI element types and locations, and a modest 5B model reaches state of the art. In both, the text representation is what makes the vision usable.

The striking result is that text can sometimes win outright on cost. Can bounding boxes replace image encoders for document understanding? shows bounding-box coordinates plus disentangled attention capture text-and-spatial layout *without any image encoder* — comparable understanding at substantially lower compute than pixel-based multimodal models. For document-like screens, the pixels turn out to carry little the text-plus-coordinates didn't already encode. Can describing images in text improve zero-shot recognition? pushes the same idea: describe an image in natural language, then retrieve against a text index, and the language bridge beats direct visual embedding similarity. Text isn't just a crutch for weak vision — it can be the better channel.

There's a deeper reason to distrust raw vision parity. Does multimodal zero-shot performance actually generalize or interpolate? finds that multimodal zero-shot ability tracks how often a concept appeared in pretraining, not genuine generalization — performance gains need *exponentially* more data. So a vision model that looks competent on common UI patterns may quietly fail on rare ones, while a structured text representation degrades more predictably. That reframes "similar performance": vision can match text on the head of the distribution and fall off a cliff in the tail.

The honest synthesis: text-based and vision-based screen understanding *can* reach similar performance, but rarely by competing head-on. The reliable systems convert pixels into text-like structure early, then let a language model reason over that — vision for perceiving, text for grounding and planning. Where the screen is layout-heavy and text-dense, text representations can even win on accuracy and cost both. The frontier worth watching is whether models can learn that structure from raw streams directly: Can unlabeled UI video teach models what users intend? learns task-aware representations from unlabeled screen recordings, hinting at a future where the text-vs-vision framing dissolves into representations learned from the pixels themselves.

Sources 7 notes

Why do vision-only GUI agents struggle with screen interpretation?

OmniParser demonstrates that GPT-4V fails when forced to simultaneously identify icon meanings and predict actions from raw screenshots. Pre-parsing screenshots into structured semantic elements with descriptions lets the model focus solely on action prediction, removing the composite-task bottleneck.

Can structured interfaces help language models control GUIs better?

Agent S's dual-input design—visual input for environmental understanding plus image-augmented accessibility trees for grounding—achieved 9.37% improvement over baseline by factoring planning and grounding into separate optimization paths rather than forcing end-to-end prediction.

Can one model understand both UIs and infographics equally well?

ScreenAI unifies UIs and infographics under one schema, using screen-annotation pretraining to identify UI element types and locations. These annotations auto-generate QA and navigation data, enabling a 5B-parameter model to achieve state-of-the-art performance on multiple benchmarks.

Can bounding boxes replace image encoders for document understanding?

DocLLM shows that bounding-box spatial information combined with decomposed transformer attention can capture text-spatial alignment in documents without pixel-based visual encoding. Pretraining on text-infilling objectives suited to irregular layouts achieves this at substantially lower computational cost than multimodal LLMs using image encoders.

Can describing images in text improve zero-shot recognition?

SignRAG demonstrates that describing an unknown image via vision-language model, then retrieving known designs from a text-indexed database, eliminates the need for recognition model training. Natural-language description bridges the visual-reference gap better than direct embedding similarity.

Does multimodal zero-shot performance actually generalize or interpolate?

Across 34 models and 5 datasets, multimodal models require exponentially more pretraining data for linear performance gains on downstream tasks. Performance correlates with how often test concepts appeared during pretraining, not genuine generalization ability.

Can unlabeled UI video teach models what users intend?

UI-JEPA applies JEPA-style predictive masking to screen recordings, learning task-aware temporal representations that an LLM decoder can use to infer intent with minimal paired data. This trades the bottleneck of labeled video for abundant unlabeled streams.

Can text-based and vision-based screen understanding achieve similar performance?

Sources 7 notes

Next inquiring lines