Should GUI perception happen inside or outside the foundation model?
This explores a design fork in GUI agents: should a model 'see' and interpret the screen on its own (perception baked into the foundation model end-to-end), or should the screen be parsed into structured elements by a separate system before the model ever reasons about what to do?
This explores a design fork in GUI agents — whether the model should perceive the screen itself, or whether perception should be handed off to a separate parsing layer before the model reasons. The corpus leans, fairly consistently, toward moving perception *outside* — but for a reason more interesting than 'models are bad at vision.'
The clearest evidence is about overload. When a vision-language model is forced to identify what every icon means *and* decide the next action from a raw screenshot at the same time, it stalls — OmniParser shows GPT-4V failing precisely because it's juggling two jobs, and pre-parsing the screen into labeled semantic elements lets the model spend its whole budget on action prediction Why do vision-only GUI agents struggle with screen interpretation?. Agent S reaches the same conclusion from the architecture side: feeding the model an accessibility tree alongside the pixels, and splitting planning from grounding into separate optimization paths, beat forcing one end-to-end prediction Can structured interfaces help language models control GUIs better?. The shared lesson is that perception and planning are different optimization targets, and cramming them into one forward pass makes both worse.
There's a deeper reason this isn't just an engineering convenience. Work on multimodal reasoning finds that the real bottleneck in perception tasks is *visual attention allocation*, not how much the model verbalizes — and that piling on long chain-of-thought rationales actually degrades fine-grained perception because it optimizes the wrong policy target Does verbose chain-of-thought actually help multimodal perception tasks?. So 'do more reasoning inside the model' doesn't fix perception; it can hurt it. That's a strong argument for keeping perception as a crisp, dedicated step rather than something the language-reasoning machinery tries to absorb.
But the corpus also holds the counter-case — that the 'inside vs. outside' split may be a symptom of today's architectures, not a law. Research on modality competition argues that vision and language aren't fundamentally incompatible; they fight because dense models allocate fixed capacity and force one representation to crowd out the other. A Mixture-of-Experts design that allocates capacity per token lets both coexist Can we solve modality competition through architectural design?. Read against the GUI papers, this suggests the externalize-perception trend is partly a workaround for an architectural bottleneck that could, in principle, be solved internally. And structure helps even when perception stays inside: breaking visual reasoning into explicit cognitive stages outperforms flat reasoning Can breaking down visual reasoning into three stages improve model performance?, hinting that the win is *structured* perception, wherever it physically lives.
The most useful reframing comes from agent design more broadly. ReAct shows that interleaving reasoning with external feedback — querying the environment at each step rather than predicting blind — prevents errors from compounding Can interleaving reasoning with real-world feedback prevent hallucination?. A parsed GUI element tree is exactly this kind of external grounding signal. So the honest answer the corpus points to: perception is best handled *outside* today, because end-to-end pixels overload the model and text-reasoning doesn't fix vision — but the boundary is architectural, not fundamental, and the durable principle is that perception should be a *separate, structured, grounded* step, whether or not it eventually moves back inside a better-designed model.
Sources 6 notes
OmniParser demonstrates that GPT-4V fails when forced to simultaneously identify icon meanings and predict actions from raw screenshots. Pre-parsing screenshots into structured semantic elements with descriptions lets the model focus solely on action prediction, removing the composite-task bottleneck.
Agent S's dual-input design—visual input for environmental understanding plus image-augmented accessibility trees for grounding—achieved 9.37% improvement over baseline by factoring planning and grounding into separate optimization paths rather than forcing end-to-end prediction.
Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.
Modality competition arises from caption distributional shift and rigid dense capacity allocation, not from vision and language being fundamentally incompatible. Mixture of Experts resolves the architectural bottleneck by allocating capacity per token, enabling modalities to coexist without competing.
CoCoT structures VLM reasoning through embodied perception, embedded situation analysis, and norm-grounded interpretation, achieving +8% improvement over flat CoT on social benchmarks. The gains suggest cognitive structure matters more than reasoning volume for social tasks.
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.