BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent
In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To address this gap, we propose Blink–Think–Link (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward – the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome.
Introduction. Automation of graphical user interface (GUI) interactions constitutes a pivotal milestone in developing genuinely intelligent digital assistants [1, 2, 3]. Recent breakthroughs in large vision-language models (VLMs) [4, 5] and reinforcement learning fine-tuning techniques have substantially improved agents’ capabilities in natural language command interpretation, visual element perception, and multi-step task execution through human-like reasoning [6, 7]. However, current mainstream systems adopt mainly two approaches. The first relies on supervised fine-tuning (SFT) to align model behavior with task objectives, but this method faces two major limitations: a strong dependence on large-scale expert-labeled data and limited generalization capability when faced with out-of-distribution scenarios.
Discussion / Conclusion. We propose the BTL framework, an innovative GUI interaction architecture inspired by the biological cognitive paradigm of Blink–Think–Link. This framework simulates the human closed-loop system of visual perception, cognitive decision-making, and action execution during GUI operations, overcoming the limitations of traditional outcome-driven RFT approaches. Experimental results show that the BTL-UI agent, developed under this framework, achieves significant performance improvements across a variety of GUI interaction tasks. We believe that the BTL framework proposed in this study establishes a promising and generalizable paradigm for developing digital assistants that are more natural, efficient, and aligned with human cognition. It not only benefits human-GUI interaction but can also be extended to other humancomputer interaction tasks, such as embodied intelligence.