How should GUI agents remember patterns across different software environments?

This explores how GUI agents should store and reuse what they learn so that knowledge carries across changing or unfamiliar software, rather than being relearned every time the interface shifts.

This explores how GUI agents should store and reuse what they learn so it carries across changing or unfamiliar software — and the corpus converges on a striking answer: don't bake patterns into the model's weights, externalize them into structured, layered memory the agent can read, recombine, and revise. The cleanest statement of the principle is that agent reliability comes from offloading three cognitive burdens — memory, skills, and protocols — into a surrounding harness rather than from a bigger model Where does agent reliability actually come from?. For GUI work specifically, that's what lets an agent survive a software update it has never seen.

The most direct answer to 'across different environments' is to stratify memory by abstraction level. Agent S keeps three tiers — outside web knowledge, high-level narrative patterns, and detailed episodic subtask traces — so that when a button moves or a menu is redesigned, the abstract pattern still transfers even though the concrete clicks don't How can GUI agents adapt when software constantly changes?. Agent Workflow Memory pushes the same idea harder: it extracts reusable sub-task routines at finer granularity than whole tasks and strips out the example-specific values, which is precisely why its gains *grow* as the gap between training and test environments widens (24% on Mind2Web, 51% on WebArena) Can agents learn reusable sub-task routines from past experience?. The lesson is counterintuitive but consistent — the more you abstract away the specific environment, the better the memory travels to a new one.

There's a deeper reason weights are the wrong place to store this. Updating parameters causes catastrophic forgetting, so VOYAGER instead keeps executable skills in an embedding-indexed library and composes complex behaviors from simpler stored ones, learning continuously without overwriting what it knew Can agents learn new skills without forgetting old ones?. AgentFly formalizes this as learning entirely through memory operations — case, subtask, and tool memory — with zero parameter updates, hitting 87.9% on GAIA Can agents learn continuously from experience without updating weights?. Across environments, memory-as-library beats memory-as-weights precisely because libraries are additive where weights are destructive.

But not all memory is the same, and the corpus is sharp on structure. RAISE shows agent memory decomposes into four components along two axes — dialogue-level vs. turn-level — each with its own failure mode and update policy, so 'remember patterns' isn't one problem but several How should agent memory split across time scales?. And memory left to grow unchecked degrades; DeepAgent folds interaction history into episodic, working, and tool schemas, compressing tokens while preserving the ability to pause and reconsider strategy Can agents compress their own memory without losing critical details?. So 'how should they remember' includes 'how should they forget well.'

The non-obvious doorway: for GUI agents specifically, the *quality* of what gets stored depends on how the screen is perceived in the first place. OmniParser found that vision-only agents fail when forced to simultaneously read icon meaning and predict actions — pre-parsing screenshots into structured semantic elements unblocks them Why do vision-only GUI agents struggle with screen interpretation?. That matters for cross-environment memory because a pattern stored as 'click pixel region' is brittle, while a pattern stored over named semantic elements transfers. If you want memory that survives a new interface, structure the perception before you structure the memory.

Sources 8 notes

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

How can GUI agents adapt when software constantly changes?

Agent S uses three-tier planning combining online web knowledge, high-level narrative memory patterns, and detailed episodic subtask experience. This hierarchical approach lets agents generalize across software changes while maintaining concrete execution grounding.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

How should agent memory split across time scales?

RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Why do vision-only GUI agents struggle with screen interpretation?

OmniParser demonstrates that GPT-4V fails when forced to simultaneously identify icon meanings and predict actions from raw screenshots. Pre-parsing screenshots into structured semantic elements with descriptions lets the model focus solely on action prediction, removing the composite-task bottleneck.

How should GUI agents remember patterns across different software environments?

Sources 8 notes

Next inquiring lines