How do strategy-level abstractions differ from storing raw task workflows?

This explores the trade-off between memory that stores *generalized strategies* (reusable, abstracted routines) versus memory that stores *concrete task recordings* (the literal click-by-click steps) — and which one actually helps an agent reuse what it learned.

This explores the gap between abstracting what you learned into a reusable strategy and just keeping the raw recording of what you did. The corpus stages this as a genuine fight, not a settled question. On one side, Agent Workflow Memory argues that the win comes from *abstraction*: it extracts sub-task routines at finer granularity than whole tasks, strips out example-specific values (the particular URL, the particular form field), and then compounds those routines hierarchically — yielding 24–51% gains that get *larger* as the gap between training and test tasks widens Can agents learn reusable sub-task routines from past experience?. The whole point is that throwing away the specifics is what makes the memory transfer.

But the corpus also has a sharp dissent. PRAXIS finds the opposite for web agents: indexing procedures by concrete environment state and local action pairs — keeping the click-by-click specifics — beats higher-level workflow abstractions, which it argues *lose* exactly the detail you need to act reliably Does state-indexed memory outperform high-level workflow memory for web agents?. So the difference between strategy-level and raw-workflow memory isn't 'one is better' — it's a bet about how much your future tasks will resemble your past ones. Abstraction pays off when tasks differ; concrete state-indexed recall pays off when reliable execution in a familiar environment matters more than generalizing.

The reason both can be right shows up in a third note: when you separate the planner from the executor, *decomposition ability transfers across domains but solving ability does not* Does separating planning from execution improve reasoning accuracy?. That's the cleanest explanation of the whole tension. Strategy-level abstractions capture the part of skill that generalizes (how to break a problem down); raw workflows capture the part that doesn't (the exact actions that worked here). Storing them at the wrong level means either over-generalizing a brittle plan or memorizing steps that won't replay.

There's a further wrinkle worth knowing: abstractions don't just compress, they *change how an agent searches*. RLAD shows that spending compute on generating diverse abstractions produces structured breadth-first exploration and prevents 'underthinking' — the failure where a model commits to one path too early Can abstractions guide exploration better than depth alone?. A raw stored workflow can't do that; it's a single rail. So a strategy abstraction is also a tool for considering alternatives, not just a smaller way to remember one.

And at the far end, FlowReasoner abandons stored workflows entirely — instead of reusing fixed task-level templates, it generates a fresh multi-agent architecture per query Can AI systems design unique multi-agent workflows per individual query?. That reframes the original question: the spectrum runs from raw recorded workflows, to abstracted reusable strategies, to no stored workflow at all but a *strategy for producing one on demand*. The deeper you go, the more 'memory' stops being storage and starts being a generative skill.

Sources 5 notes

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Does state-indexed memory outperform high-level workflow memory for web agents?

PRAXIS shows that indexing procedures by environment state and local action pairs yields consistent accuracy and reliability gains across VLM backbones on the REAL benchmark, compared to higher-level workflow abstractions that lose click-by-click specifics.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can AI systems design unique multi-agent workflows per individual query?

FlowReasoner demonstrates that meta-agents trained with reinforcement learning and external execution feedback can generate unique multi-agent architectures for each user query, optimizing across performance, complexity, and efficiency—moving beyond fixed task-level workflow templates.

How do strategy-level abstractions differ from storing raw task workflows?

Sources 5 notes

Next inquiring lines