INQUIRING LINE

How does PRAXIS differ architecturally from Agent Workflow Memory and causal rule learning?

This explores how agents accumulate and reuse experience — and what architectural unit each approach extracts: Agent Workflow Memory stores reusable sub-task routines, while the corpus's adjacent work stores skills, episodic cases, or pruned reasoning trees instead.


This reads the question as being about the *unit of reuse* — what an agent extracts from past experience and how it stores and recomposes it. One caveat up front: the corpus doesn't contain a note named PRAXIS or one specifically on causal rule learning, so this synthesizes the architectural landscape they'd sit inside rather than naming all three head-to-head. What it does have is a sharp spread of design choices along exactly that axis.

Agent Workflow Memory is the clearest anchor: it extracts *sub-task routines* at finer granularity than whole tasks, strips out example-specific values to make them reusable, and compounds them hierarchically — and the gains grow as the gap between training and test situations widens Can agents learn reusable sub-task routines from past experience?. The architectural commitment is procedural: the reusable thing is a *how-to*, abstracted from the specifics. VOYAGER makes a sibling choice but stores executable *skills* in an embedding-indexed library and composes complex skills from simpler ones, which is what lets it learn continuously without the catastrophic forgetting that weight-update methods suffer Can agents learn new skills without forgetting old ones?. Both externalize procedure into a library rather than baking it into model weights — the difference is granularity and how composition happens.

The interesting contrast is what *else* an agent could store. AgentFly keeps three memory modules — case, subtask, and tool — and treats the whole thing as a memory-augmented decision process, improving its policy entirely through memory operations with zero parameter updates Can agents learn continuously from experience without updating weights?. DeepAgent folds raw interaction history into episodic, working, and tool schemas to stay efficient under long horizons Can agents compress their own memory without losing critical details?. A 2025 survey argues these aren't really different memory *types* at all — it reframes agent memory along forms, functions, and dynamics, showing the familiar short-term/long-term split is an emergent temporal pattern rather than an architectural fact Can three axes replace the short-term long-term memory split?. That's the lens that makes the PRAXIS-vs-AWM-vs-rules question crisp: they differ in which *form* (routine, skill, case, rule) and which *function* (experiential vs procedural) they externalize.

If 'causal rule learning' is the third leg, the corpus's nearest territory is the move to externalize *cognitive burden into structure* rather than rely on model scale: reliable agents push memory, skills, and protocols into a harness layer so the model stops re-solving the same problems Where does agent reliability actually come from?. There's also a thread that replaces stored routines with *control flow* — LLM Programs embed the model inside an explicit algorithm that hides step-irrelevant context Can algorithms control LLM reasoning better than LLMs alone?, and the Thread Inference Model dispenses with a separate memory store entirely by structuring reasoning as recursive subtask trees with KV-cache pruning, letting one model do internally what multi-agent systems do across components Can recursive subtask trees overcome context window limits?.

So the architectural fault line the question is pointing at is real and the corpus maps it well: do you reuse *procedures* (AWM, VOYAGER), *episodes/cases* (AgentFly, DeepAgent), *control structure* (LLM Programs, Thread Inference), or *abstracted rules* — and is that store external and editable, or compiled into weights? The thing worth knowing you wanted to know: the survey's claim that these look like distinct architectures but are better understood as different settings of form-and-function on a shared substrate — which means the 'difference' between approaches like these is often a choice of granularity and storage medium, not a deep architectural divide Can three axes replace the short-term long-term memory split?.


Sources 8 notes

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Can three axes replace the short-term long-term memory split?

A 2025 survey reframes agent memory along forms (token/parametric/latent), functions (factual/experiential/working), and dynamics (formation/evolution/retrieval), showing that short/long-term phenomena emerge from temporal patterns rather than architectural separation. This enables precise system comparison and replaces vague implementation-based claims.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking agent architecture design choices. The question remains open: How do PRAXIS, Agent Workflow Memory, and causal rule learning differ in their unit of reuse, storage form, and composition strategy?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. A library synthesis identified these as variants of a shared substrate (form × function × storage), not distinct paradigms:
• Agent Workflow Memory extracts sub-task routines, strips example-specific values, and compounds them hierarchically; gains grow as train–test gap widens (2024-09).
• VOYAGER and SkillClaw store executable skills in embedding-indexed or collectively-evolved libraries, enabling composition and lifelong learning without catastrophic forgetting (~2025–2026).
• AgentFly, DeepAgent, and Thread Inference Model use case/episode memory, episodic schemas, or recursive subtask trees with KV-cache pruning — different forms, same externalization principle (2025–2026).
• A 2025 survey reframed agent memory as three axes (form, function, dynamics) rather than type buckets, suggesting short-term/long-term is emergent, not architectural (2512.13564).
• The shift to externalizing cognitive burden into memory, skills, and control flow rather than model scale is now the consensus framing (2604.08224, 2026-04).

Anchor papers (verify; mind their dates):
• arXiv:2409.07429 – Agent Workflow Memory (2024-09)
• arXiv:2512.13564 – Memory in the Age of AI Agents: Forms, Functions, Dynamics (2025-12)
• arXiv:2604.08224 – Externalization in LLM Agents: Unified Review (2026-04)
• arXiv:2605.12978 – Useful Memories Become Faulty When Continuously Updated (2026-05)

Your task:
(1) RE-TEST THE FORM-FUNCTION CLAIM. The 2025–2026 library argues 'architectural difference' dissolves into form choice (routine vs. skill vs. case vs. rule) and storage medium (external vs. compiled). Judge whether newer work on PRAXIS or causal rule induction confirms this or reveals a genuinely distinct design principle. If either stores *learned causal structure* rather than procedures/episodes, does that break the unifying framework?
(2) SURFACE CONTRADICTIONS. The 2026-05 paper warns that continuously updated memories become faulty. Does this undermine the externalization consensus? Find work from late 2025 onward that either defends memory freshness strategies or pivots back to weight-based learning for durability.
(3) PROPOSE two questions assuming the regime has moved: (a) If form–function–dynamics is the real decomposition, what novel composition *rule* (causal, probabilistic, or symbolic) could chain across memory modules that current systems don't attempt? (b) Do causal rules learned *inside* a memory module outperform rules learned *across* a skill library, and under what scale?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines