SYNTHESIS NOTE
Agentic Systems and Tool Use

Can agents adapt without pausing service to users?

Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.

Synthesis note · 2026-05-18 · sourced from Agents Multi Architecture
What actually changes inside a model during RL training? How should agents split planning from visual grounding?

Deployed LLM agents face a fundamental tension. They must serve users continuously without interruption, yet their capabilities grow stale as the real-world task distribution drifts. The three existing approaches each address only one half of the problem. Memory-based methods store raw trajectories but cannot extract transferable behavioral patterns. Skill-based methods compress experience into reusable instructions but treat the skill library as a static database never coordinated with weight optimization. RL-based methods update model weights but require service downtime during retraining.

MetaClaw (2603.17187) names the structural fix: two fundamentally different timescales of adaptation are naturally complementary, and existing systems address only one. Behavioral heuristics ("always verify a file path before reading," "confirm before destructive commands") can be distilled within seconds from a single failed conversation and injected immediately. Improving the model's underlying policy across diverse task types requires gradient-based optimization over many trajectories, on a timescale of minutes to hours. The complementarity is missed when systems pick one timescale.

The architecture has two mutually-reinforcing mechanisms. Skill-driven fast adaptation analyzes failure trajectories and synthesizes new skills via an LLM evolver — the new skills take effect immediately with zero service downtime, just by being added to the system prompt or skill retrieval pool. Opportunistic policy optimization performs gradient-based LoRA fine-tuning using a process reward model — but triggered only during user-inactive windows by the Opportunistic Meta-Learning Scheduler (OMLS), which monitors configurable sleep hours, system keyboard inactivity, and Google Calendar occupancy. The agent never pauses serving; weight updates happen entirely during natural downtime.

The virtuous cycle is the key claim: a better policy produces more informative failures for skill synthesis, and richer skills yield higher-reward trajectories for policy optimization. The two mechanisms feed each other across timescales.

The under-noticed contribution is the stale reward contamination problem and its fix. Once skills have evolved, trajectories collected under the old skill context carry stale rewards that would contaminate gradient updates if reused. MetaClaw introduces skill generation versioning: support data (failure trajectories consumed by skill evolution) is strictly separated from query data (post-adaptation trajectories used for RL updates). This is a non-obvious design requirement only visible once you commit to the dual-timescale architecture.

The deployment context — single agent on OpenClaw connecting to 20+ messaging channels — clarifies why the no-downtime constraint matters: the same agent must remain available across user time zones and conversational habits. Idle-window detection turns the constraint into an opportunity.

Inquiring lines that use this note as a source 11

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 101 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

continual agent adaptation requires two complementary timescales — fast skill injection from failures plus slow gradient updates during user-inactive windows