INQUIRING LINE

Can tool adaptation work without freezing the agent in the loop?

This explores whether an agent's tools and skills can keep evolving while the agent stays live in its loop — rather than being paused, retrained, or held frozen while adaptation happens elsewhere.


This reads the question two ways, and the corpus answers both: can tools change while the agent keeps running, and does keeping the agent's weights frozen actually block adaptation? The cleanest map of the territory is a 2x2 that splits adaptation by what you optimize (the agent vs. its tools) and what feedback you use (execution signals vs. final output) How do agentic AI systems decompose into adaptation paradigms?. That framing matters here because it shows 'adapt the agent' and 'adapt the tools' are separable axes — you don't have to touch the agent to get better behavior.

The strongest 'yes' comes from work that deliberately freezes the executor and moves all the learning into the tool layer. SkillOS trains a separate curator that reshapes the skill repository while the executing agent stays fixed — and the curator's improvements transfer across different executor backbones, which means the intelligence lived in the tools, not the frozen agent Can a separate trained curator improve skill libraries better than frozen agents?. VOYAGER makes the same bet from the opposite direction: store skills in an external, composable library so the agent learns continuously without weight updates — and specifically dodges the catastrophic forgetting that comes from gradient-based learning Can agents learn new skills without forgetting old ones?. Several memory-centric results push further, showing frozen models improve purely through the *shape* of what's stored: causal-form memory that preserves applicability conditions beats generic reflection and transfers to new environments Can frozen language models continually improve through memory structure alone?, and extracting natural-language rules into reusable skills lifts a frozen GPT-4.1 without any retraining Can frozen models learn better by extracting context into skills?.

But there's a sharper reading of 'in the loop' — adapting tools *while the agent is mid-task*, not in some offline curation window. Here the corpus says the offline/in-loop split is itself a quality problem. MUSE-Autoskill argues that authoring skills outside the loop creates a 'situated context' mismatch, and that invoking skill creation from inside the reasoning loop — grounded in exact task state and immediate feedback — closes that gap and even transfers cleanly to other agents Does creating skills inside the agent loop eliminate mismatches?. DeepAgent makes the parallel case for tool *selection*: discovering tools dynamically during execution beats pre-retrieving a fixed set, because the agent keeps a global view and can change strategy mid-flight Can agents discover tools dynamically instead of pre-selecting them?. So adaptation in the loop isn't just possible — for long-horizon work it's the better design.

The most direct answer to 'without freezing the agent' is MetaClaw, which refuses to choose: deployed agents run two adaptation timescales at once — fast skill injection from failures with zero downtime (seconds) and slower gradient optimization during idle windows (minutes to hours) — and the two reinforce each other, since better policies surface more informative failures and richer skills enable higher-reward runs Can agents adapt without pausing service to users?. That same continuous-feedback logic shows up in memory that grows and prunes its own links from closed-loop execution Should agent memory adapt dynamically based on execution feedback? and in workflow memory that induces reusable sub-task routines on the fly, with gains of 24–51% Can agents learn reusable sub-task routines from past experience?.

The quiet surprise running underneath all of this: freezing the agent often *helps*. The papers that hold weights fixed and let tools, skills, and memory evolve aren't accepting a limitation — they're avoiding catastrophic forgetting, getting cross-backbone transfer for free, and keeping service live. The thing you'd assume is the bottleneck (a frozen model) turns out to be the feature that makes safe, continuous tool adaptation possible.


Sources 10 notes

How do agentic AI systems decompose into adaptation paradigms?

A 2x2 taxonomy based on optimization target (agent vs tool) and feedback signal (execution vs output) unifies dispersed adaptation research. This framework directly maps to implementation decisions and explains trade-offs like query quality versus final answer quality.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can frozen language models continually improve through memory structure alone?

Agents using causal-form memory (preserving applicability conditions) outperform generic reflection by 23 points on repeated trials and gain 4-17 points transferring to new environments, showing memory shape matters more than parameter updates.

Can frozen models learn better by extracting context into skills?

Extracting natural-language rules from context into reusable skills improves frozen model reasoning without weight updates. On CL-bench, this lifts GPT-4.1 from 11.1% to 16.5%, with skills transferable across model backbones.

Does creating skills inside the agent loop eliminate mismatches?

MUSE-Autoskill demonstrates that invoking skill creation from within the agent's reasoning loop grounds new skills in exact task context, immediate feedback, and runtime validation. In-loop skills reach 87.94% task accuracy and transfer to other agents with minimal loss, eliminating the situated context problem of offline authoring.

Can agents discover tools dynamically instead of pre-selecting them?

DeepAgent demonstrates that discovering tools as needed—rather than pre-retrieving a fixed set—enables agents to maintain global task perspective and adapt strategy mid-execution. This approach scales better for long-horizon tasks where the tool space is too large to enumerate.

Can agents adapt without pausing service to users?

MetaClaw demonstrates that deployed agents require both rapid skill injection from failures (seconds, zero downtime) and slower gradient-based optimization during idle windows (minutes to hours). The two mechanisms reinforce each other, with better policies producing more informative failures and richer skills enabling higher-reward trajectories.

Should agent memory adapt dynamically based on execution feedback?

FluxMem demonstrates that adaptive memory topology—where links form, refine, and consolidate based on closed-loop execution feedback—consistently reaches state-of-the-art across three distinct benchmarks. Dynamic connectivity outperforms fixed retrieval by aligning abstraction and eliminating interference.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher tasked with re-evaluating whether tool adaptation can work without freezing the agent in the loop — a question that sits at the intersection of continual learning, agentic design, and safe deployment.

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat these as perishable assertions to be re-tested against current model capabilities and orchestration practices.

• Freezing agent weights while evolving tool/skill repositories *prevents catastrophic forgetting* and enables cross-backbone transfer; SkillOS demonstrates curator-driven skill reshaping with frozen executors (2026).
• In-loop tool creation (grounded in live task state and immediate feedback) closes "situated context" mismatch and transfers across agents better than offline curation; MUSE-Autoskill and DeepAgent show dynamic tool discovery during execution outperforms pre-retrieved fixed toolsets (2025–2026).
• Dual-timescale adaptation—fast skill injection on failure (seconds) + slow gradient optimization in idle windows (minutes–hours)—reinforces itself without full agent freeze; MetaClaw achieves this with zero downtime (2026).
• Memory-centric frozen-model gains: causal-form memory and natural-language skill extraction lift frozen GPT-4.1; agent workflow memory induces sub-task routines on the fly with +24–51% gains (2024–2026).
• The core tension: freezing looks like a constraint but emerges as a *feature* for safe, live adaptation.

Anchor papers (verify; mind their dates):
• arXiv:2603.17187 — MetaClaw (2026): dual-timescale agent adaptation without full freeze.
• arXiv:2605.06614 — SkillOS (2026): decoupled skill curation from frozen executor.
• arXiv:2605.27366 — MUSE-Autoskill (2026): in-loop skill creation from runtime state.
• arXiv:2510.21618 — DeepAgent (2025): dynamic tool discovery during execution.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, probe whether post-2026 models (Claude 4, GPT-5 class, or stronger open models), scaling in inference-time compute, improved memory management (e.g., persistent KV caches, structured retrieval), or multi-agent orchestration have since *relaxed* or *overturned* the requirement to freeze. Separate the durable question ("does unfrozen gradient-based learning harm long-horizon behavior?") from the perishable claim ("you must freeze to avoid forgetting"). Cite what resolved it, or state plainly where the constraint still holds.

(2) Surface the strongest *contradicting* or *superseding* work from the last ~6 months—especially anything showing unfrozen agents with comparable or better safety/transfer profiles, or evidence that in-loop weight updates now scale without catastrophic forgetting.

(3) Propose 2 new research questions that assume the regime may have shifted:
   – Can emergent planning-ahead in modern LLMs compensate for the loss of explicit catastrophic-forgetting safeguards, making unfrozen adaptation viable at scale?
   – Does the value of tool adaptation *in the loop* (vs. offline curation) depend on task horizon length or feedback granularity in ways that newer inference architectures change?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines