Should we train the evolver or the executor when building self-improving agents?
This explores a design fork in self-improving agents: do you put the learning into the executor that does the task, or into the separate 'evolver' that rewrites the agent's skills, prompts, and harness — and the corpus increasingly points toward training the evolver while freezing the executor.
This reads the question as a place-the-learning problem: a self-improving agent has two layers — the executor that acts in the environment, and an evolver/curator that revises what the executor knows and how it's wired. The corpus's most direct answer comes from SkillOS Can a separate trained curator improve skill libraries better than frozen agents?, which keeps the executor frozen and trains only the curator. The payoff is that the curator learns to push skill repositories away from generic verbose additions toward actionable execution logic and cross-task meta-strategies — and, crucially, it generalizes across different executor backbones. That last detail is the real argument for training the evolver: the learned thing transfers, rather than being baked into one model's weights.
There's a sharp supporting clue in why the executor is the wrong place to invest. The finding that harness-improvement quality is flat across model tiers Do stronger models always evolve their own harnesses better? shows that generating useful updates isn't bottlenecked by raw model strength — even smaller models write comparable edits. The bottleneck is activating and following those updates, which peaks at mid-tier. So pouring capability into the executor buys you less than you'd think; the leverage is in the evolution loop, not the actor.
But 'train the evolver' has a precondition the corpus is blunt about: the evolver needs a real external signal, or it eats itself. The self-improvement mirage note Can models reliably improve themselves without external feedback? argues pure self-improvement stalls on the generation-verification gap, diversity collapse, and reward hacking — every method that actually works smuggles in an external anchor (past versions, third-party judges, user corrections, tool feedback). The successful evolvers in this collection all obey that rule: the Darwin Gödel Machine Can AI systems improve themselves through trial and error? swaps formal proofs for empirical benchmarking against an archive of variants; FlowReasoner Can AI systems design unique multi-agent workflows per individual query? trains a meta-agent on external execution feedback; SkillClaw How can agent systems share learned skills across users? runs its autonomous evolver over aggregated cross-user trajectories. The evolver is trainable precisely because the environment grades it.
The interesting wrinkle is that 'train the evolver' and 'don't train the executor' don't have to mean weight updates at all. A whole branch of the corpus evolves the executor's behavior through externalized memory while leaving its parameters untouched: VOYAGER's composable skill library Can agents learn new skills without forgetting old ones? avoids the catastrophic forgetting that weight updates cause, Reflexion stores verbal self-diagnoses as episodic memory Can agents learn from failure without updating their weights?, and ReasoningBank distills strategy hints from both wins and failures Can agents learn better from their failures than successes?. Here the 'evolver' is whatever curates that store — and the SkillOS result says: don't curate it by hand or with a frozen agent, train something to curate it.
Where the corpus gets honest about the ceiling is metacognition. Truly self-improving agents need intrinsic metacognition Can AI systems improve their own learning strategies? — today's evolvers run on fixed, human-designed loops that break under domain shift. So the deeper answer to 'evolver or executor' is: train the evolver, but the frontier is making the evolver able to revise its own learning strategy, not just the executor's skills. RLVMR's process rewards for planning and reflection Can RL agents learn to reason better, not just succeed? are an early step toward training that metacognitive layer directly rather than treating it as a fixed scaffold. The thing you didn't know you wanted to know: the question isn't really executor-vs-evolver — it's how high up the meta-ladder you can afford to put the trainable part.
Sources 11 notes
SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.
Model strength doesn't bottleneck writing useful harness edits—even smaller models generate comparable improvements. But using those updates non-monotonically peaks at mid-tier models, with weak and strong models both struggling to activate and follow updated instructions.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
DGM replaces formal proofs with empirical benchmarking and maintains an evolutionary archive of agent variants, achieving 2.5× improvement on SWE-bench and 2.2× on Polyglot by discovering capabilities like better code editing and context management.
FlowReasoner demonstrates that meta-agents trained with reinforcement learning and external execution feedback can generate unique multi-agent architectures for each user query, optimizing across performance, complexity, and efficiency—moving beyond fixed task-level workflow templates.
SkillClaw aggregates interaction trajectories across users, processes them through an autonomous evolver that identifies patterns and refines skills, then synchronizes updates system-wide. This converts siloed individual learning into shared capability improvement without manual curation.
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
ReasoningBank shows that storing strategy-level reasoning hints from both self-judged successes and failures outperforms success-only memory and raw trajectory storage. Coupled with test-time scaling, memory and compute compound rather than substitute, creating a novel scaling law where accuracy improves through cumulative interaction history.
Current self-improvement methods use extrinsic, fixed metacognitive loops designed by humans that fail under domain shift or capability changes. True self-improvement requires agents to generate their own adaptive metacognitive knowledge, planning, and evaluation—a gap confirmed as a neglected research area across neuro-symbolic AI.
RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.