Do stronger models always evolve their own harnesses better?
When AI agents self-improve their prompts and tools, does raw model power help equally at writing updates versus using them? Understanding this split could reshape how we design self-evolving systems.
Self-evolving agents edit an external harness — prompts, skills, memories, tools — from execution evidence, without touching model parameters. The natural assumption is that stronger base models do this better on both ends. This paper disentangles two distinct capabilities and finds neither follows that assumption.
Harness-updating — producing persistent edits that lead to gains — is flat in base capability. Models across capability tiers produce updates yielding surprisingly similar gains; even a Qwen3.5-9B evolver induces gains comparable to Claude Opus 4.6. Writing a good skill or memory is apparently not bottlenecked by raw model strength.
Harness-benefit — actually improving when handed an updated harness — is non-monotonic. Weak-tier models gain little, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. Two failure modes explain the weak end: failing to activate the relevant harness artifact, and failing to follow it faithfully once activated.
The practical inversion is sharp: invest capability budget in the agent that uses the harness, not the evolver that writes it — and target agent training at harness invocation and long-horizon instruction-following rather than at generating cleverer updates. This complicates the "let a frontier model improve everything" intuition and connects to Why do better reasoning models ignore instructions?: strong models may benefit less precisely because the bottleneck is faithful instruction-following, which scaling erodes.
Inquiring lines that use this note as a source 3
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Where does agent reliability actually come from?
Exploring whether LLM agent performance depends on larger models or on thoughtful system design choices like memory, skills, and protocols that shift cognitive work outside the model.
sharpens it: which model should hold which role in a harness-evolution loop
-
Can a separate trained curator improve skill libraries better than frozen agents?
Explores whether decoupling skill curation from agent execution enables better long-term learning of what skills to keep, delete, or refine. Matters because manual curation doesn't scale and heuristic approaches lack feedback.
same updater/executor split; this note adds the capability-tier asymmetry
-
How can agent self-evolution be made safe and auditable?
As agents begin updating their own prompts and tools, how can we track these changes, measure their effects, and safely reverse problematic updates? This matters because untracked evolution leads to unmaintainable systems and makes regressions impossible to diagnose.
the protocol substrate this capability analysis runs on top of
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents
- Scaling Laws for Agent Harnesses via Effective Feedback Compute
- Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
- From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
- Automated Alignment Researchers: Using large language models to scale scalable oversight
- Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
- Self-Improving Model Steering
- AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
Original note title
the capacity to produce useful harness updates is flat across model tiers but the capacity to benefit from them peaks at mid-tier