Do stronger models always evolve their own harnesses better?

When AI agents self-improve their prompts and tools, does raw model power help equally at writing updates versus using them? Understanding this split could reshape how we design self-evolving systems.

Synthesis note · 2026-06-03 · sourced from Evolution

Self-evolving agents edit an external harness — prompts, skills, memories, tools — from execution evidence, without touching model parameters. The natural assumption is that stronger base models do this better on both ends. This paper disentangles two distinct capabilities and finds neither follows that assumption.

Harness-updating — producing persistent edits that lead to gains — is flat in base capability. Models across capability tiers produce updates yielding surprisingly similar gains; even a Qwen3.5-9B evolver induces gains comparable to Claude Opus 4.6. Writing a good skill or memory is apparently not bottlenecked by raw model strength.

Harness-benefit — actually improving when handed an updated harness — is non-monotonic. Weak-tier models gain little, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. Two failure modes explain the weak end: failing to activate the relevant harness artifact, and failing to follow it faithfully once activated.

The practical inversion is sharp: invest capability budget in the agent that uses the harness, not the evolver that writes it — and target agent training at harness invocation and long-horizon instruction-following rather than at generating cleverer updates. This complicates the "let a frontier model improve everything" intuition and connects to Why do better reasoning models ignore instructions?: strong models may benefit less precisely because the bottleneck is faithful instruction-following, which scaling erodes.

Inquiring lines that use this note as a source 3

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 121 in 2-hop network ·dense cluster Open in graph ↗

Do stronger models always evolve their own harne… Where does agent reliability actually come from? Can a separate trained curator improve skill libra… How can agent self-evolution be made safe and audi…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Where does agent reliability actually come from? Exploring whether LLM agent performance depends on larger models or on thoughtful system design choices like memory, skills, and protocols that shift cognitive work outside the model.
sharpens it: which model should hold which role in a harness-evolution loop
Can a separate trained curator improve skill libraries better than frozen agents? Explores whether decoupling skill curation from agent execution enables better long-term learning of what skills to keep, delete, or refine. Matters because manual curation doesn't scale and heuristic approaches lack feedback.
same updater/executor split; this note adds the capability-tier asymmetry
How can agent self-evolution be made safe and auditable? As agents begin updating their own prompts and tools, how can we track these changes, measure their effects, and safely reverse problematic updates? This matters because untracked evolution leads to unmaintainable systems and makes regressions impossible to diagnose.
the protocol substrate this capability analysis runs on top of

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

the capacity to produce useful harness updates is flat across model tiers but the capacity to benefit from them peaks at mid-tier

Do stronger models always evolve their own harnesses better?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 5