INQUIRING LINE

When should agent-created code be promoted into permanent harness infrastructure?

This explores the lifecycle moment when code an agent writes on the fly should graduate from a throwaay artifact into durable, shared tooling that future agents depend on — what signals justify the promotion.


This explores when agent-created code should stop being disposable and become permanent harness infrastructure. The corpus frames this as one of the least-mapped problems in agent design: of the three layers of agentic code, the agent-authored artifacts that persist and get shared across agents are the underexplored one, and the open questions cluster exactly around persistence, sharing, and lifecycle management What makes agent-created code artifacts so hard to manage?. So there's no settled answer yet — but the surrounding research suggests a usable set of promotion criteria.

The first signal is that the code was created in context and validated against real execution. The strongest finding here is that skills authored inside the agent's own reasoning loop — grounded in exact task context, immediate feedback, and runtime validation — outperform offline-authored ones and transfer to other agents with minimal loss Does creating skills inside the agent loop eliminate mismatches?. The implication is that promotion shouldn't be a separate curation step bolted on afterward; the code that survives is the code that already proved itself while solving the live task. Promote what the loop already verified, not what looks reusable in the abstract.

The second signal is compositionality — does the artifact become a building block? VOYAGER's lesson is that storing executable skills in an indexed library and composing complex skills from simpler ones lets an agent keep learning without the catastrophic forgetting that weight updates cause Can agents learn new skills without forgetting old ones?. Code earns permanence when it compounds: when later tasks call it as a primitive rather than re-deriving it. This works precisely because code is an executable, inspectable, stateful medium — you can read it, run it, and check it before trusting it Can code become the operational substrate for agent reasoning?, which is what makes promotion safe in a way that promoting opaque model weights never is.

The third signal is whether the artifact clears a reliability threshold high enough for the harness to depend on it deterministically. Two notes pull in the same direction here. Execution-free verification of code can now reach ~93% accuracy, which is the kind of bar that turns a one-off script into something a reward signal or a downstream agent can rely on Can structured reasoning replace code execution for RL rewards?. And production teams find that once code is load-bearing, it has to be deterministic — explicit direct function calls beat flexible protocol-mediated access, because ambiguity that's tolerable in exploration becomes failure in production Why do protocol-based tool integrations fail in production workflows?. So a rough rule: promote when the artifact is reusable, verified above your reliability bar, and behaves deterministically enough that other agents can call it without re-checking it.

The quieter point worth taking away: promotion isn't only a quality gate, it's a governance act. Once agent-written code becomes permanent infrastructure, it becomes part of the operating environment every future agent consults — and embedding rules into that runtime layer has proven more effective than policing them after the fact Can governance rules embedded in runtime memory actually protect autonomous agents?. Promoting code is how an agent system writes its own constitution one function at a time, which is also why the lifecycle question is still wide open rather than solved.


Sources 7 notes

What makes agent-created code artifacts so hard to manage?

Of the three agentic code layers, agent-authored artifacts that persist and are shared across agents are underexplored in research. Open challenges cluster around persistence, sharing, and lifecycle management — exactly where future gains in autonomy and coordination may live.

Does creating skills inside the agent loop eliminate mismatches?

MUSE-Autoskill demonstrates that invoking skill creation from within the agent's reasoning loop grounds new skills in exact task context, immediate feedback, and runtime validation. In-loop skills reach 87.94% task accuracy and transfer to other agents with minimal loss, eliminating the situated context problem of offline authoring.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can code become the operational substrate for agent reasoning?

Research shows code uniquely enables agents to externalize reasoning, execute policies, model environments, and verify progress through its simultaneous executability, inspectability, and statefulness across task steps.

Can structured reasoning replace code execution for RL rewards?

Semi-formal reasoning templates enable execution-free patch equivalence verification at 93% accuracy on real agent code, crossing the reliability threshold needed for RL reward signals. This makes execution-free verification viable for certain task classes like fault localization and code reasoning.

Why do protocol-based tool integrations fail in production workflows?

MCP integration caused non-deterministic failures through ambiguous tool selection and parameter inference. Replacing it with explicit direct function calls and single-tool-per-agent design restored determinism. A 306-practitioner survey confirms 85% of production teams build custom agents, forgoing frameworks.

Can governance rules embedded in runtime memory actually protect autonomous agents?

A persistent agent recorded 889 governance events across 96 active days, with safeguards encoded directly into the memory layer the agent consulted during operation. Runtime-resident governance proved more effective than external policies because the agent actually accessed it during decision-making.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with re-testing whether constraints on agent-code promotion have shifted. The question remains open: When should agent-created code move from disposable to permanent harness infrastructure?

What a curated library found — and when (dated claims, not current truth):
These findings span 2025–2026; treat them as perishable claims to be verified against the latest models, training regimes, and tooling:
- Skills authored inside the agent's runtime loop outperform offline-authored ones and transfer with minimal loss (~2026).
- Compositional skill libraries let agents compound learning without catastrophic forgetting; code earns permanence when later tasks call it as a primitive (~2026).
- Execution-free code verification can reach ~93% accuracy, enabling reliable deterministic promotion (~2026).
- Production agentic workflows require deterministic function calls, not protocol-mediated access; ambiguity tolerable in exploration fails in production (~2026).
- Embedding governance rules into the runtime layer (as permanent code) outperforms after-the-fact policing (~2026).

Anchor papers (verify; mind their dates):
- arXiv:2605.18747 "Code as Agent Harness" (2026-05)
- arXiv:2604.08377 "SkillClaw: Let Skills Evolve Collectively with Agentic Evolver" (2026-04)
- arXiv:2603.01896 "Agentic Code Reasoning" (2026-03)
- arXiv:2605.27366 "MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation" (2026-05)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models, training methods, SDKs, or orchestration (memory layers, caching, multi-agent coordination) have relaxed or overturned it. Separate the durable question (likely still open) from the perishable limitation (possibly resolved); cite what resolved it. Pay special attention: have SLMs or new RL pipelines made the 93% execution-free bar obsolete, or raised it? Has protocol-mediated access matured enough to rival deterministic calls in production?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does any recent paper argue that promotion should remain fluid, or that code permanence introduces systemic brittleness?
(3) Propose 2 research questions that ASSUME the regime may have moved: one on the reliability threshold needed for safe promotion today, one on whether governance-as-code scales across multi-agent societies.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines