Can one-off agent code be safely promoted to durable infrastructure?

This explores whether the throwaway scripts an agent writes mid-task can be hardened into reliable, reusable infrastructure — and what the corpus says has to be true for that promotion to be safe rather than reckless.

This reads the question as: can the disposable code an agent generates to solve one problem be promoted into durable, trusted infrastructure — and the corpus says yes, but only if you change *what kind of thing* the code is on the way up. The starting advantage is that agent code isn't just text output. Can code become the operational substrate for agent reasoning? argues code is special precisely because it's executable, inspectable, and stateful at once — which is exactly the trio of properties you need to trust something long-term. A one-off script already has the raw material of infrastructure; the question is what you bolt onto it.

The most direct answer to 'how' comes from the idea that reliability isn't in the model, it's in the scaffolding around it. Where does agent reliability actually come from? frames durable agents as ones that push memory, skills, and protocols out of the model's head and into a stable harness layer — so the same problem doesn't get re-solved (and re-broken) every run. Promotion, in that framing, *is* the act of moving code from the model's transient reasoning into that harness. Can agents learn new skills without forgetting old ones? shows the mechanism concretely: VOYAGER stores working code as executable skills in an indexed library and composes complex ones from simpler ones, accumulating capability without the catastrophic forgetting that weight-updates cause. That's the safe path — a one-off becomes a named, retrievable skill rather than dissolving back into the next prompt.

But 'safely' is the load-bearing word, and here the corpus turns cautionary. Why do protocol-based tool integrations fail in production workflows? reports that the flexible, protocol-mediated tool access that's fine for exploration breaks in production — ambiguous tool selection and inferred parameters create non-deterministic failures, and teams had to fall back to explicit direct function calls to restore predictability. The lesson: durable infrastructure demands *less* of the improvisation that made the one-off code possible. The promotion isn't a copy-paste; it's a tightening. Verification is the other gate — Can structured reasoning replace code execution for RL rewards? is interesting precisely because it names a reliability threshold (93% on patch-equivalence) at which you can start trusting code judgments without running everything, which is the kind of bar promotion should have to clear.

Two more notes reframe the stakes. Can governance rules embedded in runtime memory actually protect autonomous agents? found that safeguards work when they live *inside* the runtime memory the agent actually consults — bolting policy on after promotion doesn't take. So durable infrastructure should carry its guardrails as part of itself, not as documentation. And Can one compromised agent corrupt an entire multi-agent network? is the warning shot: once a component becomes shared infrastructure that many agents depend on, a single corrupted piece can silently propagate through the whole network. Promotion multiplies blast radius, which is exactly why the determinism and embedded-governance disciplines matter more for durable code than for the one-off.

The thing you might not have expected to learn: the economics quietly reward doing this. Do persistent agents really cost less per token? found that in a long-lived agent environment 82.9% of tokens were cache reads — when code and context persist and get reused, the meaningful cost unit stops being tokens and becomes completed artifacts. Promoting one-off code to durable infrastructure isn't only a safety-and-reliability move; it's the thing that makes persistent agents cheap in the first place.

Sources 8 notes

Can code become the operational substrate for agent reasoning?

Research shows code uniquely enables agents to externalize reasoning, execute policies, model environments, and verify progress through its simultaneous executability, inspectability, and statefulness across task steps.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Why do protocol-based tool integrations fail in production workflows?

MCP integration caused non-deterministic failures through ambiguous tool selection and parameter inference. Replacing it with explicit direct function calls and single-tool-per-agent design restored determinism. A 306-practitioner survey confirms 85% of production teams build custom agents, forgoing frameworks.

Can structured reasoning replace code execution for RL rewards?

Semi-formal reasoning templates enable execution-free patch equivalence verification at 93% accuracy on real agent code, crossing the reliability threshold needed for RL reward signals. This makes execution-free verification viable for certain task classes like fault localization and code reasoning.

Can governance rules embedded in runtime memory actually protect autonomous agents?

A persistent agent recorded 889 governance events across 96 active days, with safeguards encoded directly into the memory layer the agent consulted during operation. Runtime-resident governance proved more effective than external policies because the agent actually accessed it during decision-making.

Can one compromised agent corrupt an entire multi-agent network?

Research demonstrates that a single biased agent can transmit persistent behavioral corruption through six downstream agents in chain and bidirectional topologies using only normal inter-agent communication. The bias evades detection and paraphrasing defenses because it carries no explicit semantic content.

Do persistent agents really cost less per token?

A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether one-off agent code can safely become durable infrastructure. The question remains open; a curated library's findings (2025–2026) are dated claims, not current truth.

What a curated library found — and when:
• Code is safe to promote because it's executable, inspectable, and stateful — the raw material of trust — but only if wrapped in a harness layer that externalizes memory, skills, and protocols outside the model (2026-04).
• VOYAGER-style skill libraries avoid catastrophic forgetting by storing working code as indexed, composable skills rather than re-solving in each prompt (2026-04).
• Production agentic workflows require deterministic function calls, not flexible protocol-mediated tool selection; ambiguous tool choice and inferred parameters cause non-deterministic failures (2025-12).
• Safeguards must live *inside* runtime memory the agent consults; bolted-on policy after promotion fails (2026-01).
• Once shared infrastructure, a single corrupted component silently propagates bias through multi-agent networks (2026-02); promotion multiplies blast radius.
• In persistent agents, 82.9% of tokens are cache reads — promotion shifts economics from cost-per-token to cost-per-artifact (2026-05).

Anchor papers (verify; mind their dates):
• arXiv:2604.08224 (Externalization in LLM Agents, 2026-04)
• arXiv:2512.08769 (Production-Grade Agentic AI, 2025-12)
• arXiv:2603.00131 (Thought Virus, 2026-02)
• arXiv:2605.26870 (Persistent AI Agents, 2026-05)

Your task:
(1) RE-TEST EACH CONSTRAINT. For determinism, harness design, and embedded governance: has newer tooling (SDKs, orchestration frameworks, runtime verification) relaxed the promotion friction? Have models' code generation fidelity or tool-use consistency improved enough to permit looser scaffolding? Cite what relaxed or still holds.
(2) Surface the strongest work from the last 6 months that either contradicts the "harness-is-mandatory" view or supersedes the determinism requirement with probabilistic-safety approaches.
(3) Propose 2 research questions that assume the regime may have shifted: (a) can dynamic, runtime-learned guardrails replace static embedded governance? (b) does federated skill composition (cross-org code sharing) remain safe, or does the blast-radius problem require centralized audit?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can one-off agent code be safely promoted to durable infrastructure?

Sources 8 notes

Next inquiring lines