Should agent capability be optimized separately from general capability?

This explores whether the things that make an agent good at acting in the world — memory, skills, coordination, interaction time — are a separate optimization target from raw model intelligence, or just a byproduct of it.

This explores whether agent capability is its own engineering problem, distinct from making the underlying model smarter. The corpus answers with a fairly consistent yes — and the most striking thread is that much of what makes an agent reliable lives *outside* the model entirely. One line of work argues reliability comes from externalizing three cognitive burdens — memory, skills, and interaction protocols — into a 'harness' layer, so the model stops re-solving the same problems on every run Where does agent reliability actually come from?. If reliability is a property of the harness rather than the weights, then scaling general capability won't get you there, and you'd be optimizing the wrong thing.

The separation goes deeper than 'model vs. scaffolding.' Agent efficiency itself decomposes into roughly orthogonal axes — memory compression, tool-learning, planning — where improving one does nothing for the others, so they have to be designed for independently Does agent efficiency really break down into three distinct components?. Even the way you spend compute at runtime splits in two: scaling *interaction* (more environment steps, backtracking, replanning) is a distinct lever from scaling chain-of-thought reasoning depth, and on messy partial-information tasks the interaction lever wins Does agent interaction time scale separately from reasoning depth?. These are dimensions a 'just make the model better' strategy can't reach.

The most economically pointed argument is that agent work and general capability shouldn't even share the same model. Most agentic subtasks are repetitive, well-defined language operations that small models handle at a fraction of the cost, making heterogeneous designs — small models by default, large ones only when needed — the rational pattern Can small language models handle most agent tasks?. And as agents start holding credentials and transacting, the binding constraint shifts away from raw capability toward coordination, settlement, and auditability — problems no amount of model intelligence solves on its own When do agents need coordination more than raw capability?. There's even direct evidence that *separating* the optimization targets helps: SkillOS trains a dedicated skill curator decoupled from a frozen executor, and the curated repositories shift toward genuinely strategic meta-skills — a gain you get from optimizing the curation loop, not the executor Can a separate trained curator improve skill libraries better than frozen agents?.

But the corpus also resists a clean firewall between the two. Computational-graph framings show that prompts and agent coordination structure can be optimized *jointly* — node behavior and edge connectivity tuned in the same pass — which suggests the boundary between 'general' and 'agentic' capability is sometimes an optimization seam you can dissolve rather than a wall Can we automatically optimize both prompts and agent coordination?. There's also a cautionary note: a lot of what looks like agent capability is just token spend — around 80% of multi-agent performance variance traces to budget, not smarter coordination — so 'optimizing agent capability' can quietly become 'paying for more tokens' if you're not measuring carefully How does test-time scaling work at the agent level?.

The quiet payoff here is about *measurement*: if agent capability is genuinely separate, then evaluating agents by task success alone is measuring the wrong variable. The argument is that you need to score trajectory quality, memory hygiene, and verification cost — the harness — because a single success score creates false confidence and hides exactly the dimensions you'd be optimizing separately What should we actually measure in agent evaluation?. In other words, the case for optimizing agent capability separately is also a case for evaluating it separately — and most current benchmarks don't.

Sources 9 notes

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Does agent efficiency really break down into three distinct components?

Research identifies memory compression, tool learning efficiency, and planning optimization as three structurally independent components, each with distinct cost profiles (tokens, latency, and steps). Improving one axis does not automatically improve the others, requiring holistic design.

Does agent interaction time scale separately from reasoning depth?

Test-time interaction—increasing environment steps—enables exploration, backtracking, and replanning that per-step reasoning cannot achieve. Curriculum-based RL on rollout length produces SOTA web agents, showing interaction scaling dominates on tasks with partial observability.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

When do agents need coordination more than raw capability?

Once agents hold credentials, transact value, and interact with other agents, raw model capability stops being the limiting factor. The real bottleneck becomes whether agents can coordinate reliably, settle accounts, and leave auditable evidence of their actions.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Can we automatically optimize both prompts and agent coordination?

Language agents represented as computational graphs—where nodes are operations and edges define information flow—reveal that CoT, ToT, and Reflexion are formally equivalent structures. This unified view enables automatic optimization of both node prompts and edge connectivity without manual redesign.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

What should we actually measure in agent evaluation?

Single-score evaluation collapses multi-dimensional agent behavior and creates false confidence in deployment readiness. Research shows agents need benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost to reflect actual system performance.

Should agent capability be optimized separately from general capability?

Sources 9 notes

Next inquiring lines