Why does capability discovery become the bottleneck in large agent systems?

This explores why, once you have many agents, the hard part stops being building capable agents and becomes *finding* the right one for a job — and whether the corpus treats that 'discovery bottleneck' as real or as a symptom of something deeper.

This reads the question as being about scale: in a system with a handful of agents you wire them together by hand, but past some threshold the limiting factor becomes knowing which agent can actually do what — and the corpus suggests that bottleneck is real, but it's a consequence of two things growing at once: the *number* of agents and the *fluidity* of what each one can do.

Start with heterogeneity. The economically rational way to build agent systems isn't one big model — it's many small specialized ones, with large models called selectively Can small language models handle most agent tasks?. That design choice is what creates the problem: the more varied your fleet, the less any central router can hold a hand-maintained map of who does what. This is the explicit pitch behind treating capability matching as a first-class, indexed operation — versioned capability vectors in an HNSW index let discovery scale sub-linearly *precisely because* manual wiring breaks down as agent heterogeneity rises Can semantic capability vectors replace manual agent routing?. Discovery becomes the bottleneck because the thing you're searching over got too big and too diverse to enumerate.

The second pressure is that capabilities don't sit still. Agents accumulate reusable sub-task routines from past experience Can agents learn reusable sub-task routines from past experience?, build executable skill libraries that compose simple skills into complex ones Can agents learn new skills without forgetting old ones?, and in shared ecosystems those skills evolve across users through centralized aggregation How can agent systems share learned skills across users?. So what an agent *can* do is a moving target. You're not indexing a fixed catalog; you're tracking a continuously changing one. That's why discovery is a bottleneck and not a one-time setup cost.

But here's the turn the corpus offers — capability discovery may be the *visible* bottleneck while not being the *binding* one. Several notes argue that once agents become social and economic actors, raw capability stops being the constraint and coordination, settlement, and auditable trust take over When do agents need coordination more than raw capability?. Historical analysis finds capable agents stall for want of ecosystem conditions — trustworthiness, standardization, social acceptability — rather than capability gaps Why do capable AI agents still fail in real deployments?. And even when agents find each other, coordination degrades predictably with network scale: they agree too late, or adopt strategies without telling neighbors, and they accept information from peers without verification Why do multi-agent systems fail to coordinate at scale?. In that light, 'finding the right capability' is the easy half; trusting and orchestrating it is the hard half.

The thing you might not have known you wanted: the deepest framing here is that all of these — discovery, memory, coordination — may be the same structural pressure wearing different masks. Reliability comes from externalizing memory, skills, and protocols into a harness layer instead of cramming everything into the model Where does agent reliability actually come from?, and efficiency techniques across memory, tool use, and planning independently converge on the same principles — bound your context, minimize external calls, control your search Do efficiency techniques across agent components reveal shared structural constraints?. Capability discovery is exactly a 'minimize and control the search' problem. So it becomes the bottleneck not because matching is uniquely hard, but because scaling agents turns *everything* into a search-and-coordination problem the model can't hold in its head — and discovery is where that pressure surfaces first.

Sources 10 notes

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Can semantic capability vectors replace manual agent routing?

Versioned Capability Vectors embedded in HNSW indices couple semantic matching with policy and budget constraints, making capability discovery a first-class operation that scales sub-linearly as agent heterogeneity increases.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

How can agent systems share learned skills across users?

SkillClaw aggregates interaction trajectories across users, processes them through an autonomous evolver that identifies patterns and refines skills, then synchronizes updates system-wide. This converts siloed individual learning into shared capability improvement without manual curation.

When do agents need coordination more than raw capability?

Once agents hold credentials, transact value, and interact with other agents, raw model capability stops being the limiting factor. The real bottleneck becomes whether agents can coordinate reliably, settle accounts, and leave auditable evidence of their actions.

Why do capable AI agents still fail in real deployments?

Historical analysis from GPS to modern AI shows agent failures consistently result from absent ecosystem conditions—value generation, personalization, trustworthiness, social acceptability, and standardization—rather than capability gaps. Even highly capable systems stall without these five conditions.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Do efficiency techniques across agent components reveal shared structural constraints?

Techniques for memory, tool learning, and planning independently converge on shared principles: context bounding, minimizing external calls, and controlled search. This convergence suggests these reflect fundamental structural pressures in agentic computation rather than component-specific optimizations.

Why does capability discovery become the bottleneck in large agent systems?

Sources 10 notes

Next inquiring lines