Can construction-time routing and runtime agent pruning be combined effectively?

This explores whether two different levers — deciding an agent system's shape up front (routing topology, which models, how many agents) and trimming the system as it runs (pruning agents, caches, or memory mid-task) — can work together, and what the corpus says about each half.

This explores whether you can decide an agent system's structure *before* it runs and still trim it *while* it runs — two levers pulled at two different moments. The corpus treats these as largely separate research lines, which is itself the interesting finding: most work optimizes one and assumes the other is fixed. The construction-time camp asks 'what architecture should I build for this query?' The runtime camp asks 'what can I throw away now that I'm executing?' Few notes do both, and reading them against each other suggests why combining them is harder than it looks.

On the construction side, the strongest signal is that routing is no longer a single yes/no switch. What decisions must multi-agent routing systems optimize simultaneously? (MasRouter) argues you must jointly decide collaboration topology, agent count, role allocation, *and* per-agent model assignment all at once — and that doing so beats single-model routing. Can AI systems design unique multi-agent workflows per individual query? (FlowReasoner) pushes further: a meta-agent generates a bespoke architecture per query via RL feedback. And Can we automatically optimize both prompts and agent coordination? reframes the whole thing as a graph you can optimize on two axes — node prompts and edge connectivity — which is essentially the formal substrate where 'routing' and 'pruning' become the same operation: adding versus removing edges and nodes. That graph view is the closest the corpus comes to a unified answer to your question.

On the runtime side, pruning shows up as cache and memory management rather than killing whole agents. Can recursive subtask trees overcome context window limits? (Thread Inference Model) prunes up to 90% of the KV cache mid-reasoning and still stays accurate — and pointedly argues a single model doing this internally can *replace* a multi-agent system. Can agents compress their own memory without losing critical details? has agents fold their own history into compact schemas as they go. Can shared-prefix trees reduce redundancy in agent rollouts? prunes redundancy by branching from shared prefixes instead of sampling independent chains. These are all 'runtime trimming,' but notice the tension: TIM's pruning argues *against* the multi-agent structures that the routing papers spend their effort constructing.

That tension is the catch in 'combined effectively.' How does test-time scaling work at the agent level? reports that ~80% of multi-agent performance variance comes from raw token spend, not coordination cleverness — which implies the highest-leverage move isn't elaborate construction-time routing *or* surgical runtime pruning but simply governing the token budget both decisions feed into. Pair that with Can small language models handle most agent tasks? (use cheap SLMs by default, escalate selectively) and Can routers select the right model before generation happens? (pick the model before generating, not after), and a coherent combined recipe emerges: route cheaply and heterogeneously at construction, then let the system prune its own cache and memory at runtime — with both stages answerable to one shared budget rather than optimized in isolation.

The honest gap: the corpus has rich material on each half but no note that demonstrates the *interaction* — what happens when an aggressively-pruned runtime undercuts a carefully-routed topology, or when Can semantic capability vectors replace manual agent routing?-style capability discovery has to re-route after pruning removes an agent it counted on. The computational-graph framing says they *should* compose because they're operations on the same object; nobody here has shown they compose *effectively* under load. That's the thing worth knowing you wanted to know: the two levers are theoretically the same lever, and the open question is whether pulling both at once is stable.

Sources 10 notes

What decisions must multi-agent routing systems optimize simultaneously?

MasRouter shows that routing in multi-agent systems must jointly optimize collaboration topology, agent count, role allocation, and per-agent LLM assignment through a cascaded controller. This unified approach surpasses single-model routing by 3.51% accuracy while cutting HumanEval costs by 49%.

Can AI systems design unique multi-agent workflows per individual query?

FlowReasoner demonstrates that meta-agents trained with reinforcement learning and external execution feedback can generate unique multi-agent architectures for each user query, optimizing across performance, complexity, and efficiency—moving beyond fixed task-level workflow templates.

Can we automatically optimize both prompts and agent coordination?

Language agents represented as computational graphs—where nodes are operations and edges define information flow—reveal that CoT, ToT, and Reflexion are formally equivalent structures. This unified view enables automatic optimization of both node prompts and edge connectivity without manual redesign.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Can shared-prefix trees reduce redundancy in agent rollouts?

Tree-structured rollouts that branch from shared prefixes produce more distinct trajectories within a fixed token budget than independent chain sampling. This improves advantage estimation statistics and enables longer-horizon tasks within the same compute constraint.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Can routers select the right model before generation happens?

RouteLLM and Hybrid-LLM both achieve 40-50% cost reduction by routing to a single model based on query difficulty prediction, not response evaluation. Single-model routing minimizes latency compared to ensemble or cascade alternatives.

Can semantic capability vectors replace manual agent routing?

Versioned Capability Vectors embedded in HNSW indices couple semantic matching with policy and budget constraints, making capability discovery a first-class operation that scales sub-linearly as agent heterogeneity increases.

Can construction-time routing and runtime agent pruning be combined effectively?

Sources 10 notes

Next inquiring lines