SYNTHESIS NOTE

Can models decide better than retrievers which tools to use?

Traditional retrieval picks tools upfront based on initial queries, but do models themselves make better decisions about tool needs as they reason? This explores whether authority over tool selection should move from external systems to the LLM.

Synthesis note · 2026-05-03 · sourced from Tool Computer Use

MCP-Zero's structural argument is that retrieval-based tool injection — match the user query to relevant tools via semantic similarity and inject only those — fails on realistic agent tasks for three specific reasons. First, retrieval is passive: the external retrieval system selects tools based on the initial query rather than letting the model express its evolving needs as it reasons through the task. Second, there is semantic misalignment between colloquial user inputs and formal API documentation — the distributional mismatch reduces retrieval precision. Third, retrieval is single-round: it happens once per query and cannot accommodate progressive refinement of subtask requirements or correction when initial retrievals prove inadequate.

A query like "Debug the file" needs filesystem tools, code-generation tools, and command-execution tools — three different domains that no single semantic match against the initial query can identify, because the requirements only become clear as the model reasons.

MCP-Zero's response inverts the direction. Proactive Tool Request: the model emits a structured <tool assistant>server: ... tool: ...</tool assistant> block specifying what it needs in API-aligned vocabulary — bypassing the colloquial-to-formal mismatch. Hierarchical Vector Routing: a coarse-to-fine retrieval first selects candidate servers, then ranks tools within them — only top-k descriptions returned, reducing context overhead. Iterative Proactive Invocation: the model can initiate multiple tool requests across the conversation for different subtasks, building a cross-domain toolchain progressively, and revise requests if returned tools are insufficient.

The deeper move is to return the authority of tool requirement specification to the LLM itself — leveraging chain-of-thought, self-reflection, and planning that modern models already have. The implication is that for thousands-of-tools ecosystems, the retrieval system should be a service the model calls, not a gatekeeper that pre-selects what the model is allowed to consider. This is the same architectural move as Will agents compete for attention just like users do? viewed from the supply side: tools become services agents discover and invoke, not options pre-selected by an upstream retriever.

Inquiring lines that use this note as a source 20

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 142 in 2-hop network ·medium cluster Open in graph ↗

Can models decide better than retrievers which t… Where do traditional function calling systems actu… Will agents compete for attention just like users … Can models learn to ask clarifying questions inste… Can reasoning and tool execution be truly decouple… Why do capable AI agents still fail in real deploy…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Where do traditional function calling systems actually break down? Function calling seems simple but fails in ways that aren't obvious. This explores three independent failure points—retrieval, context bloat, and output rigidity—that together explain why even the best models struggle.
extends: Floworks names retrieval as one bottleneck; MCP-Zero argues retrieval is the wrong primitive entirely — replace passive retrieval with model-initiated proactive tool requests.
Will agents compete for attention just like users do? As autonomous agents take over user tasks, will the Web's economic competition shift from human clicks to agent invocations? This explores whether existing ad-market mechanisms could scale to agent decision-making.
complements: MCP-Zero is the supply-side mechanism — tools as services agents query — that the agent attention economy assumes.
Can models learn to ask clarifying questions instead of guessing? Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
extends: same proactivity move applied to a different domain — instead of asking the user for missing input, the model asks the tool registry for missing capability.
Can reasoning and tool execution be truly decoupled? Can LLM reasoning be separated from tool observations to eliminate redundant re-prompting and enable parallel execution? Two recent architectures suggest yes, but what are the tradeoffs?
complements: ReWOO/CoA decouple reasoning from tool execution at the inference layer; MCP-Zero decouples tool retrieval from query semantics at the discovery layer. Both argue for separating concerns at different points in the agent stack.
Why do capable AI agents still fail in real deployments? Explores whether agent failures stem from insufficient capability or from missing ecosystem conditions like user trust, value clarity, and social norms. Understanding this distinction matters for predicting which agents will succeed.
extends: standardization is one of the five ecosystem conditions; MCP itself is the standardization layer this paper builds on.

Can models decide better than retrievers which tools to use?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4