Can models decide better than retrievers which tools to use?
Traditional retrieval picks tools upfront based on initial queries, but do models themselves make better decisions about tool needs as they reason? This explores whether authority over tool selection should move from external systems to the LLM.
MCP-Zero's structural argument is that retrieval-based tool injection — match the user query to relevant tools via semantic similarity and inject only those — fails on realistic agent tasks for three specific reasons. First, retrieval is passive: the external retrieval system selects tools based on the initial query rather than letting the model express its evolving needs as it reasons through the task. Second, there is semantic misalignment between colloquial user inputs and formal API documentation — the distributional mismatch reduces retrieval precision. Third, retrieval is single-round: it happens once per query and cannot accommodate progressive refinement of subtask requirements or correction when initial retrievals prove inadequate.
A query like "Debug the file" needs filesystem tools, code-generation tools, and command-execution tools — three different domains that no single semantic match against the initial query can identify, because the requirements only become clear as the model reasons.
MCP-Zero's response inverts the direction. Proactive Tool Request: the model emits a structured <tool assistant>server: ... tool: ...</tool assistant> block specifying what it needs in API-aligned vocabulary — bypassing the colloquial-to-formal mismatch. Hierarchical Vector Routing: a coarse-to-fine retrieval first selects candidate servers, then ranks tools within them — only top-k descriptions returned, reducing context overhead. Iterative Proactive Invocation: the model can initiate multiple tool requests across the conversation for different subtasks, building a cross-domain toolchain progressively, and revise requests if returned tools are insufficient.
The deeper move is to return the authority of tool requirement specification to the LLM itself — leveraging chain-of-thought, self-reflection, and planning that modern models already have. The implication is that for thousands-of-tools ecosystems, the retrieval system should be a service the model calls, not a gatekeeper that pre-selects what the model is allowed to consider. This is the same architectural move as Will agents compete for attention just like users do? viewed from the supply side: tools become services agents discover and invoke, not options pre-selected by an upstream retriever.
Inquiring lines that use this note as a source 20
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Should LLMs query users back when presented with under-specified scenarios?
- What makes proactive tool retrieval better than single-round semantic matching?
- How should moderator LLMs decide which speakers to query per topic?
- Should retrieval be triggered always or only for difficult questions?
- Do different domains require different types of model investment?
- How can smaller models help select useful data for larger models?
- Could eliminating retrieval entirely work better than shifting the burden?
- How does semantic mismatch between user language and API documentation degrade tool retrieval?
- When should a system decide to retrieve versus reason alone?
- How does semantic clustering help decide which model handles each query?
- How do agents discover and select which tools to invoke?
- Can models retrieve the right tool without relying on vector similarity?
- Should retrieval be triggered by model uncertainty or fixed intervals?
- How should retrieval and verification tasks be separated architecturally?
- How should retrieval systems decide when to fetch new information?
- What role does document reranking play alongside decisions about whether to retrieve?
- Which model capabilities actually matter for sustained workflow delegation?
- How should retrieval triggers use model uncertainty instead of fixed intervals?
- What are the 27 external features that predict retrieval need?
- Can retrieval systems decide when to retrieve instead of always querying?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Where do traditional function calling systems actually break down?
Function calling seems simple but fails in ways that aren't obvious. This explores three independent failure points—retrieval, context bloat, and output rigidity—that together explain why even the best models struggle.
extends: Floworks names retrieval as one bottleneck; MCP-Zero argues retrieval is the wrong primitive entirely — replace passive retrieval with model-initiated proactive tool requests.
-
Will agents compete for attention just like users do?
As autonomous agents take over user tasks, will the Web's economic competition shift from human clicks to agent invocations? This explores whether existing ad-market mechanisms could scale to agent decision-making.
complements: MCP-Zero is the supply-side mechanism — tools as services agents query — that the agent attention economy assumes.
-
Can models learn to ask clarifying questions instead of guessing?
Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
extends: same proactivity move applied to a different domain — instead of asking the user for missing input, the model asks the tool registry for missing capability.
-
Can reasoning and tool execution be truly decoupled?
Can LLM reasoning be separated from tool observations to eliminate redundant re-prompting and enable parallel execution? Two recent architectures suggest yes, but what are the tradeoffs?
complements: ReWOO/CoA decouple reasoning from tool execution at the inference layer; MCP-Zero decouples tool retrieval from query semantics at the discovery layer. Both argue for separating concerns at different points in the agent stack.
-
Why do capable AI agents still fail in real deployments?
Explores whether agent failures stem from insufficient capability or from missing ecosystem conditions like user trust, value clarity, and social norms. Understanding this distinction matters for predicting which agents will succeed.
extends: standardization is one of the five ecosystem conditions; MCP itself is the standardization layer this paper builds on.
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- MCP-Zero: Proactive Toolchain Construction for LLM Agents from Scratch
- Small LLMs Are Weak Tool Learners: A Multi-LLM Agent
- Efficient Tool Use with Chain-of-Abstraction Reasoning
- Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
- Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
- Deep Research: A Systematic Survey
- Eliciting Reasoning in Language Models with Cognitive Tools
- QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?
Original note title
proactive tool retrieval lets the model itself decide when and which tools to fetch — replacing passive single-round semantic matching with iterative cross-domain toolchain construction