Why does tool use decouple factual capacity from model parameter count?
This explores why letting a model call external tools breaks the old assumption that knowing more facts requires a bigger model — and what the corpus says about where capability actually lives.
This explores why letting a model call external tools breaks the old assumption that knowing more facts requires a bigger model. The cleanest answer in the collection is a formal one: there's a proof that facts stored *inside* a model's weights are hard-capped by its size — each parameter can only hold so much — but facts retrieved *through a tool* (a lookup, a search, a database call) ride on a small, fixed circuit that doesn't grow with the number of facts Can models store unlimited facts without growing larger?. So the model stops being a container of facts and becomes a router to them. The same work flags a hidden cost of the alternative: cramming more facts in by fine-tuning actively overwrites what the model already knew, degrading general ability — another reason in-weight memorization is a bad place to scale.
The deeper pattern is that this isn't only about facts. A parallel line of work proves that giving a model tools strictly *expands what it can reason about*, not just what it can remember — there are strategies that are impossible or impossibly verbose in pure text but trivial once the model can offload a step to code or a calculator Do tools actually expand what language models can reason about?. Capacity, in other words, decouples from parameter count on two fronts at once: storage and computation.
This reframes some headline failures. The much-discussed "reasoning cliff" — where models collapse on hard multi-step problems — turns out to be largely an artifact of how we test them. Confined to text-only generation, a model can *know* an algorithm yet be unable to execute it at scale; hand it a tool and the cliff disappears Does the reasoning cliff depend on how we test models?. The bottleneck was never reasoning capability but execution bandwidth — the procedural grinding that weights are a wasteful place to do Are reasoning model collapses really failures of reasoning?.
If factual and procedural capacity live outside the weights, the engineering question shifts from "how big" to "how well do you orchestrate the outside." That's where the corpus gets practical: decoupling the model's reasoning from the tool's *responses* avoids re-feeding bulky observations through the context and lets calls run in parallel Can reasoning and tool execution be truly decoupled?, while the quality of tool-use itself depends heavily on training data realism — randomly sampled, unrelated tools produce incoherent behavior, and relevance-graph sampling fixes it Why does random tool sampling produce unrealistic synthetic training data?.
The thing worth taking away: scaling laws made us think intelligence is something you *store*, more parameters for more knowledge. Tool use suggests a different architecture for capability — a modest reasoning core that knows how to *reach* for facts and computation rather than embody them. The frontier moves from the size of the model to the design of what it can call.
Sources 6 notes
A formal proof and experiments show in-weight memorization is bounded by model size, while tool-use enables unbounded factual recall through a simple circuit. In-weight finetuning also degrades general capability by overwriting prior knowledge.
Formal proof shows tool-integrated reasoning enables strategies impossible or prohibitively verbose in text alone, expanding both empirical and feasible support. The advantage spans abstract reasoning, not just arithmetic, and Advantage Shaping Policy Optimization stabilizes training without reward distortion.
Language models show catastrophic failure in text-only reasoning benchmarks but maintain scaling when given tool access. The cliff reflects execution constraints, not reasoning capability, making text-only evaluations systematically underestimate real-world performance.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
ReWOO and Chain-of-Abstraction both decouple reasoning from tool responses through different mechanisms—planning-before-execution and abstract placeholders respectively—eliminating quadratic prompt growth and sequential latency while maintaining reasoning quality.
Random tool sampling fails because unrelated tools cannot credibly compose, and Q&A framing ignores multi-turn dialogue coherence. ToolFlow shows that sampling tools from relevance graphs and generating with dialogue plans closes this gap.