INQUIRING LINE

Can smaller LLMs perform tool use tasks through modular decomposition?

This explores whether smaller, cheaper LLMs can handle tool-use and agent tasks when the work is broken into modular pieces — rather than asking one big model to do everything end to end.


This explores whether smaller LLMs can punch above their weight on tool use by breaking tasks into modular parts. The corpus says yes, and it points to two different reasons why — one about cost, one about architecture. The cost argument is the bluntest: most of what an agent actually does is repetitive, well-defined language work, and small models handle those subtasks at roughly 10–30× lower cost, which makes a heterogeneous design (small models by default, large ones only when needed) the economically rational pattern rather than a compromise Can small language models handle most agent tasks?. The architectural argument is more interesting: modularity isn't just cheaper, it changes what a small model can do at all.

The key mechanism is separation. When you split the model that *plans* a task from the model that *executes* each step, accuracy improves — and notably, the decomposition skill transfers across domains while the solving skill doesn't Does separating planning from execution improve reasoning accuracy?. LLM Programs push this further by wrapping models inside explicit algorithms that hand each call only the context relevant to its step, hiding everything else — which directly addresses the capability and context-window limits that hit small models hardest Can algorithms control LLM reasoning better than LLMs alone?. Cognitive tools take the same isolation idea and show its power vividly: four reasoning operations implemented as sandboxed calls lifted GPT-4.1's AIME score from 26.7% to 43.3% with no training at all, because enforced isolation elicits reasoning that pure prompting can't reliably trigger Can modular cognitive tools unlock reasoning without training?.

For tool use specifically, decoupling pays off twice. ReWOO and Chain-of-Abstraction both pull reasoning apart from tool responses — planning before execution, or using abstract placeholders — which kills the quadratic prompt growth and sequential latency that otherwise crush a small model's limited context Can reasoning and tool execution be truly decoupled?. And on the raw skill of calling functions correctly, small models can be trained to match large ones: DPO on a teacher's correct-and-incorrect examples beats plain fine-tuning precisely because the negative examples target the rigid output-format mistakes where small models stumble Can small models match large models on function calling?. Externalizing reasoning into a structure helps too — GPT-4o mini gained 29% on hard GAIA tasks by building knowledge-graph triples instead of holding everything in its head Can structuring reasoning as knowledge graphs help smaller models solve complex tasks?, and recursive subtask trees with cache pruning let a single model sustain reasoning past its context limit, replacing what used to need a multi-agent system Can recursive subtask trees overcome context window limits?.

Here's the thing you might not expect: modular decomposition isn't free, and the corpus quietly marks its limits. Long delegated workflows compound silent errors — frontier models corrupt about 25% of document content over extended relay tasks, and the damage doesn't plateau Do frontier LLMs silently corrupt documents in long workflows?. So chaining many small steps trades one risk (a weak model overwhelmed) for another (errors accumulating across hand-offs). And decomposition can't manufacture capability that isn't there: LLMs plateau around 55–60% on genuine constrained optimization regardless of scale Do larger language models solve constrained optimization better?, and they fall back to pattern-matching memorized templates rather than actually running iterative numerical methods Do large language models actually perform iterative optimization?. The takeaway worth leaving with: modularity works because most tool-use 'reasoning' is really orchestration — routing, formatting, and step management that a small model does fine once the hard cognitive load is structured away from it. Where a task needs a capability the model simply lacks, decomposition reorganizes the failure; it doesn't remove it.


Sources 11 notes

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Can reasoning and tool execution be truly decoupled?

ReWOO and Chain-of-Abstraction both decouple reasoning from tool responses through different mechanisms—planning-before-execution and abstract placeholders respectively—eliminating quadratic prompt growth and sequential latency while maintaining reasoning quality.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can structuring reasoning as knowledge graphs help smaller models solve complex tasks?

Knowledge Graph of Thoughts (KGoT) achieves 29% improvement on GAIA Level 3 tasks using GPT-4o mini by externalizing reasoning into iteratively constructed KG triples. The approach improves transparency, reduces bias, and enables quality control over reasoning steps.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **Can smaller LLMs perform tool use tasks through modular decomposition?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable constraints to be re-tested.

• Modularity (decomposer–solver separation, LLM Programs, cognitive tools as sandboxed calls) reliably improves accuracy on tool-use and reasoning tasks; cognitive tools lifted GPT-4.1's AIME from 26.7% to 43.3% (~2025).
• Small models match large ones on function calling via DPO training on teacher errors; decoupling reasoning from tool observations (ReWOO, Chain-of-Abstraction) eliminates quadratic prompt growth (~2024–2025).
• Externalizing reasoning into knowledge graphs (triples) and recursive subtask trees with KV-cache pruning extend effective reasoning past context limits (~2025).
• Critical constraint: frontier models silently corrupt ~25% of document content over long delegated workflows; error compounds across hand-offs (~2026).
• Hard ceiling: LLMs plateau at 55–60% on genuine constrained optimization and cannot execute iterative numerical methods, falling back to memorized templates (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2506.12115 (Cognitive Tools, ~2025)
• arXiv:2506.02153 (Small LMs Future of Agentic AI, ~2025)
• arXiv:2604.15597 (Document Corruption in Delegation, ~2026)
• arXiv:2603.23004 (Constraint Satisfaction Ceiling, ~2026)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (o1, o3, or their successors), training methods (RL beyond DPO), tooling (new SDKs, sandboxing), or orchestration (multi-turn memory, adaptive routing) have since RELAXED or OVERTURNED it. Separate the durable question (modularity's role in scaling smaller models) from perishable limitations (error accumulation, optimization ceiling). Cite what relaxed each constraint, and flag where it still appears to hold.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months** — any paper showing modularity fails, or that unified large models now outperform modular small ones, or that error corruption has been solved.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., "If error corruption is now controlled, does modularity enable small models to match frontier performance on long-horizon tool workflows?" or "Do retrieval-augmented or verifiable-execution tools bypass the optimization plateau?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines