Can breaking function calling into subtasks improve model generalization?
Does training on seven granular function-calling subtasks instead of one umbrella objective close the gap between open-source and proprietary models? This explores whether decomposition surfaces hidden failure modes that unified training misses.
The diagnosis behind Granite-20B-FunctionCalling is that "function calling" as a training target is too coarse. Models fine-tuned on umbrella function-calling datasets like ToolLLM, ToolAlpaca, and Gorilla underperform along three dimensions: they fail to generalize out-of-domain, they handle the granular sub-tasks poorly when isolated, and they trail proprietary models like GPT, Claude, and Gemini. The pattern suggests that what looks like one capability is actually seven that are loosely coupled.
Granite's response is to make the seven explicit and train across all of them as separate tasks: (1) Nested Function Calling — using one function's output as another's input; (2) Function Chaining — sequencing dependent calls; (3) Parallel Functions — invoking multiple independent calls; (4) Function Name Detection — picking the right function from a set; (5) Parameter-Value Pair Detection — slot filling against a schema; (6) Next-Best Function — selecting the next call given partial state; (7) Response Generation — composing the user-facing reply from tool outputs.
The structural claim is that an instruction-tuning mixture across granular sub-tasks generalizes better than a single umbrella objective, because each sub-task surfaces different failure modes during training. A model that has explicitly practiced nested calls, parallel calls, and chaining understands their composition rather than emitting tokens that look like function calls without structural correctness.
The implication for capability evaluation: a single function-calling benchmark is misleading. Models can be strong on call-statement generation while failing on parameter slot-filling or next-best-function selection, and the average masks where the failure lies. The right unit of evaluation — and training — is the sub-task, not the umbrella. This connects directly to Where do traditional function calling systems actually break down?: Floworks names three independent failure points; Granite implicitly says there are seven, all training-addressable through multi-task decomposition.
Inquiring lines that use this note as a source 26
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does DPO outperform SFT specifically for function calling tasks?
- What role does rigid output format play in function calling failure modes?
- Can granular sub-task training for function calling improve both open and proprietary models?
- What makes the frame problem distinct from feature-level shortcuts?
- How much of the combinatorial task space must training data cover?
- Why does full multi-task fine-tuning perform worse than sequential training?
- What task structures benefit most from geometric parameter merging?
- How does preference-based training compare to supervised fine-tuning for function calling?
- Why do single function-calling benchmarks mask model weakness in specific areas?
- Do different function-calling subtasks have different entropy profiles during training?
- What three independent failure points bottleneck traditional function calling systems?
- How do ensemble methods apply within a single model?
- What performance trade-offs emerge when composing multiple independently trained model capabilities?
- Can structured decomposition fix evaluation gaps in other research tasks?
- How do neural networks decompose complex tasks into modular subnetworks?
- Can granular function calling tasks learn composition from graph-sampled data?
- Does training on granular tasks beat training on the full function calling problem?
- Can the joint-training principle extend beyond memorization and generalization pairs?
- Can sub-task handlers be swapped between neural and symbolic systems?
- How should skill libraries coordinate with gradient-based weight optimization?
- Can training on diverse related tasks be more efficient than task-specific training?
- Why does scaling data and model size improve compositional generalization?
- How do neural networks decompose tasks into modular subnetworks that transfer?
- Can we predict which tasks will decompose into modular subnetworks?
- Can specialized components replace single fully-trained models in deployment?
- Can intentional data-mixture design replace model scaling for rare task learning?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Where do traditional function calling systems actually break down?
Function calling seems simple but fails in ways that aren't obvious. This explores three independent failure points—retrieval, context bloat, and output rigidity—that together explain why even the best models struggle.
extends: Floworks names three structural failure points (retrieval, schema bloat, output format); Granite identifies seven sub-task failure modes that umbrella training conflates. Both argue function-calling is not one problem.
-
Can small models match large models on function calling?
Explores whether small language models fine-tuned with the right training method can achieve comparable performance to large models on structured reasoning tasks requiring precise function calls, and what training approach makes this possible.
complements: Granite addresses the *what to train on* axis (granular sub-tasks); DPO-from-teacher addresses the *how to train* axis (preference vs SFT). Both target the open-vs-proprietary gap on function calling.
-
Does training order reshape how models handle different task types?
Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.
complements: multi-task training surfaces complementary entropy dynamics; Granite's seven-task mixture should benefit from this — different sub-tasks have different entropy profiles and training across them stabilizes.
-
Does separating planning from execution improve reasoning accuracy?
Can modular LM architectures that split problem decomposition from solution execution outperform monolithic models? This explores whether decoupling these cognitive operations reduces interference and boosts performance.
complements: same decomposition logic applied within function-calling — slot-filling, chaining, and response generation each warrant separate training signals because their error modes differ.
-
Can small language models handle most agent tasks?
Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
complements: granular sub-task decomposition is what enables SLM-first deployment of function-calling — each sub-task is small enough for a fine-tuned SLM to handle.
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
- Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks
- Revisiting RAG Ensemble: A Theoretical and Mechanistic Analysis of Multi-RAG System Collaboration
- Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges
- Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling
- Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
- Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective
- Octopus v2: On-device language model for super agent
Original note title
function calling decomposes into seven granular tasks — multi-task learning across them generalizes where umbrella training fails