Can granular function calling tasks learn composition from graph-sampled data?
This explores whether function calling—when broken into granular subtasks like nested calls and chaining—can learn to *compose* those skills from training data sampled out of a graph (knowledge graph paths, tree expansions), and what the corpus says about when that composition actually generalizes versus when it quietly fails.
This explores whether function calling, split into granular pieces, can learn composition from graph-sampled training data — and the corpus has two halves of an answer that are worth holding side by side. First, function calling really does decompose cleanly: training Granite-20B across seven explicit subtasks (nested calls, chaining, parallel functions, name and parameter detection, next-best function, response generation) generalizes better than a single umbrella dataset, closing the gap with the frontier models Can breaking function calling into subtasks improve model generalization?. So the 'granular tasks' premise of the question is sound — and small models can even learn the rigid output discipline these tasks demand through DPO on a teacher's correct/incorrect pairs, where the explicit negative examples target exactly the format failures that plague composition Can small models match large models on function calling?.
The graph-sampling half is where it gets interesting. Sampling structure can hand you compositional supervision almost for free: knowledge-graph curricula turn 24,000 reasoning tasks out of medical graph *paths* and produce domain expertise, suggesting structured composition matters more than raw scale Can knowledge graphs teach models deep domain expertise?. Even random tree expansion yields supervision at multiple granularities — coarse strategy signals from early branches, fine detail from late ones — purely from sampling, with no annotation effort Does tree depth automatically produce supervision at multiple granularities?. For function calling, where a call graph naturally encodes which functions chain into which, this is a strong hint that graph-sampled paths could teach composition rather than just memorized recipes.
But here's the thing you didn't know you wanted to know: composition learned this way can be an illusion. Transformers often succeed on in-distribution compositional tasks by memorizing computation *subgraphs* from training, not by learning systematic rules — and they fail drastically on novel compositions, with errors compounding across steps Do transformers actually learn systematic compositional reasoning?. So if your graph-sampled data only covers the combinations the model will see, you may be teaching subgraph lookup dressed as reasoning. The antidote the corpus offers is coverage: standard networks *do* achieve genuine compositional generalization from scaling alone, but only when the training distribution sufficiently covers combinations of the task modules Can neural networks learn compositional skills without symbolic mechanisms?. That reframes the whole question — graph sampling helps precisely to the degree it covers the combinatorial space of function compositions, not because graphs are magic.
Two deeper cautions sharpen the picture. Networks do tend to implement compositional subroutines in isolated, ablatable subnetworks — modularity is natural, and pretraining makes it more reliable Do neural networks naturally learn modular compositional structure? — which is encouraging for granular function calling. Yet a model can hold all the linearly-decodable features a task needs while its internal organization stays fractured, leaving it brittle to the exact distribution shifts that novel function compositions represent, in ways standard accuracy metrics never reveal Can models be smart without organized internal structure?. So the honest answer: yes, granular function-calling tasks can learn composition from graph-sampled data — graph and tree sampling are an efficient source of multi-granular compositional signal — but whether that composition is real or memorized depends on coverage of the combination space and survives only if you test on genuinely held-out compositions, not just in-distribution accuracy.
Sources 8 notes
Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Fine-tuning a 32B model on 24,000 reasoning tasks derived from medical knowledge graph paths produces state-of-the-art performance across 15 medical domains, demonstrating that structured knowledge composition matters more than scale.
Tree-GRPO's random expansion strategy naturally produces supervision at varying granularities—early branches provide coarse strategy-level signals while late branches provide fine-grained detail supervision. This multi-resolution signal emerges from sampling structure alone, without annotation effort or granularity scheduling.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.