How does preference-based training compare to supervised fine-tuning for function calling?
This explores whether learning from preference pairs (DPO/RLHF-style, where the model sees both good and bad examples) beats plain supervised fine-tuning when teaching a model to call functions correctly — and the corpus suggests the answer hinges on a quirk of what function calling actually demands: rigid output format.
This reads the question as: for the specific task of getting a model to emit well-formed function calls, does training on preferences (correct-vs-incorrect pairs) outperform straight supervised fine-tuning? The corpus has a sharp answer for at least one regime. Small models trained with DPO on preference pairs generated by a large teacher — pairs of correct and incorrect function calls — beat SFT, and the reason is mechanistic: function calling fails mostly on rigid output format, and DPO's explicit negative examples directly punish those format errors in a way SFT's positive-only examples never can Can small models match large models on function calling?.
That result lands harder once you see what SFT actually teaches. There's evidence that instruction tuning transfers knowledge of the output *space* rather than task understanding — models trained on semantically empty or even deliberately wrong instructions perform about as well as those trained on correct ones, because what carries over is 'what valid output looks like,' not 'what the task means' Does instruction tuning teach task understanding or output format?. For function calling, format-fidelity is most of the battle, so a method that shapes the output distribution by contrast (here's the malformed call, don't do that) has a structural edge over one that only shows correct examples.
The corpus also pushes back on treating preference/RL training as a free lunch. RL fine-tuning can sharpen template-matching rather than install genuine procedure — out-of-distribution variants reveal models leaning on memorized patterns, so the gains may not transfer to novel call structures Do fine-tuned language models actually learn optimization procedures?. And RL tends to collapse onto a single dominant output format from pretraining, amplifying one pattern while suppressing alternatives Does RL training collapse format diversity in pretrained models? — which is *helpful* when you want one canonical call format but harmful if your API surface needs varied shapes. Relatedly, preference tuning's effect flips by domain: it narrows diversity where convergence is rewarded (like code) and widens it where distinctiveness pays Does preference tuning always reduce diversity the same way?. Function calling sits firmly on the convergence side, so that narrowing is a feature.
There's a third path the corpus surfaces that sidesteps the SFT-vs-preference framing entirely: decompose function calling into its parts. One approach breaks it into seven granular subtasks — nested calls, chaining, parallel functions, name detection, parameter detection, next-best-function, response generation — and shows multi-task training across them generalizes better than umbrella tool-use datasets, closing the gap with frontier models Can breaking function calling into subtasks improve model generalization?. This hints that *how you carve the training signal* may matter as much as whether it's supervised or preference-based.
The broader thread, if you want to keep pulling: RL-style training can embed domain knowledge more effectively than SFT by rewarding reasoning quality over token-level correctness Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?, and reward signals don't even have to come from humans — rule-based or task metrics can serve directly as RL rewards, eliminating the SFT distillation step from proprietary models altogether Can recommendation metrics train language models directly?. For function calling, where 'correct' is often mechanically checkable, that's a tantalizing door: the verifier you'd use to grade a call could itself become the training signal.
Sources 8 notes
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.
Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.