INQUIRING LINE

How does preference-based training compare to supervised fine-tuning for function calling?

This explores whether learning from preference pairs (DPO/RLHF-style, where the model sees both good and bad examples) beats plain supervised fine-tuning when teaching a model to call functions correctly — and the corpus suggests the answer hinges on a quirk of what function calling actually demands: rigid output format.


This reads the question as: for the specific task of getting a model to emit well-formed function calls, does training on preferences (correct-vs-incorrect pairs) outperform straight supervised fine-tuning? The corpus has a sharp answer for at least one regime. Small models trained with DPO on preference pairs generated by a large teacher — pairs of correct and incorrect function calls — beat SFT, and the reason is mechanistic: function calling fails mostly on rigid output format, and DPO's explicit negative examples directly punish those format errors in a way SFT's positive-only examples never can Can small models match large models on function calling?.

That result lands harder once you see what SFT actually teaches. There's evidence that instruction tuning transfers knowledge of the output *space* rather than task understanding — models trained on semantically empty or even deliberately wrong instructions perform about as well as those trained on correct ones, because what carries over is 'what valid output looks like,' not 'what the task means' Does instruction tuning teach task understanding or output format?. For function calling, format-fidelity is most of the battle, so a method that shapes the output distribution by contrast (here's the malformed call, don't do that) has a structural edge over one that only shows correct examples.

The corpus also pushes back on treating preference/RL training as a free lunch. RL fine-tuning can sharpen template-matching rather than install genuine procedure — out-of-distribution variants reveal models leaning on memorized patterns, so the gains may not transfer to novel call structures Do fine-tuned language models actually learn optimization procedures?. And RL tends to collapse onto a single dominant output format from pretraining, amplifying one pattern while suppressing alternatives Does RL training collapse format diversity in pretrained models? — which is *helpful* when you want one canonical call format but harmful if your API surface needs varied shapes. Relatedly, preference tuning's effect flips by domain: it narrows diversity where convergence is rewarded (like code) and widens it where distinctiveness pays Does preference tuning always reduce diversity the same way?. Function calling sits firmly on the convergence side, so that narrowing is a feature.

There's a third path the corpus surfaces that sidesteps the SFT-vs-preference framing entirely: decompose function calling into its parts. One approach breaks it into seven granular subtasks — nested calls, chaining, parallel functions, name detection, parameter detection, next-best-function, response generation — and shows multi-task training across them generalizes better than umbrella tool-use datasets, closing the gap with frontier models Can breaking function calling into subtasks improve model generalization?. This hints that *how you carve the training signal* may matter as much as whether it's supervised or preference-based.

The broader thread, if you want to keep pulling: RL-style training can embed domain knowledge more effectively than SFT by rewarding reasoning quality over token-level correctness Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?, and reward signals don't even have to come from humans — rule-based or task metrics can serve directly as RL rewards, eliminating the SFT distillation step from proprietary models altogether Can recommendation metrics train language models directly?. For function calling, where 'correct' is often mechanically checkable, that's a tantalizing door: the verifier you'd use to grade a call could itself become the training signal.


Sources 8 notes

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Can breaking function calling into subtasks improve model generalization?

Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can recommendation metrics train language models directly?

Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: does preference-based training (DPO, RLHF) outperform supervised fine-tuning (SFT) for function calling, and under what conditions?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025; treat as perishable constraints:

• Small models trained with DPO on preference pairs (correct vs. incorrect function calls) beat SFT; DPO's negative examples directly penalize format errors SFT cannot address (2024–2025).
• Instruction tuning teaches output *format distribution*, not task understanding — models trained on semantically empty or wrong instructions perform comparably, because validity-signals, not meaning, transfer (2023).
• RL fine-tuning can memorize template patterns rather than install genuine procedure; out-of-distribution variants reveal brittleness, and gains may not transfer to novel call structures (2024–2025).
• RL post-training converges on a single dominant pretraining format, amplifying one pattern while suppressing alternatives — helpful for one canonical format, harmful if APIs need varied shapes (2025).
• Function calling decomposes into seven granular subtasks (nested calls, chaining, name/parameter detection, etc.); multi-task training across them generalizes better than umbrella tool-use datasets, closing gaps with frontier models (2024).

Anchor papers (verify; mind their dates):
- arXiv:2305.11383 (2023): Do Models Really Learn to Follow Instructions?
- arXiv:2410.18890 (2024): Improving Small-Scale LLMs Function Calling for Reasoning Tasks
- arXiv:2504.07912 (2025): Echo Chamber — RL Post-training Amplifies Pretraining Behaviors
- arXiv:2509.20162 (2025): Embedding Domain Knowledge via RL from Augmented Generation

Your task:
(1) RE-TEST EACH CONSTRAINT. For DPO's claimed advantage over SFT on format-fidelity: has the gap closed with newer SFT methods (e.g., structured output specs, grammar-constrained decoding, in-context examples)? For RL's memorization problem: do recent scaling methods (arXiv:2510.13786) or online RL techniques dissolve brittleness? For format convergence: have checkpoints or ensemble approaches (arXiv:2507.18624) recovered diversity without harming performance? Separate what is still genuinely hard from what newer tooling/methods may have solved.
(2) Surface the strongest work from the last ~4 months that contradicts or supersedes the SFT–preference framing entirely — e.g., rule-based verification signals, multi-task decomposition, or post-hoc verifiers as training signals. Flag any papers that shift the regime away from binary comparisons.
(3) Propose 2 research questions that assume the regime may have moved: one testing whether mechanically checkable correctness (the function call's validity) as a direct RL reward signal outperforms human preference pairs; another exploring whether adaptive decomposition (task-specific subtask choices) beats fixed seven-task schemas.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines