Can small models match large models on function calling?
Explores whether small language models fine-tuned with the right training method can achieve comparable performance to large models on structured reasoning tasks requiring precise function calls, and what training approach makes this possible.
The insight in this paper is methodological: function-calling for reasoning tasks is a domain where DPO outperforms SFT for small models, because the failure modes are more about preferring the right format and call sequence than about generating any plausible text. The proposed framework uses an agent that, given a problem and a callable function set, queries a large LLM by injecting function descriptions and examples and managing calls in a step-by-step reasoning chain. The byproduct is a dataset of correct AND incorrect chat completions — preference pairs ready for DPO.
Why DPO rather than SFT or PPO. SFT teaches the model to imitate good examples but provides no signal about what to avoid — and rigid output formats (precise variable names, JSON, argument values) punish near-misses harshly, so explicit negative examples matter. PPO would work but requires extensive human feedback to train a reward model, making it resource-intensive. DPO removes the reward-model step by incorporating preferences directly into the training objective, with demonstrated stability advantages over PPO.
The structural move is that a large LLM does double duty: it generates the candidate reasoning chains AND its successes/failures provide the preference labels for the small model's training. This is a teacher-distillation pattern but with both polarities — the small model learns what the large model gets right and what it gets wrong, not just to imitate the large model's right answers. The pattern fits the broader case for Can small language models handle most agent tasks?: function-calling is exactly the kind of repetitive, scoped, format-rigid work where a fine-tuned small model can replace a large general-purpose one.
The practical implication: when output format is rigid and small-model deployment is the goal, the question is not "can SFT close the gap" but "what's the cheapest source of preference signal." Self-generated preference pairs from a strong teacher are essentially free relative to human feedback.
Inquiring lines that use this note as a source 105
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can communication problems and optimization problems be addressed with the same alignment approaches?
- Why does DPO outperform SFT specifically for function calling tasks?
- Can granular sub-task training for function calling improve both open and proprietary models?
- When does the right constraint beat additional model capacity?
- Can input augmentation and rephrasing compensate for smaller model limitations?
- How do larger models maintain more parallel tasks than smaller models?
- Can likelihood choice matter more than architectural depth for CF?
- Do models learn different sophistry strategies for QA versus code generation?
- Why do method-level improvements avoid the generation-verification gap that parameter-level improvements face?
- Does correct model behavior guarantee internal alignment of learned objectives?
- Can language models learn to form ad-hoc conventions through training?
- How much of the combinatorial task space must training data cover?
- Does scaling model size solve compositional generalization problems?
- How does inference compute substitution affect the training parameter scaling trade-off?
- Do tool-enabled reasoning models close the gap on constraint satisfaction?
- Does selecting examples from multiple complexity levels outperform selecting only high-quality examples?
- Can test-time scaling prioritize genuine reasoning over pattern matching?
- How does business logic specification replace annotated training datasets?
- How much alignment data does a language model actually need to specialize well?
- Does the optimal model size depend on what capabilities you actually need?
- Can smaller models actually perform well on specific downstream tasks?
- Why do open-source models trained on proprietary outputs still fail at reasoning?
- Can fine-tuning ever teach semantic inference instead of amplifying training shortcuts?
- Can prompt optimization inject new knowledge into language models?
- What architectural changes would let language models develop genuine functional competence?
- Why do models fail on logically equivalent tasks with different data distributions?
- Why does NLI fine-tuning amplify frequency bias instead of teaching inference?
- How does preference-based training compare to supervised fine-tuning for function calling?
- Why do single function-calling benchmarks mask model weakness in specific areas?
- Do different function-calling subtasks have different entropy profiles during training?
- What three independent failure points bottleneck traditional function calling systems?
- Does more inference compute help reasoning models match specialized domain performance?
- Does fine-tuning models for specific tasks destroy their ability to reason?
- How can smaller models help select useful data for larger models?
- Can test-time compute on smaller models replace larger model inference?
- Which RAG sub-decisions are actually pattern matching versus reasoning intensive?
- Does architectural design matter more than model scale for reasoning tasks?
- Can small models solve complex tasks using externalized reasoning graphs?
- Can instance seeds work for tasks beyond language understanding benchmarks?
- Does training data format shape model reasoning more than domain content?
- Does knowledge structure matter more than knowledge volume for model training?
- Can smaller specialist models outperform large generalist models on domain tasks?
- How do general language model benchmarks predict specialized domain performance?
- Do instruction-tuned models learn tasks or just output format distributions?
- Why do language models struggle with formal logical reasoning and joins?
- Why does distillation transfer reasoning patterns with few examples?
- What makes reasoning-specific post-training different from standard parameter scaling?
- How much does training composition affect syntactic versus reasoning performance?
- Why do smaller and larger models converge on different output formats?
- Why do production systems optimize for three model classes instead of foundation models?
- What are the computational trade-offs between training-time vs inference-time consistency correction?
- Why do large language models outperform fine-tuned models once repeated items are removed?
- Can smaller models achieve domain expertise through focused RL training?
- Can finetuning sparse subnetworks alone match full parameter finetuning results?
- Why do smaller models favor code formats while larger models prefer natural language?
- How do routers decide when to escalate from small to large models?
- Do small models show different parameter efficiency patterns than large models?
- Can multiple small models outperform a single large model with good routing?
- What training cost tradeoffs exist between fine-tuning and other knowledge injection methods?
- What substrate do supervised models lack that makes them weaker on low-resource languages?
- Can reasoning in free text then formatting separately recover performance?
- How should tiny language models be architected differently than large ones?
- What filtering criteria best identify student-compatible refinements from teacher models?
- Why might diverse smaller models with routing beat one giant model?
- What makes routing a better investment than training larger models?
- Can weaker models match stronger ones with sufficient search and reasoning budget?
- Can granular function calling tasks learn composition from graph-sampled data?
- Can models retrieve the right tool without relying on vector similarity?
- Does training on granular tasks beat training on the full function calling problem?
- What makes a small surgical wide component sufficient with a capable deep model?
- Why do smaller LLMs fail at zero-shot argument scheme classification?
- Does supervised fine-tuning improve reasoning or just response formatting?
- Why does AI code generation lag behind pattern-matching benchmarks?
- How does the generation-verification gap prevent language models from improving themselves?
- Can smaller judge models better capture human preferences than larger prompted models?
- Can cognitive scaffolding replace tool-based reasoning augmentation in language models?
- How do retrieval and fine-tuning trade off flexibility against training cost?
- Why do smaller models lose reasoning faithfulness more than larger models?
- Can models maintain reasoning-output coupling while improving domain accuracy?
- How do pre-training and distillation enable minimal routing signals to work?
- Do different model sizes show different rates of optional field overfilling behavior?
- Does fine-tuning a small model match fine-tuning a large one?
- Can specialized components replace single fully-trained models in deployment?
- Can smaller LLMs perform tool use tasks through modular decomposition?
- What output distribution properties make smaller models better for wide sampling?
- Does joint optimization of prompts and parameters outperform separate tuning?
- Does bounding textual edits prevent skill degradation better than free rewriting?
- Does token-level loss aggregation help aligned models differently?
- Why do structure-targeted training negatives fail to fix the underlying problem?
- What architectural variables most improve inference efficiency today?
- Why does the right structural prior matter more than raw model capacity?
- Can weak models supervise the alignment of stronger models effectively?
- How can language models extract more value from fewer demonstrations?
- How does tool integration leverage comprehension without demanding perfect generation?
- Why does parameter-efficient tuning scaling fail to improve finetuning performance?
- Does pretraining data size matter less than base model scale for finetuning?
- How do task frequency and complexity interact with model capacity during training?
- Does task diversity in pretraining data transfer reasoning better than larger models?
- How does model scale affect anticipatory behavior in structured training?
- Can preference trees structure alignment data for domains beyond math and code?
- Why do small specialized models match frontier multimodal models on screen tasks?
- Why does architecture matter more than training compute for inference efficiency?
- How does tool-based reasoning expand what language models can do?
- Can smaller models produce skill updates as useful as frontier model updates?
- How does evaluation setting affect measured reasoning capabilities in language models?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can small language models handle most agent tasks?
Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
exemplifies: function-calling is the canonical repetitive-scoped-format-rigid task where SLM-first architectures pay off; DPO-from-teacher is one viable training recipe.
-
Can breaking function calling into subtasks improve model generalization?
Does training on seven granular function-calling subtasks instead of one umbrella objective close the gap between open-source and proprietary models? This explores whether decomposition surfaces hidden failure modes that unified training misses.
complements: Granite argues granular sub-task training closes the open-vs-proprietary gap on the *what to train on* axis; this note argues DPO-from-teacher closes the gap on the *how to train* axis. Both target the same problem (open-source function-calling lags proprietary).
-
Where do traditional function calling systems actually break down?
Function calling seems simple but fails in ways that aren't obvious. This explores three independent failure points—retrieval, context bloat, and output rigidity—that together explain why even the best models struggle.
extends: Floworks names rigid output format as one of three failure points; this note shows DPO-with-negative-examples is a targeted intervention against the format failure mode specifically.
-
Does teacher-refined data always improve student model performance?
Explores whether higher-quality training data from teacher models uniformly benefits student models, or if compatibility with the student's current learning state matters for effective instruction.
complicates: teacher-distillation effectiveness depends on student-teacher compatibility; preference-pair distillation may inherit this dependency.
-
Why do alignment methods work if they model human irrationality?
DPO and PPO-Clip succeed partly by implicitly encoding human cognitive biases like loss aversion. Does modeling irrationality explain their effectiveness better than traditional preference learning theory?
extends: explains *why* DPO's negative-example signal works — it implicitly models loss aversion, which is exactly the asymmetry rigid output formats impose.
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- Learning to Reason for Factuality
- A Survey on Post-training of Large Language Models
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- An Emulator for Fine-Tuning Large Language Models using Small Language Models
- Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs
Original note title
DPO-trained small models can match large models on function-calling reasoning chains — preference data from a teacher beats SFT for the rigid output format