ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis
Supervised fine-tuning (SFT) is a common method to enhance the tool calling capabilities of Large Language Models (LLMs), with the training data often being synthesized. The current data synthesis process generally involves sampling a set of tools, formulating a requirement based on these tools, and generating the call statements. However, tools sampled randomly lack relevance, making them difficult to combine and thus reducing the diversity of the data. Additionally, current work overlooks the coherence between turns of dialogues, leading to a gap between the synthesized data and realworld scenarios. To address these issues, we propose a Graph-based Sampling strategy to sample more relevant tool combinations, and a Planned-generation strategy to create plans that guide the synthesis of coherent dialogues. We integrate these two strategies and enable multiple agents to synthesize the dialogue data interactively, resulting in our tool-calling data synthesis pipeline TOOLFLOW. Data quality assessments demonstrate improvements in the naturalness and coherence of our synthesized dialogues. Finally, we apply SFT on LLaMA- 3.1-8B using 8,000 synthetic dialogues generated with TOOLFLOW.
Introduction. Enabling Large Language Models (LLMs) to perform tool calling significantly enhances their capabilities and practical applications. This requires the models to possess strong understanding, reasoning, and instruction-following abilities. Customized fine-tuning is a widely used method to improve the tool-calling capabilities of LLMs (Abdelaziz et al., 2024; Patil et al., 2023; Schick et al., 2023; Qin et al., 2023). However, access to fine-tuning data can be limited. One viable solution is to utilize LLMs for data synthesis (Basu et al., 2024; Wang et al., 2023; Xu et al., 2023; Yu et al., 2024). A typical tool-calling data synthesis process involves three steps: (1) selecting candidate tool(s), (2) generating requirements based on those tools, and (3) creating the call statements (Tang et al., 2023; Liu et al., 2024b). However, the data synthesized through this method often lacks realism and naturalness. Randomly sampled tools frequently fail to interconnect, making it difficult to combine them for complex tasks.
Discussion / Conclusion. In this work, we propose Graph-based Sampling and Planned Generation strategies to enhance the diversity and coherence of synthetic data. Based on these two strategies, we introduce a pipeline called TOOLFLOW for synthesizing tool calling data and generate 8,000 training samples. Using this dataset, we conduct SFT on Llama3.1-8B-Instruct, resulting in improved tool calling capability of the model. Subsequently, we conduct correlation analysis to demonstrate the influence of data diversity and coherence on model performance. This provides a reference for the composition of training data for the tool-enhanced agent. We summarize the limitations in two points. As described in Section 3.4, the seed data is a precollected tool set including 16,000 APIs. Although our TOOLFLOW can synthesize more diverse data, it is undeniable that the size and diversity of the tool set also affect the diversity of the data. However, how to enrich the seed data has not yet been studied in this work.