Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing
Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of response quality. Therefore in this work we propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. Our approach uses a router that assigns queries to the small or large model based on the predicted query difficulty and the desired quality level. The desired quality level can be tuned dynamically at test time to seamlessly trade quality for cost as per the scenario requirements. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
Introduction. Large language models (LLMs) have become the dominant force in natural language processing in recent years [Zhao et al., 2023]. Their impact has been especially striking in generative applications where it has extended beyond standard language understanding and question-answering benchmarks like [Hendrycks et al., 2020, Srivastava et al., 2022] to several successful real-world deployments. These include the wildly popular ChatGPT [OpenAI, b] and several other chatbots [Zheng et al., 2023] powered by different LLMs [Taori et al., 2023, Touvron et al., 2023, OpenAI, 2023], which allow users to engage in natural language conversations and obtain informative responses on a range of practically useful tasks like creative writing, translation, code completion, etc. An important added attraction of these models is their accessibility. Users can input queries and receive responses in natural language, without any specialized data or code, and this is what has created such a widespread demand for their services across regions, professions, and disciplines.
Discussion / Conclusion. Motivated by the need to optimize the trade-off between LLM inference costs and response quality, we have presented a hybrid inference approach based on quality-aware query routing. We train a router to discriminate between “hard” and “easy” queries, enabling the LLM provider to make cost-efficient decisions about which model should serve a given query. Our experimental results on a variety of state-of-the-art LLMs of varying sizes show that such an optimization is possible and that we can realize cost advantages of up to 40% with no significant drop in response quality. To the best of our knowledge, this is the first work exploring the possibilities of cost-effective and quality-aware query routing between LLMs. We identify several important extensions for future work: (1) Task-aware routing. Our current routers make routing decisions purely based on query inputs. To improve routing effectiveness, we can provide more informative signals which help routers distinguish easy queries from the hard ones, such as task labels for query examples and can also identify tasks which may be more suited to routing for a given pair of LLMs. (2) Generalizing to N-model routing.