LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. LLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B- Instruct model. To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. Experimental results show that compared to previous speech-language models, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms.
Introduction. Large language models (LLMs), represented by ChatGPT (OpenAI, 2022), have become powerful general-purpose task solvers, capable of assisting people in daily life through conversational interactions. However, most LLMs currently only support text-based interactions, which limits their application in scenarios where text input and output are not ideal. Recently, the emergence of GPT- 4o (OpenAI, 2024) has made it possible to interact with LLMs through speech, responding to user’s instruction with extremely low latency and significantly enhancing the user experience. However, there is still a lack of exploration in the open-source community on building such speech interaction models based on LLMs. Therefore, how to achieve low-latency and high-quality speech interaction with LLMs is a pressing challenge that needs to be addressed.
Discussion / Conclusion. In this paper, we propose an innovative model architecture, LLaMA-Omni, which enables lowlatency and high-quality speech interaction with LLMs. LLaMA-Omni is built upon the latest Llama-3.1-8B-Instruct model, with the addition of a speech encoder for speech understanding and a streaming speech decoder that can generate both text and speech responses simultaneously. To align the model with speech interaction scenarios, we construct a speech instruction dataset InstructionS2S-200K, which contains 200K speech instructions along with the speech responses. Experimental results show that, compared to previous speech-language models, LLaMA-Omni delivers superior responses in both content and style, with a response latency as low as 226ms. Moreover, training LLaMA-Omni requires less than 3 days on 4 GPUs, enabling rapid development of speech interaction models based on the latest LLMs. In the future, we plan to explore enhancing the expressiveness of generated speech responses and improving real-time interaction capabilities.