Voxtral

Paper · arXiv 2507.13264 · Published July 17, 2025

We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while preserving strong text capabilities. Voxtral Small outperforms a number of closedsource models, while being small enough to run locally. A 32K context window enables the model to handle audio files up to 40 minutes in duration and long multiturn conversations. We also contribute three benchmarks for evaluating speech understanding models on knowledge and trivia. Both Voxtral models are released under Apache 2.0 license.

Introduction. This paper describes Voxtral Mini and Voxtral Small, a pair of multimodal language models trained to understand both speech and text, released with open-weights under an Apache 2.0 license. Voxtral is pretrained on a large-scale corpus of audio and text documents, and subsequently instruction tuned on real and synthetic data. It is capable of responding directly to audio (or text) and answering questions about audio files. With a 32K token context window, Voxtral is capable of processing audio files up to 40 minutes long. Compared with similarly sized models in the same evaluation setting, we find that Voxtral delivers strong audio reasoning capabilities without sacrificing text-only performance. Its performance is state-of-the-art for speech transcription and translation, outperforming other open-weights and closed models. In speech question-answering (QA) and summarization, it performs comparably with closed models of a similar price class, such as GPT-4o mini [Hurst et al., 2024] and Gemini 2.5 Flash [Comanici et al., 2025].

Discussion / Conclusion. This paper presented Voxtral Mini and Voxtral Small, a pair of open-weights audio chat models. It demonstrated their capabilities in understanding spoken audio and text, both on existing and new benchmarks. Their strengths across a wide array of speech tasks, strong instruction following, and multilingual prowess make them highly versatile for complex multimodal tasks. Both models are released under the Apache 2.0 license.

Voxtral

Synthesis notes that discuss concepts related to this paper