
On Thursday, the French AI startup Mistral unveiled a new open-source text-to-speech model that may be utilised in enterprise use cases such as customer care or by voice AI assistants. Mistral is in direct rivalry with companies like ElevenLabs, Deepgram, and OpenAI thanks to the platform, which enables businesses to create speech assistants for sales and customer engagement.
Voxtral TTS is the first open-source text-to-speech (TTS) model from Mistral AI. The model was introduced as lightweight enough to operate locally on edge devices like laptops, smartphones, and smartwatches.
Nine languages are supported by the new model, known as Voxtral TTS: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.
“A speech model has been requested by our clients. Therefore, we developed a compact speech model that can be used on laptops, smartphones, smartwatches, and other edge devices. In a phone conversation with a press firm, Pierre Stock, vice president of science operations at Mistral AI, stated, “It offers state-of-the-art performance at a fraction of the cost of anything else on the market.”
According to Mistral, the new model can catch features such as minor accents, inflections, intonations, and anomalies in speech flow using a sample of less than five seconds. For use cases like dubbing or real-time translation, the model, which is based on Ministral 3B, can effortlessly convert between languages without losing the voice’s qualities. According to Stock, the company intended for the model to sound human rather than robotic.
The company claims that the model was designed for real-time performance. For a 10-second sample of 500 characters, its time-to-first-audio (TTFA), which measures when the model begins “speaking” after receiving input, is 90 ms. Additionally, the model can render a 10-second clip in about 1.6 seconds thanks to its real-time factor (RTF) of 6x.
These are what to expect with the introduction of the Voxtral TTS in terms of the key features.
- Edge-friendly: 4 billion parameters, runs on just 3 GB RAM.
- Low latency: 90 ms time-to-first-audio for real-time use.
- Voice cloning: Adapts to any voice with under 5 seconds of audio.
- Multilingual: Supports nine languages, including English, Spanish, Hindi, and Arabic.
- Expressive: Delivers human-like speech with emotion and varied tone.
Mistral introduced two transcription models earlier this year, one for big batch processing and the other for low-latency real-time use cases. The company’s goal with the new speech model is probably to provide businesses a complete range of voice solutions.
“We intend to create an end-to-end platform that can manage multimodal input streams, such as text, audio, and images, as well as output. The primary advantage of that is that an end-to-end agentic system that allows audio as an input or output gives you a lot more information, according to Stock.
With regard to Mistral’s positioning, because its speech models are open source and customizable, businesses will be more likely to use them than their rivals.
Voxtral TTS completes a full suite of voice AI products, following Mistral’s recent release of speech-to-text (transcription) models, including Voxtral Realtime and Voxtral Mini Transcribe V2. The model can be deployed privately and on-device without relying on the cloud because it is made available with open weights under an Apache 2.0 license.
Discover more from TechBooky
Subscribe to get the latest posts sent to your email.







