
Cohere, an enterprise AI business, introduced its first speech model on Thursday, and Indeed, Transcribe, Cohere’s first open-source voice model created especially for automatic speech recognition (ASR), has been released. Note-taking and speech analysis are two applications for Transcribe, an open-source automatic speech recognition model.
The model is self-hostable and optimized for high-accuracy transcribing in business settings, and also the model is designed for usage with consumer-grade GPUs for those who wish to self-host it, and it is quite light at just 2 billion parameters. English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Chinese, Japanese, Korean, Vietnamese, and Arabic are the fourteen languages that it presently supports.
With an average word error rate (WER) of 5.42, Transcribe outperforms models like Zoom Scribe v1, IBM Granite 4.0 1B, ElevenLabs Scribe v2, and Qwen3-ASR-1.7B Speech on the Hugging Face Open ASR leaderboard, according to Cohere.
According to the business, when human evaluators evaluated Transcribe’s transcriptions for correctness, coherence, and usability, it earned an average victory rate of 61% over rival algorithms. But when it came to transcribing Portuguese, German, and Spanish, the model lagged behind its competitors.
Transcribe, according to Cohere, can handle 525 minutes of audio in a minute, which is high for its class of model.
The business is offering the concept for free via its API and intends to incorporate Transcribe into its enterprise agent orchestration platform, North. Additionally, the model will be accessible on Cohere’s managed inference platform, Model Vault.
As demand for note-taking and dictation apps like Granola and Wispr Flow rises, speech recognition methods are becoming more and more popular.
Cohere reportedly informed investors earlier this year that it will generate $240 million in recurring revenue in 2025. Aidan Gomez, the company’s CEO, was quoted as stating that the venture would go public “soon.”
Some of the key features that come with the Cohere Transcribe are outlined below:
- Architecture: 2-billion-parameter conformer-based encoder-decoder model.
- Performance: #1 on Hugging Face Open ASR Leaderboard with 5.42% WER, outperforming Whisper and ElevenLabs.
- Languages: Supports 14 languages, including English, Chinese, Japanese, Arabic, and major European languages.
- Efficiency: Runs on consumer GPUs; processes 525 minutes of audio per minute of compute.
- License: Apache 2.0 for open use and modification.
The mode is available on Hugging Face for download, via Cohere’s free API for experimentation, or Model Vault for private cloud inference and will soon integrate into Cohere’s North platform for enterprise voice-powered workflows.
The present restrictions to this model, even though the model is very accurate, are that it does not currently support timestamps, speaker diarization (the ability to discriminate between various speakers), or automatic language detection (users must input the language code).
Discover more from TechBooky
Subscribe to get the latest posts sent to your email.







