OmniVoice
k2-fsa/OmniVoice
published Mar 2026 · updated Jul 2026
OmniVoice is a massively multilingual zero-shot text-to-speech model supporting over 600 languages.
specs
| Task | Text-to-Speech (TTS) |
| Architecture | Diffusion language model-style discrete non-autoregressive (NAR) |
| Languages | 600+ languages |
| License | CC-BY-NC (model) |
about this model
OmniVoice is a massively multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. Built on a diffusion language model-style discrete non-autoregressive architecture, it directly maps text to multi-codebook acoustic tokens, eliminating the performance bottlenecks of two-stage pipelines. The model is initialized from a pre-trained LLM to ensure high intelligibility and trained on a 581k-hour multilingual dataset curated entirely from open-source data.
Key capabilities
- Voice cloning from a short reference audio, delivering state-of-the-art quality.
- Voice design via controllable speaker attributes such as gender, age, pitch, dialect, and whisper.
- Fine-grained control including non-verbal symbols (e.g.,
[laughter]) and pronunciation correction using pinyin or phonemes.
Performance
The model achieves a real-time factor (RTF) as low as 0.025—over 40 times faster than real-time. In benchmark evaluations, OmniVoice delivers state-of-the-art performance across Chinese, English, and a range of multilingual TTS benchmarks.
OmniVoice is hosted as a managed, OpenAI-compatible API on GigaRouter, requiring no local installation or configuration. Developers can integrate it with a simple API call to access high-quality, multilingual TTS with voice cloning and design capabilities.
best for
- ·Zero-shot voice cloning from a short reference audio
- ·Voice design via speaker attributes (gender, age, pitch, dialect, whisper)
- ·Multilingual speech synthesis across 600+ languages
FAQ
Zero-shot voice cloning and voice design with state-of-the-art quality across 600+ languages.
The pre-trained model is released under CC-BY-NC due to training data constraints; code is Apache 2.0.
Real-time factor as low as 0.025, over 40x faster than real-time.
Use the OpenAI-compatible endpoint with your gigarouter API key and model ID.
Text and optional reference audio with transcription; also accepts non-verbal symbols and pronunciation correction via pinyin/phonemes.
We're benchmarking and onboarding OmniVoice as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.