models / text-to-speech · coming soon

OmniVoice

k2-fsa/OmniVoice

published Mar 2026 · updated Jul 2026

OmniVoice is a massively multilingual zero-shot text-to-speech model supporting over 600 languages.

est. price

~$0.0075

· estimated, set at launch

API providers

downloads / mo

902.4K

specs

Task	Text-to-Speech (TTS)
Architecture	Diffusion language model-style discrete non-autoregressive (NAR)
Languages	600+ languages
License	CC-BY-NC (model)

about this model

OmniVoice is a massively multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. Built on a diffusion language model-style discrete non-autoregressive architecture, it directly maps text to multi-codebook acoustic tokens, eliminating the performance bottlenecks of two-stage pipelines. The model is initialized from a pre-trained LLM to ensure high intelligibility and trained on a 581k-hour multilingual dataset curated entirely from open-source data.

Key capabilities

Voice cloning from a short reference audio, delivering state-of-the-art quality.
Voice design via controllable speaker attributes such as gender, age, pitch, dialect, and whisper.
Fine-grained control including non-verbal symbols (e.g., [laughter]) and pronunciation correction using pinyin or phonemes.

Performance

The model achieves a real-time factor (RTF) as low as 0.025—over 40 times faster than real-time. In benchmark evaluations, OmniVoice delivers state-of-the-art performance across Chinese, English, and a range of multilingual TTS benchmarks.

OmniVoice is hosted as a managed, OpenAI-compatible API on GigaRouter, requiring no local installation or configuration. Developers can integrate it with a simple API call to access high-quality, multilingual TTS with voice cloning and design capabilities.

best for

·Zero-shot voice cloning from a short reference audio
·Voice design via speaker attributes (gender, age, pitch, dialect, whisper)
·Multilingual speech synthesis across 600+ languages

FAQ

What is OmniVoice best used for?

Zero-shot voice cloning and voice design with state-of-the-art quality across 600+ languages.

What is the model license?

The pre-trained model is released under CC-BY-NC due to training data constraints; code is Apache 2.0.

How fast is inference?

Real-time factor as low as 0.025, over 40x faster than real-time.

How do I call OmniVoice via the gigarouter API?

Use the OpenAI-compatible endpoint with your gigarouter API key and model ID.

What input formats are supported?

Text and optional reference audio with transcription; also accepts non-verbal symbols and pronunciation correction via pinyin/phonemes.

not yet live

We're benchmarking and onboarding OmniVoice as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →

XTTS-v2

9.3M dl/mo

Qwen3-TTS-12Hz-1.7B-CustomVoice

2M dl/mo

Qwen3-TTS-12Hz-0.6B-CustomVoice