Kimi-Audio 7B Instruct
moonshotai/Kimi-Audio-7B-Instruct
published Apr 2025 · updated May 2025
Kimi-Audio 7B Instruct is a text-to-speech model that generates natural speech from text, also supporting audio understanding, conversation, and other audio tasks.
specs
| Task | Text-to-Speech (also audio understanding, generation, conversation) |
| Architecture | Hybrid audio input (continuous acoustic + discrete semantic tokens) with LLM core and parallel heads for text and audio token generation |
| Parameters | 9.8B |
| License | MIT |
| Languages | English, Chinese |
about this model
Kimi-Audio-7B-Instruct is a text-to-speech (TTS) model built on a universal audio foundation model that handles audio understanding, generation, and conversation within a single framework. It is hosted by gigarouter as a managed, OpenAI-compatible API, allowing developers to integrate speech synthesis without managing infrastructure.
Capabilities
The model supports both English and Chinese and can generate speech from text prompts or produce text and audio outputs in conversational exchanges. Beyond TTS, it is capable of automatic speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), and sound event or scene classification—though the primary deployed task on gigarouter is text-to-speech.
Architecture and Training
Kimi-Audio-7B-Instruct employs a hybrid audio input combining continuous acoustic vectors with discrete semantic tokens, processed by a large language model core with parallel output heads for text and audio tokens. The model was pre-trained on over 13 million hours of diverse audio data (speech, music, environmental sounds) and text data, totaling roughly 9.77 billion parameters (BF16, 19.5 GB). A chunk-wise streaming detokenizer based on flow matching enables low-latency inference during speech generation.
Performance
According to the model’s technical report, Kimi-Audio achieves state-of-the-art results on a wide range of audio benchmarks spanning recognition, understanding, and generation tasks. The model is released under the MIT license and has been tuned for instruction following in the instruct variant.
best for
- ·Building conversational agents that can respond with natural speech
- ·Real-time text-to-speech for applications requiring low-latency audio generation
- ·Handling diverse audio tasks (ASR, audio captioning, emotion recognition) in a single model
FAQ
The model outputs audio at 24 kHz.
Use the OpenAI-compatible endpoint with your gigarouter API key, specifying the model name and appropriate input parameters.
It supports English and Chinese.
It has 9.8 billion parameters (9,766,336,640).
It is released under the MIT License.
We're benchmarking and onboarding Kimi-Audio 7B Instruct as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.