Kimi-Audio 7B Instruct

moonshotai/Kimi-Audio-7B-Instruct

published Apr 2025 · updated May 2025

Kimi-Audio 7B Instruct is a text-to-speech model that generates natural speech from text, also supporting audio understanding, conversation, and other audio tasks.

est. price

~$0.0075

· estimated, set at launch

API providers

downloads / mo

79K

license

mit

specs

Task	Text-to-Speech (also audio understanding, generation, conversation)
Architecture	Hybrid audio input (continuous acoustic + discrete semantic tokens) with LLM core and parallel heads for text and audio token generation
Parameters	9.8B
License	MIT
Languages	English, Chinese

about this model

Kimi-Audio-7B-Instruct is a text-to-speech (TTS) model built on a universal audio foundation model that handles audio understanding, generation, and conversation within a single framework. It is hosted by gigarouter as a managed, OpenAI-compatible API, allowing developers to integrate speech synthesis without managing infrastructure.

Capabilities

The model supports both English and Chinese and can generate speech from text prompts or produce text and audio outputs in conversational exchanges. Beyond TTS, it is capable of automatic speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), and sound event or scene classification—though the primary deployed task on gigarouter is text-to-speech.

Architecture and Training

Kimi-Audio-7B-Instruct employs a hybrid audio input combining continuous acoustic vectors with discrete semantic tokens, processed by a large language model core with parallel output heads for text and audio tokens. The model was pre-trained on over 13 million hours of diverse audio data (speech, music, environmental sounds) and text data, totaling roughly 9.77 billion parameters (BF16, 19.5 GB). A chunk-wise streaming detokenizer based on flow matching enables low-latency inference during speech generation.

Performance

According to the model’s technical report, Kimi-Audio achieves state-of-the-art results on a wide range of audio benchmarks spanning recognition, understanding, and generation tasks. The model is released under the MIT license and has been tuned for instruction following in the instruct variant.

best for

·Building conversational agents that can respond with natural speech
·Real-time text-to-speech for applications requiring low-latency audio generation
·Handling diverse audio tasks (ASR, audio captioning, emotion recognition) in a single model

FAQ

What is the output sample rate for generated speech?

The model outputs audio at 24 kHz.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your gigarouter API key, specifying the model name and appropriate input parameters.

Which languages does Kimi-Audio 7B Instruct support?

It supports English and Chinese.

What is the parameter count of this model?

It has 9.8 billion parameters (9,766,336,640).

What license does the model use?

It is released under the MIT License.

not yet live

We're benchmarking and onboarding Kimi-Audio 7B Instruct as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →

XTTS-v2

9.3M dl/mo

Qwen3-TTS-12Hz-1.7B-CustomVoice

2M dl/mo

Qwen3-TTS-12Hz-0.6B-CustomVoice