skip to content
gigarouter gigarouter
models / text-to-speech · coming soon

MOSS TTS Nano

OpenMOSS-Team/MOSS-TTS-Nano-100M

published Apr 2026 · updated Apr 2026

MOSS TTS Nano is a tiny multilingual speech generation model that produces 48 kHz stereo audio from text, supports zero-shot voice cloning, and runs on CPU without a GPU.

status
coming soon
API providers
0
downloads / mo
83.5K
license
apache-2.0

specs

TaskText-to-Speech (TTS)
ArchitectureAutoregressive Audio Tokenizer + LLM (CAT-based)
Parameters0.1B (100 million)
Output Audio48 kHz, 2-channel stereo
LicenseNot yet licensed (see LICENSE file)

about this model

MOSS-TTS-Nano is a multilingual text-to-speech model that generates 48 kHz stereo speech using a pure autoregressive Audio Tokenizer + LLM pipeline, with only 0.1B parameters.

The model is designed for real-time speech generation with minimal footprint. It supports streaming inference and can run on a 4-core CPU without a GPU. Voice cloning is the primary workflow, enabling zero-shot cloning from a reference audio prompt. An ONNX CPU version is available that removes PyTorch dependencies and delivers nearly 2x processing efficiency compared to the original.

MOSS-TTS-Nano is built on the MOSS-Audio-Tokenizer-Nano backbone: a lightweight tokenizer with approximately 20 million parameters that compresses 48 kHz stereo audio into a 12.5 Hz token stream using RVQ with 16 codebooks, supporting variable bitrates from 0.125 kbps to 4 kbps.

OpenMOSS team logo MOSI.AI logo

Supported Languages

LanguageCodeLanguageCodeLanguageCode
ChinesezhEnglishenGermande
SpanishesFrenchfrJapaneseja
ItalianitHungarianhuKoreanko
RussianruPersian (Farsi)faArabicar
PolishplPortugueseptCzechcs
DanishdaSwedishsvGreekel
Turkishtr
MOSS-TTS-Nano pipeline diagram

The model also supports long-form text input with automatic chunked voice cloning, and is designed for simple local setup, web serving, and lightweight product integration. Finetuning code and a local reader application are available separately.

Architecture diagram of MOSS-Audio-Tokenizer-Nano

best for

FAQ

What is MOSS TTS Nano best used for?

It is best for zero-shot voice cloning and real-time multilingual speech synthesis, especially in CPU-friendly, low-latency deployments.

What languages does MOSS TTS Nano support?

It supports 20 languages including Chinese, English, German, Spanish, French, Japanese, Korean, and more.

Can MOSS TTS Nano run on CPU?

Yes, it can run on a 4-core CPU for streaming inference, and an ONNX version provides near 2x speed on a single CPU core.

What is the output audio format?

The output is 48 kHz stereo (2-channel) WAV audio, generated via voice cloning with a reference audio prompt.

How do I use MOSS TTS Nano via the gigarouter API?

Send a request to the gigarouter OpenAI-compatible endpoint with your API key; the model accepts text and optional reference audio for voice cloning.

not yet live

We're benchmarking and onboarding MOSS TTS Nano as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →