NeuTTS Nano (English)
neuphonic/neutts-nano
published Nov 2025 · updated Feb 2026
NeuTTS Nano is a lightweight, on-device text-to-speech model with instant voice cloning, built for real-time speech synthesis on CPUs and edge devices.
specs
| Task | Text-to-Speech (TTS) |
| Architecture | Compact LM backbone + NeuCodec audio codec (single codebook) |
| Parameters | ~116.8M active, ~228.7M total |
| License | NeuTTS Open License 1.0 |
| Language | English only |
about this model
NeuTTS Nano is an English-language text-to-speech (TTS) model built for on-device generation with instant voice cloning, combining a compact language model backbone with a neural audio codec. Gigarouter hosts this model as a managed, OpenAI‑compatible API, enabling developers to integrate high‑quality speech synthesis without maintaining infrastructure.
Model Details
- Active parameters: ~116.8 M (backbone only); total parameters: ~228.7 M (backbone + tied embeddings/head).
- Context window: 2048 tokens (≈30 seconds of audio including the prompt).
- Audio codec: NeuCodec, a single‑codebook codec achieving low‑bitrate, high‑quality audio.
- Optimized for real‑time CPU inference. On a 2‑thread CPU the model achieves a real‑time factor of 2× (twice as fast as real time).
- Outputs are watermarked.
Throughput Benchmarks (Q4_0 Quantisation)
Token generation speed on four devices (tokens/s, CPU‑only unless noted):
- Galaxy A25 5G: 45 t/s
- AMD Ryzen 9 HX 370: 221 t/s
- iMac M4 (16 GB): 195 t/s
- NVIDIA RTX 4090: 19,268 t/s
Comparison with NeuTTS-Air
| Model | Active Params | Total Params | License |
|---|---|---|---|
| NeuTTS-Air | ~360 M | ~552 M | Apache 2.0 |
| NeuTTS Nano | ~120 M | ~229 M | NeuTTS Open License 1.0 |
Voice Cloning
To clone a voice, provide a reference audio sample (mono, 16–44 kHz, 3–15 s, clean, .wav) and a text prompt. The model synthesises the given text in the style of that reference speaker.
best for
- ·Instant voice cloning from a few seconds of audio
- ·Real-time speech synthesis on laptop-class CPUs
- ·On-device voice agents and assistants
- ·Privacy-sensitive applications where audio must stay local
FAQ
The model takes a reference audio sample (mono WAV, 3-15 seconds, 16-44 kHz) and a text string. It outputs synthesized speech as a 24 kHz WAV file.
The context window is 2048 tokens, corresponding to roughly 30 seconds of audio including the prompt.
In Q4_0 quantisation, throughput is 45 tokens/s on a Galaxy A25 5G, 221 tokens/s on an AMD Ryzen 9 HX 370, and 195 tokens/s on an iMac M4 (all CPU-only).
It uses the NeuTTS Open License 1.0, which allows free non-commercial and commercial use with attribution. Check the full license text for details.
Use the gigarouter OpenAI-compatible endpoint with an API key. Pass the reference audio as a file and the text as a prompt to generate speech.
We're benchmarking and onboarding NeuTTS Nano (English) as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.