skip to content
gigarouter gigarouter
models / text-to-speech · coming soon

Fish Audio S2 Pro

fishaudio/s2-pro

published Mar 2026 · updated Mar 2026

Fish Audio S2 Pro is a text-to-speech model that generates expressive, multi-speaker speech with fine-grained inline control of prosody and emotion via natural-language tags.

est. price
~$0.0075
· estimated, set at launch
API providers
0
downloads / mo
434.2K
license
other

specs

TaskText-to-Speech (TTS)
ArchitectureDual-Autoregressive Transformer (Slow AR 4B, Fast AR 400M) with RVQ audio codec
Parameters4.4B total
LicenseFish Audio Research License (research/non-commercial free; commercial requires separate license)
Languages80+ languages, including English, Japanese, Chinese, Korean, Spanish, French, German, etc.

about this model

Fish Audio S2 Pro is a text-to-speech (TTS) model that combines a dual-autoregressive (Dual-AR) transformer architecture with reinforcement learning alignment to enable fine-grained inline control of prosody, emotion, and voice style. Trained on over 10 million hours of audio across 80+ languages, the model uses a 4-billion-parameter "Slow AR" decoder to predict the primary semantic codebook along the time axis, while a 400-million-parameter "Fast AR" decoder generates the remaining 9 residual codebooks at each time step, preserving acoustic fidelity with efficient inference.

Key capabilities

  • Inline control through natural language: Users embed free-form textual instructions (e.g., [whisper in small voice], [professional broadcast tone]) directly in the input text. Over 15,000 unique tags are supported, allowing open-ended expression at the word level without predefined labels.
  • Multilingual support: Tier-1 languages are Japanese, English, and Chinese; Tier-2 includes Korean, Spanish, Portuguese, Arabic, Russian, French, and German, with 70+ additional languages.
  • Dual-AR architecture: Because the architecture is structurally isomorphic to autoregressive LLMs, it inherits all SGLang-native serving optimizations—continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching.

Production streaming performance (single NVIDIA H200 GPU)

  • Real-time factor (RTF): 0.195
  • Time-to-first-audio: ~100 ms
  • Throughput: 3,000+ acoustic tokens/s while maintaining RTF below 0.5
Architecture diagram of Fish Audio S2 Pro showing the Dual-AR decoder arrangement and the RVQ audio codec pipeline

For further details, see the technical report, the model repository, and the Fish Audio playground.

best for

FAQ

What is the architecture of Fish Audio S2 Pro?

It uses a decoder-only transformer with a Dual-Autoregressive (Dual-AR) design: a 4B parameter Slow AR for primary semantic codebook and a 400M parameter Fast AR for residual acoustic details.

How many parameters does the model have?

4.4B parameters total (4B Slow AR + 400M Fast AR).

What languages are supported?

Over 80 languages, with Tier 1 support for Japanese, English, and Chinese, and Tier 2 for Korean, Spanish, Portuguese, Arabic, Russian, French, German, and many others.

How can I control emotion or prosody in the speech?

Insert free-form natural-language tags in the text, e.g., `[whisper]`, `[angry]`, `[excited tone]`, or custom descriptions like `[professional broadcast tone]`. The model supports 15,000+ unique tags.

What is the license for commercial use?

Research and non-commercial use is free, but commercial use requires a separate license from Fish Audio. Contact [email protected].

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your gigarouter API key. Provide input text and optional tags. Refer to the gigarouter documentation for endpoint details.

not yet live

We're benchmarking and onboarding Fish Audio S2 Pro as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →