Question 1

What is the architecture of Fish Audio S2 Pro?

Accepted Answer

It uses a decoder-only transformer with a Dual-Autoregressive (Dual-AR) design: a 4B parameter Slow AR for primary semantic codebook and a 400M parameter Fast AR for residual acoustic details.

Question 2

How many parameters does the model have?

Accepted Answer

4.4B parameters total (4B Slow AR + 400M Fast AR).

Question 3

What languages are supported?

Accepted Answer

Over 80 languages, with Tier 1 support for Japanese, English, and Chinese, and Tier 2 for Korean, Spanish, Portuguese, Arabic, Russian, French, German, and many others.

Question 4

How can I control emotion or prosody in the speech?

Accepted Answer

Insert free-form natural-language tags in the text, e.g., `[whisper]`, `[angry]`, `[excited tone]`, or custom descriptions like `[professional broadcast tone]`. The model supports 15,000+ unique tags.

Question 5

What is the license for commercial use?

Accepted Answer

Research and non-commercial use is free, but commercial use requires a separate license from Fish Audio. Contact business@fish.audio.

Question 6

How do I call this model via the gigarouter API?

Accepted Answer

Use the OpenAI-compatible endpoint with your gigarouter API key. Provide input text and optional tags. Refer to the gigarouter documentation for endpoint details.

Task	Text-to-Speech (TTS)
Architecture	Dual-Autoregressive Transformer (Slow AR 4B, Fast AR 400M) with RVQ audio codec
Parameters	4.4B total
License	Fish Audio Research License (research/non-commercial free; commercial requires separate license)
Languages	80+ languages, including English, Japanese, Chinese, Korean, Spanish, French, German, etc.

Fish Audio S2 Pro

specs

about this model

Key capabilities

Production streaming performance (single NVIDIA H200 GPU)

best for

FAQ

related text-to-speech models