Question 1

What languages does ARK ASR 3B support?

Accepted Answer

It supports Chinese, English, German, Japanese, French, Korean, Spanish, Polish, Italian, Romanian, Hungarian, Czech, Dutch, Finnish, Croatian, Slovak, Slovene, Estonian, and Lithuanian.

Question 2

What is the average Word Error Rate on the Open ASR Leaderboard English short-form benchmark?

Accepted Answer

The model card reports an average WER of 5.04% across AMI, Earnings22, GigaSpeech, LibriSpeech, SPGISpeech, and VoxPopuli. Note: this result is not yet reflected in the official leaderboard CSV.

Question 3

How do I call this model via the gigarouter API?

Accepted Answer

Use the gigarouter OpenAI-compatible endpoint with your API key. The model accepts 16 kHz mono audio and returns transcribed text. Refer to the gigarouter documentation for the exact endpoint and request format.

Question 4

What is the recommended inference setup?

Accepted Answer

Load the model with trust_remote_code=True in Hugging Face Transformers, using bfloat16 on CUDA and SDPA attention. The official inference script handles processor, tokenizer, and generation cleanup.

Question 5

Can I deploy ARK ASR 3B for online serving?

Accepted Answer

Yes, the repository includes a vLLM adapter that exposes both a compact /asr endpoint and an OpenAI-style /v1/audio/transcriptions endpoint. See the vLLM serving section in the model card.

Task	automatic speech recognition
Architecture	audio-capable autoregressive Transformers with Whisper-style encoder, MLP adapter, and Qwen decoder
Parameters	3B
License	Apache-2.0

Dataset	WER
AMI	8.79%
Earnings22	8.23%
GigaSpeech	6.98%
LibriSpeech Clean	1.03%
LibriSpeech Other	2.35%
SPGISpeech	2.46%
VoxPopuli	5.47%
Average	5.04%

Dataset	CER
AISHELL-1	1.80%
WenetSpeech test meeting	4.97%
WenetSpeech test-net	4.58%

ARK ASR 3B

specs

about this model

English short-form benchmark (Open ASR Leaderboard)

Chinese CER

best for

FAQ

related speech-to-text models