skip to content
gigarouter gigarouter
models / speech-to-text · coming soon

ARK ASR 3B

AutoArk-AI/ARK-ASR-3B

published Jun 2026 · updated Jun 2026

ARK ASR 3B is a multilingual automatic speech recognition model that combines a Whisper-style audio encoder, an MLP adapter, and a Qwen decoder to transcribe speech in 19 languages.

est. price
~$0.0034
· estimated, set at launch
API providers
0
downloads / mo
1.7K
license
apache-2.0

specs

Taskautomatic speech recognition
Architectureaudio-capable autoregressive Transformers with Whisper-style encoder, MLP adapter, and Qwen decoder
Parameters3B
LicenseApache-2.0

about this model

ARK-ASR-3B is an automatic speech recognition model that combines a Whisper-style audio encoder with a 3-billion-parameter Qwen decoder via an MLP adapter.

Architecture diagram of ARK-ASR showing audio encoder, MLP adapter, and Qwen decoder with placeholder token injection

The model supports 19 languages: Chinese, English, German, Japanese, French, Korean, Spanish, Polish, Italian, Romanian, Hungarian, Czech, Dutch, Finnish, Croatian, Slovak, Slovene, Estonian, and Lithuanian. Audio is processed at 16 kHz and transcribed autoregressively. The decoder replaces audio placeholder token embeddings with encoder output before generation, with custom token filtering for ASR.

English short-form benchmark (Open ASR Leaderboard)

ARK-ASR-3B achieves an average word error rate of 5.04% and real-time factor (RTFx) of 490.98 across seven datasets, claiming state-of-the-art on the Hugging Face Open ASR Leaderboard English short-form benchmark at the time of release.

DatasetWER
AMI8.79%
Earnings228.23%
GigaSpeech6.98%
LibriSpeech Clean1.03%
LibriSpeech Other2.35%
SPGISpeech2.46%
VoxPopuli5.47%
Average5.04%

Chinese CER

DatasetCER
AISHELL-11.80%
WenetSpeech test meeting4.97%
WenetSpeech test-net4.58%

The model is trained using on-policy distillation from a stronger teacher (see arXiv:2605.28139). All evaluation follows the Hugging Face open_asr_leaderboard codebase. ARK-ASR-3B is hosted on Gigarouter as a managed, OpenAI-compatible API – no local installation or GPU management required.

best for

FAQ

What languages does ARK ASR 3B support?

It supports Chinese, English, German, Japanese, French, Korean, Spanish, Polish, Italian, Romanian, Hungarian, Czech, Dutch, Finnish, Croatian, Slovak, Slovene, Estonian, and Lithuanian.

What is the average Word Error Rate on the Open ASR Leaderboard English short-form benchmark?

The model card reports an average WER of 5.04% across AMI, Earnings22, GigaSpeech, LibriSpeech, SPGISpeech, and VoxPopuli. Note: this result is not yet reflected in the official leaderboard CSV.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key. The model accepts 16 kHz mono audio and returns transcribed text. Refer to the gigarouter documentation for the exact endpoint and request format.

What is the recommended inference setup?

Load the model with <code>trust_remote_code=True</code> in Hugging Face Transformers, using bfloat16 on CUDA and SDPA attention. The official inference script handles processor, tokenizer, and generation cleanup.

Can I deploy ARK ASR 3B for online serving?

Yes, the repository includes a vLLM adapter that exposes both a compact /asr endpoint and an OpenAI-style /v1/audio/transcriptions endpoint. See the vLLM serving section in the model card.

not yet live

We're benchmarking and onboarding ARK ASR 3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →