skip to content
gigarouter gigarouter
models / speech-to-text · coming soon

Phi-4 Multimodal Instruct

microsoft/Phi-4-multimodal-instruct

published Feb 2025 · updated Dec 2025

Phi-4 Multimodal Instruct is a lightweight open multimodal foundation model that processes text, image, and audio inputs to generate text outputs, supporting 128K token context length and multiple languages.

est. price
~$0.0034
· estimated, set at launch
API providers
0
downloads / mo
541.1K
license
mit

specs

TaskAutomatic speech recognition, speech translation, speech summarization, speech QA, image understanding, OCR, chart/table understanding, multi-image summarization
ArchitectureMixture-of-LoRAs with modality-specific adapters and routers over a frozen 3.8B-parameter language model
Parameters3.8 billion (base language model); speech/audio LoRA adapters add 460 million parameters

about this model

Phi-4-multimodal-instruct is an open multimodal foundation model designed for automatic speech recognition (ASR) and related speech tasks, hosted on gigarouter as a managed API. It processes text, image, and audio inputs, generating text outputs with a 128K token context length. The model uses a novel mixture-of-LoRAs architecture, where the speech/audio LoRA adapters contain only 460 million parameters, enabling top-tier ASR performance from a single frozen base language model. Its expanded vocabulary of 200K tokens supports multilingual speech recognition across English, Chinese, German, French, Italian, Japanese, Spanish, and Portuguese.

Speech Recognition Performance

On the HuggingFace OpenASR leaderboard, Phi-4-multimodal-instruct ranked first with a word error rate of 6.14% (as of March 2025), surpassing expert models such as WhisperV3 and SeamlessM4T-v2-Large. It also performs speech translation (bidirectional between English and seven other languages) and speech summarization, with summarization quality close to GPT‑4o.

Bar chart showing aggregated speech recognition WER across benchmarks Bar chart showing Word Error Rate by language on CommonVoice and FLEURS Bar chart showing speech translation BLEU scores from German, Spanish, French, Italian, Japanese, Portuguese, Chinese to English Bar chart showing speech translation BLEU scores from English to seven languages

Vision-Speech Task Performance

When processing image and audio inputs together, the model achieves strong results on vision-speech benchmarks:

BenchmarkPhi-4-multimodal-instructInternOmni-7BGemini-2.0-Flash-LiteGemini-2.0-FlashGemini-1.5-Pro
s_AI2D68.953.962.069.467.7
s_ChartQA69.056.135.551.346.9
s_DocVQA87.379.976.080.378.2
s_InfoVQA63.760.359.463.666.1
Average72.262.658.266.264.7

The model is available via gigarouter’s OpenAI-compatible API, requiring no local setup. Benchmark scores and additional technical details are documented in the Phi-4-Multimodal technical report.

best for

FAQ

What input modalities does Phi-4 Multimodal Instruct support?

It accepts text, image, and audio inputs, and generates text outputs. It supports vision+language, vision+speech, and speech/audio-only inference modes.

What is the context length and vocabulary size?

The model supports a 128K token context length and uses an expanded vocabulary of 200K tokens for better multilingual support.

How does Phi-4 Multimodal compare to WhisperV3 for ASR?

Phi-4 Multimodal surpasses WhisperV3 on automatic speech recognition and speech translation benchmarks, ranking first on the Hugging Face OpenASR leaderboard with a mean WER of 6.02%.

What is the license for Phi-4 Multimodal Instruct?

The model is released under the MIT license.

How can I call Phi-4 Multimodal Instruct via API?

Use the gigarouter OpenAI-compatible endpoint with your API key to send text, image, or audio inputs and receive text responses.

not yet live

We're benchmarking and onboarding Phi-4 Multimodal Instruct as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →