Gemma 4 E2B IT (Mobile Optimized)

google/gemma-4-E2B-it-qat-mobile-transformers

published Jun 2026 · updated Jun 2026

Gemma 4 E2B IT (Mobile Optimized) is a quantized, mobile-optimized instruction-tuned multimodal model that processes text, image, and audio inputs and generates text outputs.

status

coming soon

API providers

downloads / mo

22.2K

license

apache-2.0

specs

Task	Multimodal Understanding & Generation (Text, Image, Audio)
Architecture	Decoder-only transformer with Per-Layer Embeddings (PLE), hybrid sliding window + global attention
Parameters	2.3B effective (5.1B total)
License	Apache 2.0
Context Length	128K tokens
Supported Modalities	Text, Image, Audio

about this model

google/gemma-4-E2B-it-qat-mobile-transformers is an any-to-any multimodal model that processes text, image, and audio inputs to generate text output. It is a variant of the Gemma 4 E2B model built by Google DeepMind, optimized with quantization-aware training (QAT) in a mobile-optimized (wNa8o8) format for efficient on-device deployment.

Core Capabilities

Supports text, image, and audio input; generates text output.
128K token context window.
Built-in reasoning mode (configurable thinking).
Native function calling for agentic workflows.
Code generation, completion, and correction.
Multilingual support for 35+ languages out of the box, pre-trained on 140+ languages.
Automatic speech recognition (ASR) and speech-to-text translation.

Architecture

Effective 2.3B parameters (5.1B with embeddings), 35 decoder layers, sliding window attention of 512 tokens, vocabulary size 262K. Employs Per-Layer Embeddings (PLE) and a hybrid attention mechanism interleaving local sliding window with full global attention. Global layers use unified Keys and Values with proportional RoPE (p-RoPE).

Dense Model Specifications

Property	E2B
Total Parameters	2.3B effective (5.1B with embeddings)
Layers	35
Sliding Window	512 tokens
Context Length	128K tokens
Vocabulary Size	262K
Supported Modalities	Text, Image, Audio
Vision Encoder Params	~150M
Audio Encoder Params	~300M

Benchmark Results (Instruction-Tuned)

Benchmark	Gemma 4 E2B
MMLU Pro	60.0%
AIME 2026 no tools	37.5%
LiveCodeBench v6	44.0%
Codeforces ELO	633
GPQA Diamond	43.4%
Tau2 (average over 3)	24.5%
BigBench Extra Hard	21.9%
MMMU (multilingual)	67.4%
MMMU Pro (vision)	44.2%
OmniDocBench 1.5 (avg edit distance, lower↓)	0.290
MATH-Vision	52.4%
MedXPertQA MM	23.5%
CoVoST (audio)	33.47
FLEURS (audio, lower↓)	0.09
MRCR v2 8 needle 128k (long context)	19.1%

The QAT mobile-optimized variant uses a custom wNa8o8 schema featuring targeted 2-bit decoding layers, optimized KV caches, and static activations to maximize VRAM savings while preserving quality comparable to bfloat16. Hosted on gigarouter as an OpenAI-compatible API, this model requires no local setup.

best for

·On-device AI assistant with image and audio input
·Document OCR and analysis on mobile devices
·Lightweight reasoning agent for edge deployments

FAQ

What is the effective parameter count of Gemma 4 E2B IT (Mobile Optimized)?

It has 2.3B effective parameters (5.1B total including embeddings).

What modalities does it support?

It supports text, image (with variable aspect ratio and resolution), and audio (ASR and speech-to-text translation).

Is this model fine-tuned for instruction following?

Yes, it is an instruction-tuned variant, supporting system prompts and configurable thinking (reasoning) modes.

How can I access this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, specifying the model name provided by gigarouter.

What is the license for this model?

It is released under the Apache 2.0 license.

not yet live

We're benchmarking and onboarding Gemma 4 E2B IT (Mobile Optimized) as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related multimodal models