Gemma 4 E2B IT (Mobile Optimized)
google/gemma-4-E2B-it-qat-mobile-transformers
published Jun 2026 · updated Jun 2026
Gemma 4 E2B IT (Mobile Optimized) is a quantized, mobile-optimized instruction-tuned multimodal model that processes text, image, and audio inputs and generates text outputs.
specs
| Task | Multimodal Understanding & Generation (Text, Image, Audio) |
| Architecture | Decoder-only transformer with Per-Layer Embeddings (PLE), hybrid sliding window + global attention |
| Parameters | 2.3B effective (5.1B total) |
| License | Apache 2.0 |
| Context Length | 128K tokens |
| Supported Modalities | Text, Image, Audio |
about this model
google/gemma-4-E2B-it-qat-mobile-transformers is an any-to-any multimodal model that processes text, image, and audio inputs to generate text output. It is a variant of the Gemma 4 E2B model built by Google DeepMind, optimized with quantization-aware training (QAT) in a mobile-optimized (wNa8o8) format for efficient on-device deployment.
Core Capabilities
- Supports text, image, and audio input; generates text output.
- 128K token context window.
- Built-in reasoning mode (configurable thinking).
- Native function calling for agentic workflows.
- Code generation, completion, and correction.
- Multilingual support for 35+ languages out of the box, pre-trained on 140+ languages.
- Automatic speech recognition (ASR) and speech-to-text translation.
Architecture
Effective 2.3B parameters (5.1B with embeddings), 35 decoder layers, sliding window attention of 512 tokens, vocabulary size 262K. Employs Per-Layer Embeddings (PLE) and a hybrid attention mechanism interleaving local sliding window with full global attention. Global layers use unified Keys and Values with proportional RoPE (p-RoPE).
Dense Model Specifications
| Property | E2B |
|---|---|
| Total Parameters | 2.3B effective (5.1B with embeddings) |
| Layers | 35 |
| Sliding Window | 512 tokens |
| Context Length | 128K tokens |
| Vocabulary Size | 262K |
| Supported Modalities | Text, Image, Audio |
| Vision Encoder Params | ~150M |
| Audio Encoder Params | ~300M |
Benchmark Results (Instruction-Tuned)
| Benchmark | Gemma 4 E2B |
|---|---|
| MMLU Pro | 60.0% |
| AIME 2026 no tools | 37.5% |
| LiveCodeBench v6 | 44.0% |
| Codeforces ELO | 633 |
| GPQA Diamond | 43.4% |
| Tau2 (average over 3) | 24.5% |
| BigBench Extra Hard | 21.9% |
| MMMU (multilingual) | 67.4% |
| MMMU Pro (vision) | 44.2% |
| OmniDocBench 1.5 (avg edit distance, lower↓) | 0.290 |
| MATH-Vision | 52.4% |
| MedXPertQA MM | 23.5% |
| CoVoST (audio) | 33.47 |
| FLEURS (audio, lower↓) | 0.09 |
| MRCR v2 8 needle 128k (long context) | 19.1% |
The QAT mobile-optimized variant uses a custom wNa8o8 schema featuring targeted 2-bit decoding layers, optimized KV caches, and static activations to maximize VRAM savings while preserving quality comparable to bfloat16. Hosted on gigarouter as an OpenAI-compatible API, this model requires no local setup.
best for
- ·On-device AI assistant with image and audio input
- ·Document OCR and analysis on mobile devices
- ·Lightweight reasoning agent for edge deployments
FAQ
It has 2.3B effective parameters (5.1B total including embeddings).
It supports text, image (with variable aspect ratio and resolution), and audio (ASR and speech-to-text translation).
Yes, it is an instruction-tuned variant, supporting system prompts and configurable thinking (reasoning) modes.
Use the OpenAI-compatible endpoint with your API key, specifying the model name provided by gigarouter.
It is released under the Apache 2.0 license.
We're benchmarking and onboarding Gemma 4 E2B IT (Mobile Optimized) as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.