skip to content
gigarouter gigarouter
models / multimodal · coming soon

Gemma 4 E2B IT (Mobile Optimized)

google/gemma-4-E2B-it-qat-mobile-transformers

published Jun 2026 · updated Jun 2026

Gemma 4 E2B IT (Mobile Optimized) is a quantized, mobile-optimized instruction-tuned multimodal model that processes text, image, and audio inputs and generates text outputs.

status
coming soon
API providers
0
downloads / mo
22.2K
license
apache-2.0

specs

TaskMultimodal Understanding & Generation (Text, Image, Audio)
ArchitectureDecoder-only transformer with Per-Layer Embeddings (PLE), hybrid sliding window + global attention
Parameters2.3B effective (5.1B total)
LicenseApache 2.0
Context Length128K tokens
Supported ModalitiesText, Image, Audio

about this model

google/gemma-4-E2B-it-qat-mobile-transformers is an any-to-any multimodal model that processes text, image, and audio inputs to generate text output. It is a variant of the Gemma 4 E2B model built by Google DeepMind, optimized with quantization-aware training (QAT) in a mobile-optimized (wNa8o8) format for efficient on-device deployment.

Core Capabilities

  • Supports text, image, and audio input; generates text output.
  • 128K token context window.
  • Built-in reasoning mode (configurable thinking).
  • Native function calling for agentic workflows.
  • Code generation, completion, and correction.
  • Multilingual support for 35+ languages out of the box, pre-trained on 140+ languages.
  • Automatic speech recognition (ASR) and speech-to-text translation.

Architecture

Effective 2.3B parameters (5.1B with embeddings), 35 decoder layers, sliding window attention of 512 tokens, vocabulary size 262K. Employs Per-Layer Embeddings (PLE) and a hybrid attention mechanism interleaving local sliding window with full global attention. Global layers use unified Keys and Values with proportional RoPE (p-RoPE).

Dense Model Specifications

PropertyE2B
Total Parameters2.3B effective (5.1B with embeddings)
Layers35
Sliding Window512 tokens
Context Length128K tokens
Vocabulary Size262K
Supported ModalitiesText, Image, Audio
Vision Encoder Params~150M
Audio Encoder Params~300M

Benchmark Results (Instruction-Tuned)

BenchmarkGemma 4 E2B
MMLU Pro60.0%
AIME 2026 no tools37.5%
LiveCodeBench v644.0%
Codeforces ELO633
GPQA Diamond43.4%
Tau2 (average over 3)24.5%
BigBench Extra Hard21.9%
MMMU (multilingual)67.4%
MMMU Pro (vision)44.2%
OmniDocBench 1.5 (avg edit distance, lower↓)0.290
MATH-Vision52.4%
MedXPertQA MM23.5%
CoVoST (audio)33.47
FLEURS (audio, lower↓)0.09
MRCR v2 8 needle 128k (long context)19.1%

The QAT mobile-optimized variant uses a custom wNa8o8 schema featuring targeted 2-bit decoding layers, optimized KV caches, and static activations to maximize VRAM savings while preserving quality comparable to bfloat16. Hosted on gigarouter as an OpenAI-compatible API, this model requires no local setup.

best for

FAQ

What is the effective parameter count of Gemma 4 E2B IT (Mobile Optimized)?

It has 2.3B effective parameters (5.1B total including embeddings).

What modalities does it support?

It supports text, image (with variable aspect ratio and resolution), and audio (ASR and speech-to-text translation).

Is this model fine-tuned for instruction following?

Yes, it is an instruction-tuned variant, supporting system prompts and configurable thinking (reasoning) modes.

How can I access this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, specifying the model name provided by gigarouter.

What is the license for this model?

It is released under the Apache 2.0 license.

not yet live

We're benchmarking and onboarding Gemma 4 E2B IT (Mobile Optimized) as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related multimodal models

compare all →