Gemma 4 12B Unified
google/gemma-4-12B-it
published May 2026 · updated Jun 2026
Gemma 4 12B Unified is an any-to-any model that processes text, image, video, and audio inputs natively without separate encoders, generating text output.
specs
| Task | Multimodal understanding and text generation |
| Architecture | Encoder-free decoder-only transformer (Unified) |
| Parameters | 11.95B |
| License | Apache 2.0 |
about this model
google/gemma-4-12B-it is a multimodal model that accepts text, image, audio, and video inputs and produces text output, using an encoder-free architecture to project raw image patches and audio waveforms directly into the LLM embedding space. It is the instruction-tuned variant of the Gemma 4 12B Unified model from Google DeepMind, designed to bring advanced reasoning, coding, and vision capabilities to consumer GPUs and workstations. Gigarouter hosts this model as a managed API, accessible via a standard OpenAI-compatible endpoint.
Key capabilities
- Thinking mode: configurable step-by-step reasoning before answering.
- Long context: supports up to 256K tokens.
- Multimodal input: interleaved text, images, video frames, and audio (ASR, speech translation).
- Function calling: native tool-use support for agentic workflows.
- Coding & reasoning: enhanced code generation and mathematical problem-solving.
- Multilingual: pre-trained on 140+ languages; 35+ supported out of the box.
Benchmark results (instruction-tuned)
| Benchmark | Gemma 4 12B | Gemma 3 27B |
|---|---|---|
| MMLU Pro | 77.2% | 67.6% |
| AIME 2026 (no tools) | 77.5% | 20.8% |
| LiveCodeBench v6 | 72.0% | 29.1% |
| GPQA Diamond | 78.8% | 42.4% |
| MMMU Pro (vision) | 69.1% | 49.7% |
| MATH-Vision | 79.7% | 46.0% |
On coding and reasoning benchmarks, the model outperforms the prior Gemma 3 27B by a wide margin, while its vision scores demonstrate strong document parsing and mathematical diagram understanding.
best for
- ·Running multimodal AI locally on laptops and consumer devices
- ·Building agentic workflows with function calling and reasoning
- ·Processing mixed text, image, audio, and video inputs in a single model
FAQ
It uses an encoder-free architecture that projects raw image patches and audio waveforms directly into the LLM's embedding space, eliminating separate vision and audio encoders for lower latency and simpler fine-tuning.
It supports text, image, video, and audio inputs natively, and generates text output.
It supports up to 256K tokens.
Use the OpenAI-compatible endpoint with your gigarouter API key, specifying the model name google/gemma-4-12B-it.
Apache 2.0.
We're benchmarking and onboarding Gemma 4 12B Unified as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.