Gemma 4 12B Unified

google/gemma-4-12B-it

published May 2026 · updated Jun 2026

Gemma 4 12B Unified is an any-to-any model that processes text, image, video, and audio inputs natively without separate encoders, generating text output.

status

coming soon

API providers

downloads / mo

license

apache-2.0

specs

Task	Multimodal understanding and text generation
Architecture	Encoder-free decoder-only transformer (Unified)
Parameters	11.95B
License	Apache 2.0

about this model

google/gemma-4-12B-it is a multimodal model that accepts text, image, audio, and video inputs and produces text output, using an encoder-free architecture to project raw image patches and audio waveforms directly into the LLM embedding space. It is the instruction-tuned variant of the Gemma 4 12B Unified model from Google DeepMind, designed to bring advanced reasoning, coding, and vision capabilities to consumer GPUs and workstations. Gigarouter hosts this model as a managed API, accessible via a standard OpenAI-compatible endpoint.

Key capabilities

Thinking mode: configurable step-by-step reasoning before answering.
Long context: supports up to 256K tokens.
Multimodal input: interleaved text, images, video frames, and audio (ASR, speech translation).
Function calling: native tool-use support for agentic workflows.
Coding & reasoning: enhanced code generation and mathematical problem-solving.
Multilingual: pre-trained on 140+ languages; 35+ supported out of the box.

Benchmark results (instruction-tuned)

Benchmark	Gemma 4 12B	Gemma 3 27B
MMLU Pro	77.2%	67.6%
AIME 2026 (no tools)	77.5%	20.8%
LiveCodeBench v6	72.0%	29.1%
GPQA Diamond	78.8%	42.4%
MMMU Pro (vision)	69.1%	49.7%
MATH-Vision	79.7%	46.0%

On coding and reasoning benchmarks, the model outperforms the prior Gemma 3 27B by a wide margin, while its vision scores demonstrate strong document parsing and mathematical diagram understanding.

best for

·Running multimodal AI locally on laptops and consumer devices
·Building agentic workflows with function calling and reasoning
·Processing mixed text, image, audio, and video inputs in a single model

FAQ

What makes the 12B Unified model different from other Gemma 4 models?

It uses an encoder-free architecture that projects raw image patches and audio waveforms directly into the LLM's embedding space, eliminating separate vision and audio encoders for lower latency and simpler fine-tuning.

What input modalities does Gemma 4 12B Unified support?

It supports text, image, video, and audio inputs natively, and generates text output.

What is the context window size?

It supports up to 256K tokens.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your gigarouter API key, specifying the model name google/gemma-4-12B-it.

What license is this model released under?

Apache 2.0.

not yet live

We're benchmarking and onboarding Gemma 4 12B Unified as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related multimodal models

compare all →

gemma-4-E4B-it

5.4M dl/mo

gemma-4-E2B-it-qat-mobile-transformers

22.2K dl/mo