skip to content
gigarouter gigarouter
models / multimodal · coming soon

Gemma 4 12B Unified

google/gemma-4-12B-it

published May 2026 · updated Jun 2026

Gemma 4 12B Unified is an any-to-any model that processes text, image, video, and audio inputs natively without separate encoders, generating text output.

status
coming soon
API providers
0
downloads / mo
3M
license
apache-2.0

specs

TaskMultimodal understanding and text generation
ArchitectureEncoder-free decoder-only transformer (Unified)
Parameters11.95B
LicenseApache 2.0

about this model

google/gemma-4-12B-it is a multimodal model that accepts text, image, audio, and video inputs and produces text output, using an encoder-free architecture to project raw image patches and audio waveforms directly into the LLM embedding space. It is the instruction-tuned variant of the Gemma 4 12B Unified model from Google DeepMind, designed to bring advanced reasoning, coding, and vision capabilities to consumer GPUs and workstations. Gigarouter hosts this model as a managed API, accessible via a standard OpenAI-compatible endpoint.

Key capabilities

  • Thinking mode: configurable step-by-step reasoning before answering.
  • Long context: supports up to 256K tokens.
  • Multimodal input: interleaved text, images, video frames, and audio (ASR, speech translation).
  • Function calling: native tool-use support for agentic workflows.
  • Coding & reasoning: enhanced code generation and mathematical problem-solving.
  • Multilingual: pre-trained on 140+ languages; 35+ supported out of the box.

Benchmark results (instruction-tuned)

Benchmark Gemma 4 12B Gemma 3 27B
MMLU Pro77.2%67.6%
AIME 2026 (no tools)77.5%20.8%
LiveCodeBench v672.0%29.1%
GPQA Diamond78.8%42.4%
MMMU Pro (vision)69.1%49.7%
MATH-Vision79.7%46.0%

On coding and reasoning benchmarks, the model outperforms the prior Gemma 3 27B by a wide margin, while its vision scores demonstrate strong document parsing and mathematical diagram understanding.

best for

FAQ

What makes the 12B Unified model different from other Gemma 4 models?

It uses an encoder-free architecture that projects raw image patches and audio waveforms directly into the LLM's embedding space, eliminating separate vision and audio encoders for lower latency and simpler fine-tuning.

What input modalities does Gemma 4 12B Unified support?

It supports text, image, video, and audio inputs natively, and generates text output.

What is the context window size?

It supports up to 256K tokens.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your gigarouter API key, specifying the model name google/gemma-4-12B-it.

What license is this model released under?

Apache 2.0.

not yet live

We're benchmarking and onboarding Gemma 4 12B Unified as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related multimodal models

compare all →