Gemma 4 26B A4B
unsloth/gemma-4-26B-A4B-it-GGUF
published Apr 2026 · updated Jun 2026
Gemma 4 26B A4B is a multimodal vision-language model that processes text and image input and generates text output, using a Mixture-of-Experts architecture with 3.8B active parameters.
specs
| Task | Visual Language Model (text + image input, text output) |
| Architecture | Mixture-of-Experts (MoE) with hybrid attention |
| Parameters | 25.2B total / 3.8B active |
| License | Apache 2.0 |
about this model
Gemma 4 26B A4B IT (GGUF) is a vision-language model (VLM) that processes text and image inputs and generates text output. It is the instruction-tuned, Mixture-of-Experts variant of Google DeepMind's Gemma 4 family, quantized to GGUF format by Unsloth for efficient deployment.
Architecture
The model uses 25.2B total parameters with only 3.8B active per token, achieved via 8 active experts out of 128 total plus 1 shared expert. This design enables inference speeds comparable to a 4B-parameter model while retaining the capability of a much larger model. It supports a context window of 256K tokens and a vocabulary of 262K tokens. The vision encoder adds approximately 550M parameters. A hybrid attention mechanism interleaves local sliding window attention (1024 tokens) with full global attention, and the final layer is always global.
Benchmark Results
On the instruction-tuned variant, as reported by Google DeepMind:
| Benchmark | Score |
|---|---|
| MMLU Pro | 82.6% |
| AIME 2026 (no tools) | 88.3% |
| LiveCodeBench v6 | 77.1% |
| Codeforces ELO | 1718 |
| GPQA Diamond | 82.3% |
| MMMU (Multimodal) | 86.3% |
| MMMU Pro (Vision) | 73.8% |
| MATH-Vision | 82.4% |
| OmniDocBench 1.5 (edit distance, lower is better) | 0.149 |
| MRCR v2 8 needle 128k (long context) | 44.1% |
Key Capabilities
- Multimodal understanding: Accepts interleaved text and images; supports variable aspect ratios and resolutions. Capable of OCR, document parsing, chart comprehension, and handwriting recognition.
- Reasoning: Built-in configurable thinking mode for step-by-step reasoning before answering.
- Coding & agentic tasks: Native function-calling support; strong results on coding benchmarks (LiveCodeBench, Codeforces).
- Long context: 256K token context window with efficient memory handling via unified KV and proportional RoPE.
- Multilingual: Pre-trained on 140+ languages; supports 35+ languages out of the box.
best for
- ·Reasoning and step-by-step thinking tasks
- ·Coding and agentic workflows with function calling
- ·Document/PDF parsing and OCR
FAQ
It uses a Mixture-of-Experts architecture with 128 total experts, activating only 8 per token, so it runs almost as fast as a 4B-parameter model while having 25.2B total parameters.
It accepts text and image inputs (including documents, charts, screenshots) and outputs text. It does not support audio input.
It supports up to 256K tokens of context.
It is released under the Apache 2.0 license.
Use the gigarouter OpenAI-compatible endpoint with your API key; it supports chat completions and function calling.
We're benchmarking and onboarding Gemma 4 26B A4B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.