Gemma 4 26B A4B IT FP8 Dynamic
RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic
published Apr 2026 · updated Jun 2026
Gemma 4 26B A4B IT FP8 Dynamic is a vision-language model that processes text and image inputs to generate text, optimized with FP8 quantization for reduced memory and faster inference.
specs
| Task | Vision-Language (Text + Image to Text) |
| Architecture | Mixture-of-Experts with 128 fine-grained experts, top-8 routing, dual attention (sliding-window and global), dynamic vision resolution |
| Parameters | 26B total, 4B active per token |
| License | Google Gemma License |
about this model
RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic is a vision-language model (VLM) that processes text and image inputs to produce text outputs, optimized via FP8 dynamic quantization of weights and activations for efficient deployment.
Architecture and Optimization
The model is a quantized version of google/gemma-4-26B-A4B-it, retaining its core design: 128 fine-grained experts with top-8 routing, 131,072 token context length, dual attention (alternating sliding-window and global attention with different head dimensions), and dynamic vision resolution with configurable token budgets (70, 140, 280, 560, or 1,120 tokens). Quantization to FP8 data type reduces disk size and GPU memory requirements by approximately 50% while preserving most of the original model's accuracy. Weights are quantized statically per channel; activations are quantized dynamically per token. Vision tower, embedding, output head, and MoE router layers remain in original precision.
Evaluation Results
All benchmarks were performed with thinking enabled, using vLLM (OpenAI-compatible API). Scores are averaged over 3 seeds (8 for AIME 2025).
| Category | Benchmark | Unquantized | FP8 Dynamic | Recovery |
|---|---|---|---|---|
| Instruction Following | IFEval (0-shot, prompt-level strict) | 89.96 | 89.34 | 99.3% |
| IFEval (0-shot, inst-level strict) | 93.21 | 92.69 | 99.4% | |
| Reasoning | GSM8K Platinum (0-shot, strict-match) | 95.43 | 95.37 | 99.9% |
| MMLU-Pro (0-shot, custom-extract) | 83.47 | 83.26 | 99.7% | |
| MATH-500 (0-shot, pass@1) | 84.80 | 85.93 | 101.3% | |
| AIME 2025 (0-shot, pass@1) | 80.00 | 80.00 | 100.0% | |
| GPQA Diamond (0-shot, pass@1) | 73.20 | 74.75 | 102.1% | |
| Coding | LiveCodeBench v6 (0-shot, pass@1) | 74.48 | 73.90 | 99.2% |
Note: Audio input is not supported on this model variant. The model requires vLLM 0.19.1 or later.
best for
- ·Multimodal reasoning with images, such as chart analysis and visual question answering
- ·Complex instruction following and tool calling with thinking/reasoning mode
- ·Code generation and mathematical problem solving
FAQ
The base model supports up to 131,072 tokens; the FP8 quantized version is deployed with a max model length of 32,768 tokens by default.
It accepts text and image inputs and generates text output. Images are processed with dynamic resolution and configurable token budgets (70 to 1120 tokens).
FP8 quantization reduces disk size and GPU memory requirements by approximately 50% compared to the 16-bit base model, while maintaining over 99% accuracy recovery on most benchmarks.
It uses the Google Gemma License.
Use the gigarouter OpenAI-compatible endpoint with your API key, sending a chat completion request with the model name RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic.
We're benchmarking and onboarding Gemma 4 26B A4B IT FP8 Dynamic as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.