Gemma 4 26B A4B IT FP8 Dynamic

RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic

published Apr 2026 · updated Jun 2026

Gemma 4 26B A4B IT FP8 Dynamic is a vision-language model that processes text and image inputs to generate text, optimized with FP8 quantization for reduced memory and faster inference.

est. price

~$1.341

/ 1k images · estimated, set at launch

API providers

downloads / mo

license

apache-2.0

specs

Task	Vision-Language (Text + Image to Text)
Architecture	Mixture-of-Experts with 128 fine-grained experts, top-8 routing, dual attention (sliding-window and global), dynamic vision resolution
Parameters	26B total, 4B active per token
License	Google Gemma License

about this model

RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic is a vision-language model (VLM) that processes text and image inputs to produce text outputs, optimized via FP8 dynamic quantization of weights and activations for efficient deployment.

Architecture and Optimization

The model is a quantized version of google/gemma-4-26B-A4B-it, retaining its core design: 128 fine-grained experts with top-8 routing, 131,072 token context length, dual attention (alternating sliding-window and global attention with different head dimensions), and dynamic vision resolution with configurable token budgets (70, 140, 280, 560, or 1,120 tokens). Quantization to FP8 data type reduces disk size and GPU memory requirements by approximately 50% while preserving most of the original model's accuracy. Weights are quantized statically per channel; activations are quantized dynamically per token. Vision tower, embedding, output head, and MoE router layers remain in original precision.

Evaluation Results

All benchmarks were performed with thinking enabled, using vLLM (OpenAI-compatible API). Scores are averaged over 3 seeds (8 for AIME 2025).

Category	Benchmark	Unquantized	FP8 Dynamic	Recovery
Instruction Following	IFEval (0-shot, prompt-level strict)	89.96	89.34	99.3%
Instruction Following	IFEval (0-shot, inst-level strict)	93.21	92.69	99.4%
Reasoning	GSM8K Platinum (0-shot, strict-match)	95.43	95.37	99.9%
	MMLU-Pro (0-shot, custom-extract)	83.47	83.26	99.7%
	MATH-500 (0-shot, pass@1)	84.80	85.93	101.3%
	AIME 2025 (0-shot, pass@1)	80.00	80.00	100.0%
	GPQA Diamond (0-shot, pass@1)	73.20	74.75	102.1%
Coding	LiveCodeBench v6 (0-shot, pass@1)	74.48	73.90	99.2%

Note: Audio input is not supported on this model variant. The model requires vLLM 0.19.1 or later.

best for

·Multimodal reasoning with images, such as chart analysis and visual question answering
·Complex instruction following and tool calling with thinking/reasoning mode
·Code generation and mathematical problem solving

FAQ

What is the context length of this model?

The base model supports up to 131,072 tokens; the FP8 quantized version is deployed with a max model length of 32,768 tokens by default.

What are the input and output formats?

It accepts text and image inputs and generates text output. Images are processed with dynamic resolution and configurable token budgets (70 to 1120 tokens).

How does this model compare in size and speed to the unquantized version?

FP8 quantization reduces disk size and GPU memory requirements by approximately 50% compared to the 16-bit base model, while maintaining over 99% accuracy recovery on most benchmarks.

What license does this model use?

It uses the Google Gemma License.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending a chat completion request with the model name RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic.

not yet live

We're benchmarking and onboarding Gemma 4 26B A4B IT FP8 Dynamic as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

Qwen2.5-VL-3B-Instruct

5.3M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit