Gemma 4 31B IT FP8
RedHatAI/gemma-4-31B-it-FP8-block
published Apr 2026 · updated Jun 2026
Gemma 4 31B IT FP8 is a vision-language model that accepts text and image inputs and generates text outputs, quantized to FP8 block format for reduced memory and faster inference.
specs
| Task | Vision-Language (Text and Image to Text) |
| Architecture | Google Gemma 4 31B Instruction-Tuned (FP8 quantized) |
| Parameters | 31 billion |
| Quantization | FP8 block (weights and activations) |
about this model
Evaluation Results
All evaluations were performed with thinking enabled, using lm-evaluation-harness and lighteval served via vLLM. Results are averaged over multiple random seeds.| Benchmark | Unquantized | FP8 Block | Recovery |
|---|---|---|---|
| IFEval (prompt-level strict) | 90.70 | 91.25 | 100.6% |
| IFEval (inst-level strict) | 93.45 | 94.00 | 100.6% |
| GSM8K Platinum | 95.78 | 95.78 | 100.0% |
| MMLU-Pro | 85.41 | 85.44 | 100.0% |
| MATH-500 | 89.40 | 88.67 | 99.2% |
| AIME 2025 | 65.83 | 68.33 | 103.8% |
| GPQA Diamond | 77.44 | 77.95 | 100.7% |
| LiveCodeBench v6 | 71.43 | 73.52 | 102.9% |
best for
- ·Visual question answering and image captioning
- ·Instruction following and tool calling with images
- ·Complex reasoning and code generation
FAQ
It reduces GPU memory and disk size by ~50% while maintaining accuracy within 1% of the unquantized version.
It accepts text and up to 4 images per prompt via the chat completions API.
Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name in the request.
Yes, it supports a thinking mode and tool/function calling when configured with the appropriate flags.
We're benchmarking and onboarding Gemma 4 31B IT FP8 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.