skip to content
gigarouter gigarouter
models / vision-language · coming soon

Gemma 4 31B IT FP8

RedHatAI/gemma-4-31B-it-FP8-block

published Apr 2026 · updated Jun 2026

Gemma 4 31B IT FP8 is a vision-language model that accepts text and image inputs and generates text outputs, quantized to FP8 block format for reduced memory and faster inference.

est. price
~$1.341
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
3.2M
license
apache-2.0

specs

TaskVision-Language (Text and Image to Text)
ArchitectureGoogle Gemma 4 31B Instruction-Tuned (FP8 quantized)
Parameters31 billion
QuantizationFP8 block (weights and activations)

about this model

RedHatAI/gemma-4-31B-it-FP8-block is a vision-language model (VLM) that accepts text and image inputs and produces text outputs, optimized for efficient deployment via FP8 quantization of weights and activations. The model is a quantized version of Google's gemma-4-31B-it, created using LLM Compressor with block-wise FP8 scaling (128×128 blocks) for weights and dynamic per-group quantization (group_size=128) for activations. This reduces disk size and GPU memory requirements by approximately 50% compared to the unquantized model. Vision tower, embedding, and output head layers remain in their original precision.

Evaluation Results

All evaluations were performed with thinking enabled, using lm-evaluation-harness and lighteval served via vLLM. Results are averaged over multiple random seeds.
Benchmark Unquantized FP8 Block Recovery
IFEval (prompt-level strict) 90.70 91.25 100.6%
IFEval (inst-level strict) 93.45 94.00 100.6%
GSM8K Platinum 95.78 95.78 100.0%
MMLU-Pro 85.41 85.44 100.0%
MATH-500 89.40 88.67 99.2%
AIME 2025 65.83 68.33 103.8%
GPQA Diamond 77.44 77.95 100.7%
LiveCodeBench v6 71.43 73.52 102.9%
The quantized model achieves 99–104% recovery across all benchmarks, with slight improvements on several tasks, demonstrating that FP8 block quantization preserves the original model's capabilities while reducing resource requirements.

best for

FAQ

What is the main benefit of this quantized model?

It reduces GPU memory and disk size by ~50% while maintaining accuracy within 1% of the unquantized version.

What input formats does this model support?

It accepts text and up to 4 images per prompt via the chat completions API.

How do I call this model on gigarouter?

Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name in the request.

Does the model support thinking or tool calling?

Yes, it supports a thinking mode and tool/function calling when configured with the appropriate flags.

not yet live

We're benchmarking and onboarding Gemma 4 31B IT FP8 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →