Gemma 4 31B IT FP8

RedHatAI/gemma-4-31B-it-FP8-block

published Apr 2026 · updated Jun 2026

Gemma 4 31B IT FP8 is a vision-language model that accepts text and image inputs and generates text outputs, quantized to FP8 block format for reduced memory and faster inference.

est. price

~$1.341

/ 1k images · estimated, set at launch

API providers

downloads / mo

3.2M

license

apache-2.0

specs

Task	Vision-Language (Text and Image to Text)
Architecture	Google Gemma 4 31B Instruction-Tuned (FP8 quantized)
Parameters	31 billion
Quantization	FP8 block (weights and activations)

about this model

RedHatAI/gemma-4-31B-it-FP8-block is a vision-language model (VLM) that accepts text and image inputs and produces text outputs, optimized for efficient deployment via FP8 quantization of weights and activations. The model is a quantized version of Google's gemma-4-31B-it, created using LLM Compressor with block-wise FP8 scaling (128×128 blocks) for weights and dynamic per-group quantization (group_size=128) for activations. This reduces disk size and GPU memory requirements by approximately 50% compared to the unquantized model. Vision tower, embedding, and output head layers remain in their original precision.

Evaluation Results

All evaluations were performed with thinking enabled, using lm-evaluation-harness and lighteval served via vLLM. Results are averaged over multiple random seeds.

Benchmark	Unquantized	FP8 Block	Recovery
IFEval (prompt-level strict)	90.70	91.25	100.6%
IFEval (inst-level strict)	93.45	94.00	100.6%
GSM8K Platinum	95.78	95.78	100.0%
MMLU-Pro	85.41	85.44	100.0%
MATH-500	89.40	88.67	99.2%
AIME 2025	65.83	68.33	103.8%
GPQA Diamond	77.44	77.95	100.7%
LiveCodeBench v6	71.43	73.52	102.9%

The quantized model achieves 99–104% recovery across all benchmarks, with slight improvements on several tasks, demonstrating that FP8 block quantization preserves the original model's capabilities while reducing resource requirements.

best for

·Visual question answering and image captioning
·Instruction following and tool calling with images
·Complex reasoning and code generation

FAQ

What is the main benefit of this quantized model?

It reduces GPU memory and disk size by ~50% while maintaining accuracy within 1% of the unquantized version.

What input formats does this model support?

It accepts text and up to 4 images per prompt via the chat completions API.

How do I call this model on gigarouter?

Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name in the request.

Does the model support thinking or tool calling?

Yes, it supports a thinking mode and tool/function calling when configured with the appropriate flags.

not yet live

We're benchmarking and onboarding Gemma 4 31B IT FP8 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

Qwen2.5-VL-3B-Instruct

5.3M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit