Gemma 4 12B QAT Uncensored Balanced

HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced

published Jun 2026 · updated Jun 2026

Gemma 4 12B QAT Uncensored Balanced is a vlm model that removes refusals from Google's Gemma 4 12B while preserving full capabilities, supporting text, image, and audio input with QAT quantization and speculative decoding for 60% faster generation.

status

coming soon

API providers

downloads / mo

71.7K

license

gemma

specs

Task	Vision-Language (multimodal: text, image, audio)
Architecture	Dense transformer with hybrid attention (sliding window 1024 tokens + global attention)
Parameters	12B (dense)
Context Window	256K tokens
License	Apache 2.0 (with separate Gemma 4 license)
Quantization	4-bit QAT (Q4_K_M)

about this model

Model Description

Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced is a vision-language model (VLM) built from Google DeepMind's Gemma 4 12B, fine-tuned to remove refusals (0/465 on refusal benchmarks) while preserving full original capabilities. It uses quantization-aware training (QAT) at 4-bit, achieving near full-precision quality with a 6.9 GB footprint.

Key Strengths

Speculative decoding speedup: Ships with an MTP (multi-token-prediction) draft head that delivers ~60% faster generation with identical output quality. The model verifies every drafted token.
Encoded-free vision: Processes images without a separate vision encoder, via a 168 MB mmproj projector. Native audio input is also supported (as in the base Gemma 4 12B).
256K context window with hybrid attention: interleaves local sliding window attention (1024 tokens) and full global attention, using Proportional RoPE.
48 layers, 262K vocabulary, 12B dense parameters.

Recommended Sampling

temperature 0.6, top_k 64, top_p 0.9, min_p 0.05, repeat_penalty 1.1.

Balanced Variant

This recommended variant uses optimized full uncensoring tuned for agentic coding, reasoning, creative writing, and reliability-critical tasks. It reasons before answering and remains dependable and on-instruction.

Architecture & License

Base model is Google DeepMind's Gemma 4 12B (repository google/gemma-4-12B-it), licensed under Apache 2.0 with a separate Gemma 4 license. The MTP draft head is provided by Unsloth.

best for

·Agentic coding and reasoning tasks
·Creative writing and storytelling
·Vision-language applications (image/audio input)
·Reliability-critical assistant deployments

FAQ

What is the difference between this and the original Gemma 4 12B?

This version removes all refusals while preserving original capabilities, and includes QAT quantization (~6.9 GB) plus an MTP draft head for 60% faster speculative decoding.

What input formats does it support?

Text, image (via mmproj), and audio (native support on the underlying Gemma 4 12B architecture).

How do I call this model via gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key and the model name `Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced`; refer to gigarouter docs for exact endpoint and parameters.

What license applies to this model?

The base Gemma 4 model is licensed under Apache 2.0, with an additional Gemma 4 license available at https://ai.google.dev/gemma/docs/gemma_4_license.

Is the Balanced variant recommended for all use cases?

Yes, the Balanced variant is recommended for 99%+ of users, optimized for agentic coding, reasoning, creative writing, and reliability-critical tasks without aggressive deflection.

not yet live

We're benchmarking and onboarding Gemma 4 12B QAT Uncensored Balanced as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

Qwen2.5-VL-3B-Instruct

5.3M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit