Gemma 4 12B QAT Uncensored Balanced
HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced
published Jun 2026 · updated Jun 2026
Gemma 4 12B QAT Uncensored Balanced is a vlm model that removes refusals from Google's Gemma 4 12B while preserving full capabilities, supporting text, image, and audio input with QAT quantization and speculative decoding for 60% faster generation.
specs
| Task | Vision-Language (multimodal: text, image, audio) |
| Architecture | Dense transformer with hybrid attention (sliding window 1024 tokens + global attention) |
| Parameters | 12B (dense) |
| Context Window | 256K tokens |
| License | Apache 2.0 (with separate Gemma 4 license) |
| Quantization | 4-bit QAT (Q4_K_M) |
about this model
Model Description
Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced is a vision-language model (VLM) built from Google DeepMind's Gemma 4 12B, fine-tuned to remove refusals (0/465 on refusal benchmarks) while preserving full original capabilities. It uses quantization-aware training (QAT) at 4-bit, achieving near full-precision quality with a 6.9 GB footprint.
Key Strengths
- Speculative decoding speedup: Ships with an MTP (multi-token-prediction) draft head that delivers ~60% faster generation with identical output quality. The model verifies every drafted token.
- Encoded-free vision: Processes images without a separate vision encoder, via a 168 MB mmproj projector. Native audio input is also supported (as in the base Gemma 4 12B).
- 256K context window with hybrid attention: interleaves local sliding window attention (1024 tokens) and full global attention, using Proportional RoPE.
- 48 layers, 262K vocabulary, 12B dense parameters.
Recommended Sampling
temperature 0.6,top_k 64,top_p 0.9,min_p 0.05,repeat_penalty 1.1.
Balanced Variant
This recommended variant uses optimized full uncensoring tuned for agentic coding, reasoning, creative writing, and reliability-critical tasks. It reasons before answering and remains dependable and on-instruction.
Architecture & License
Base model is Google DeepMind's Gemma 4 12B (repository google/gemma-4-12B-it), licensed under Apache 2.0 with a separate Gemma 4 license. The MTP draft head is provided by Unsloth.
best for
- ·Agentic coding and reasoning tasks
- ·Creative writing and storytelling
- ·Vision-language applications (image/audio input)
- ·Reliability-critical assistant deployments
FAQ
This version removes all refusals while preserving original capabilities, and includes QAT quantization (~6.9 GB) plus an MTP draft head for 60% faster speculative decoding.
Text, image (via mmproj), and audio (native support on the underlying Gemma 4 12B architecture).
Use the gigarouter OpenAI-compatible endpoint with your API key and the model name `Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced`; refer to gigarouter docs for exact endpoint and parameters.
The base Gemma 4 model is licensed under Apache 2.0, with an additional Gemma 4 license available at https://ai.google.dev/gemma/docs/gemma_4_license.
Yes, the Balanced variant is recommended for 99%+ of users, optimized for agentic coding, reasoning, creative writing, and reliability-critical tasks without aggressive deflection.
We're benchmarking and onboarding Gemma 4 12B QAT Uncensored Balanced as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.