Qwen 3.6-35B-A3B
nvidia/Qwen3.6-35B-A3B-NVFP4
published May 2026 · updated Jun 2026
Qwen 3.6-35B-A3B is a text-generation model that uses a Mixture-of-Experts transformer with Hybrid Attention, quantized to NVFP4 for efficient inference.
specs
| Task | Text Generation |
| Architecture | Mixture-of-Experts (MoE) with Hybrid Attention |
| Parameters | 35B total (3B activated) |
| License | Apache 2.0 |
| Quantization | NVFP4 (Model Optimizer) |
about this model
NVIDIA Qwen3.6-35B-A3B-NVFP4 is a text-generation model that combines a Mixture-of-Experts (MoE) transformer with hybrid attention, supporting text, image, and video inputs up to a 262K context length. It is the NVFP4-quantized version of Alibaba’s Qwen3.6-35B-A3B, optimized using NVIDIA Model Optimizer to reduce disk size and GPU memory requirements by approximately 3.06× while retaining near-lossless accuracy.
Architecture and Quantization
The model has 35B total parameters with 3B activated per token. Quantization is applied to the weights and activations of linear operators within MoE transformer blocks, converting from BF16 to NVFP4. This enables deployment on NVIDIA Hopper and Blackwell GPUs using the vLLM inference engine.
Benchmark Performance
Evaluated across reasoning, coding, long-context recall, and instruction-following benchmarks, the NVFP4 quantized model shows negligible accuracy degradation compared to the BF16 baseline. Results are reported as accuracy percentages:
| Precision | MMLU Pro | GPQA Diamond | τ²-Bench Telecom | SciCode | AIME 2025 | AA-LCR | IFBench | MMMU PRO |
|---|---|---|---|---|---|---|---|---|
| BF16 | 85.6 | 84.9 | 95.5 | 40.8 | 89.2 | 62.0 | 62.3 | 74.1 |
| NVFP4 | 85.0 | 84.8 | 94.7 | 40.6 | 88.8 | 62.0 | 62.8 | 74.5 |
Baseline: Qwen3.6-35B-A3B. SciCode uses temperature=0.6, top_p=0.95; all others temperature=1.0, top_p=0.95, max tokens 131072.
Limitations and Licensing
The base model may reflect biases and toxic language from its training data. Developers should evaluate for their specific use case. Licensed under Apache 2.0.
best for
- ·AI agent and chatbot systems
- ·Retrieval-augmented generation (RAG)
- ·Long-context reasoning (up to 262K tokens)
FAQ
It is a quantized version of Alibaba's Qwen3.6-35B-A3B, an auto-regressive language model using a MoE transformer with hybrid attention, optimized for deployment.
It reduces bits per parameter from 16 to 4, lowering disk size and GPU memory by ~3.06x, while maintaining over 98% of original accuracy on benchmarks like MMLU Pro and GPQA Diamond.
Accepts text, image, and video input; outputs text strings. Supports context length up to 262K tokens.
Use the OpenAI-compatible endpoint with your API key. Refer to gigarouter’s documentation for endpoint URL and request format.
Apache License 2.0, allowing commercial and non-commercial use.
We're benchmarking and onboarding Qwen 3.6-35B-A3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.