skip to content
gigarouter gigarouter
models / text generation · coming soon

Qwen 3.6-35B-A3B

nvidia/Qwen3.6-35B-A3B-NVFP4

published May 2026 · updated Jun 2026

Qwen 3.6-35B-A3B is a text-generation model that uses a Mixture-of-Experts transformer with Hybrid Attention, quantized to NVFP4 for efficient inference.

status
coming soon
API providers
0
downloads / mo
6.2M
license
apache-2.0

specs

TaskText Generation
ArchitectureMixture-of-Experts (MoE) with Hybrid Attention
Parameters35B total (3B activated)
LicenseApache 2.0
QuantizationNVFP4 (Model Optimizer)

about this model

NVIDIA Qwen3.6-35B-A3B-NVFP4 is a text-generation model that combines a Mixture-of-Experts (MoE) transformer with hybrid attention, supporting text, image, and video inputs up to a 262K context length. It is the NVFP4-quantized version of Alibaba’s Qwen3.6-35B-A3B, optimized using NVIDIA Model Optimizer to reduce disk size and GPU memory requirements by approximately 3.06× while retaining near-lossless accuracy.

Architecture and Quantization

The model has 35B total parameters with 3B activated per token. Quantization is applied to the weights and activations of linear operators within MoE transformer blocks, converting from BF16 to NVFP4. This enables deployment on NVIDIA Hopper and Blackwell GPUs using the vLLM inference engine.

Benchmark Performance

Evaluated across reasoning, coding, long-context recall, and instruction-following benchmarks, the NVFP4 quantized model shows negligible accuracy degradation compared to the BF16 baseline. Results are reported as accuracy percentages:

Precision MMLU Pro GPQA Diamond τ²-Bench Telecom SciCode AIME 2025 AA-LCR IFBench MMMU PRO
BF16 85.6 84.9 95.5 40.8 89.2 62.0 62.3 74.1
NVFP4 85.0 84.8 94.7 40.6 88.8 62.0 62.8 74.5

Baseline: Qwen3.6-35B-A3B. SciCode uses temperature=0.6, top_p=0.95; all others temperature=1.0, top_p=0.95, max tokens 131072.

Limitations and Licensing

The base model may reflect biases and toxic language from its training data. Developers should evaluate for their specific use case. Licensed under Apache 2.0.

best for

FAQ

What is Qwen 3.6-35B-A3B?

It is a quantized version of Alibaba's Qwen3.6-35B-A3B, an auto-regressive language model using a MoE transformer with hybrid attention, optimized for deployment.

How does the NVFP4 quantization affect performance?

It reduces bits per parameter from 16 to 4, lowering disk size and GPU memory by ~3.06x, while maintaining over 98% of original accuracy on benchmarks like MMLU Pro and GPQA Diamond.

What are the input and output formats?

Accepts text, image, and video input; outputs text strings. Supports context length up to 262K tokens.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key. Refer to gigarouter’s documentation for endpoint URL and request format.

What license is this model released under?

Apache License 2.0, allowing commercial and non-commercial use.

not yet live

We're benchmarking and onboarding Qwen 3.6-35B-A3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text generation models

compare all →