Qwen 3.6-35B-A3B

nvidia/Qwen3.6-35B-A3B-NVFP4

published May 2026 · updated Jun 2026

Qwen 3.6-35B-A3B is a text-generation model that uses a Mixture-of-Experts transformer with Hybrid Attention, quantized to NVFP4 for efficient inference.

status

coming soon

API providers

downloads / mo

6.2M

license

apache-2.0

specs

Task	Text Generation
Architecture	Mixture-of-Experts (MoE) with Hybrid Attention
Parameters	35B total (3B activated)
License	Apache 2.0
Quantization	NVFP4 (Model Optimizer)

about this model

NVIDIA Qwen3.6-35B-A3B-NVFP4 is a text-generation model that combines a Mixture-of-Experts (MoE) transformer with hybrid attention, supporting text, image, and video inputs up to a 262K context length. It is the NVFP4-quantized version of Alibaba’s Qwen3.6-35B-A3B, optimized using NVIDIA Model Optimizer to reduce disk size and GPU memory requirements by approximately 3.06× while retaining near-lossless accuracy.

Architecture and Quantization

The model has 35B total parameters with 3B activated per token. Quantization is applied to the weights and activations of linear operators within MoE transformer blocks, converting from BF16 to NVFP4. This enables deployment on NVIDIA Hopper and Blackwell GPUs using the vLLM inference engine.

Benchmark Performance

Evaluated across reasoning, coding, long-context recall, and instruction-following benchmarks, the NVFP4 quantized model shows negligible accuracy degradation compared to the BF16 baseline. Results are reported as accuracy percentages:

Precision	MMLU Pro	GPQA Diamond	τ²-Bench Telecom	SciCode	AIME 2025	AA-LCR	IFBench	MMMU PRO
BF16	85.6	84.9	95.5	40.8	89.2	62.0	62.3	74.1
NVFP4	85.0	84.8	94.7	40.6	88.8	62.0	62.8	74.5

Baseline: Qwen3.6-35B-A3B. SciCode uses temperature=0.6, top_p=0.95; all others temperature=1.0, top_p=0.95, max tokens 131072.

Limitations and Licensing

The base model may reflect biases and toxic language from its training data. Developers should evaluate for their specific use case. Licensed under Apache 2.0.

best for

·AI agent and chatbot systems
·Retrieval-augmented generation (RAG)
·Long-context reasoning (up to 262K tokens)

FAQ

What is Qwen 3.6-35B-A3B?

It is a quantized version of Alibaba's Qwen3.6-35B-A3B, an auto-regressive language model using a MoE transformer with hybrid attention, optimized for deployment.

How does the NVFP4 quantization affect performance?

It reduces bits per parameter from 16 to 4, lowering disk size and GPU memory by ~3.06x, while maintaining over 98% of original accuracy on benchmarks like MMLU Pro and GPQA Diamond.

What are the input and output formats?

Accepts text, image, and video input; outputs text strings. Supports context length up to 262K tokens.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key. Refer to gigarouter’s documentation for endpoint URL and request format.

What license is this model released under?

Apache License 2.0, allowing commercial and non-commercial use.

not yet live

We're benchmarking and onboarding Qwen 3.6-35B-A3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text generation models

tiny-Qwen2ForCausalLM-2.5

dolphin-2.9.1-yi-1.5-34b

4.6M dl/mo