Qwen3.6 35B A3B

unsloth/Qwen3.6-35B-A3B-MTP-GGUF

published May 2026 · updated May 2026

Qwen3.6 35B A3B is a vision-language model (VLM) that excels at agentic coding, repository-level reasoning, and tool calling, with Multi-Token Prediction for faster inference.

status

coming soon

API providers

downloads / mo

734.7K

license

apache-2.0

specs

Task	Vision-Language Model (Causal LM with Vision Encoder)
Architecture	Mixture of Experts (MoE) with 256 experts (8 active + 1 shared), Gated DeltaNet & Gated Attention
Parameters	35B total, 3B activated
Context Length	262,144 tokens (extensible to 1,010,000)
MTP Support	Multi-Token Prediction for ~1.5-2x faster inference

about this model

Qwen3.6-35B-A3B is a vision-language model that combines a causal language model with a vision encoder, featuring a Mixture-of-Experts architecture (35B total parameters, 3B activated) with 256 experts (8 routed + 1 shared). It supports native context lengths of 262,144 tokens, extensible to ~1M tokens.

Key capabilities

The model is optimized for agentic coding workflows, including repository-level reasoning, frontend development, and tool calling. It introduces thinking preservation, retaining reasoning context across iterative interactions. Multi-Token Prediction (MTP) enables 1.4–2.2x faster inference without accuracy loss, as implemented in the Unsloth GGUF quantizations hosted on gigarouter.

Benchmark performance

On coding agent benchmarks, Qwen3.6-35B-A3B achieves competitive or leading scores among similarly sized models (all scores from the official model card):

Benchmark	Score
SWE-bench Verified	73.4
SWE-bench Multilingual	67.2
SWE-bench Pro	49.5
Terminal-Bench 2.0	51.5
Claw-Eval (Avg)	68.7
SkillsBench (Avg5)	28.7

Benchmark results chart comparing Qwen3.6 to other models

Quantization and deployment

The gigarouter API serves Unsloth Dynamic 2.0 GGUF quantizations, which use a curated calibration dataset of over 1.5M tokens to preserve conversational quality. Memory requirements range from 17 GB (3-bit) to 70 GB (BF16). Recommended inference settings vary by mode: thinking mode uses temperature 1.0 / top_p 0.95 / top_k 20 for general tasks, or 0.6 / 0.95 / 20 for precise coding; instruct (non-thinking) mode uses 0.7 / 0.8 / 20 with presence penalty 1.5.

best for

·Agentic coding and software engineering tasks like SWE-bench
·Multi-step reasoning with thinking context preservation
·Tool calling and code execution in development workflows

FAQ

What speedup does Multi-Token Prediction (MTP) provide?

MTP enables ~1.5-2x faster inference with no accuracy loss.

What are the memory requirements for running this model?

Quantized versions require 17 GB (3-bit) to 70 GB (BF16) of total RAM+VRAM.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key.

What is the context length supported?

Native 262,144 tokens, extensible up to 1,010,000 tokens.

What are the recommended inference parameters?

For thinking mode: temperature 1.0, top_p 0.95, top_k 20. For instruct mode: temperature 0.7, top_p 0.8, top_k 20, presence_penalty 1.5.

not yet live

We're benchmarking and onboarding Qwen3.6 35B A3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

Qwen2.5-VL-3B-Instruct

5.3M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit