Ornith 1.0 9B MTP

protoLabsAI/Ornith-1.0-9B-MTP-GGUF

published Jun 2026 · updated Jul 2026

Ornith 1.0 9B MTP is a text-generation model that uses a multi-token prediction (MTP) head for lossless speculative decoding, achieving up to 1.73x speedup on llama.cpp.

status

coming soon

API providers

downloads / mo

16.8K

license

mit

specs

Task	Text Generation
Architecture	Qwen3.5-9B hybrid (linear-attention + full-attention) with MTP head
Parameters	9B
License	MIT

about this model

Ornith-1.0-9B-MTP-GGUF is a text-generation model that bundles a 9B-parameter Qwen3.5-9B hybrid (linear-attention + full-attention) fine-tune with a KL-distilled multi-token prediction (MTP) draft head, enabling lossless self-speculative decoding in llama.cpp without a separate draft model.

Speculative Decoding Performance

On a single RTX A6000 (ctx 8192, flash-attn, greedy), the model achieves the following single-stream decode speedups with the MTP head:

Config	Decode tok/s	Acceptance	Speedup
Base (no MTP)	71.0	—	1.00×
MTP n-max 2	118.3	0.766	1.67×
MTP n-max 3	122.6	0.651	1.73×
MTP n-max 4	120.8	0.565	1.70×

Acceptance is quant-stable: across Q4_K_M and Q8_0 at n-max 3, acceptance remains ~0.65. The relative speedup grows with precision (Q8_0’s bandwidth-bound baseline gains 1.73× vs Q4_K_M’s 1.38×). The KL-distilled head achieves 0.765 acceptance on coding prompts and 0.762 on general corpus (vLLM reference: 0.762).

NVFP4 Quantization for Blackwell

A dedicated NVFP4 build (6.6 GB) on Blackwell hardware (RTX 50xx / PRO 6000) delivers the fastest inference in the repository:

File	Size	No-MTP tok/s	+MTP tok/s	Platform
Q4_K_M	5.8 GB	104.6	153.4	Ampere A6000
NVFP4-MTP	6.6 GB	70.7	84.8	Ampere A6000
Q4_K_M	5.8 GB	205.1	239 (216–252)	Blackwell
NVFP4-MTP	6.6 GB	201.5	306 (287–330)	Blackwell

On Blackwell, NVFP4+MTP is ~28% faster than Q4_K_M+MTP due to near-zero verify-cost overhead on tensor-core GEMMs. Draft acceptance is near-equal between the two files (0.52 vs 0.49). On Ampere and older GPUs, Q4_K_M remains smaller and faster.

Quality and Distribution-Losslessness

Speculative decoding is distribution-lossless: every drafted token is verified against the target, preserving the output distribution. (Small non-bitwise differences can occur at greedy sampling due to floating-point reduction order — both outputs are equally valid.) Quantizing the target to NVFP4 does not degrade the draft head: acceptance remains 0.76 on real text vs 0.762 on BF16. The NVFP4 quant also scores a pass rate of 0.96 on function_call (baseline BF16 0.93) and maintains coherence through 60K context.

best for

·Real-time chat applications requiring high throughput
·Code generation and autocompletion
·Long-form content generation with low latency

FAQ

What speedup does the MTP head provide?

Up to 1.73x over base decode on llama.cpp, measured with Q8_0 and n-max 3. On Q4_K_M speedup is ~1.38x.

How is this different from using a separate draft model?

The MTP head is baked into the single GGUF file, so no separate draft model is needed. It performs lossless self-speculative decoding.

What are the license terms?

MIT license — both the base model and MTP head are MIT. The GGUF builds are a derivative, also MIT.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key. The model name is the name shown on gigarouter, pass it as the model parameter.

What quantized formats are available?

GGUF quants: BF16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, IQ4_XS, IQ3_M, IQ2_M, and an NVFP4 variant for Blackwell GPUs.

not yet live

We're benchmarking and onboarding Ornith 1.0 9B MTP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text generation models

tiny-Qwen2ForCausalLM-2.5

9.2M dl/mo

deepseek-v4-gguf

6.4M dl/mo

Qwen3.6-35B-A3B-NVFP4

6.2M dl/mo

gemma-3-270m

5.1M dl/mo