Ornith 1.0 9B MTP
protoLabsAI/Ornith-1.0-9B-MTP-GGUF
published Jun 2026 · updated Jul 2026
Ornith 1.0 9B MTP is a text-generation model that uses a multi-token prediction (MTP) head for lossless speculative decoding, achieving up to 1.73x speedup on llama.cpp.
specs
| Task | Text Generation |
| Architecture | Qwen3.5-9B hybrid (linear-attention + full-attention) with MTP head |
| Parameters | 9B |
| License | MIT |
about this model
Ornith-1.0-9B-MTP-GGUF is a text-generation model that bundles a 9B-parameter Qwen3.5-9B hybrid (linear-attention + full-attention) fine-tune with a KL-distilled multi-token prediction (MTP) draft head, enabling lossless self-speculative decoding in llama.cpp without a separate draft model.
Speculative Decoding Performance
On a single RTX A6000 (ctx 8192, flash-attn, greedy), the model achieves the following single-stream decode speedups with the MTP head:
| Config | Decode tok/s | Acceptance | Speedup |
|---|---|---|---|
| Base (no MTP) | 71.0 | — | 1.00× |
| MTP n-max 2 | 118.3 | 0.766 | 1.67× |
| MTP n-max 3 | 122.6 | 0.651 | 1.73× |
| MTP n-max 4 | 120.8 | 0.565 | 1.70× |
Acceptance is quant-stable: across Q4_K_M and Q8_0 at n-max 3, acceptance remains ~0.65. The relative speedup grows with precision (Q8_0’s bandwidth-bound baseline gains 1.73× vs Q4_K_M’s 1.38×). The KL-distilled head achieves 0.765 acceptance on coding prompts and 0.762 on general corpus (vLLM reference: 0.762).
NVFP4 Quantization for Blackwell
A dedicated NVFP4 build (6.6 GB) on Blackwell hardware (RTX 50xx / PRO 6000) delivers the fastest inference in the repository:
| File | Size | No-MTP tok/s | +MTP tok/s | Platform |
|---|---|---|---|---|
| Q4_K_M | 5.8 GB | 104.6 | 153.4 | Ampere A6000 |
| NVFP4-MTP | 6.6 GB | 70.7 | 84.8 | Ampere A6000 |
| Q4_K_M | 5.8 GB | 205.1 | 239 (216–252) | Blackwell |
| NVFP4-MTP | 6.6 GB | 201.5 | 306 (287–330) | Blackwell |
On Blackwell, NVFP4+MTP is ~28% faster than Q4_K_M+MTP due to near-zero verify-cost overhead on tensor-core GEMMs. Draft acceptance is near-equal between the two files (0.52 vs 0.49). On Ampere and older GPUs, Q4_K_M remains smaller and faster.
Quality and Distribution-Losslessness
Speculative decoding is distribution-lossless: every drafted token is verified against the target, preserving the output distribution. (Small non-bitwise differences can occur at greedy sampling due to floating-point reduction order — both outputs are equally valid.) Quantizing the target to NVFP4 does not degrade the draft head: acceptance remains 0.76 on real text vs 0.762 on BF16. The NVFP4 quant also scores a pass rate of 0.96 on function_call (baseline BF16 0.93) and maintains coherence through 60K context.
best for
- ·Real-time chat applications requiring high throughput
- ·Code generation and autocompletion
- ·Long-form content generation with low latency
FAQ
Up to 1.73x over base decode on llama.cpp, measured with Q8_0 and n-max 3. On Q4_K_M speedup is ~1.38x.
The MTP head is baked into the single GGUF file, so no separate draft model is needed. It performs lossless self-speculative decoding.
MIT license — both the base model and MTP head are MIT. The GGUF builds are a derivative, also MIT.
Use the OpenAI-compatible endpoint with your API key. The model name is the name shown on gigarouter, pass it as the model parameter.
GGUF quants: BF16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, IQ4_XS, IQ3_M, IQ2_M, and an NVFP4 variant for Blackwell GPUs.
We're benchmarking and onboarding Ornith 1.0 9B MTP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.