skip to content
gigarouter gigarouter
models / vision-language · coming soon

Moondream 2

vikhyatk/moondream2

published Mar 2024 · updated Sep 2025

Moondream 2 is a small vision language model designed to run efficiently everywhere.

est. price
~$0.626
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
1.6M
license
apache-2.0

specs

TaskVision Language Model (VLM)
ArchitectureTransformer-based vision-language model
Parameters2 billion
Context Length2048 tokens

about this model

Moondream 2 is a vision language model (VLM) that combines image understanding, captioning, visual question answering, object detection, and pointing in a compact architecture designed for efficient deployment across devices. The model is available in 2B and 0.5B parameter variants, with the larger version offering robust general-purpose performance.

Key Capabilities

  • Grounded Reasoning: A step-by-step reasoning mode that explicitly ties answers to spatial positions in the image before responding, improving precision for tasks such as chart median calculations and accurate counting.
  • Object Detection: Uses reinforcement learning on high-quality bounding-box annotations to reduce object clumping and distinguish fine-grained categories (e.g., “blue bottle” vs. “bottle”).
  • Faster Text Generation: A “superword” tokenizer with a lightweight transfer hypernetwork yields 20–40% faster response generation without accuracy loss, and eases future multilingual support.
  • UI Understanding: ScreenSpot [email protected] improved from 60.3 to 80.4 in the latest release, making the model effective for UI element localization.
  • Visual Querying & Pointing: Supports free-form questions, long-form captioning, open-vocabulary image tagging, and point-based location (e.g., “person”).

Notable Benchmark Results

Benchmark Score (Release)
ChartQA 77.5 (82.2 with PoT) – 2025-04-15
DocVQA 79.3 – 2025-04-15
TextVQA 76.3 – 2025-04-15
CountBenchQA 86.4 – 2025-03-27
OCRBench 61.2 – 2025-03-27
COCO (object detection) 51.2 – 2025-03-27

Training and Architecture Details

Moondream 2 (2025-04-14 release) was trained on approximately 450 billion tokens, significantly fewer than comparable models. A custom second-order optimizer balances conflicting gradients across tasks, and a self-supervised auxiliary image loss accelerates convergence. Reinforcement learning fine-tuning was applied across 55 vision-language tasks, with plans to expand to ~120 tasks. The model’s context length is 2048 tokens.

best for

FAQ

What tasks does Moondream 2 support?

It supports image captioning, visual question answering, object detection, and pointing.

How does it compare to larger models in size and speed?

Moondream 2 has 2 billion parameters and is designed for efficient inference on edge devices. The June 2025 release introduced 20-40% faster text generation via a superword tokenizer.

What is the context length limit?

2048 tokens.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, passing an image URL or base64 and a text prompt.

How often is the model updated?

It is updated frequently (e.g., March 27, April 15, June 21, 2025). Specify a revision tag for production use.

not yet live

We're benchmarking and onboarding Moondream 2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →