models / vision-language · coming soon

Moondream 2

vikhyatk/moondream2

published Mar 2024 · updated Sep 2025

Moondream 2 is a small vision language model designed to run efficiently everywhere.

est. price

~$0.626

/ 1k images · estimated, set at launch

API providers

downloads / mo

1.6M

license

apache-2.0

specs

Task	Vision Language Model (VLM)
Architecture	Transformer-based vision-language model
Parameters	2 billion
Context Length	2048 tokens

about this model

Moondream 2 is a vision language model (VLM) that combines image understanding, captioning, visual question answering, object detection, and pointing in a compact architecture designed for efficient deployment across devices. The model is available in 2B and 0.5B parameter variants, with the larger version offering robust general-purpose performance.

Key Capabilities

Grounded Reasoning: A step-by-step reasoning mode that explicitly ties answers to spatial positions in the image before responding, improving precision for tasks such as chart median calculations and accurate counting.
Object Detection: Uses reinforcement learning on high-quality bounding-box annotations to reduce object clumping and distinguish fine-grained categories (e.g., “blue bottle” vs. “bottle”).
Faster Text Generation: A “superword” tokenizer with a lightweight transfer hypernetwork yields 20–40% faster response generation without accuracy loss, and eases future multilingual support.
UI Understanding: ScreenSpot [email protected] improved from 60.3 to 80.4 in the latest release, making the model effective for UI element localization.
Visual Querying & Pointing: Supports free-form questions, long-form captioning, open-vocabulary image tagging, and point-based location (e.g., “person”).

Notable Benchmark Results

Benchmark	Score (Release)
ChartQA	77.5 (82.2 with PoT) – 2025-04-15
DocVQA	79.3 – 2025-04-15
TextVQA	76.3 – 2025-04-15
CountBenchQA	86.4 – 2025-03-27
OCRBench	61.2 – 2025-03-27
COCO (object detection)	51.2 – 2025-03-27

Training and Architecture Details

Moondream 2 (2025-04-14 release) was trained on approximately 450 billion tokens, significantly fewer than comparable models. A custom second-order optimizer balances conflicting gradients across tasks, and a self-supervised auxiliary image loss accelerates convergence. Reinforcement learning fine-tuning was applied across 55 vision-language tasks, with plans to expand to ~120 tasks. The model’s context length is 2048 tokens.

best for

·Image captioning and visual question answering
·Object detection and pointing
·Chart and document understanding
·UI automation and element localization

FAQ

What tasks does Moondream 2 support?

It supports image captioning, visual question answering, object detection, and pointing.

How does it compare to larger models in size and speed?

Moondream 2 has 2 billion parameters and is designed for efficient inference on edge devices. The June 2025 release introduced 20-40% faster text generation via a superword tokenizer.

What is the context length limit?

2048 tokens.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, passing an image URL or base64 and a text prompt.

How often is the model updated?

It is updated frequently (e.g., March 27, April 15, June 21, 2025). Specify a revision tag for production use.

not yet live

We're benchmarking and onboarding Moondream 2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

Qwen2.5-VL-3B-Instruct

5.3M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit