Moondream 2
vikhyatk/moondream2
published Mar 2024 · updated Sep 2025
Moondream 2 is a small vision language model designed to run efficiently everywhere.
specs
| Task | Vision Language Model (VLM) |
| Architecture | Transformer-based vision-language model |
| Parameters | 2 billion |
| Context Length | 2048 tokens |
about this model
Moondream 2 is a vision language model (VLM) that combines image understanding, captioning, visual question answering, object detection, and pointing in a compact architecture designed for efficient deployment across devices. The model is available in 2B and 0.5B parameter variants, with the larger version offering robust general-purpose performance.
Key Capabilities
- Grounded Reasoning: A step-by-step reasoning mode that explicitly ties answers to spatial positions in the image before responding, improving precision for tasks such as chart median calculations and accurate counting.
- Object Detection: Uses reinforcement learning on high-quality bounding-box annotations to reduce object clumping and distinguish fine-grained categories (e.g., “blue bottle” vs. “bottle”).
- Faster Text Generation: A “superword” tokenizer with a lightweight transfer hypernetwork yields 20–40% faster response generation without accuracy loss, and eases future multilingual support.
- UI Understanding: ScreenSpot [email protected] improved from 60.3 to 80.4 in the latest release, making the model effective for UI element localization.
- Visual Querying & Pointing: Supports free-form questions, long-form captioning, open-vocabulary image tagging, and point-based location (e.g., “person”).
Notable Benchmark Results
| Benchmark | Score (Release) |
|---|---|
| ChartQA | 77.5 (82.2 with PoT) – 2025-04-15 |
| DocVQA | 79.3 – 2025-04-15 |
| TextVQA | 76.3 – 2025-04-15 |
| CountBenchQA | 86.4 – 2025-03-27 |
| OCRBench | 61.2 – 2025-03-27 |
| COCO (object detection) | 51.2 – 2025-03-27 |
Training and Architecture Details
Moondream 2 (2025-04-14 release) was trained on approximately 450 billion tokens, significantly fewer than comparable models. A custom second-order optimizer balances conflicting gradients across tasks, and a self-supervised auxiliary image loss accelerates convergence. Reinforcement learning fine-tuning was applied across 55 vision-language tasks, with plans to expand to ~120 tasks. The model’s context length is 2048 tokens.
best for
- ·Image captioning and visual question answering
- ·Object detection and pointing
- ·Chart and document understanding
- ·UI automation and element localization
FAQ
It supports image captioning, visual question answering, object detection, and pointing.
Moondream 2 has 2 billion parameters and is designed for efficient inference on edge devices. The June 2025 release introduced 20-40% faster text generation via a superword tokenizer.
2048 tokens.
Use the OpenAI-compatible endpoint with your API key, passing an image URL or base64 and a text prompt.
It is updated frequently (e.g., March 27, April 15, June 21, 2025). Specify a revision tag for production use.
We're benchmarking and onboarding Moondream 2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.