LocateAnything 3B
nvidia/LocateAnything-3B
published Mar 2026 · updated Jun 2026
LocateAnything 3B is a vision-language grounding model that performs fast and high-quality object localization, dense detection, and point-based localization using parallel box decoding.
specs
| Task | Vision-Language Grounding (object detection, referring expression grounding, point localization) |
| Architecture | Transformer-based VLM with MoonViT vision encoder, Qwen2.5-3B language model, MLP projector, and Parallel Box Decoding (PBD) |
| Parameters | 3B |
| License | NVIDIA License for non-commercial research and development only |
about this model
LocateAnything-3B is a vision-language grounding model that performs fast and precise object localization, dense detection, and point-based grounding from natural language instructions.
Developed by NVIDIA as part of the Eagle VLM family, its core innovation is Parallel Box Decoding (PBD), which predicts complete bounding box coordinates in a single parallel step instead of token-by-token autoregressive decoding. This approach preserves geometric coherence and achieves up to 12.7 boxes per second (BPS) on a single H100 — approximately 10 times faster than Qwen3-VL (1.1 BPS) and 2.5 times faster than Rex-Omni (5.0 BPS).
The model was trained on a large-scale multi-domain dataset comprising 12 million images, over 138 million queries, and 785 million bounding boxes, covering natural scenes, robotics, driving, GUI interaction, and document understanding. It supports up to 2.5K image resolution and 24K token prompts.
Key Benchmark Results
| Benchmark | Metric | Score | Comparison |
|---|---|---|---|
| LVIS Detection | F1@Mean | 50.7 | +3.8 vs Rex-Omni |
| COCO Detection | F1@Mean | 54.7 | +1.8 vs Rex-Omni |
| Dense Detection (Dense200) | F1@Mean | 58.7 | Outperforms Rex-Omni (58.3) |
| Document Layout (DocLayNet) | F1@Mean | 76.8 | +6.1 vs Rex-Omni |
| M6Doc (OCR localization) | F1@Mean | 70.1 | +14.5 vs Rex-Omni |
| ScreenSpot-Pro (GUI grounding) | Average | 60.3 | State-of-the-art |
| HumanRef Pointing | [email protected] | 68.8 | State-of-the-art |
| RefCOCOg Referring Expression | F1@Mean | 76.7 | State-of-the-art |
LocateAnything-3B achieves state-of-the-art results across multiple grounding and detection benchmarks, with particularly strong high-IoU performance (e.g., 31.1 [email protected] on LVIS vs Rex-Omni's 20.7). The model supports three inference modes — Fast (parallel), Slow (autoregressive), and Hybrid (parallel with fallback) — and has been accepted to ECCV 2026. Gigarouter hosts this model as a managed API, providing OpenAI-compatible endpoints for efficient integration.
best for
- ·Open-set object detection in natural scenes
- ·GUI element grounding for agentic systems
- ·Automated dataset labeling and annotation
FAQ
It excels at open-set object detection, dense multi-object detection, referring expression grounding, GUI element grounding, point-based localization, and document layout understanding.
With Parallel Box Decoding, it achieves up to 2.5x higher throughput than prior approaches (e.g., 12.7 BPS on a single H100, ~10x faster than Qwen3-VL).
It is released under the NVIDIA License for non-commercial use, permitting academic and non-profit research only. Commercial use is not allowed except by NVIDIA.
Use the gigarouter OpenAI-compatible endpoint with your API key. Send an image and a text prompt; the model returns structured coordinates and labels.
Input: RGB image (up to 2.5K resolution) and natural-language text (up to 24K tokens). Output: text with bounding boxes in format <box> x1, y1, x2, y2 </box> and points as <box> x, y </box>.
We're benchmarking and onboarding LocateAnything 3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.