LocateAnything 3B

nvidia/LocateAnything-3B

published Mar 2026 · updated Jun 2026

LocateAnything 3B is a vision-language grounding model that performs fast and high-quality object localization, dense detection, and point-based localization using parallel box decoding.

est. price

~$0.939

/ 1k images · estimated, set at launch

specs

Task	Vision-Language Grounding (object detection, referring expression grounding, point localization)
Architecture	Transformer-based VLM with MoonViT vision encoder, Qwen2.5-3B language model, MLP projector, and Parallel Box Decoding (PBD)
Parameters	3B
License	NVIDIA License for non-commercial research and development only

about this model

LocateAnything-3B is a vision-language grounding model that performs fast and precise object localization, dense detection, and point-based grounding from natural language instructions.

Developed by NVIDIA as part of the Eagle VLM family, its core innovation is Parallel Box Decoding (PBD), which predicts complete bounding box coordinates in a single parallel step instead of token-by-token autoregressive decoding. This approach preserves geometric coherence and achieves up to 12.7 boxes per second (BPS) on a single H100 — approximately 10 times faster than Qwen3-VL (1.1 BPS) and 2.5 times faster than Rex-Omni (5.0 BPS).

The model was trained on a large-scale multi-domain dataset comprising 12 million images, over 138 million queries, and 785 million bounding boxes, covering natural scenes, robotics, driving, GUI interaction, and document understanding. It supports up to 2.5K image resolution and 24K token prompts.

Key Benchmark Results

Benchmark	Metric	Score	Comparison
LVIS Detection	F1@Mean	50.7	+3.8 vs Rex-Omni
COCO Detection	F1@Mean	54.7	+1.8 vs Rex-Omni
Dense Detection (Dense200)	F1@Mean	58.7	Outperforms Rex-Omni (58.3)
Document Layout (DocLayNet)	F1@Mean	76.8	+6.1 vs Rex-Omni
M6Doc (OCR localization)	F1@Mean	70.1	+14.5 vs Rex-Omni
ScreenSpot-Pro (GUI grounding)	Average	60.3	State-of-the-art
HumanRef Pointing	[email protected]	68.8	State-of-the-art
RefCOCOg Referring Expression	F1@Mean	76.7	State-of-the-art

LocateAnything-3B achieves state-of-the-art results across multiple grounding and detection benchmarks, with particularly strong high-IoU performance (e.g., 31.1 [email protected] on LVIS vs Rex-Omni's 20.7). The model supports three inference modes — Fast (parallel), Slow (autoregressive), and Hybrid (parallel with fallback) — and has been accepted to ECCV 2026. Gigarouter hosts this model as a managed API, providing OpenAI-compatible endpoints for efficient integration.

best for

·Open-set object detection in natural scenes
·GUI element grounding for agentic systems
·Automated dataset labeling and annotation

FAQ

What tasks is LocateAnything 3B best suited for?

It excels at open-set object detection, dense multi-object detection, referring expression grounding, GUI element grounding, point-based localization, and document layout understanding.

How fast is LocateAnything 3B compared to other models?

With Parallel Box Decoding, it achieves up to 2.5x higher throughput than prior approaches (e.g., 12.7 BPS on a single H100, ~10x faster than Qwen3-VL).

What is the license for using this model?

It is released under the NVIDIA License for non-commercial use, permitting academic and non-profit research only. Commercial use is not allowed except by NVIDIA.

How do I call LocateAnything 3B via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Send an image and a text prompt; the model returns structured coordinates and labels.

What input and output formats does the model support?

Input: RGB image (up to 2.5K resolution) and natural-language text (up to 24K tokens). Output: text with bounding boxes in format <box> x1, y1, x2, y2 </box> and points as <box> x, y </box>.

not yet live

We're benchmarking and onboarding LocateAnything 3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.