skip to content
gigarouter gigarouter
models / grounding · coming soon

LocateAnything 3B

nvidia/LocateAnything-3B

published Mar 2026 · updated Jun 2026

LocateAnything 3B is a vision-language grounding model that performs fast and high-quality object localization, dense detection, and point-based localization using parallel box decoding.

est. price
~$0.939
/ 1k images · estimated, set at launch

specs

TaskVision-Language Grounding (object detection, referring expression grounding, point localization)
ArchitectureTransformer-based VLM with MoonViT vision encoder, Qwen2.5-3B language model, MLP projector, and Parallel Box Decoding (PBD)
Parameters3B
LicenseNVIDIA License for non-commercial research and development only

about this model

LocateAnything-3B is a vision-language grounding model that performs fast and precise object localization, dense detection, and point-based grounding from natural language instructions.

Developed by NVIDIA as part of the Eagle VLM family, its core innovation is Parallel Box Decoding (PBD), which predicts complete bounding box coordinates in a single parallel step instead of token-by-token autoregressive decoding. This approach preserves geometric coherence and achieves up to 12.7 boxes per second (BPS) on a single H100 — approximately 10 times faster than Qwen3-VL (1.1 BPS) and 2.5 times faster than Rex-Omni (5.0 BPS).

The model was trained on a large-scale multi-domain dataset comprising 12 million images, over 138 million queries, and 785 million bounding boxes, covering natural scenes, robotics, driving, GUI interaction, and document understanding. It supports up to 2.5K image resolution and 24K token prompts.

Key Benchmark Results

BenchmarkMetricScoreComparison
LVIS DetectionF1@Mean50.7+3.8 vs Rex-Omni
COCO DetectionF1@Mean54.7+1.8 vs Rex-Omni
Dense Detection (Dense200)F1@Mean58.7Outperforms Rex-Omni (58.3)
Document Layout (DocLayNet)F1@Mean76.8+6.1 vs Rex-Omni
M6Doc (OCR localization)F1@Mean70.1+14.5 vs Rex-Omni
ScreenSpot-Pro (GUI grounding)Average60.3State-of-the-art
HumanRef Pointing[email protected]68.8State-of-the-art
RefCOCOg Referring ExpressionF1@Mean76.7State-of-the-art

LocateAnything-3B achieves state-of-the-art results across multiple grounding and detection benchmarks, with particularly strong high-IoU performance (e.g., 31.1 [email protected] on LVIS vs Rex-Omni's 20.7). The model supports three inference modes — Fast (parallel), Slow (autoregressive), and Hybrid (parallel with fallback) — and has been accepted to ECCV 2026. Gigarouter hosts this model as a managed API, providing OpenAI-compatible endpoints for efficient integration.

best for

FAQ

What tasks is LocateAnything 3B best suited for?

It excels at open-set object detection, dense multi-object detection, referring expression grounding, GUI element grounding, point-based localization, and document layout understanding.

How fast is LocateAnything 3B compared to other models?

With Parallel Box Decoding, it achieves up to 2.5x higher throughput than prior approaches (e.g., 12.7 BPS on a single H100, ~10x faster than Qwen3-VL).

What is the license for using this model?

It is released under the NVIDIA License for non-commercial use, permitting academic and non-profit research only. Commercial use is not allowed except by NVIDIA.

How do I call LocateAnything 3B via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Send an image and a text prompt; the model returns structured coordinates and labels.

What input and output formats does the model support?

Input: RGB image (up to 2.5K resolution) and natural-language text (up to 24K tokens). Output: text with bounding boxes in format <box> x1, y1, x2, y2 </box> and points as <box> x, y </box>.

not yet live

We're benchmarking and onboarding LocateAnything 3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.