skip to content
gigarouter gigarouter
models / object detection · coming soon

Locate Anything 3B

mudler/locate-anything.cpp-gguf

published Jun 2026 · updated Jun 2026

Locate Anything 3B is an open-vocabulary detection and visual grounding model that locates objects in images based on text prompts.

status
coming soon
API providers
0
downloads / mo
4.6K
license
other

specs

TaskOpen-vocabulary detection / visual grounding
ArchitectureQwen2.5-3B (LM) + MoonViT (vision) + 2-layer MLP projector
Parameters3B
LicenseNVIDIA license (weights); MIT (GGUF repo)

about this model

locate-anything.cpp-gguf is an open-vocabulary detection model that localizes objects in images from natural-language prompts. It is a C++/ggml port of NVIDIA’s LocateAnything-3B, producing detections identical to the official PyTorch version while running faster on both CPU and GPU.

Performance

On a Ryzen 9 9950X3D (CPU, 16 threads, 448×448 fixture) the recommended q8_0 build (box-identical, 6.3 GB) delivers a 3.9× speedup over the official f32 model. q6_k (5.5 GB) is also box-identical at 4.1×. Lower-bit quantizations trade a small amount of box precision for further speed:

dtypesizeinfervs official f32boxes
f169.15 GB13.68 s1.7×identical
q8_06.26 GB6.07 s3.9×identical
q6_k5.51 GB5.77 s4.1×identical
q5_k5.10 GB5.11 s4.6×sub-pixel
q4_k4.72 GB4.29 s5.5×sub-pixel
Quantization size vs CPU speedup chart showing decreasing size and increasing speedup from f16 to q4_k

On an NVIDIA GB10 GPU (against the official bf16 model) the f16 build is ~1.7× faster; the box-identical q8_0 build is ~1.9–2.1× faster:

Bar chart showing GPU speedup of locate-anything.cpp vs official bf16 across three scenes

Quantization Policy

Only the Qwen2 language-model matmuls are quantized; the MoonViT vision tower, projector, norms, and biases remain in f32. This preserves box parity through q6_k. The model can also run in three decode modes (slow, hybrid, fast) with IoU ≥0.999 against official outputs.

best for

FAQ

What is the input and output format for this model?

Input: an image and a text prompt (e.g., "Locate all the instances that matches the following description: person</c>car."). Output: a JSON list of detections with labels and bounding boxes, plus an optional annotated PNG.

How does this model compare in speed to the official PyTorch version?

On CPU, the recommended q8_0 GGUF is ~3.9× faster than official PyTorch f32 and box-identical. On GPU (NVIDIA GB10), it is ~1.9–2.1× faster than official bf16.

What quantization levels are available and which is recommended?

Available: f16, q8_0, q6_k, q5_k, q4_k. The q8_0 GGUF is recommended as the sweet spot: box-identical to f32, less than half the size, and ~3.9× faster than official PyTorch.

What are the license terms for using this model?

The model weights are under NVIDIA's license (not standard open-source). The GGUF repository is MIT-licensed. You must comply with NVIDIA's license for the weights.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Send a request with the image and prompt to the hosted model endpoint.

not yet live

We're benchmarking and onboarding Locate Anything 3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related object detection models

compare all →