YOLOE

jameslahm/yoloe

published Mar 2025 · updated Mar 2025

YOLOE is a real-time open-vocabulary object detection and segmentation model that supports text, visual, and prompt-free prompts.

status

coming soon

API providers

downloads / mo

4.3K

license

agpl-3.0

specs

Task	Object Detection and Segmentation
Architecture	YOLOv8 / YOLO11 with RepRTA, SAVPE, LRPC
Parameters	10M to 50M (depending on variant)
License	AGPL-3.0

about this model

jameslahm/yoloe is a real-time object detection and segmentation model that supports text, visual, and prompt-free paradigms within a single unified architecture. It is hosted on Gigarouter as an OpenAI-compatible API.

Key capabilities

RepRTA (Re-parameterizable Region-Text Alignment) for text prompts: refines textual embeddings via a lightweight auxiliary network that can be re-parameterized, adding zero inference and transfer overhead.
SAVPE (Semantic-Activated Visual Prompt Encoder) for visual prompts: decouples semantic and activation branches to improve visual embedding accuracy with minimal complexity.
LRPC (Lazy Region-Prompt Contrast) for prompt-free detection: uses a built-in large vocabulary and specialized embedding to identify all objects without costly language model dependency.

Benchmark highlights

On the LVIS minival set (zero-shot detection):

Model	Size	AP (text)	AP (visual)
YOLOE-v8-S	640	27.9	26.2
YOLOE-v8-M	640	32.6	31.0
YOLOE-v8-L	640	35.9	34.2

YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP on LVIS, with 3× less training cost and 1.4× inference speedup (T4 TensorRT: 305.8 FPS). On COCO downstream transfer (full tuning, 80 epochs), YOLOE-v8-L achieves 53.0 AP and 42.7 AP, outperforming closed-set YOLOv8-L by +0.6 AP and +0.4 AP with nearly 4× less training time. Prompt-free evaluation on LVIS yields up to 27.2 AP (YOLOE-v8-L) at 25.3 FPS (T4 PyTorch).

Comparison of performance, training cost, and inference efficiency between YOLOE and YOLO-Worldv2 in terms of open text prompts.

The model is licensed under AGPL-3.0. The underlying research has been accepted at ICCV 2025. No installation or local setup is required – the model is accessed through Gigarouter's API endpoint.

best for

·Open-vocabulary real-time object detection in video surveillance
·Instance segmentation with text or visual prompts
·Prompt-free detection for general object discovery

FAQ

What prompt types does YOLOE support?

YOLOE supports text prompts, visual prompts (e.g., an image patch), and a prompt-free mode where it detects all objects using a built-in vocabulary.

How does YOLOE compare to YOLO-World in terms of speed and accuracy?

YOLOE achieves higher accuracy with less training cost and faster inference; e.g., YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP on LVIS with 3x less training cost and 1.4x inference speedup.

What is the input format for the hosted API on gigarouter?

Input is an image file or URL along with optional text or visual prompts. The API endpoint is OpenAI-compatible; use an API key for authentication.

What license governs use of YOLOE?

The model is released under the AGPL-3.0 license.

Can YOLOE be used for both detection and segmentation?

Yes, YOLOE jointly outputs bounding boxes and instance masks, supporting both tasks in a single model.

not yet live

We're benchmarking and onboarding YOLOE as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related object detection models

compare all →

table-transformer-structure-recognition

1.8M dl/mo

table-transformer-detection

1.5M dl/mo

yolos-small

713.6K dl/mo

PP-DocLayoutV3_safetensors

341.1K dl/mo

rtdetr_v2_r50vd

309.8K dl/mo

rtdetr_r50vd_coco_o365

254.5K dl/mo