YOLOE
jameslahm/yoloe
published Mar 2025 · updated Mar 2025
YOLOE is a real-time open-vocabulary object detection and segmentation model that supports text, visual, and prompt-free prompts.
specs
| Task | Object Detection and Segmentation |
| Architecture | YOLOv8 / YOLO11 with RepRTA, SAVPE, LRPC |
| Parameters | 10M to 50M (depending on variant) |
| License | AGPL-3.0 |
about this model
jameslahm/yoloe is a real-time object detection and segmentation model that supports text, visual, and prompt-free paradigms within a single unified architecture. It is hosted on Gigarouter as an OpenAI-compatible API.
Key capabilities
- RepRTA (Re-parameterizable Region-Text Alignment) for text prompts: refines textual embeddings via a lightweight auxiliary network that can be re-parameterized, adding zero inference and transfer overhead.
- SAVPE (Semantic-Activated Visual Prompt Encoder) for visual prompts: decouples semantic and activation branches to improve visual embedding accuracy with minimal complexity.
- LRPC (Lazy Region-Prompt Contrast) for prompt-free detection: uses a built-in large vocabulary and specialized embedding to identify all objects without costly language model dependency.
Benchmark highlights
On the LVIS minival set (zero-shot detection):
| Model | Size | AP (text) | AP (visual) |
|---|---|---|---|
| YOLOE-v8-S | 640 | 27.9 | 26.2 |
| YOLOE-v8-M | 640 | 32.6 | 31.0 |
| YOLOE-v8-L | 640 | 35.9 | 34.2 |
YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP on LVIS, with 3× less training cost and 1.4× inference speedup (T4 TensorRT: 305.8 FPS). On COCO downstream transfer (full tuning, 80 epochs), YOLOE-v8-L achieves 53.0 AP and 42.7 AP, outperforming closed-set YOLOv8-L by +0.6 AP and +0.4 AP with nearly 4× less training time. Prompt-free evaluation on LVIS yields up to 27.2 AP (YOLOE-v8-L) at 25.3 FPS (T4 PyTorch).
The model is licensed under AGPL-3.0. The underlying research has been accepted at ICCV 2025. No installation or local setup is required – the model is accessed through Gigarouter's API endpoint.
best for
- ·Open-vocabulary real-time object detection in video surveillance
- ·Instance segmentation with text or visual prompts
- ·Prompt-free detection for general object discovery
FAQ
YOLOE supports text prompts, visual prompts (e.g., an image patch), and a prompt-free mode where it detects all objects using a built-in vocabulary.
YOLOE achieves higher accuracy with less training cost and faster inference; e.g., YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP on LVIS with 3x less training cost and 1.4x inference speedup.
Input is an image file or URL along with optional text or visual prompts. The API endpoint is OpenAI-compatible; use an API key for authentication.
The model is released under the AGPL-3.0 license.
Yes, YOLOE jointly outputs bounding boxes and instance masks, supporting both tasks in a single model.
We're benchmarking and onboarding YOLOE as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.