models / object detection · coming soon

RT-DETR (R18)

PekingU/rtdetr_r18vd

published May 2024 · updated Jul 2024

RT-DETR (R18) is a real-time end-to-end object detection model that eliminates the need for NMS post-processing.

est. price

~$0.047

/ 1k images · estimated, set at launch

API providers

downloads / mo

license

apache-2.0

specs

Task	Object Detection
Architecture	RT-DETR with ResNet-18 backbone
Parameters	20M
License	Apache-2.0

about this model

RTDetrR18vd is a real-time end-to-end object detection model that eliminates the need for non-maximum suppression (NMS), addressing a key limitation of YOLO-based detectors. It is the first real-time end-to-end Transformer-based detector, built on the RT-DETR architecture with an efficient hybrid encoder that decouples intra-scale interaction and cross-scale fusion for speed, combined with uncertainty-minimal query selection for accuracy.

Key Strengths

The model achieves a strong balance of speed and accuracy. On the COCO 2017 validation set, RT-DETR-R18vd reaches 46.5 AP (63.8 AP50, 50.4 AP75) at 217 FPS on a T4 GPU with batch size 1, using only 20M parameters and 60.7 GFLOPs. When pretrained on Objects365, performance improves to 49.2 AP at the same 217 FPS. The architecture supports flexible speed tuning by adjusting decoder layers without retraining.

Benchmark Results (COCO val2017)

Model	Params (M)	GFLOPs	FPS (bs=1)	AP	AP50	AP75
RT-DETR-R18	20	60.7	217	46.5	63.8	50.4
RT-DETR-R34	31	91.0	172	48.5	66.2	52.3
RT-DETR-R50	42	136	108	53.1	71.3	57.7
RT-DETR-R101	76	259	74	54.3	72.7	58.6

With Objects365 pretraining, RT-DETR-R18 reaches 49.2 AP, and larger variants achieve up to 56.2 AP (R101). The original paper was accepted to CVPR 2024.

Architecture Overview

RT-DETR architecture diagram showing backbone, efficient hybrid encoder with AIFI and CCFF modules, uncertainty-minimal query selection, and decoder The model processes multi-scale features from the last three backbone stages through an Attention-based Intra-scale Feature Interaction (AIFI) and CNN-based Cross-scale Feature Fusion (CCFF) encoder, then selects high-quality initial queries for the decoder.

Speed-accuracy comparison chart showing RT-DETR outperforming YOLO variants on T4 GPU RT-DETR-R18vd is trained on COCO 2017 (118k training images) and is available under the Apache-2.0 license.

best for

·Real-time object detection in video streams or edge devices
·Applications requiring end-to-end detection without NMS

FAQ

What is the input and output format for this model?

The model accepts images resized to 640x640 pixels and outputs bounding boxes, class labels, and confidence scores for detected objects.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key. Send an image URL or base64-encoded image and receive detection results in JSON format.

What is the license for RT-DETR (R18)?

The model is released under the Apache-2.0 license.

How does RT-DETR (R18) compare to YOLO in speed and accuracy?

RT-DETR (R18) achieves 46.5 AP on COCO at 217 FPS on a T4 GPU, outperforming previous YOLO detectors in both speed and accuracy while eliminating NMS.

What dataset was RT-DETR (R18) trained on?

The model was trained on the COCO 2017 object detection dataset (118k training images).

not yet live

We're benchmarking and onboarding RT-DETR (R18) as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related object detection models

compare all →

table-transformer-structure-recognition

1.8M dl/mo

table-transformer-detection

1.5M dl/mo

yolos-small

713.6K dl/mo

PP-DocLayoutV3_safetensors

341.1K dl/mo

rtdetr_v2_r50vd

309.8K dl/mo

rtdetr_r50vd_coco_o365

254.5K dl/mo