RT-DETR R101

PekingU/rtdetr_r101vd_coco_o365

published Jun 2024 · updated Jul 2024

RT-DETR R101 is a real-time end-to-end object detection model using a ResNet-101-vd backbone and Transformer encoder-decoder architecture.

est. price

~$0.047

/ 1k images · estimated, set at launch

API providers

downloads / mo

99.4K

license

apache-2.0

specs

Task	Object Detection
Architecture	RT-DETR with ResNet-101-vd backbone and Transformer encoder-decoder
Parameters	76 million
License	Apache 2.0
Input Size	640x640 pixels

about this model

RT-DETR (Real-Time Detection Transformer) is an object detection model that eliminates the need for non-maximum suppression (NMS) by using an end-to-end Transformer architecture, achieving real-time inference speeds competitive with YOLO while delivering higher accuracy.

Architecture and Key Strengths

The model uses an efficient hybrid encoder that decouples intra-scale interaction and cross-scale fusion to process multi-scale features quickly. An uncertainty-minimal query selection provides high-quality initial queries to the decoder, boosting accuracy. RT-DETR supports flexible speed tuning by adjusting the number of decoder layers without retraining, adapting to different latency requirements.

On a T4 GPU, RT-DETR-R101 reaches 74 FPS (batch size 1) and 54.3% AP on COCO val2017. After pre-training on Objects365, the same model achieves 56.2% AP. The work was accepted to CVPR 2024.

Benchmark Results (COCO val2017 with Objects365 pre-training)

Model	#Epochs	Params (M)	GFLOPs	FPS (bs=1)	AP	AP50	AP75	AP-s	AP-m	AP-l
RT-DETR-R18 (O365)	60	20	61	217	49.2	66.6	53.5	33.2	52.3	64.8
RT-DETR-R50 (O365)	24	42	136	108	55.3	73.4	60.1	37.9	59.9	71.8
RT-DETR-R101 (O365)	24	76	259	74	56.2	74.6	61.3	38.3	60.5	73.5

Model Overview and Architecture

Comparison of RT-DETR with other real-time detectors showing speed-accuracy trade-off

Training hyperparameters table for RT-DETR

Architecture diagram of RT-DETR showing backbone, efficient hybrid encoder with AIFI and CCFF, uncertainty-minimal query selection, and decoder

best for

·Real-time object detection in video surveillance
·High-accuracy detection for autonomous driving systems
·Batch inference on live camera feeds requiring low latency

FAQ

What input format does the model expect?

It expects an image resized to 640x640, normalized with mean (0.485, 0.456, 0.406) and std (0.229, 0.224, 0.225), typically as a tensor.

What does the model output?

It outputs bounding boxes, class labels, and confidence scores for detected objects (COCO classes).

How does RT-DETR compare to YOLO in speed and accuracy?

RT-DETR eliminates NMS, achieving higher accuracy (56.2% AP vs YOLO) at competitive real-time speeds (74 FPS on T4 with R101).

What is the license for this model?

Apache 2.0, allowing free use, modification, and distribution.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your gigarouter API key, sending the image as a base64-encoded string in the request.

not yet live

We're benchmarking and onboarding RT-DETR R101 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related object detection models

compare all →

table-transformer-structure-recognition

1.8M dl/mo

table-transformer-detection

1.5M dl/mo

yolos-small

713.6K dl/mo

PP-DocLayoutV3_safetensors

341.1K dl/mo

rtdetr_v2_r50vd

309.8K dl/mo

rtdetr_r50vd_coco_o365

254.5K dl/mo