RT-DETR R101
PekingU/rtdetr_r101vd_coco_o365
published Jun 2024 · updated Jul 2024
RT-DETR R101 is a real-time end-to-end object detection model using a ResNet-101-vd backbone and Transformer encoder-decoder architecture.
specs
| Task | Object Detection |
| Architecture | RT-DETR with ResNet-101-vd backbone and Transformer encoder-decoder |
| Parameters | 76 million |
| License | Apache 2.0 |
| Input Size | 640x640 pixels |
about this model
RT-DETR (Real-Time Detection Transformer) is an object detection model that eliminates the need for non-maximum suppression (NMS) by using an end-to-end Transformer architecture, achieving real-time inference speeds competitive with YOLO while delivering higher accuracy.
Architecture and Key Strengths
The model uses an efficient hybrid encoder that decouples intra-scale interaction and cross-scale fusion to process multi-scale features quickly. An uncertainty-minimal query selection provides high-quality initial queries to the decoder, boosting accuracy. RT-DETR supports flexible speed tuning by adjusting the number of decoder layers without retraining, adapting to different latency requirements.
On a T4 GPU, RT-DETR-R101 reaches 74 FPS (batch size 1) and 54.3% AP on COCO val2017. After pre-training on Objects365, the same model achieves 56.2% AP. The work was accepted to CVPR 2024.
Benchmark Results (COCO val2017 with Objects365 pre-training)
| Model | #Epochs | Params (M) | GFLOPs | FPS (bs=1) | AP | AP50 | AP75 | AP-s | AP-m | AP-l |
|---|---|---|---|---|---|---|---|---|---|---|
| RT-DETR-R18 (O365) | 60 | 20 | 61 | 217 | 49.2 | 66.6 | 53.5 | 33.2 | 52.3 | 64.8 |
| RT-DETR-R50 (O365) | 24 | 42 | 136 | 108 | 55.3 | 73.4 | 60.1 | 37.9 | 59.9 | 71.8 |
| RT-DETR-R101 (O365) | 24 | 76 | 259 | 74 | 56.2 | 74.6 | 61.3 | 38.3 | 60.5 | 73.5 |
Model Overview and Architecture



best for
- ·Real-time object detection in video surveillance
- ·High-accuracy detection for autonomous driving systems
- ·Batch inference on live camera feeds requiring low latency
FAQ
It expects an image resized to 640x640, normalized with mean (0.485, 0.456, 0.406) and std (0.229, 0.224, 0.225), typically as a tensor.
It outputs bounding boxes, class labels, and confidence scores for detected objects (COCO classes).
RT-DETR eliminates NMS, achieving higher accuracy (56.2% AP vs YOLO) at competitive real-time speeds (74 FPS on T4 with R101).
Apache 2.0, allowing free use, modification, and distribution.
Use the OpenAI-compatible endpoint with your gigarouter API key, sending the image as a base64-encoded string in the request.
We're benchmarking and onboarding RT-DETR R101 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.