DETR ResNet-101 DC5

facebook/detr-resnet-101-dc5

published Mar 2022 · updated Sep 2023

DETR ResNet-101 DC5 is an end-to-end object detection model that uses a Transformer encoder-decoder with a dilated ResNet-101 backbone to directly predict bounding boxes and class labels from images.

est. price

~$0.047

/ 1k images · estimated, set at launch

API providers

downloads / mo

7.1K

license

apache-2.0

specs

Task	Object Detection
Architecture	Transformer encoder-decoder with ResNet-101 backbone (dilated C5 stage)
Parameters	~60 million (estimated)
License	Apache 2.0

about this model

facebook/detr-resnet-101-dc5 is an object detection model that uses a Transformer encoder-decoder architecture with a dilated ResNet-101 backbone, trained end-to-end on COCO 2017 (118k images). It formulates detection as a direct set prediction problem: a fixed set of 100 learned object queries is matched to ground-truth objects via bipartite matching (Hungarian algorithm), and the model outputs class labels and bounding boxes in parallel.

Key strengths

Simplified pipeline: no hand-crafted post-processing (NMS) or region proposal network.
Strong performance on COCO val5k: average precision (AP) of 44.9, with AP = 64.7, AP = 47.7, AP = 23.8, AP = 49.0, and AP = 64.2.
Inference speed: 0.097 seconds per image (first 100 COCO val5k images with TorchScript transformer). Model checkpoint size is 232 MB.
Trained for 500 epochs with learning rate drop at epoch 400 on 8 nodes (batch size 1 per GPU with dilation).

Comparison to other DETR variants (COCO val5k box AP)

Model	Backbone	Schedule	Inference time (s)	AP
DETR	R50	500	0.036	42.0
DETR-DC5	R50	500	0.083	43.3
DETR	R101	500	0.050	43.5
DETR-DC5	R101	500	0.097	44.9

Additional metrics (COCO val5k)

Average recall: AR@1 = 0.354, AR@10 = 0.571, AR@100 = 0.614
AR by size: small = 0.363, medium = 0.667, large = 0.834

best for

·Detecting common objects in natural images using COCO classes
·Applications requiring a simple, end-to-end detection pipeline without hand-crafted components

FAQ

What is the input format for this model?

The model expects an image resized so the shortest side is at least 800 pixels and the longest side at most 1333 pixels, normalized with ImageNet mean and std. Use the gigarouter OpenAI-compatible endpoint with an API key.

What output does the model produce?

It outputs 100 object queries, each with a class label (from COCO) and a bounding box, plus a "no object" class for empty queries.

How does this model compare to other DETR variants in speed and accuracy?

It achieves 44.9 AP on COCO val5k with an inference time of 0.097 seconds per image, outperforming DETR R50 (42.0 AP, 0.036s) and DETR R101 (43.5 AP, 0.050s).

What is the license for this model?

The model is released under the Apache 2.0 license.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, passing an image URL or base64-encoded image.

not yet live

We're benchmarking and onboarding DETR ResNet-101 DC5 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related object detection models

compare all →

table-transformer-structure-recognition

1.8M dl/mo

table-transformer-detection

1.5M dl/mo

yolos-small

713.6K dl/mo

PP-DocLayoutV3_safetensors

341.1K dl/mo

rtdetr_v2_r50vd

309.8K dl/mo

rtdetr_r50vd_coco_o365

254.5K dl/mo