Conditional DETR ResNet-50

microsoft/conditional-detr-resnet-50

published Sep 2022 · updated May 2024

Conditional DETR ResNet-50 is a detection model that uses a conditional cross-attention mechanism for fast training convergence, trained on COCO 2017 for object detection.

est. price

~$0.047

/ 1k images · estimated, set at launch

API providers

downloads / mo

18.5K

license

apache-2.0

specs

Task	Object Detection
Architecture	Conditional DETR with ResNet-50 backbone (transformer encoder-decoder)
Parameters	44M
License	Apache 2.0
Training Data	COCO 2017 (118k annotated images)

about this model

Conditional DETR with ResNet-50 backbone is an object detection transformer model that uses a conditional cross-attention mechanism to achieve fast training convergence while maintaining high accuracy. It is trained end-to-end on COCO 2017 (118k annotated images) and addresses the slow convergence of the original DETR by learning a conditional spatial query from the decoder embedding. This spatial query allows each cross-attention head to attend to a distinct region around the object, reducing dependence on content embeddings and easing optimization.

Key Strengths

Converges 6.7× faster than DETR-R50: reaches competitive performance in 50 epochs instead of 500.
Built on the standard ResNet-50 backbone for broad compatibility and efficient inference.
Licensed under Apache 2.0.

Benchmark Results (COCO 2017 val)

The model achieves 41.0 AP after 50 epochs of training with 44M parameters and 90G FLOPs, versus DETR-R50 at 50 epochs (34.8 AP) and DETR-R50 at 500 epochs (42.0 AP).

Method	Epochs	Params (M)	FLOPs (G)	AP	AP	AP	AP
Conditional DETR-R50	50	44	90	41.0	20.6	44.3	59.3
DETR-R50	50	41	86	34.8	13.9	37.3	54.4
DETR-R50	500	41	86	42.0	20.5	45.8	61.1

Architecture diagram of Conditional DETR showing conditional spatial query integration

The model is hosted by gigarouter as a managed, OpenAI-compatible API. Developers can deploy it for general-purpose object detection without handling model installation or inference infrastructure.

best for

·Real-time object detection in images and video
·Detection of common objects (COCO categories) in photographs
·Transfer learning for custom object detection tasks

FAQ

What is Conditional DETR best for?

Object detection with faster training convergence compared to original DETR, achieving comparable accuracy in fewer epochs.

How many parameters does the model have?

44 million parameters.

What is the license for Conditional DETR ResNet-50?

Apache 2.0.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with an API key; send an image as input and receive bounding boxes and class labels in JSON format.

What input and output formats does the API support?

Input: image URL or base64-encoded image. Output: JSON with detected objects, confidence scores, and bounding box coordinates.

not yet live

We're benchmarking and onboarding Conditional DETR ResNet-50 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related object detection models

compare all →

table-transformer-structure-recognition

1.8M dl/mo

table-transformer-detection

1.5M dl/mo

yolos-small

713.6K dl/mo

PP-DocLayoutV3_safetensors

341.1K dl/mo

rtdetr_v2_r50vd

309.8K dl/mo

rtdetr_r50vd_coco_o365

254.5K dl/mo