models / object detection · coming soon

YOLOS Tiny

hustvl/yolos-tiny

published Apr 2022 · updated Apr 2024

YOLOS Tiny is a detection model that uses a Vision Transformer (ViT) trained with the DETR loss for object detection.

est. price

~$0.047

/ 1k images · estimated, set at launch

API providers

downloads / mo

100.9K

license

apache-2.0

specs

Task	Object Detection
Architecture	Vision Transformer (ViT)
Training Data	ImageNet-1k pre-training + COCO 2017 fine-tuning
Evaluation AP	28.7 on COCO val2017

about this model

hustvl/yolos-tiny is an object detection model that applies a Vision Transformer (ViT) architecture with a DETR-style bipartite matching loss to identify and localize objects in images. Introduced in the paper "You Only Look at One Sequence" (Fang et al., NeurIPS 2021), it treats detection as a pure sequence-to-sequence task, minimizing handcrafted 2D inductive biases.

The model was pre-trained on ImageNet-1k for 300 epochs and fine-tuned on COCO 2017 object detection (118k training images, 5k validation images) for an additional 300 epochs. During inference, it processes 100 object queries per image, using Hungarian matching to produce one-to-one predictions of bounding boxes and class labels.

Benchmark Performance

On COCO 2017 validation, hustvl/yolos-tiny achieves an average precision (AP) of 28.7. For reference, the base-sized YOLOS model (ViT-Base architecture) reaches 42.0 AP on the same benchmark, demonstrating that even with minimal modifications to the vanilla ViT, competitive detection results are possible.

The model is hosted as a managed API on Gigarouter. Users send images and receive predicted bounding boxes with COCO class labels, without needing to install any Python libraries or model weights.

best for

·Real-time object detection on resource-constrained devices
·Edge deployment with low latency requirements
·Efficient vision tasks where model size is critical

FAQ

What is the main advantage of YOLOS Tiny over larger detection models?

It is a tiny Vision Transformer that achieves competitive accuracy (28.7 AP) with very few parameters, making it suitable for fast inference on limited hardware.

What input format does the model expect?

It expects an image (e.g., PIL Image or numpy array) preprocessed with the YolosImageProcessor to match the required tensor format.

What output does the model produce?

It outputs predicted bounding boxes, class labels, and confidence scores for up to 100 object queries per image.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, sending the image as a base64-encoded string or URL, and receive detection results in JSON format.

What is the license for YOLOS Tiny?

The model card does not specify a license; please refer to the repository or contact the authors for licensing details.

not yet live

We're benchmarking and onboarding YOLOS Tiny as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related object detection models

compare all →

table-transformer-structure-recognition

1.8M dl/mo

table-transformer-detection

1.5M dl/mo

yolos-small

713.6K dl/mo

PP-DocLayoutV3_safetensors

341.1K dl/mo

rtdetr_v2_r50vd

309.8K dl/mo

rtdetr_r50vd_coco_o365

254.5K dl/mo