YOLOS Small

hustvl/yolos-small

published Apr 2022 · updated May 2024

YOLOS Small is a detection model that uses a Vision Transformer with DETR loss to predict bounding boxes and class labels for objects in images.

est. price

~$0.047

/ 1k images · estimated, set at launch

API providers

downloads / mo

713.6K

license

apache-2.0

specs

Task	Object Detection
Architecture	Vision Transformer (ViT)
Parameters	30.7M
AP on COCO val	36.1
License	Apache-2.0

about this model

YOLOS-small is an object detection model that applies a Vision Transformer (ViT) backbone to the object detection task, trained with the DETR bipartite matching loss and fine-tuned on COCO 2017.

Architecture and training

The model uses a vanilla ViT architecture with minimal spatial priors. It is pre-trained on ImageNet-1k for 200 epochs and then fine-tuned on COCO 2017 object detection (118k training images) for 150 epochs. Detection is performed by adding 100 learnable detection tokens to the input sequence; the model predicts class and bounding box for each token. The Hungarian algorithm matches predictions to ground truth, and the loss combines cross-entropy for classes with L1 and generalized IoU for boxes.

Performance

On COCO 2017 validation, YOLOS-small achieves an average precision (AP) of 36.1. For comparison, the larger YOLOS-base variant reaches 42.0 AP, matching more complex frameworks such as Faster R-CNN and DETR. The model contains 30.7 million parameters and is released under the Apache-2.0 license.

References

Paper: You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection (Fang et al., NeurIPS 2021)
Code and pre-trained models: github.com/hustvl/YOLOS
Hugging Face model page: hustvl/yolos-small

best for

·Real-time object detection in images with low compute requirements
·Detecting common objects across 80 COCO categories
·Embedding object detection into lightweight applications

FAQ

What is YOLOS Small?

It is a small Vision Transformer model fine-tuned on COCO for object detection, using a bipartite matching loss like DETR.

How does YOLOS Small compare to YOLOS-Base?

YOLOS Small has 30.7M parameters and achieves 36.1 AP on COCO val, while YOLOS-Base achieves 42.0 AP with more parameters.

What license does YOLOS Small use?

It is released under the Apache-2.0 license.

What are the input and output formats?

Input: an image. Output: predicted bounding boxes and corresponding COCO class labels.

How can I call YOLOS Small via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, sending an image URL or base64-encoded image.

not yet live

We're benchmarking and onboarding YOLOS Small as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related object detection models

compare all →

table-transformer-structure-recognition

1.8M dl/mo

table-transformer-detection

1.5M dl/mo

PP-DocLayoutV3_safetensors

341.1K dl/mo

rtdetr_v2_r50vd

309.8K dl/mo

rtdetr_r50vd_coco_o365

254.5K dl/mo

table-transformer-structure-recognition-v1.1-all

239.5K dl/mo