YOLOS Tiny
hustvl/yolos-tiny
published Apr 2022 · updated Apr 2024
YOLOS Tiny is a detection model that uses a Vision Transformer (ViT) trained with the DETR loss for object detection.
specs
| Task | Object Detection |
| Architecture | Vision Transformer (ViT) |
| Training Data | ImageNet-1k pre-training + COCO 2017 fine-tuning |
| Evaluation AP | 28.7 on COCO val2017 |
about this model
hustvl/yolos-tiny is an object detection model that applies a Vision Transformer (ViT) architecture with a DETR-style bipartite matching loss to identify and localize objects in images. Introduced in the paper "You Only Look at One Sequence" (Fang et al., NeurIPS 2021), it treats detection as a pure sequence-to-sequence task, minimizing handcrafted 2D inductive biases.
The model was pre-trained on ImageNet-1k for 300 epochs and fine-tuned on COCO 2017 object detection (118k training images, 5k validation images) for an additional 300 epochs. During inference, it processes 100 object queries per image, using Hungarian matching to produce one-to-one predictions of bounding boxes and class labels.
Benchmark Performance
On COCO 2017 validation, hustvl/yolos-tiny achieves an average precision (AP) of 28.7. For reference, the base-sized YOLOS model (ViT-Base architecture) reaches 42.0 AP on the same benchmark, demonstrating that even with minimal modifications to the vanilla ViT, competitive detection results are possible.
The model is hosted as a managed API on Gigarouter. Users send images and receive predicted bounding boxes with COCO class labels, without needing to install any Python libraries or model weights.
best for
- ·Real-time object detection on resource-constrained devices
- ·Edge deployment with low latency requirements
- ·Efficient vision tasks where model size is critical
FAQ
It is a tiny Vision Transformer that achieves competitive accuracy (28.7 AP) with very few parameters, making it suitable for fast inference on limited hardware.
It expects an image (e.g., PIL Image or numpy array) preprocessed with the YolosImageProcessor to match the required tensor format.
It outputs predicted bounding boxes, class labels, and confidence scores for up to 100 object queries per image.
Use the OpenAI-compatible endpoint with your API key, sending the image as a base64-encoded string or URL, and receive detection results in JSON format.
The model card does not specify a license; please refer to the repository or contact the authors for licensing details.
We're benchmarking and onboarding YOLOS Tiny as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.