Depth Anything Small
LiheYoung/depth_anything_vits14
published Jan 2024 · updated Jan 2024
Depth Anything Small is a monocular depth estimation model that predicts a depth map from a single image, trained on 1.5M labeled and 62M+ unlabeled images for robust zero-shot generalization.
specs
| Task | Monocular Depth Estimation |
| Architecture | Vision Transformer (ViT-S) with DPT head |
| Parameters | 24.8M |
| License | Apache 2.0 |
about this model
Depth Anything (small variant, vits14) is a monocular depth estimation model that delivers robust, generalizable depth maps from single RGB images. It was trained on a combination of 1.5 million labeled images and over 62 million unlabeled images using a data engine that scales up data coverage and reduces generalization error. The model adopts a ViT-small backbone (24.8 million parameters) and is accepted at CVPR 2024.
Key strengths
- Strong zero-shot performance across diverse environments, including indoor, outdoor, and synthetic scenes.
- Despite being much lighter (24.8M params), it outperforms MiDaS v3.1 BEiT L-512 (345M params) on KITTI, Sintel, and ETH3D, and is competitive on NYUv2, DDAD, and DIODE.
- The encoder also serves as a powerful feature extractor for downstream tasks: fine-tuned on semantic segmentation, it achieves 86.2 mIoU on Cityscapes and 59.4 mIoU on ADE20K.
Zero-shot benchmark results (Ours-S / vits14)
| Dataset | AbsRel | δ₁ |
|---|---|---|
| KITTI | 0.080 | 0.936 |
| NYUv2 | 0.053 | 0.972 |
| Sintel | 0.464 | 0.739 |
| DDAD | 0.247 | 0.768 |
| ETH3D | 0.127 | 0.885 |
| DIODE | 0.076 | 0.939 |
The model is widely adopted as the default depth processor in tools such as InstantID, InvokeAI, and ControlNet-based workflows. It is also available in ONNX and TensorRT formats for deployment flexibility.
Through gigarouter’s hosted API, you can call Depth Anything directly without managing dependencies or infrastructure, receiving OpenAI-compatible depth outputs for any input image.
best for
- ·Zero-shot depth estimation on diverse, unseen images
- ·Real-time or resource-constrained depth applications
- ·Preprocessing for depth-conditioned image generation (e.g., ControlNet)
FAQ
The API accepts an image file (e.g., PNG, JPEG) and returns a depth map as a grayscale image or raw tensor, depending on the endpoint configuration.
At 24.8M parameters, it is significantly smaller and faster than models like MiDaS v3.1 BEiT L-512 (345M), while often matching or exceeding its zero-shot accuracy on benchmarks like KITTI and NYUv2.
The model is released under the Apache 2.0 license, allowing for commercial and non-commercial use with attribution.
Use the OpenAI-compatible endpoint with your API key, sending a POST request with the image data to the designated depth estimation route.
The model was trained on a combination of 1.5M labeled images and over 62 million unlabeled images, using a data engine to scale up data coverage.
We're benchmarking and onboarding Depth Anything Small as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.