Depth Anything Large
LiheYoung/depth_anything_vitl14
published Jan 2024 · updated Jan 2024
Depth Anything Large is a depth model that performs robust monocular depth estimation using a large-scale Vision Transformer (ViT-L) trained on 1.5 million labeled and 62 million+ unlabeled images.
specs
| Task | Monocular Depth Estimation |
| Architecture | Vision Transformer (ViT-L) with DPT head |
| Parameters | ~335 million |
| Training Data | 1.5M labeled images + 62M+ unlabeled images |
about this model
Depth Anything (ViT-L) is a monocular depth estimation model that produces dense depth maps from a single RGB image. It builds on the Vision Transformer-Large (ViT-L) backbone and is trained on a combination of 1.5 million labeled images from six datasets and over 62 million unlabeled images from eight datasets, using a data engine that automatically annotates large-scale unlabeled data to improve generalization.
Key Strengths
- Robust zero-shot relative depth estimation across diverse in-the-wild images, outperforming prior models such as MiDaS v3.1 and ZoeDepth in both relative and metric depth tasks.
- State-of-the-art metric depth after fine-tuning on NYUv2 and KITTI, setting new records for those benchmarks.
- The model serves as a strong visual encoder: fine-tuned for semantic segmentation, it achieves 86.2 mIoU on Cityscapes and 59.4 mIoU on ADE20K.
- Accepted at CVPR 2024; the depth output also improves depth-conditioned ControlNet, surpassing the previous MiDaS-based version.
Zero-Shot Evaluation
When evaluated without fine-tuning on six public datasets, the model delivers high-quality relative depth estimates. Its performance consistently surpasses that of MiDaS v3.1 (BEiT L-512) across KITTI, NYUv2, Sintel, DDAD, ETH3D, and DIODE.
best for
- ·Monocular depth estimation under diverse and challenging conditions
- ·Depth conditioning for ControlNet and image generation pipelines
- ·Pre-training encoder for downstream tasks such as semantic segmentation
FAQ
It excels at zero-shot monocular depth estimation on any image, including in-the-wild photos, and can be fine-tuned for metric depth on datasets like NYUv2 and KITTI.
It outperforms MiDaS v3.1 (BEiT L-512) in zero-shot relative depth estimation and achieves better results when used for depth-conditioned ControlNet.
Input is an RGB image (resized to 518x518, normalized); output is a depth map with inverse relative depth values per pixel.
Call the OpenAI-compatible endpoint with your API key, sending an image URL or base64-encoded image; the API returns the depth map as a tensor or image.
Yes, the official repository includes a script for video depth visualization (run_video.py), and the model can be applied frame-by-frame.
We're benchmarking and onboarding Depth Anything Large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.