skip to content
gigarouter gigarouter
models / depth estimation · coming soon

Depth Anything Large

LiheYoung/depth_anything_vitl14

published Jan 2024 · updated Jan 2024

Depth Anything Large is a depth model that performs robust monocular depth estimation using a large-scale Vision Transformer (ViT-L) trained on 1.5 million labeled and 62 million+ unlabeled images.

status
coming soon
API providers
0
downloads / mo
13.5K

specs

TaskMonocular Depth Estimation
ArchitectureVision Transformer (ViT-L) with DPT head
Parameters~335 million
Training Data1.5M labeled images + 62M+ unlabeled images

about this model

Depth Anything (ViT-L) is a monocular depth estimation model that produces dense depth maps from a single RGB image. It builds on the Vision Transformer-Large (ViT-L) backbone and is trained on a combination of 1.5 million labeled images from six datasets and over 62 million unlabeled images from eight datasets, using a data engine that automatically annotates large-scale unlabeled data to improve generalization.

Key Strengths

  • Robust zero-shot relative depth estimation across diverse in-the-wild images, outperforming prior models such as MiDaS v3.1 and ZoeDepth in both relative and metric depth tasks.
  • State-of-the-art metric depth after fine-tuning on NYUv2 and KITTI, setting new records for those benchmarks.
  • The model serves as a strong visual encoder: fine-tuned for semantic segmentation, it achieves 86.2 mIoU on Cityscapes and 59.4 mIoU on ADE20K.
  • Accepted at CVPR 2024; the depth output also improves depth-conditioned ControlNet, surpassing the previous MiDaS-based version.

Zero-Shot Evaluation

When evaluated without fine-tuning on six public datasets, the model delivers high-quality relative depth estimates. Its performance consistently surpasses that of MiDaS v3.1 (BEiT L-512) across KITTI, NYUv2, Sintel, DDAD, ETH3D, and DIODE.

best for

FAQ

What is Depth Anything Large best for?

It excels at zero-shot monocular depth estimation on any image, including in-the-wild photos, and can be fine-tuned for metric depth on datasets like NYUv2 and KITTI.

How does Depth Anything Large compare to MiDaS?

It outperforms MiDaS v3.1 (BEiT L-512) in zero-shot relative depth estimation and achieves better results when used for depth-conditioned ControlNet.

What are the input and output formats for the model?

Input is an RGB image (resized to 518x518, normalized); output is a depth map with inverse relative depth values per pixel.

How can I use Depth Anything Large via the gigarouter API?

Call the OpenAI-compatible endpoint with your API key, sending an image URL or base64-encoded image; the API returns the depth map as a tensor or image.

Does Depth Anything Large support video depth estimation?

Yes, the official repository includes a script for video depth visualization (run_video.py), and the model can be applied frame-by-frame.

not yet live

We're benchmarking and onboarding Depth Anything Large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related depth estimation models

compare all →