Depth Anything V2 Large
depth-anything/Depth-Anything-V2-Large
published Jun 2024 · updated Jul 2024
Depth Anything V2 Large is a monocular depth estimation model that produces fine-grained and robust depth maps from a single image, trained on 595K synthetic images and 62M+ real images.
specs
| Task | Monocular Depth Estimation |
| Architecture | ViT-Large (Vision Transformer) with DPT head |
| Parameters | 1.3B |
about this model
Depth-Anything-V2-Large is a monocular depth estimation model that produces dense depth maps from a single RGB image. It is trained on 595K synthetic labeled images and over 62 million real unlabeled images, leveraging a teacher-student framework with large-scale pseudo-labeling.
Key Strengths
- Produces finer-grained details and more robust predictions than the previous version (Depth Anything V1).
- Outperforms Stable Diffusion-based depth models (e.g., Marigold, GeoWizard) in both accuracy and efficiency – over 10x faster inference with a lighter architecture.
- Available in multiple scales from 25M to 1.3B parameters, with this Large variant offering a strong balance of performance and resource use.
- Can be fine-tuned for metric depth estimation, supporting both relative and absolute depth tasks.
Performance & Versatility
The model demonstrates strong generalization across diverse scenes. The authors constructed a new evaluation benchmark with precise annotations and varied environments to overcome limitations in existing test sets. Depth-Anything-V2-Large is suitable for applications requiring high-quality depth output without the latency of diffusion-based alternatives.
best for
- ·Robotics and autonomous navigation
- ·Augmented reality scene understanding
- ·3D reconstruction from single images
FAQ
It is the largest variant of Depth Anything V2, a monocular depth estimation model with 1.3B parameters, capable of producing high-quality depth maps from a single RGB image.
V2 provides finer and more robust depth predictions by replacing labeled real images with synthetic images, scaling up the teacher model, and using large-scale pseudo-labeled real images.
Input is a single RGB image; output is a raw depth map (HxW array) where each pixel value represents relative depth.
It is more than 10x faster and more lightweight than SD-based models like Marigold or Geowizard.
Use the OpenAI-compatible endpoint with your API key; send an image URL or base64 encoded image and receive the depth map in the response.
We're benchmarking and onboarding Depth Anything V2 Large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.