Depth Anything V2 Base
depth-anything/Depth-Anything-V2-Base
published Jun 2024 · updated Jul 2024
Depth Anything V2 Base is a monocular depth estimation model that predicts a dense depth map from a single image, offering fine-grained details and robust performance.
specs
| Task | Monocular Depth Estimation |
| Architecture | ViT-B (Vision Transformer Base) encoder with DPT decoder |
| Parameters | ~97M (Base scale) |
| License | Apache 2.0 |
about this model
Depth-Anything-V2-Base is a monocular depth estimation model that predicts a dense depth map from a single RGB image. It is the base-scale variant (ViT-B encoder) of the Depth Anything V2 family, which was accepted at NeurIPS 2024.
Key Capabilities
The model is trained on 595K synthetic labeled images and over 62 million real unlabeled images. It delivers finer-grained details and more robust predictions than its predecessor, Depth Anything V1. Compared with Stable Diffusion-based models such as Marigold and GeoWizard, Depth Anything V2 is more than 10× faster and more accurate, while being significantly lighter. The training pipeline uses a teacher model scaled to 1.3B parameters and large-scale pseudo-labeled real images to achieve strong generalization.
Performance and Evaluation
A qualitative comparison (arXiv:2406.09414, Table 1) shows that Depth Anything V2 satisfies all six desirable properties: fine detail preservation, handling of transparent objects, reflections, complex scenes, efficiency, and transferability. In the same comparison, Marigold and Depth Anything V1 each satisfy only three properties (different sets). The authors also constructed the DA-2K benchmark with precise annotations and diverse scenes to address limitations in existing test sets. Depth Anything V2 models are available from 25M to 1.3B parameters; the Base model offers a strong balance of performance and speed for deployment.
best for
- ·Generating depth maps for augmented reality applications
- ·Enhancing 3D scene reconstruction from single images
- ·Improving object detection and segmentation with depth cues
FAQ
The model expects a single RGB image as input, typically in common formats like JPEG or PNG.
It outputs a raw depth map of the same spatial dimensions as the input image, with pixel values representing relative depth.
The Base model (ViT-B, ~97M params) offers a balance between speed and accuracy, being faster than Large (1.3B) and more capable than Small (25M).
The model is released under the Apache 2.0 license, allowing for commercial and research use.
Use the OpenAI-compatible endpoint with your API key, sending an image URL or base64-encoded image in the request.
We're benchmarking and onboarding Depth Anything V2 Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.