Depth Anything V2 Base

depth-anything/Depth-Anything-V2-Base

published Jun 2024 · updated Jul 2024

Depth Anything V2 Base is a monocular depth estimation model that predicts a dense depth map from a single image, offering fine-grained details and robust performance.

status

coming soon

API providers

downloads / mo

license

cc-by-nc-4.0

specs

Task	Monocular Depth Estimation
Architecture	ViT-B (Vision Transformer Base) encoder with DPT decoder
Parameters	~97M (Base scale)
License	Apache 2.0

about this model

Depth-Anything-V2-Base is a monocular depth estimation model that predicts a dense depth map from a single RGB image. It is the base-scale variant (ViT-B encoder) of the Depth Anything V2 family, which was accepted at NeurIPS 2024.

Key Capabilities

The model is trained on 595K synthetic labeled images and over 62 million real unlabeled images. It delivers finer-grained details and more robust predictions than its predecessor, Depth Anything V1. Compared with Stable Diffusion-based models such as Marigold and GeoWizard, Depth Anything V2 is more than 10× faster and more accurate, while being significantly lighter. The training pipeline uses a teacher model scaled to 1.3B parameters and large-scale pseudo-labeled real images to achieve strong generalization.

Performance and Evaluation

A qualitative comparison (arXiv:2406.09414, Table 1) shows that Depth Anything V2 satisfies all six desirable properties: fine detail preservation, handling of transparent objects, reflections, complex scenes, efficiency, and transferability. In the same comparison, Marigold and Depth Anything V1 each satisfy only three properties (different sets). The authors also constructed the DA-2K benchmark with precise annotations and diverse scenes to address limitations in existing test sets. Depth Anything V2 models are available from 25M to 1.3B parameters; the Base model offers a strong balance of performance and speed for deployment.

best for

·Generating depth maps for augmented reality applications
·Enhancing 3D scene reconstruction from single images
·Improving object detection and segmentation with depth cues

FAQ

What input format does the model expect?

The model expects a single RGB image as input, typically in common formats like JPEG or PNG.

What output does the model produce?

It outputs a raw depth map of the same spatial dimensions as the input image, with pixel values representing relative depth.

How does Depth Anything V2 Base compare to the Small or Large versions?

The Base model (ViT-B, ~97M params) offers a balance between speed and accuracy, being faster than Large (1.3B) and more capable than Small (25M).

What is the license for using this model?

The model is released under the Apache 2.0 license, allowing for commercial and research use.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, sending an image URL or base64-encoded image in the request.

not yet live

We're benchmarking and onboarding Depth Anything V2 Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related depth estimation models

compare all →

Depth-Anything-V2-Small-hf

1.7M dl/mo

DA3METRIC-LARGE

825K dl/mo

depth-anything-large-hf

388.9K dl/mo

dpt-hybrid-midas

225.1K dl/mo

DA3NESTED-GIANT-LARGE-1.1

199.9K dl/mo

Depth-Anything-V2-Large-hf

199.1K dl/mo