DepthPro

apple/DepthPro-hf

published Nov 2024 · updated Feb 2025

DepthPro is a depth model that produces high-resolution metric depth maps from a single image without requiring camera metadata.

est. price

~$0.235

/ 1k images · estimated, set at launch

API providers

downloads / mo

23.3K

license

apple-amlr

specs

Task	Monocular Depth Estimation
Architecture	Multi-scale Vision Transformer (ViT) with DPT-like fusion
Input Resolution	1536x1536 pixels
Speed	0.3 seconds for a 2.25-megapixel depth map on a standard GPU
License	Apple-ASCL

about this model

DepthPro-hf is a foundation model for zero-shot metric monocular depth estimation that generates high-resolution depth maps with absolute scale without requiring camera intrinsics. Developed by Apple and published at ICLR 2025, it produces a 2.25-megapixel depth map in 0.3 seconds on a standard GPU.

Example depth map produced by DepthPro showing sharp boundaries and fine details

The model employs a multi-scale Vision Transformer (ViT) architecture. Images are downsampled into patches processed by a shared Dinov2 encoder, then merged and refined through a DPT-like fusion stage. A separate head estimates focal length from a single image. Training combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing.

Training Data

The model was trained on the following datasets:

Training Hyperparameters

Evaluation

Depth Pro outperforms prior work along multiple dimensions, including metric accuracy, boundary sharpness, and focal length estimation. The model hosted on gigarouter is based on the reference implementation, which has been re-trained; its performance is close to, but does not exactly match, the results reported in the paper.

Architecture

Additional technical details are available in the paper and the official repository.

best for

·Generating metric depth maps for 3D reconstruction
·Estimating absolute scale depth from a single photo without camera intrinsics
·Focal length estimation from a single image

FAQ

What input format does the model expect?

An RGB image resized and normalized to 1536x1536 pixels with mean=[0.5,0.5,0.5] and std=[0.5,0.5,0.5].

What does the model output?

It outputs a metric depth map (in meters) and an estimate of the focal length (or field of view).

What license is used?

The model is released under the Apple-ASCL license.

Is DepthPro zero-shot?

Yes, it is a foundation model for zero-shot metric monocular depth estimation, trained on mixed real and synthetic datasets.

How can I call this model via an API?

Use the gigarouter OpenAI-compatible endpoint with your API key.

not yet live

We're benchmarking and onboarding DepthPro as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related depth estimation models

compare all →

Depth-Anything-V2-Small-hf

1.7M dl/mo

DA3METRIC-LARGE

825K dl/mo

depth-anything-large-hf

388.9K dl/mo

dpt-hybrid-midas

225.1K dl/mo

DA3NESTED-GIANT-LARGE-1.1

199.9K dl/mo

Depth-Anything-V2-Large-hf

199.1K dl/mo