skip to content
gigarouter gigarouter
models / depth estimation · coming soon

DPT Large

Intel/dpt-large

published Mar 2022 · updated Feb 2024

DPT Large is a monocular depth estimation model that uses a Vision Transformer backbone to predict a depth map from a single RGB image.

est. price
~$0.094
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
67.1K
license
apache-2.0

specs

TaskMonocular Depth Estimation
ArchitectureDense Prediction Transformer (Vision Transformer backbone with convolutional decoder)
LicenseApache 2.0

about this model

Intel/dpt-large is a monocular depth estimation model that uses a Vision Transformer (ViT) backbone with a convolutional decoder for dense prediction. It is the large variant of the Dense Prediction Transformer (DPT) family, also known as MiDaS 3.0, and was introduced by Ranftl et al. (2021) in the paper Vision Transformers for Dense Prediction. DPT model architecture diagram

Key strengths

The transformer backbone processes representations at a constant high resolution and maintains a global receptive field at every stage, enabling finer-grained and more globally coherent depth predictions compared to fully-convolutional networks. Trained on the MIX 6 dataset (1.4 million images), the model achieves up to 28% relative improvement over state-of-the-art fully-convolutional networks for monocular depth estimation (per the paper). For semantic segmentation on ADE20K, DPT set a new state of the art with 49.02% mIoU.

Zero-shot cross-dataset benchmark results

The table below compares DPT-Large with prior methods on zero-shot transfer across six datasets. Lower is better for all metrics.

ModelTraining setDIW WHDRETH3D AbsRelSintel AbsRelKITTI δ>1.25NYU δ>1.25TUM δ>1.25
DPT-LargeMIX 610.820.0890.2708.468.329.97
DPT-HybridMIX 611.060.0930.27411.568.6910.89
MiDaSMIX 612.950.1160.32916.088.

best for

FAQ

What is the DPT Large model best used for?

DPT Large is designed for zero-shot monocular depth estimation — predicting depth from a single image without fine-tuning.

How does DPT Large differ from MiDaS?

DPT Large uses a Vision Transformer backbone instead of a convolutional one, achieving up to 28% relative improvement over MiDaS on depth estimation benchmarks.

What input format does the model require?

The model expects a single RGB image. The image is resized so that the longer side is 384 pixels and then a 384x384 crop is used during training; at inference, the DPTImageProcessor handles preprocessing.

What license is the model released under?

The model is released under the Apache 2.0 license according to the Hugging Face model card.

How can I use this model via the gigarouter API?

You can call the model through gigarouter's OpenAI-compatible endpoint using your API key, sending an image and receiving a depth map in response.

not yet live

We're benchmarking and onboarding DPT Large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related depth estimation models

compare all →