UniDepth V2 ViT-L14

lpiccinelli/unidepth-v2-vitl14

published Jun 2024 · updated Feb 2025

UniDepth V2 ViT-L14 is a monocular metric depth estimation model that predicts accurate metric depth from a single RGB image, achieving state-of-the-art results on KITTI and NYU Depth v2 benchmarks.

status

coming soon

API providers

downloads / mo

6.3M

specs

Task	Monocular Metric Depth Estimation
Architecture	ViT-L14 backbone with self-promptable camera module and pseudo-spherical output representation
Input	Single RGB image (any camera)
Output	Metric depth map (meters) and uncertainty map

about this model

lpiccinelli/unidepth-v2-vitl14 is a universal monocular metric depth estimation model that predicts metric depth from a single image without requiring camera intrinsics. Built on a ViT-L/14 backbone, it uses a self-promptable camera module to predict a dense camera representation and a pseudo-spherical output (azimuth, elevation, depth) that disentangles camera parameters from depth, enabling zero-shot generalization across diverse datasets and camera geometries.

UniDepthV2 improves over the CVPR 2024 Highlight UniDepthV1 by adding an uncertainty-level output for downstream tasks needing confidence estimates and an edge-guided loss that sharpens depth boundaries. The model is state-of-the-art on NYU Depth v2 and KITTI Eigen for monocular depth estimation and was ranked 1st on the KITTI depth prediction benchmark at the time of submission. On the provided test assets, its demo achieves an Absolute Relative error (ARel) of 7.45%.

This model is hosted as a managed, OpenAI-compatible API on gigarouter, requiring no local installation or GPU infrastructure.

best for

·Autonomous vehicle depth perception
·3D scene reconstruction from a single image
·Augmented reality depth-aware effects
·Robotic manipulation and navigation

FAQ

What is UniDepth V2 ViT-L14 best for?

It is best for monocular metric depth estimation, providing accurate depth in meters from a single image with state-of-the-art performance on KITTI and NYU Depth v2.

What input does the model accept?

It accepts a single RGB image. The self-promptable camera module handles any camera geometry automatically.

What does the model output?

It outputs a metric depth map in meters and an optional uncertainty map for downstream tasks requiring confidence.

How can I use this model via the gigarouter API?

Send an image to the gigarouter OpenAI-compatible endpoint with your API key to receive depth predictions.

How does UniDepth V2 handle different camera geometries?

It uses a self-promptable camera module that predicts dense camera representations, enabling universal metric depth estimation without camera calibration.

not yet live

We're benchmarking and onboarding UniDepth V2 ViT-L14 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related specialist model models

compare all →

electra-base-discriminator

wespeaker-voxceleb-resnet34-LM

6.8M dl/mo

stable-diffusion-v1-5-archive

5.8M dl/mo