DepthCrafter
tencent/DepthCrafter
published Sep 2024 · updated Jul 2025
DepthCrafter is a depth model that generates temporally consistent long depth sequences for open-world videos.
specs
| Task | Video Depth Estimation |
| Architecture | Video-to-depth diffusion model fine-tuned from a pre-trained image-to-video diffusion model |
| Max Sequence Length | 110 frames per generation |
| Training Data | Realistic and synthetic video datasets |
| Inference Strategy | Overlapped segment estimation with latent interpolation for arbitrarily long videos |
about this model
DepthCrafter is a video depth estimation model that generates temporally consistent long depth sequences with fine-grained details for open-world videos, without requiring camera poses or optical flow. It is trained from a pre-trained image-to-video diffusion model using a three-stage strategy, enabling variable-length depth sequences up to 110 frames per pass and extreme-length videos through segment-wise estimation with seamless stitching.
Key Strengths
- State-of-the-art zero-shot performance on open-world video depth estimation, validated across multiple datasets.
- Selected as a CVPR 2025 Highlight and winner of the Best Paper Award at the PixFoundation workshop.
- Handles diverse video content, motion, and camera movement without additional input.
- Optimized inference speed: version 1.0.1 runs at 465.84 ms/frame at 1024×576 resolution (vs. 1913.92 ms/frame for previous version, 180.46 ms for Depth-Anything-V2, 1070.29 ms for Marigold).
Benchmark Accuracy
Absolute Relative Error (AbsRel ↓) and δ₁ accuracy (↑) on four datasets:
| Dataset | AbsRel | δ₁ |
|---|---|---|
| Sintel | 0.270 | 0.697 |
| ScanNet | 0.123 | 0.856 |
| KITTI | 0.104 | 0.896 |
| Bonn | 0.071 | 0.972 |
DepthCrafter v1.0.1 outperforms Marigold, Depth-Anything-V2, and its own previous version on all four datasets in both AbsRel and δ₁.
Output and Integration
- Supports EXR output format for high-dynamic-range depth maps.
- Community integrations available for ComfyUI and Nuke.
- For business licensing inquiries, contact [email protected].
For more visualizations and details, see the project page.
best for
- ·Generating consistent depth for long open-world videos
- ·Depth-based visual effects such as background replacement and 3D point cloud creation
- ·Conditional video generation guided by depth maps
- ·Seamless depth estimation for extremely long videos via segment-wise stitching
FAQ
It can generate depth for up to 110 frames at once. Longer videos are handled by segment-wise estimation with overlapped stitching.
Depth-Anything-V2 produces per-frame depth without temporal consistency. DepthCrafter enforces smooth depth across frames and targets open-world video depth under zero-shot settings.
At 1024x576 resolution, v1.0.1 runs at 465.84 ms/frame. Reported AbsRel / d1 on Sintel: 0.270/0.697, Scannet: 0.123/0.856, KITTI: 0.104/0.896, Bonn: 0.071/0.972.
No, it requires only RGB video frames. No additional information such as camera poses or optical flow is needed.
Use the gigarouter OpenAI-compatible endpoint with your API key. Input is a sequence of video frames; output are depth maps in a format specified by the API.
We're benchmarking and onboarding DepthCrafter as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.