Depth Anything 3 Nested Giant Large
depth-anything/DA3NESTED-GIANT-LARGE
published Nov 2025 · updated Nov 2025
Depth Anything 3 Nested Giant Large is a depth model that combines any-view Giant and metric Large models for metric-scale visual geometry reconstruction.
specs
| Task | Metric Depth Estimation, Pose Estimation, 3D Reconstruction |
| Architecture | Plain transformer with unified depth-ray representation |
| Parameters | 1.40B |
| License | CC BY-NC 4.0 (non-commercial only) |
about this model
DA3NESTED-GIANT-LARGE is a visual geometry model that recovers spatially consistent metric-scale depth, camera pose, and 3D structure from arbitrary image inputs, with or without known camera poses. It combines the any-view Giant model with the metric Large model in a nested architecture, totaling 1.40B parameters.
Developed by the ByteDance Seed Team, the model uses a single plain transformer (vanilla DINO encoder) with a unified depth-ray representation, eliminating the need for complex multi-task learning. It is trained exclusively on public academic datasets.
Capabilities
- Relative and metric depth estimation
- Camera pose estimation and pose conditioning
- 3D Gaussian reconstruction and sky segmentation
- Export formats: GLB, NPZ, PLY, mini NPZ, GS PLY, GS video
Performance
Depth Anything 3 significantly outperforms prior state-of-the-art models. Against VGGT, it achieves an average improvement of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. It also surpasses Depth Anything 2 for monocular depth estimation. The model is trained exclusively on public academic datasets.
Architecture and Training
The model uses a plain transformer backbone (vanilla DINO encoder) with a depth-ray representation, avoiding architectural specialization or complex multi-task learning. A teacher-student training paradigm is employed to achieve detail and generalization on par with Depth Anything 2.
Additional Features
- DA3-Streaming: handles ultra-long video sequence inference with less than 12GB GPU memory via sliding-window streaming
- Reference view selection for multi-view inputs via
ref_view_strategy - Evaluation benchmark pipeline with 5 datasets (ETH3D, ScanNet++, DTU, 7Scenes, HiRoom)
Licensed under CC BY-NC 4.0 (non-commercial use only).
best for
- ·Metric-scale depth estimation from single or multi-view images
- ·Camera pose estimation for multi-view geometry
- ·3D scene reconstruction including Gaussian splatting
FAQ
It is designed for metric-scale visual geometry reconstruction, including depth estimation, camera pose estimation, and 3D reconstruction from arbitrary views.
It significantly outperforms Depth Anything 2 in monocular depth estimation and also surpasses VGGT in multi-view depth and pose estimation.
Licensed under CC BY-NC 4.0, which permits non-commercial use only. Commercial use is not allowed.
Input: image paths, PIL images, or numpy arrays. Output: depth maps, confidence maps, camera extrinsics (w2c), and intrinsics (all float32). Optional export formats include glb, npz, ply, mini_npz, gs_ply, gs_video.
Use the OpenAI-compatible endpoint with your gigarouter API key. Send image data in the request and receive depth/pose/3D outputs in the response.
We're benchmarking and onboarding Depth Anything 3 Nested Giant Large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.