Depth Anything 3 Small
depth-anything/DA3-SMALL
published Nov 2025 · updated Nov 2025
Depth Anything 3 Small is a multi-view depth estimation and camera pose estimation model that uses a unified depth-ray representation in a plain transformer.
specs
| Task | Multi-view Depth Estimation & Camera Pose Estimation |
| Architecture | Plain Transformer (Vision Transformer) with unified depth-ray representation |
| Parameters | 0.08B |
| License | Apache 2.0 |
about this model
Depth Anything 3 DA3-Small is a multi-view depth estimation and camera pose estimation model that predicts spatially consistent geometry from arbitrary visual inputs, with or without known camera poses. It uses a single plain transformer backbone with a unified depth-ray representation, trained exclusively on public academic datasets.
Key Capabilities
- Relative depth estimation from single or multiple images
- Camera pose estimation (extrinsics and intrinsics)
- Pose conditioning for geometry-aware inference
- Feed-forward 3D Gaussian estimation for novel view synthesis
- Multi-camera spatial perception (e.g., autonomous driving setups with non-overlapping views)
Performance
DA3-Small significantly outperforms Depth Anything 2 for monocular depth estimation and VGGT for multi-view depth and pose estimation. On the project page, DA3 surpasses VGGT by 35.7% in camera pose accuracy and 23.6% in geometric accuracy (arXiv reports 44.3% and 25.1% respectively). The model also reduces drift in large-scale SLAM applications, outperforming COLMAP (which requires over 48 hours).
Architecture and Training
DA3-Small has 0.08B parameters and uses a teacher-student training paradigm to achieve detail and generalization on par with DA2. A DPT head can be trained on the frozen backbone to predict 3D Gaussian parameters for generalizable novel view synthesis. The model supports DA3-Streaming for ultra-long video sequences using under 12GB GPU memory via sliding-window inference.
Benchmarks
The paper introduces a visual geometry benchmark covering camera pose estimation, any-view geometry, and visual rendering. A benchmark evaluation pipeline is released for pose estimation and 3D reconstruction on five datasets. The model was accepted as an ICLR 2026 Oral.
Limitations
- Trained on academic datasets; performance may vary on domain-specific images
- Results depend on image quality, lighting conditions, and scene complexity
best for
- ·Multi-view depth estimation from arbitrary images
- ·Camera pose estimation without known poses
- ·Feed-forward 3D Gaussian estimation for novel view synthesis
FAQ
It predicts spatially consistent depth maps and camera poses from one or more images, using a unified depth-ray representation.
Use the OpenAI-compatible endpoint with your API key. Send a request with image URLs or base64-encoded images.
Input: list of image paths, PIL images, or numpy arrays. Output: depth maps (float32), confidence maps, camera extrinsics and intrinsics (float32), and optional 3D export in GLB, PLY, or NPZ.
Yes, the model is licensed under Apache 2.0. The hosted API on gigarouter may have usage costs; check gigarouter pricing.
DA3 significantly outperforms DA2 for monocular depth estimation, and also surpasses VGGT for multi-view depth and pose estimation. DA3-Small has 0.08B parameters.
We're benchmarking and onboarding Depth Anything 3 Small as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.