Depth Anything With Any Prior: 5 Depth Anything V2 5 Depth Anything V2 5 Depth Anything V2 5 Depth Anything V2
Depth Anything With Any Prior: 5 Depth Anything V2 5 Depth Anything V2 5 Depth Anything V2 5 Depth Anything V2
Zehan Wang1 *, Siyu Chen1∗ , Lihe Yang2 , Jialei Wang1 , Ziang Zhang1 , Hengshuang Zhao2 , Zhou Zhao1
1
Zhejiang University; 2 The University of Hong Kong
https://prior-depth-anything.github.io/
arXiv:2505.10565v1 [cs.CV] 15 May 2025
2
Depth Completion As noted in [37], SfM reconstruc- 3.1. Preliminary
tions from 19 images often result in depth maps with only
Given an RGB image I ∈ R3×H×W and its corresponding
0.04% valid pixels. Completing the sparse depth maps with
metric depth prior Dprior ∈ RH×W , prior-based monocu-
observed RGB images is a fundamental computer vision
lar depth estimation takes the I and Dprior as input, aiming
task [8, 9, 33, 44, 61, 65]. Recent approaches like Omni-
to output the depth map Doutput ∈ RH×W that is detailed,
DC [66] and Marigold-DC [47] have achieved certain levels
complete, and metrically precise. As discussed in 1, depth
of zero-shot generalization across diverse scenes and vary-
priors obtained by different measurement techniques often
ing sparsity levels. However, due to the lack of explicit
display various forms of incompleteness. To handle various
scene geometry guidance, they face challenges in extremely
priors with a unified framework, we uniformly represent the
sparse scenarios.
i=0 ,
coordinates of valid positions in Dprior as P = {xi , yi }N
which N pixels are valid.
Depth Super-resolution Obtaining high-resolution met-
ric depth maps with depth cameras usually demands signifi- 3.2. Coarse Metric Alignment
cant power. A more efficient alternative is to use low-power
As shown in Fig. 2, different types of depth priors exhibit
sensors to capture low-resolution maps and then enhance
distinct missing patterns (e.g. sparse points, low-resolution
them using super-resolution. Early efforts [23, 53, 62, 64],
grids, or irregular holes). These differences in sparsity and
however, show limited generalization to unseen scenes. Re-
incompleteness restrict models’ ability to generalize across
cent PromptDA [29] achieves effective zero-shot super-
various priors. To tackle this, we propose pre-filling missing
resolution by using the low-resolution map as a prompt for
regions to transform all priors into a shared intermediate
depth foundation models [56].
domain, reducing the gap between them.
However, interpolation-based filling, used in previous
Depth Inpainting As discussed in [27, 56], due to inher-
methods [29, 30], preserves pixel-level metric information
ent limitations in stereo matching and depth sensors, even
but ignores geometric structure, leading to significant er-
“ground truth” depth data in real-world datasets often have
rors in the filled areas. On the other hand, global align-
significant missing regions. Additionally, in applications
ment [10, 11], which scales relative depth predictions to
like 3D Gaussian editing and generation [10, 31, 59], there
match priors, maintains the fine structure of predictions
is a need to fill holes in depth maps. DepthLab [30] first
but loses critical pixel metric details. To address these
fills holes using interpolation and then refines the results
challenges, we propose pixel-level metric alignment, which
with a depth-guided diffusion model. However, interpola-
aligns geometry predictions and metric priors at pixel level,
tion errors reduce its effectiveness for large missing areas or
preserving both predicted structure and original metric in-
incomplete depth ranges.
formation.
These previous methods have two main limitations: 1)
Poor performance when prior is limited. 2) Difficulty gener- Pixel-level Metric Alignment We first use a frozen MDE
alizing to unseen prior patterns. Our approach, Prior Depth model to obtain a relative depth prediction Dpred ∈ RH×W .
Anything, tackles these challenges by explicitly using geo- Then, by explicitly utilizing the accurate geometric struc-
metric information from depth prediction in a coarse-to-fine ture in the predicted depth, we fill the invalid regions in
process, achieving impressive generalization and accuracy Dprior pixel by pixel. Considering the pre-filled coarse depth
across various patterns of prior input. map D̂prior , which inherits all the valid pixels in Dprior :
3
𝐃!"#$ Coarse Metric Alignment Fine Structure Refinement
(Explicit Fusion) (Implicit Fusion)
……
prediction
Condition
Conv
Frozen
MDE Pred
Pixel Alignment
Scale Norm
&
Re-weighting
Coarse
Align filled
……
Figure 2. Prior Depth Anything. Considering RGB images, any form of depth prior Dprior , and relative prediction Dpred from a frozen
MDE model, coarse metric alignment first explicitly combines the metric data in Dprior and geometry structure in Dpred to fill the incomplete
areas in Dprior . Fine structure refinement implicitly merges the complementary information to produce the final metric depth map.
5
ARKitScenes RGB-D-D NYUv2 ScanNet ETH-3D DIODE KITTI
Model Encoder
AbsRel↓ RMSE↓ AbsRel↓ RMSE↓ 8× 16× 8× 16× 8× 16× 8× 16× 8× 16×
DAv2 ViT-L 3.67 0.0764 4.67 0.1116 4.77 5.13 4.64 4.85 6.27 7.38 12.49 11.20 9.54 11.22
Depth Pro ViT-L 3.25 0.0654 4.28 0.1030 4.48 4.83 4.17 4.40 5.88 6.79 8.20 8.33 6.76 9.16
Omni-DC - 2.14 0.0435 2.09 0.0685 1.57 3.11 1.29 2.65 1.86 4.09 2.81 4.71 4.05 8.35
Marigold-DC SDv2 2.17 0.0448 2.15 0.0672 1.83 3.32 1.63 2.83 2.33 4.75 4.28 6.60 5.17 9.47
DepthLab SDv2 2.10 0.0411 2.13 0.0624 2.60 3.73 1.89 3.19 2.60 4.50 4.42 6.16 17.17 22.90
PromptDA ViT-L 1.34 0.0347 2.79 0.0708 1.61 1.75 1.87 1.93 1.80 2.56 3.18 3.73 3.92 4.95
DAv2-B+ViT-S 2.09 0.0414 2.07 0.0597 1.73 2.79 1.60 2.50 2.06 3.91 3.09 4.36 4.54 8.20
PriorDA
DAv2-B+ViT-B 1.94 0.0404 2.02 0.0581 1.72 2.73 1.61 2.45 2.00 3.79 3.10 4.23 4.65 8.24
(ours)
Depth Pro+ViT-B 1.95 0.0408 2.02 0.0581 1.72 2.74 1.58 2.43 1.99 3.77 3.01 4.15 4.44 7.99
Table 4. Zero-shot depth super-resolution (e.g. low-resolution prior). ARKitScenes and RGB-D-D provide captured low-resolution
depth. For other datasets, results are reported in AbsRel ↓, with low-resolution maps created by downsampling the GT depths.
NYUv2 ScanNet ETH-3D DIODE KITTI
Model Encoder
Range Shape Object Range Shape Object Range Shape Object Range Shape Object Range Shape Object
DAv2 ViT-L 17.40 5.24 6.56 16.75 4.64 6.74 68.76 8.23 19.22 51.55 29.20 13.41 31.12 14.93 17.94
Depth Pro ViT-L 10.89 9.20 6.52 16.76 15.39 6.80 10.37 34.28 17.28 37.44 34.74 13.53 14.51 16.11 8.19
Omni-DC - 23.24 5.94 13.79 22.89 5.44 8.71 29.47 4.81 17.97 38.83 7.75 25.43 35.42 8.94 15.06
Marigold-DC SDv2 19.83 2.37 6.18 17.14 1.97 6.66 25.36 2.15 7.72 39.33 7.59 18.97 33.44 9.21 7.72
DepthLab SDv2 23.85 2.66 10.87 21.17 2.08 10.40 30.61 2.75 10.53 41.01 6.51 17.17 40.43 13.60 18.66
PromptDA ViT-L 36.67 20.88 23.14 35.86 17.87 21.89 46.21 24.94 27.42 49.50 25.66 28.29 55.79 32.74 38.29
DAv2-B+ViT-S 16.86 2.30 5.72 14.29 2.01 5.87 21.16 1.98 6.52 36.59 5.58 10.77 30.04 6.67 7.99
PriorDA
DAv2-B+ViT-B 16.61 2.30 5.49 14.48 1.99 5.73 21.90 1.76 6.09 36.64 5.94 9.72 30.79 6.29 7.52
(ours)
Depth Pro+ViT-B 16.31 2.17 5.59 14.18 1.98 5.87 22.72 1.76 6.21 34.90 4.86 11.99 30.44 5.47 6.04
Table 5. Zero-shot depth inpainting (e.g. missing area prior). All results are reported in AbsRel ↓. Metrics are calculated only on the
masked (inpainted) regions. “Range”: masks for depth beyond 3m (indoors) and 15m (outdoors), “Shape”: average result for square masks
of sizes 80, 120, 160, and 200, “Object”: object segmentation masks detected by YOLO [26].
4.2. Comparison on Mixed Depth Prior pling (e.g. NYUv2 [42], ScanNet [13], etc.), our approach
achieves performance comparable to state-of-the-art meth-
We quantitatively evaluate the ability to handle challenging
ods. However, since downsampling tends to include overly
unseen mixed priors in Tab 2. In terms of absolute per-
specific details from the GT depths, directly replicating
formance, all versions of our model outperform compared
noise and blurred boundaries from GT leads to better results
baselines. More importantly, our model is less impacted by
instead. Therefore, ARKitScenes [1] and RGB-D-D [23]
the additional patterns. For example, compared to the set-
are more representative and practical, as they use low-power
ting that only uses sparse points in Tab 3, adding missing
cameras to capture the low-resolution depths. On these
areas or low-resolution results in only minor performance
two benchmarks, our method achieves leading performance
drops (1.96→2.01, 3.08 in NYUv2). In contrast, Omni-DC
compared to other zero-shot methods.
(2.63→2.86, 3.81), and Marigold-DC (2.13→2.26, 3.82)
show larger declines. These results highlight the robustness Zero-shot Depth Inpainting In Tab 5, we evaluate the
of our method to different prior inputs. performance of inpainting missing regions in depth maps.
In the practical and challenging “Range” setting, our
4.3. Comparison on Individual Depth Prior
method achieves superior results, which is highly meaning-
Zero-shot Depth Completion Tab 3 shows the zero-shot ful for improving depth sensors with limited effective work-
depth completion results with different kinds and sparsity ing ranges. Additionally, it outperforms all alternatives in
levels of sparse points as priors. Compared to Omni- filling square and object masks, demonstrating its potential
DC [66] and Marigold-DC [47], which are specifically de- for 3D content generation and editing.
signed for depth completion and rely on sophisticated, time-
consuming structures, our approach achieves better overall 4.4. Qualitative Analysis
performance with simpler and more efficient designs.
In Fig 3, we provide a qualitative comparison of the outputs
Zero-shot Depth Super-resolution In Tab 4, we present from different models. Our model consistently outperforms
results for super-resolution depth maps. On benchmarks previous approaches, offering richer details, sharper bound-
where low-resolution maps are created through downsam- aries, and more accurate metrics.
6
RGB image Ours Marigold-DC Omni-DC PromptDA DepthLab
Figure 3. Qualitative comparisons with previous methods. The depth prior or error map is shown below each sample.
Figure 4. Error analysis on widely used but indeed noisy benchmarks [13, 42]. Red means higher error, while blue indicates lower error.
Fig 4 visualizes the error maps of our model. The errors These “beyond ground truth” cases highlight the potential
mainly occur around blurred edges in the “ground truth” of of our approach in addressing the inherent noise in depth
real data. Our method effectively corrects the noise in labels measurement techniques. More visualizations can be found
while aligning with the metric information from the prior. in the supp.
7
S L M S+M L+M S+L Model Encoder S L M S+M L+M S+L
Interpolation 7.93 3.88 8.96 8.38 4.55 7.99 ViT-S 2.15 2.77 2.68 2.22 2.87 3.20
Ours (w/o re-weight) 2.92 3.44 6.91 3.22 4.53 4.36 Depth ViT-B 1.97 2.73 2.50 2.02 2.82 3.09
Ours 2.42 3.51 6.70 2.60 4.32 4.40 Anything V2 ViT-L 1.92 2.71 2.29 1.97 2.79 3.04
Table 6. Accuracy of pre-filled depth maps with different strate- ViT-G 1.87 2.70 2.22 1.94 2.76 3.02
gies. To independently compare each pre-fill strategy, we directly Depth Pro ViT-L 1.96 2.74 2.35 2.01 2.82 3.08
compare the pre-filled maps with ground truth.
Table 9. Effect of using different frozen MDE models. The condi-
Seen Unseen tioned MDE model is ViT-B version here.
S L M S+M L+M S+L Model Encoder Param Latency(ms)
None 2.50 3.71 46.07 2.50 3.74 3.64 Omni-DC - 85M 334
Interpolation 3.40 2.68 4.28 3.53 2.94 3.56 Marigold-DC SDv2 1,290M 30,634
Ours (w/o re-weight) 2.13 2.86 2.58 2.19 2.94 3.25 DepthLab SDv2 2,080M 9,310
Ours 1.99 2.82 2.26 2.06 2.90 3.11 PromptDA ViT-L 340M 32
Table 7. Effect of pre-filled strategies on generalization. We train DAv2-B+ViT-S 97M+25M 157 19+123+15
PriorDA
DAv2-B+ViT-B 97M+98M 161 19+123+19
models with various pre-fill strategies using only sparse points and (ours)
Depth Pro+ViT-B 952M+98M 760 618+123+19
evaluate their ability to generalize to unseen types of depth priors.
Table 10. Analysis of inference efficiency. “x+x+x ” represents the
Metric Geometry S L M S+M L+M S+L latency of the frozen MDE model, coarse metric alignment, and
✗ ✓ 5.46 5.29 5.48 5.36 5.30 5.46 conditioned MDE model, respectively.
✓ ✗ 2.10 2.94 2.58 2.17 3.02 3.31
✓ ✓ 1.96 2.74 2.48 2.01 2.82 3.08 Testing-time improvement We investigate the potential
of test-time improvements in Tab 9. Our findings reveal that
Table 8. Effect of each condition for conditioned MDE models.
larger and stronger frozen MDE models consistently bring
higher accuracy, while smaller models maintain competi-
4.5. Ablation Study
tive performance and enhance the efficiency of the entire
We use Depth Anything V2 ViT-B as the frozen MDE and pipeline. These findings underscore the flexibility of our
ViT-S as the conditioned MDE for ablation studies by de- model and its adaptability to various scenarios.
fault. All results are evaluated on NYUv2. Inference efficiency analysis In Tab 10, we analyze the
Accuracy of different pre-fill strategy As shown in inference efficiency of different models on one A100 GPU
Fig 6, our pre-fill method outperforms simple interpolation for an image resolution of 480×640. Overall, compared
across all scenarios by explicitly utilizing the precise geo- to previous approaches, our model variants achieve leading
metric structures in depth prediction. Additionally, the re- performance while demonstrating certain advantages in pa-
weight mechanism further enhances performance. rameter number and inference latency. For a more detailed
Pre-fill strategy for generalization From Tab 7, we ob- breakdown, we provide the time consumption for each stage
serve that our pixel-level metric alignment helps the model of our method. The coarse metric alignment, which relies
generalize to new prior patterns, and the re-weighting strat- on kNN and least squares, accounts for the majority of the
egy further enhances the robustness by improving the accu- inference latency. However, it still demonstrates significant
racy of the pre-filled depth map. efficiency advantages compared to the sophisticated Omni-
DC and diffusion-based DepthLab and Marigold-DC.
Effectiveness of fine structure refinement Comparing
the pre-filled coarse depth maps in Tab 6 with the fi- 5. Application
nal output accuracy in Tab 3, 4, 5 and 2, the perfor-
mance improvements after fine structure refinement (sparse To demonstrate our model’s real-world applicability, we
points: 2.42→2.01, low-resolution: 3.51→2.79, missing ar- employ prior-based monocular depth estimation models to
eas: 6.70→2.48, S+M: 2.60→2.09, L+M: 4.32→2.88, S+L: refine the depth predictions from VGGT, a state-of-the-art
4.40→3.17) demonstrate its effectiveness in rectifying mis- 3D reconstruction foundation model. VGGT provides both
aligned geometric structures in the pre-filled depth maps a depth and confidence map. We take the top 30% most con-
while maintaining its accurate metric information. fident pixels as depth prior and apply different prior-based
Effectiveness of metric and geometry condition We models to obtain finer depth predictions. 1
evaluate the impact of metric and geometry guidance for Table 11 reports VGGT’s performance in monocular and
the conditioned MDE model in Tab 8. The results show that multi-view depth estimation, along with the effectiveness
combining both conditions achieves the best performance, 1 For models less adept at handling missing pixels (DepthLab,
emphasizing the importance of reinforcing geometric infor- PromptDA), the entire VGGT depth prediction was provided as prior.
mation during the fine structure refinement stage.
8
Monocular Depth Estimation Multi-view Depth Estimation
NYU ETH-3D KITTI ETH-3D KITTI
VGGT 3.54 (-) 4.94 (-) 6.56 (-) 2.46 (-) 18.75 (-)
+Omni-DC 4.12 (0.58) 6.08 (1.14) 6.85 (0.29) 2.64 (0.18) 18.66 (-0.09)
+Marigold-DC 4.06 (0.52) 5.43 (0.49) 7.63 (1.07) 2.81 (0.35) 18.86 (0.11)
+DepthLab 3.56 (0.02 ) 4.92 (-0.02) 7.97 (1.41) 2.25 (-0.21) 19.47 (0.72)
+PromptDA 3.43 (-0.11) 4.97 (0.03) 6.50 (-0.06) 2.48 (0.02) 18.91 (0.16)
+PriorDA 3.45 (-0.09) 4.43 (-0.51) 6.39 (-0.17) 1.99 (-0.47) 18.61 (-0.14)
Table 11. Results of refining VGGT depth prediction with different methods. All results are reported as AbsRel.
of different prior-based methods as refiners. We observe [7] Manuel Carranza-Garcı́a, F Javier Galán-Sales, José Marı́a
that only our PriorDA consistently improves VGGT’s pre- Luna-Romera, and José C Riquelme. Object detection using
dictions, primarily due to its ability to adapt to diverse pri- depth completion and camera-lidar fusion for autonomous
ors. These surprising results highlight PriorDA’s broad ap- driving. Integrated Computer-Aided Engineering, 2022. 1
plication potential. [8] Xinjing Cheng, Peng Wang, and Ruigang Yang. Learn-
ing depth with convolutional spatial propagation network.
6. Conclusion TPAMI, 2019. 3
[9] Xinjing Cheng, Peng Wang, Chenye Guan, and Ruigang
In this work, we present Prior Depth Anything, a robust Yang. Cspn++: Learning context and resource aware convo-
and powerful solution for prior-based monocular depth esti- lutional spatial propagation networks for depth completion.
mation. We propose a coarse-to-fine pipeline to progres- In AAAI, 2020. 3
sively integrate the metric information from incomplete [10] Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee,
depth measurements and the geometric structure from rel- and Kyoung Mu Lee. Luciddreamer: Domain-free gener-
ative depth predictions. The model offers three key advan- ation of 3d gaussian splatting scenes. arXiv:2311.13384,
tages: 1) delivering accurate and fine-grained depth esti- 2023. 3
mation with any type of depth prior; 2) offering flexibility [11] Jaeyoung Chung, Jeongtaek Oh, and Kyoung Mu Lee.
to adapt to extensive applications through test-time module Depth-regularized optimization for 3d gaussian splatting in
switching; and 3) showing the potential to rectify inherent few-shot images. In CVPR, 2024. 1, 3
noise and blurred boundaries in real depth measurements. [12] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe
References Franke, Stefan Roth, and Bernt Schiele. The cityscapes
dataset for semantic urban scene understanding. In CVPR,
[1] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, 2016. 2
Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, [13] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal-
Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse ber, Thomas Funkhouser, and Matthias Nießner. Scannet:
real-world dataset for 3d indoor scene understanding using Richly-annotated 3d reconstructions of indoor scenes. In
mobile rgb-d data. In NeurIPS, 2021. 5, 6 CVPR, 2017. 5, 6, 7
[2] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. [14] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ra-
Adabins: Depth estimation using adaptive bins. In CVPR, manan. Depth-supervised nerf: Fewer views and faster train-
2021. 2 ing for free. In CVPR, 2022. 1
[3] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka,
[15] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map
and Matthias Müller. Zoedepth: Zero-shot transfer by com-
prediction from a single image using a multi-scale deep net-
bining relative and metric depth. arXiv:2302.12288, 2023.
work. In NeurIPS, 2014. 2
5
[16] Patrick Esser, Johnathan Chiu, Parmida Atighehchian,
[4] Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain,
Jonathan Granskog, and Anastasis Germanidis. Structure
Marcel Santos, Yichao Zhou, Stephan R Richter, and
and content-guided video synthesis with diffusion models.
Vladlen Koltun. Depth pro: Sharp monocular metric depth
In ICCV, 2023. 1
in less than a second. In ICLR, 2025. 1, 2, 5
[5] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- [17] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat-
man, Simran Arora, Sydney von Arx, Michael S Bern- manghelich, and Dacheng Tao. Deep ordinal regression net-
stein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, work for monocular depth estimation. In CVPR, 2018. 2
et al. On the opportunities and risks of foundation models. [18] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
arXiv:2108.07258, 2021. 2 ready for autonomous driving? the kitti vision benchmark
[6] Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- suite. In CVPR, 2012. 5
tual kitti 2. arXiv:2001.10773, 2020. 2, 5 [19] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel
9
Urtasun. Vision meets robotics: The kitti dataset. The inter- [34] René Ranftl, Katrin Lasinger, David Hafner, Konrad
national journal of robotics research, 2013. 2 Schindler, and Vladlen Koltun. Towards robust monocular
[20] Christian Häne, Lionel Heng, Gim Hee Lee, Friedrich Fraun- depth estimation: Mixing datasets for zero-shot cross-dataset
dorfer, Paul Furgale, Torsten Sattler, and Marc Pollefeys. 3d transfer. TPAMI, 2020. 1, 2
visual perception for self-driving cars using a multi-camera [35] Alex Rasla and Michael Beyeler. The relative importance
system: Calibration, mapping, localization, and obstacle de- of depth cues and semantic edges for indoor mobility using
tection. Image and Vision Computing, 2017. 1 simulated prosthetic vision in immersive virtual reality. In
[21] Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Proceedings of the 28th ACM symposium on virtual reality
Kaiqiang Zhou, Hongbo Zhang, Bingbing Liu, and Ying- software and technology, 2022. 1
Cong Chen. Lotus: Diffusion-based visual foundation model [36] Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit
for high-quality dense prediction. In ICLR, 2025. 1, 2 Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb,
[22] Lingzhi He, Hongguang Zhu, Feng Li, Huihui Bai, Runmin and Joshua M Susskind. Hypersim: A photorealistic syn-
Cong, Chunjie Zhang, Chunyu Lin, Meiqin Liu, and Yao thetic dataset for holistic indoor scene understanding. In
Zhao. Towards fast and accurate real-world depth super- ICCV, 2021. 2, 5
resolution: Benchmark dataset and baseline. In CVPR, 2021. [37] Barbara Roessle, Jonathan T Barron, Ben Mildenhall,
1 Pratul P Srinivasan, and Matthias Nießner. Dense depth pri-
[23] Lingzhi He, Hongguang Zhu, Feng Li, Huihui Bai, Runmin ors for neural radiance fields from sparse input views. In
Cong, Chunjie Zhang, Chunyu Lin, Meiqin Liu, and Yao CVPR, 2022. 1, 3
Zhao. Towards fast and accurate real-world depth super- [38] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
resolution: Benchmark dataset and baseline. In CVPR, 2021. Patrick Esser, and Björn Ommer. High-resolution image syn-
3, 5, 6 thesis with latent diffusion models. In CVPR, 2022. 5
[24] Daniel Herrera, Juho Kannala, and Janne Heikkilä. Joint [39] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary
depth and color camera calibration with distortion correc- Bradski. Orb: An efficient alternative to sift or surf. In ICCV,
tion. TPAMI, 2012. 1 2011. 1, 5
[40] Johannes L Schonberger and Jan-Michael Frahm. Structure-
[25] Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long,
from-motion revisited. In CVPR, 2016. 1
Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and
Shaojie Shen. Metric3d v2: A versatile monocular geomet- [41] Thomas Schops, Johannes L Schonberger, Silvano Galliani,
ric foundation model for zero-shot metric depth and surface Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An-
normal estimation. TPAMI, 2024. 1, 2 dreas Geiger. A multi-view stereo benchmark with high-
resolution images and multi-camera videos. In CVPR, 2017.
[26] Glenn Jocher, Jing Qiu, and Ayush Chaurasia. Ultralytics
5
YOLO, 2023. 6
[42] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob
[27] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met-
Fergus. Indoor segmentation and support inference from
zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos-
rgbd images. In ECCV, 2012. 2, 5, 6, 7
ing diffusion-based image generators for monocular depth
[43] Mel Slater and Sylvia Wilbur. A framework for immersive
estimation. In CVPR, 2024. 1, 2, 3, 5
virtual environments (five): Speculations on the role of pres-
[28] Robert Lange and Peter Seitz. Solid-state time-of-flight ence in virtual environments. Presence: Teleoperators &
range camera. IEEE Journal of quantum electronics, 2001. Virtual Environments, 1997. 1
1 [44] Jie Tang, Fei-Peng Tian, Boshi An, Jian Li, and Ping Tan. Bi-
[29] Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Ji- lateral propagation network for depth completion. In CVPR,
aming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei 2024. 3
Zhou, and Bingyi Kang. Prompting depth anything for 4k [45] Yifu Tao, Marija Popović, Yiduo Wang, Sundara Tejaswi
resolution accurate metric depth estimation. In CVPR, 2025. Digumarti, Nived Chebrolu, and Maurice Fallon. 3d lidar re-
2, 3, 5 construction with probabilistic depth completion for robotic
[30] Zhiheng Liu, Ka Leong Cheng, Qiuyu Wang, Shuzhe Wang, navigation. In IROS, 2022. 1
Hao Ouyang, Bin Tan, Kai Zhu, Yujun Shen, Qifeng [46] Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo,
Chen, and Ping Luo. Depthlab: From partial to complete. Haochen Wang, Falcon Z Dai, Andrea F Daniele, Mo-
arXiv:2412.18153, 2024. 2, 3, 5 hammadreza Mostajabi, Steven Basart, Matthew R Walter,
[31] Zhiheng Liu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, et al. Diode: A dense indoor and outdoor depth dataset.
Jie Xiao, Kai Zhu, Nan Xue, Yu Liu, Yujun Shen, and Yang arXiv:1908.00463, 2019. 5
Cao. Infusion: Inpainting 3d gaussians via learning depth [47] Massimiliano Viola, Kevin Qu, Nando Metzger, Bingxin Ke,
completion from diffusion prior. arXiv:2404.11613, 2024. Alexander Becker, Konrad Schindler, and Anton Obukhov.
1, 3 Marigold-dc: Zero-shot monocular depth completion with
[32] David G Lowe. Distinctive image features from scale- guided diffusion. arXiv:2412.13389, 2024. 2, 3, 5, 6
invariant keypoints. IJCV, 2004. 1, 5 [48] Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng,
[33] Jin-Hwi Park, Chanhwi Jeong, Junoh Lee, and Hae-Gon Kaiyong Zhao, and Xiaowen Chu. Irs: A large naturalis-
Jeon. Depth prompting for sensor-agnostic depth estimation. tic indoor robotics stereo dataset to train deep models for
In CVPR, 2024. 3 disparity and surface normal estimation. In ICME, 2021. 2
10
[49] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris [65] Yiming Zuo and Jia Deng. Ogni-dc: Robust depth comple-
Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- tion with optimization-guided neural iterations. In ECCV,
sion made easy. In CVPR, 2024. 1 2024. 3
[50] Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, [66] Yiming Zuo, Willow Yang, Zeyu Ma, and Jia Deng. Omni-
Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- dc: Highly robust depth completion with multiresolution
bastian Scherer. Tartanair: A dataset to push the limits of depth integration. arXiv:2411.19278, 2024. 2, 3, 5, 6
visual slam. In IROS, 2020. 2
[51] Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariha-
ran, Mark Campbell, and Kilian Q Weinberger. Pseudo-lidar
from visual depth estimation: Bridging the gap in 3d object
detection for autonomous driving. In CVPR, 2019. 1
[52] Diana Wofk, Fangchang Ma, Tien-Ju Yang, Sertac Karaman,
and Vivienne Sze. Fastdepth: Fast monocular depth estima-
tion on embedded systems. In ICRA, 2019. 1
[53] Chuhua Xian, Kun Qian, Zitian Zhang, and Charlie CL
Wang. Multi-scale progressive fusion learning for depth map
super-resolution. arXiv:2011.11865, 2020. 3
[54] Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan,
Kangyang Xie, Zhiyue Zhao, Hao Chen, and Chunhua Shen.
Diffusion models trained with large data are transferable vi-
sual models. In ICLR, 2025. 2
[55] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi
Feng, and Hengshuang Zhao. Depth anything: Unleashing
the power of large-scale unlabeled data. In CVPR, 2024. 1,
2
[56] Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao-
gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any-
thing v2. In NeurIPS, 2024. 1, 2, 3, 5
[57] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren,
Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-
scale dataset for generalized multi-view stereo networks. In
CVPR, 2020. 2
[58] Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu,
Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d:
Towards zero-shot metric 3d prediction from a single image.
In ICCV, 2023. 1, 2
[59] Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T
Freeman, and Jiajun Wu. Wonderworld: Interactive 3d scene
generation from a single image. arXiv:2406.09394, 2024. 1,
3
[60] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
conditional control to text-to-image diffusion models. In
ICCV, 2023. 1
[61] Youmin Zhang, Xianda Guo, Matteo Poggi, Zheng Zhu,
Guan Huang, and Stefano Mattoccia. Completionformer:
Depth completion with convolutions and vision transform-
ers. In CVPR, 2023. 3
[62] Zixiang Zhao, Jiangshe Zhang, Shuang Xu, Zudi Lin, and
Hanspeter Pfister. Discrete cosine transform network for
guided depth map super-resolution. In CVPR, 2022. 3
[63] Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang,
Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-
vla: A 3d vision-language-action generative world model. In
ICML, 2024. 1
[64] Zhiwei Zhong, Xianming Liu, Junjun Jiang, Debin Zhao, and
Xiangyang Ji. Guided depth map super-resolution: A survey.
ACM Computing Surveys, 2023. 3
11
S L M S+M L+M S+L of the Depth Anything v2 model exhibit stronger capabil-
ities, training conditioned MDE models based on larger
k=3 2.00 2.52 2.74 2.10 2.83 3.07 backbones is an important direction for future work. Ad-
k=5 1.97 2.16 2.73 2.04 2.82 3.09 ditionally, following Depth Anything, all training images
k=10 2.00 2.31 2.74 2.09 2.83 3.12 are resized to 518×518. In contrast, PromptDA is natively
k=20 2.10 2.27 2.76 2.16 2.83 3.14 trained at 1440×1920 resolution. Therefore, training at
Table 12. Impact of different k-value on the accuracy of the final higher resolutions to better handle easily accessible high-
output Doutput . resolution RGB images is another crucial direction for our
future research.
S L M S+M L+M S+L
Only Sparse 1.99 2.82 2.26 2.06 2.90 3.11
Three Pattern 1.96 2.74 2.48 2.01 2.82 3.08
Table 13. Impact of used prior patterns during training.
12
RGB image GT & Prior Ours Error
13
RGB image GT & Prior Ours Error
14
RGB image GT & Prior Ours Error
15
RGB image
GT & Prior
Ours
Error
16
Synthetic data, while precise, tends to have limited diversity and repetitive patterns due to pre-defined video sampling, which can restrict real-world scene coverage. In contrast, real images from large-scale datasets provide a wide array of distinct scenes. Thus, when training models for depth estimation, purely synthetic data may not be sufficient for achieving robust generalization, leading to the necessity of incorporating unlabeled real images to enhance scene diversity and coverage .
Knowledge distillation enhances small model performance by allowing these models to learn from the outputs of a larger, more capable teacher model. This is notably facilitated through the transfer of capabilities achieved using unlabeled real images, bypassing direct synthetic-to-real transfer. The practice involves mimicking the predictions of the teacher model while fine-tuning with additional data for robustness, thus equipping smaller models to better handle zero-shot depth estimation tasks .
Real images in depth estimation training provide a broader diversity of scenes, which mitigate the distribution shift inherent in purely synthetic datasets. Unlike synthetic images, which are often redundant, unlabeled real images from large datasets cover numerous distinct and informative scenes, enhancing the model's robustness and generalization capability .
Depth models' generalization ability is optimized using pre-fill strategies and the incorporation of multiple diverse priors during training, evaluated through sparse point generalization tests. Incorporating mixed priors ensures that the model can adapt to various unseen conditions by minimizing performance degradation compared to baseline models like Omni-DC or Marigold-DC. Success is quantified by measuring performance on unseen prior types versus known benchmarks, demonstrating robust adaptability and consistent accuracy improvements across different datasets .
The scale-invariant log loss is used for pixel-level supervision in depth estimation models to ensure accurate predictions despite the variations in scale between the model's output and the ground truth depth. This approach, following the ZoeDepth model, allows the model to produce relative depth predictions that can be accurately transformed into the ground truth scale, improving overall performance .
Model robustness in handling zero-shot depth super-resolution is attributed to adaptive learning techniques that allow for accurate depth map refinement across diverse benchmarks. Unlike simple approaches that might directly replicate noise from GT depths, the models utilize strategies that balance detail retention and noise reduction, especially on ARKitScenes and RGB-D-D datasets. This adaptability ensures performance leading in preserving spatial details and accuracy without excessive noise, thereby outperforming other models in practical scenarios with limited initial data .
Depth estimation models are evaluated across seven unseen real-world datasets including NYUv2, ScanNet, ETH3D, DIODE, KITTI, ARKitScenes, and RGB-D-D. These datasets represent diverse environments, both indoor and outdoor. Evaluation criteria include zero-shot accuracy in terms of metrics like AbsRel and RMSE under various depth priors, such as sparse points, low-resolution, and missing areas. Success is determined by the ability of the models to maintain performance across different image and prior combinations, revealing robustness and adaptability in varying real-world scenarios .
The 'Depth Anything V2' model is trained using a combination of synthetic and real data to leverage the strengths of both. Initially, a capable teacher model is trained on synthetic images to achieve high precision. This teacher model's knowledge is then distilled into smaller student models using high-quality pseudo labels generated from real images, addressing distribution shift and limited synthetic diversity. Furthermore, a gradual transfer of learning from large to smaller models, mimicking knowledge distillation with unlabeled real data, facilitates the safe scale-down, enhancing smaller models' robustness .
The zero-shot inpainting capability of depth models fills missing depth data in maps created by limitations such as sensor range or occlusions. This feature is crucial for enhancing the effectiveness of depth sensors in real-world applications, like 3D modeling and augmented reality, where complete spatial data is necessary for accurate scene representation and interaction. By effectively filling gaps and demonstrating superior performance in settings like 'Range' and 'Object' masks, these models advance practical applications by improving depth data completeness and fidelity .
The models exhibit robustness in handling mixed depth priors by combining various patterns such as sparse points, low-resolution inputs, and missing areas. The ability to manage the additional complexity with only minor performance degradation, compared to more significant drops seen in alternative models like Omni-DC and Marigold-DC, underscores their efficacy. This performance resilience highlights the model's capability to process and integrate multiple sources of depth information effectively .