0% found this document useful (0 votes)
79 views16 pages

Depth Anything With Any Prior: 5 Depth Anything V2 5 Depth Anything V2 5 Depth Anything V2 5 Depth Anything V2

The document presents 'Prior Depth Anything', a framework that integrates metric information from synthetic images with geometric structures from depth predictions to create accurate depth maps. It employs a coarse-to-fine pipeline and pixel-level metric alignment to enhance depth estimation across various scenarios. The model demonstrates strong zero-shot generalization capabilities and is applicable in tasks such as depth completion, super-resolution, and inpainting using real-world datasets.

Uploaded by

openid_mgC8yaK1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views16 pages

Depth Anything With Any Prior: 5 Depth Anything V2 5 Depth Anything V2 5 Depth Anything V2 5 Depth Anything V2

The document presents 'Prior Depth Anything', a framework that integrates metric information from synthetic images with geometric structures from depth predictions to create accurate depth maps. It employs a coarse-to-fine pipeline and pixel-level metric alignment to enhance depth estimation across various scenarios. The model demonstrates strong zero-shot generalization capabilities and is applicable in tasks such as depth completion, super-resolution, and inpainting using real-world datasets.

Uploaded by

openid_mgC8yaK1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Depth Anything with Any Prior

Zehan Wang1 *, Siyu Chen1∗ , Lihe Yang2 , Jialei Wang1 , Ziang Zhang1 , Hengshuang Zhao2 , Zhou Zhao1
1
Zhejiang University; 2 The University of Hong Kong
https://prior-depth-anything.github.io/
arXiv:2505.10565v1 [cs.CV] 15 May 2025

purely synthetic images unlabeled real images


Abstract purely synthetic images unlabeled real images un
purely syntheticpurely
images
synthetic images purelyunlabeledsyntheticreal
images
images
unlabeled real images
purely
purely
This work presents Prior Depth Anything, a synthetic
synthetic images
framework
Geometric Estimation
images(Depth Anything) purely
Metric Measurement
syntheticreal
unlabeled
unlabeled
(SfM, Sensor, etc)
images
real images
images unl
purely synthetic images purelyunlabeledsyntheticreal
images
images un
that combines incomplete but precise metric information in purely synthetic images purely synthetic Prior
images
unlabeled real images unl
depth measurement with relative but complete geometric Depth
highly precise
structures in depth prediction, generating accurate, dense, highly precise ☺
Anything
☺ ignored details largest teacher
and detailed metric depth maps for any scene. highlyToprecise
this distribution
☺ fine details
highly precise shift ☹ ☺ highly precise ☺ fine largest
details teacher
highly
highly precise
precise distribution


complete shift
content ☹ incomplete highly
content largest
precise teacher ☺ largest
complete
teacher
content
end, we design a coarse-to-fine pipeline todistribution
progressively shift ☹
limited
highly precisedistribution diversityshift ☹ distribution largest
highlylargest precise shift
teacher ☹
distribution shift ☺

limited
good geometry
diversity
missing geometry
distribution teacher
shift ☺

good geometry
integrate the two complementary depth sources. First,
distribution
limited weshifthighly
diversity precise
☹ unscaled metric ☹ highly
☺ precise distribution
metric
limited
largest
precise
diversity
teacher ☺ precise metric
distribution
limited
shift
diversity

limited diversity ☹ limited
shift ☹
diversity largest teacher
introduce pixel-level metric alignment and distance-aware ☹
Figure
diversitydistribution
limited diversity 1. Core
shift ☹ Motivation. We progressively
distribution shift ☹ ☹
integrate comple-
limited ☹
☹ limited diversity ☹
weighting to pre-fill diverse metric priors by explicitly using mentaryteacher
largest information from metric measurements (accurate metrics)
limited
largest diversity
teacher
and relative ☹ (completeness
predictions limited diversity
and fine details)☹ pseudo
to produce
depth prediction. It effectively narrows the domainlargestgap be-
teacher largest teacher labels
largest teacher largest
dense teacher
and fine-grained metric largest
depth maps.teacher pseudo labels
tween prior patterns, enhancing generalization largest
across vary-
teacher pseudo labels pseudo labels
largest teacher
Figure largest teacher
pseudo labels
ing scenarios. Second, we develop a conditioned Figure 7:
monoc- 7: Depth
Depth
largest
estimation,teacher Anything
Anything
which takes RGB V2.
V2. We
pseudo
We
pseudo
images
largest teacher
first
labels
first
andlabelstrain
train
measured the
the most
most capable
depth capable
ular depth estimation (MDE) modelFigure to refine 7:the Depth
Figure
Then,
inherent Anything
to7: Depth
mitigate V2. WeFigure
Anything
the first
distribution 7:
V2.train
Depth
We the
shift most
Anything
first and train capable
limited
pseudo the V2.
most teacher
We
diversity
labels firston
capable oftr
Figure
Figure 7:
7: Depth
Depth Anything
Then,Anything
to mitigate
priors as V2.
inputs
V2. We
theWe
to Figurefirst
distribution
output
first train
7:
detailed
train Depththe
shift
and
the most
Anything capable
and limited
precise
most metric
capable V2.
depth teacher
We
diversity
teacher first on
tra
of
on
noise of depth priors. By conditioning Figure
Then,
Then,
on the
to
to 7: Depth
normalized
Then,
mitigate
real
mitigate Anything
tomaps.
the
images
the mitigate
distribution
with V2.
We unify
distribution the WeFigure
thedifferent
teacher.
Then,
shift first
distribution
Then,
shift and
to 7:
depth
and train
Depth
mitigate the
shiftwe
limited
toFinally,
estimation
mitigate
limited most
Anything
and
the
diversity
tasks
the capable
limited
distribution
train
diversity by abstract-
distribution V2.
of
student
of teacher
We
diversity
synthetic first
shiftand
models
synthetic
shift on
of
and tr
da
oot
da
Then,
pre-filled prior and prediction, the model to
further real
Figure
mitigate
implicitly images
the 7: with
Depth
distribution the teacher.
Anything shift
Figure V2.
and Finally,
7: We
limited
Depth we
first train
train
diversity
Anything student
the most
of
V2. models
capable
synthetic
We first da
tra
Then,
real
real to
images
images mitigate
real
with the
images distribution
thetoteacher.with
teacher.
ing various the
Finally,
types Then,
shift
teacher.
real
of and
to
images
we
depth mitigate
limited
Finally,
trainwith
measurements student we the
diversity
the
as distribution
train
teacher.
models
depth prior. of
student on synthetic
Finally,
To shift
models
high-quali weanddao
tr
merges the two complementary depth sources.
real images Our with
Then,
model
with the
the mitigate
teacher. Finally,
the
Finally, real we
images
distribution
we train
train with
student
shift
student the
and teacher.
models
limited
models Finally,
on high-quali
diversity
on we
high-quali of tra
sy
real images
showcases impressive zero-shot generalization acrosswith
depth the teacher. Finally,
clarify the scope, Then,
real
we wetotrain
images
first mitigate
outline with
student
common theteacher.
the distribution
models
types of on
Finally,
depth shiftwe
high-quali andtr
real images priorswith
and their theprimary
teacher.
real Finally,
applications:
images with wethe train
teacher.student models
Finally, we on
tra
completion, super-resolution, and inpainting over 7Enhance real- the scene coverage. The diversity of synthetic image
world datasets, matching or even surpassing previous Enhance
task- • 1)
the Sparse
sceneThe points
coverage.(depth completion):
The diversity Depth offrom Li-
synthetic image
Enhance
Enhance the
the scene
real-world
scene coverage.
DARscenes.
coverage. or SfMThe
Enhance
Enhance
diversity
Nevertheless,
[40] diversity
is typically
the
theofscene
of scene
synthetic
we
synthetic
sparse. can coverage.
coverage.
Completing images
easily
images sparse
The
The
coveris diversi
is limited,
limited,
diversity
numer
specific methods. More importantly, Enhance
it performsthe well scene
on
real-world coverage.
scenes. The Enhance
diversity
Nevertheless,
real-world the of scene
synthetic
wepublic
scenes. can coverage.images
easily
Nevertheless, The
is limited,
coverMoreov diversi
wenumercan
challenging, unseen mixed priors andreal-world
real-world
enables test-time
real-world
scenes.
large-scale
Enhance
scenes.
im-
scenes.
Nevertheless,
theunlabeled
depth
Nevertheless,
Nevertheless,scene real-world
priors iscoverage. we
images
crucial
we
real-world
we
canapplications
for
can
can
easily
scenes.
from
The
easily
easily
scenes.
cover
diversity
such
cover
cover
Nevertheless,
asof numerous
datasets.
3D
Nevertheless,synthetic
recon-
numerous
numerous
wedistinct
imagescan
distinct
distinct
we can
large-scale
provements by switching predictionlarge-scale
models, providing
large-scale
unlabeled
redundant
real-world
a dueunlabeled
images
struction
scenes.to being
[11,from
37] Enhance
images
large-scale
public
large-scale
repetitively
and
Nevertheless, autonomous the
from scene
datasets.
unlabeled
unlabeled
we public
sampled
driving coverage.
[7,
can easily datasets.
Moreover,
images
images
20,fromcover
45]. The
from
from diversity
Moreov
synthetic
pre-defined public
publi
numero
large-scale unlabeled
unlabeled
redundant images
images
• 2)due to being from
from public
large-scale
public
repetitively
real-world datasets.
datasets.
unlabeled
scenes. sampled Moreover,
Moreover,
images
from
Nevertheless, fromsynthetic
synthetic
pre-defined
we publi
can
flexible accuracy-efficiency trade-offredundant
while evolving due to
images
with being
large-scale are repetitively
Low-resolution
clearly
unlabeled redundant
sampled
redundant
distinguished
images (depth due due
from from
and to pre-defined
super-resolution):
to being
very
public repetitively
repetitively
informative.
datasets. Low- videos. sample
In
sampl
By
Moreover trac
redundant
redundant due
due to
to
images being
being
are repetitively
repetitively
clearly
power Time sampled
redundant
sampled
distinguished
of large-scale
Flight (ToF) duefrom
from
and to
unlabeled
cameras pre-defined
being
pre-defined
very
[28], repetitively
informative.
images
commonly usedvideos.
videos.
from In
sampl
By Intra
public co
c
advancements in MDE models. images
images are are clearly
models
redundant
are clearly
clearly distinguished
notindue
only
distinguished todemonstrate
being and
images
images
and very
repetitively
very are
are informative.
clearly
clearly
stronger sampled
informative. distinguished
zero-shot By
distinguished
from
By training
MDE and
and
pre-defined
training onvery
capabili
on suf
veryvi
suffi
images
models models distinguished
not mobile
only images
and
phones,
demonstrate
redundant very
capture are informative.
clearly
low-resolution
stronger
due to distinguished
being By
depth
zero-shot training
maps.
MDE
repetitively and on
capabilivery
samplesuf
1. Introduction models not
models not only
not only
real
images
real
only
real
demonstrate
demonstrate
images”),
are
images”),
demonstrate
images”),
clearly
Depth but
but stronger
stronger
they
they
stronger models
models
super-resolution
but theymodels can
distinguished
can
can
zero-shot
alsonot
zero-shot
not
isalso
zero-shot
also
only
only
serve
andfor
serve
essential
not only
serve
MDE
MDEdemonstrate
demonstrate
as
very
MDE as
spatial
capability
capability
better
informative.
better
demonstrate
as capability
better
stronger
pre-trained
perception,
(as
stronger
pre-trained (as
(as
stronger
pre-trained
By shown
zero
shown
zero
sour
train
sour
shownzero
sour
real
real images”),
images”), but
but
models they
theynotVRcan
can
only
[35], also
also images
serve
real
serve
real
demonstrate
and AR [43]images”),
as are
images”),
as better
better clearly
stronger
in portable pre-trained
but
pre-trained
but
devices distinguished
they
they can
[22].can
zero-shot sources
also
sources
MDEalso and
serve
for
serve
for very
capabilityas
dow
dow
as beb
real images”), but
Transfer
Fine-detailed and dense metric depth information is aTransfer
they can
knowledge also serve
realfrom as
images”),better
the most pre-trained
but they
capable can sources
also
model for
serve
to dow
as
smal b
Transfer
fun-
real images”),
Transfer
knowledge • knowledge
3) Missing
knowledge
from butthe areas
they
most
from
models
(depth
cancapable
from
Transfer
the
also not
the most
only
inpainting):
serve
most
knowledge model
capable
demonstrate
ascapable
better
Stereo
to
from
model stronger
pre-trained
matching
model
smaller the most
to
to
ones.
smal
zero
source
smal
capaW
damental pursuit in computer vision Transfer
[11, 14, 16, 37,knowledge
Transfer that
49, 60] smaller
knowledge
that smaller from
from models
models the
the39] most
Transfer
cannot
most
cannot
real capable
capable knowledge
directly
directly
images”), model
benefit
model
benefit
but they to
tofrom smaller
from
smaller
from
can [31,alsothe most
ones.
synthetic-to-
ones.
synthetic-to-
serve ascap
W
W
be
that smaller
failures
models
[32, Transfer
or 3D
cannot knowledge
Gaussian
directly
splatting
benefit from
edits
from the
59] most
synthetic-to- cap
and robotics applications [51, 52, 63].that smaller
Although
that smaller models
Transfer
armed
monocular
smaller armed modelswith cannot
with knowledge
cannot
may leave directly
large-scale
directly
large that
from
that
missingbenefit
smaller
unlabeled
smaller the
areas from
in models
most real
models
depth synthetic-to-real
capable cannot
images,
maps. cannot
Filling modeldirectly
they
directly
these to
can transfer
benefit
smalle
learn
benefi
depth estimation (MDE) models [4, armedthat models
armed cannot
with large-scale
directly
large-scale that benefit
unlabeled
smaller
unlabeled from real
models
real synthetic-to-real
images,
cannot
images, they
directly
they can
can transfer
learn
benefi
learn
21, 25, 27,
armed with
with large-scale
34, 55,
of 56, smaller
thatthe
large-scale most unlabeled
gaps models
is vital formodel,
capable
unlabeled armed
3Dreal
Transfer
cannot
armed
real scene images,
with large-scale
knowledge
directly
generation
similar
images,
with they
benefit
to and
large-scale
they can from
editing
knowledge
can unlabeled
learn
[59].
unlabeled
learn the to mimic
mostreal
synthetic-to-re
distillation
to mimic
real ima
capathe
ima
the[[
armed
58] have made significant progress, enabling
of the with
complete
most large-scale
of
of the
the
and
capable most
most
model,
• 4) unlabeled
capable
capable
Mixed similar
prior: armed
ofreal
model,
model,
Intothe images,
with
similar
similar
knowledge
most
real-world large-scale
they
capableto
to
scenarios, can
knowledge
knowledge
distillation
model, unlabeled
learn
different to mimic
real
distillation
distillation
similar
[27].
depth But
to ima
knowthe
diff [
detailed depth predictions, predicted of
of the
the most
most armed
enforced
capable
capable with
at
model,
model, large-scale
the label
similar
similar of
ofthat
levelthe smaller
unlabeled
to For
the via
most
knowledge
most extra models
real
capable
capable images,
unlabeled cannot
model,
distillation
model, they directly
real can
similar
[27].
similardata, Butbenefit
learn
to
toinste
knoto
diff
kno
depths
enforced are enforced
relative
at the and
of thelabel most
at
levelthe
priors vialabel
often extra level
coexist.
enforced
armed via
unlabeledwith extra
instance,
at the unlabeled
structured
real
label
large-scale data, light
level real
cam-
instead
via
unlabeled data,
extraofrealatinste
unlabe
the
ima
lack precise metric information. Onenforced
the other hand,at original
the
depth label erascapable
labeled
level [24]viadata.
often model,
extra This
enforced
generate similar
practice
unlabeled at the
low-resolution to
is
real knowledge
safer
label data, because
andlevel instead
incompleteviadistillation
there
of
extra is
at [27f
evid
the
unlab
original
measurement technologies, such as Structure labeled
from
original
data.
enforced
is not
Motion
labeled
This
alwaysat the data.
practice
label
beneficial, This
original
is
level
of the practice
safer
viamostlabeled
especiallybecause
extra is
capable safer
data.there
unlabeled
when because
This
model, is
the evidence
practice
real there
data,
similar is
teacher-studenis is evid
showin
safer
instead
to know b
original
is not labeled
always is data.
not This
always
beneficial,
depth practice
maps,
beneficial,
especially
dueoriginal
toistheir
is notsafer labeled
especially
when alwaysbecause
limited
the
working data.there
when
beneficial, This
range.
teacher-student isHandling
the evidence
practice
teacher-studen
especially scale showin
safer
when
gap
is not always original
(SfM) [40] or depth sensors [28], provide precise butsupported
often
beneficial,labeled
in Figure
these mixeddata.
especially This
isenforced
16,
priors practice
unlabeled
is
not vital
when foratthe
always theisbeneficial,
images
practical safer
label because
boost
applications.
teacher-student thethere
levelespecially
via extra isunlabe
robustness
scale eviden
gapb
whe
supported
incomplete and coarse metric information. insupported
Figure
is not 16,
alwaysin Figure
unlabeled
beneficial,16, unlabeled
supported
images
original especiallyboost
in
labeled images
Figure
the
when
data. boost
robustness
16, the
This unlabeledthe ofrobustness
our
teacher-student
practice images
is small
safer
supported
In this paper, we explore prior-based monocular in depth
Figure 16, unlabeledsupported
In Tab. 1, we images boost
detail the in Figure
patterns ofthe robustness
each 16, unlabeled
prior. Existing of our smallb
images
supported in Figure
methods mainly16, unlabeled images
is not always beneficial, especially when
focus on specific limited boost
priors, the
limiting robustness of
5 Depth their use Anything
in diverse, real-world V2 scenarios. In this work, we
5 Depth Anything
Depth5Anything V2 in Figure
V2 supported
5 Depth Anything V2 images b
16, unlabeled
* Equal Contribution.

5 Depth Anything V2 5 Depth Anything V2


5 Depth
5.1 Overall
1
Anything
FrameworkV2
5.1 Overall
5.1Framework
Overall 5.1
Framework
5.1 Overall Framework Overall Framework
5.1
5.1 Overall
Overall Framework
Framework 5 Depth
5.1
5.1 OverallAnything
Overall Framework
Framework V2
5.1 Overall
According to allFramework
the above analysis, our final pipeline to train D
According toAccording
all the above
According
It consists to
to
of analysis,
all
all
threethe According
above
thesteps:
aboveouranalysis,
finaltopipeline
analysis,allour
thefinal
our above
to train
final analysis,
Depth
pipeline
pipeline to our
to Anyth
finD
train
train D
According
According to
It consists of all the
the
three above
toItAccording
all above
steps:
consists of analysis,
According
5.1
analysis, our
According
It our to
final
Overall
final
consists to
of all the
pipeline
pipeline
all the
three above
Frameworkto
above
to
steps: analysis,
train
train Depth
analysis,
Depth our
Anyth
Anyth
our fi
fi
to three
all thesteps:
above analysis, our final pipeline to train De
Sparse Point Low- Missing Area
Methods Target Task Mixed
SfM LiDAR Extreme resolution Range Shape Object
Marigold-DC [47] Depth Completion ✓ ✓ ✓ ✗ ✗ ✗ ✗ ✗
Omni-DC [66] Depth Completion ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗
PromptDA [29] Depth Super-resolution ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗
DepthLab [30] Depth Inpainting ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✗
Prior Depth Anything All-Rounder ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Table 1. Applicable scenarios of current prior-based monocular depth estimation models. SfM: sparse matching points from SfM, LiDAR:
sparse LiDAR line patterns, Extreme: extremely sparse points (100 points), Range: missing depth within a specific range, Shape: missing
regular-shaped areas, Object: missing depth of an object.

propose Prior Depth Anything, motivated by the comple- 2. Related Work


mentary advantage between predicted and measured depth
maps, as illustrated in Fig. 1. Technically, we design a 2.1. Monocular Depth Estimation
coarse-to-fine pipeline to explicitly and progressively com- Monocular depth estimation (MDE) is a fundamental com-
bine the depth prediction with measured depth prior, achiev- puter vision task that predicts the depth of each pixel from a
ing impressive robustness to any image with any prior. single color image [2, 15, 17]. Recently, with the success of
We first introduce coarse metric alignment to pre-fill in- “foundation models” [5], some studies [4, 21, 27, 34, 54–
complete depth priors using predicted relative depth maps, 56, 58] have attempted to build depth foundation models
which effectively narrows the domain gap between various by scaling up data and using stronger backbones, enabling
prior types. Next, we apply the fine structure refinement them to predict detailed geometric structures for any image.
to rectify misaligned geometric structures in the pre-filled MiDaS [34] made the pioneering study by training an
depth priors caused by inherent noises in the depth measure- MDE model on joint datasets to improve generalization.
ments. Specifically, the pre-filled depth prior (with accurate Following this line, Depth Anything v1 [55] scaled training
metric data) and the relative depth prediction (with fine de- with massive unlabeled image data, while Depth Anything
tails and structure) are provided as additional inputs to a v2 [56] further enhanced its ability to handle fine details,
conditioned MDE model. Guided by the RGB image input, reflections, and transparent objects by incorporating highly
the model can combine the strengths of two complementary precise synthetic data [6, 36, 48, 50, 57].
depth sources for the final output. Although these methods have demonstrated high accu-
We evaluate our model on 7 datasets with varying depth racy and robustness, they primarily produce unscaled rela-
priors. It achieves zero-shot depth completion, super- tive depth maps due to the significant scale differences be-
resolution, and inpainting within a single model, match- tween indoor and outdoor scenes. While Metric3D [25, 58]
ing or outperforming previous models that are specialized and Depth Pro [4] achieve zero-shot metric depth estimation
for only one of these tasks. More importantly, our model through canonical camera transformations, the precision re-
achieves significantly better results when different depth mains limited compared to measurement technologies.
priors are mixed, highlighting its effectiveness in more prac- Our method builds on the strength of existing depth foun-
tical and varying scenarios. dation models, which excel at precisely capturing relative
Our contribution can be summarized as follows: geometric structures and fine details in any image. By pro-
• We propose Prior Depth Anything, a unified framework gressively integrating accurate but incomplete metric infor-
to estimate fine-detailed and complete metric depth with mation in the depth measurements, our model can generate
any depth priors. Our model can seamlessly handle zero- dense and detailed metric depth maps for any scene.
shot depth completion, super-resolution, inpainting, and
adapt to more varied real-world scenarios. 2.2. Prior-based Monocular Depth Estimation
• We introduce coarse metric alignment to pre-fill depth
priors, narrowing the domain gap between different types In practical applications, depth measurement methods like
of depth prior and enhancing the model’s generalization. multi-view matching [12] or sensors [19, 42] can provide
• We design fine structure refinement to alleviate the inher- accurate metric information, but due to their inherent nature
ent noise in depth measurements. This involves a con- or cost limitations, these measurements often capture in-
ditioned MDE model to granularly merge the pre-filled complete information. Some recent studies have tried to use
depth prior and prediction based on image content. this measurement data as prior knowledge in the depth esti-
• Our method exhibits superior zero-shot results across var- mation process to achieve dense and accurate metric depth.
ious datasets and tasks, surpassing even state-of-the-art These methods, however, primarily focus on specific pat-
methods specifically designed for individual tasks. terns of depth measurement, which can be categorized into
three types based on their input patterns:

2
Depth Completion As noted in [37], SfM reconstruc- 3.1. Preliminary
tions from 19 images often result in depth maps with only
Given an RGB image I ∈ R3×H×W and its corresponding
0.04% valid pixels. Completing the sparse depth maps with
metric depth prior Dprior ∈ RH×W , prior-based monocu-
observed RGB images is a fundamental computer vision
lar depth estimation takes the I and Dprior as input, aiming
task [8, 9, 33, 44, 61, 65]. Recent approaches like Omni-
to output the depth map Doutput ∈ RH×W that is detailed,
DC [66] and Marigold-DC [47] have achieved certain levels
complete, and metrically precise. As discussed in 1, depth
of zero-shot generalization across diverse scenes and vary-
priors obtained by different measurement techniques often
ing sparsity levels. However, due to the lack of explicit
display various forms of incompleteness. To handle various
scene geometry guidance, they face challenges in extremely
priors with a unified framework, we uniformly represent the
sparse scenarios.
i=0 ,
coordinates of valid positions in Dprior as P = {xi , yi }N
which N pixels are valid.
Depth Super-resolution Obtaining high-resolution met-
ric depth maps with depth cameras usually demands signifi- 3.2. Coarse Metric Alignment
cant power. A more efficient alternative is to use low-power
As shown in Fig. 2, different types of depth priors exhibit
sensors to capture low-resolution maps and then enhance
distinct missing patterns (e.g. sparse points, low-resolution
them using super-resolution. Early efforts [23, 53, 62, 64],
grids, or irregular holes). These differences in sparsity and
however, show limited generalization to unseen scenes. Re-
incompleteness restrict models’ ability to generalize across
cent PromptDA [29] achieves effective zero-shot super-
various priors. To tackle this, we propose pre-filling missing
resolution by using the low-resolution map as a prompt for
regions to transform all priors into a shared intermediate
depth foundation models [56].
domain, reducing the gap between them.
However, interpolation-based filling, used in previous
Depth Inpainting As discussed in [27, 56], due to inher-
methods [29, 30], preserves pixel-level metric information
ent limitations in stereo matching and depth sensors, even
but ignores geometric structure, leading to significant er-
“ground truth” depth data in real-world datasets often have
rors in the filled areas. On the other hand, global align-
significant missing regions. Additionally, in applications
ment [10, 11], which scales relative depth predictions to
like 3D Gaussian editing and generation [10, 31, 59], there
match priors, maintains the fine structure of predictions
is a need to fill holes in depth maps. DepthLab [30] first
but loses critical pixel metric details. To address these
fills holes using interpolation and then refines the results
challenges, we propose pixel-level metric alignment, which
with a depth-guided diffusion model. However, interpola-
aligns geometry predictions and metric priors at pixel level,
tion errors reduce its effectiveness for large missing areas or
preserving both predicted structure and original metric in-
incomplete depth ranges.
formation.
These previous methods have two main limitations: 1)
Poor performance when prior is limited. 2) Difficulty gener- Pixel-level Metric Alignment We first use a frozen MDE
alizing to unseen prior patterns. Our approach, Prior Depth model to obtain a relative depth prediction Dpred ∈ RH×W .
Anything, tackles these challenges by explicitly using geo- Then, by explicitly utilizing the accurate geometric struc-
metric information from depth prediction in a coarse-to-fine ture in the predicted depth, we fill the invalid regions in
process, achieving impressive generalization and accuracy Dprior pixel by pixel. Considering the pre-filled coarse depth
across various patterns of prior input. map D̂prior , which inherits all the valid pixels in Dprior :

D̂prior (x, y) = Dpred (x, y), where (x, y) ∈ P (1)


3. Prior Depth Anything
For each missing pixel (x̂, ŷ), we first identify its K clos-
Advanced monocular depth estimation models excel in pre-
est valid points {xk , yk }K
k=1 from the valid pixel set P using
dicting detailed, complete relative depth maps with precise
k-nearest neighborhood (kNN). Then, we compute the op-
geometric structures for any image. In contrast, depth mea-
timal scale s and shift t parameter that minimizes the least-
surement technologies can provide metric depth maps but
squares error between the depth values of Dpred and Dprior
suffer from inherent noise and varying patterns of incom-
at the K supporting points:
pleteness. Inspired by the complementary strengths of esti-
mated and measured depth, we introduce Prior Depth Any- K
thing to progressively and effectively merge the two depth
X
s, t = arg min ∥s · Dpred (xk , yk ) + t − Dprior (xk , yk )∥2
sources. To handle diverse real-world scenarios, we take s,t
k=1
measurement depth in any form as the metric prior, produc- (2)
ing fine-grained and complete metric depth maps for any With the optimal scale s and shift t, we fill the miss-
image with any prior. ing pixels in D̂prior by linearly aligning the predicted depth

3
𝐃!"#$ Coarse Metric Alignment Fine Structure Refinement
(Explicit Fusion) (Implicit Fusion)
……
prediction

" !"%&" RGB Conditioned Scale


𝐃!"%&" 𝐃 RGB Conv MDE De-norm

Condition
Conv
Frozen
MDE Pred
Pixel Alignment

Scale Norm
&
Re-weighting
Coarse
Align filled
……

Figure 2. Prior Depth Anything. Considering RGB images, any form of depth prior Dprior , and relative prediction Dpred from a frozen
MDE model, coarse metric alignment first explicitly combines the metric data in Dprior and geometry structure in Dpred to fill the incomplete
areas in Dprior . Fine structure refinement implicitly merges the complementary information to produce the final metric depth map.

value at (x̂, ŷ) to metric prior: 3.3. Fine Structure Refinement


D̂prior (x̂, ŷ) = s · Dpred (x̂, ŷ) + t (3) Although the prefilled coarse dense depth is generally ac-
curate in metric, the parameter-free approach is sensitive to
Distance-aware Re-weighting Our pilot study shows noise in depth priors. A single noisy pixel on blurred edges
that simple pixel-level metric alignment achieves reason- can disrupt all filled regions relying on it as a supporting
able accuracy and generalization. However, two limitations point. To tackle these errors, we further implicitly lever-
remain: 1) Discontinuity Risk: Adjacent pixels in missing age the MDE model’s ability in capturing precise geometric
regions may select different k-nearest neighbors, resulting structures in RGB images, learning to rectify the noise in
in abrupt depth changes. 2) Uniform Weighting: Nearby priors and produce refined depth.
supporting points offer more reliable metric cues than dis- Metric Condition Specifically, we incorporate the pre-
tant ones, but equal weighting in the least squares overlooks filled prior D̂prior as extra conditions to the pre-trained MDE
this geometrical correlation, leading to suboptimal align- model. With the guidance of RGB images, the conditioned
ment. MDE model is trained to correct the potential noise and er-
To handle this issue, we further introduce distance-aware ror in D̂prior . To this end, we introduce a condition convo-
weighting for more smooth and accurate alignment. Within lutional layer parallel to the RGB input layer, as shown in
the alignment objective of Eq. 2, we re-weight each support- Fig. 2. By initializing the condition layer to zero, our model
ing point based on its distance to the query pixel, modifying can natively inherit the ability of pre-trained MDE model.
Eq. 2 to:
Geometry Condition In addition to leveraging the MDE
K
X ∥s · Dpred (xk , yk ) + t − Dprior (xk , yk )∥2 model’s inherent ability in capturing geometric structures
s, t = arg min from RGB input, we also incorporate existing depth predic-
s,t ∥(x̂, ŷ) − (xk , yk )∥2
k=1
(4) tions as an external geometry condition to help refine the
This simple modification ensures smoother transitions be- coarse pre-filled prior. The depth prediction Dpred , obtained
tween regions and improves robustness by emphasizing ge- from the frozen MDE model, is also passed into the condi-
ometrically closer measurements. tion MDE model through a zero-initialized conv layer.
In summary, by explicitly integrating the accurate metric Scale Normalization Then, we normalize the metric con-
information from Dprior and the fine geometric structures dition D̂prior and geometry condition Dpred to [0,1] for
from Dpred , we cultivate the pre-filled dense prior D̂prior , two key benefits: 1) Better Scene Generalization: differ-
which offers two main advantages: 1) Similar Pattern: Fill- ent scenes (e.g. indoor vs. outdoor) have significant depth
ing the missing area narrows the differences between vari- scale differences. Normalization removes this scale vari-
ous prior types, improving the generalization across differ- ance, improving performance across diverse scenes. 2) Bet-
ent scenarios. 2) Fine Geometry: The filled regions, de- ter MDE Model Generalization: predictions from different
rived from linear transformations of depth prediction, na- frozen MDE models also have varying scales. Normalizing
tively preserve the fine geometric structures, significantly Dpred enables test-time model switching, offering flexible
boosting performance when prior information is limited. accuracy-efficiency trade-offs for diverse demands, and en-
abling seamless improvements as MDE models advance.
4
NYUv2 ScanNet ETH-3D DIODE KITTI
Model Encoder
S+M L+M S+L S+M L+M S+L S+M L+M S+L S+M L+M S+L S+M L+M S+L
DAv2 ViT-L 4.59 4.95 5.08 4.34 4.55 4.62 6.82 10.99 7.49 11.57 10.23 12.46 10.20 11.12 10.97
Depth Pro ViT-L 4.46 4.87 5.41 4.22 4.44 4.51 6.39 7.08 9.29 12.92 8.42 8.91 6.29 9.24 9.18
Omni-DC - 2.86 3.26 3.81 2.88 2.76 3.64 2.09 4.17 4.59 4.23 4.80 5.40 4.36 8.63 9.02
Marigold-DC SDv2 2.26 3.38 3.82 2.19 2.87 3.37 2.15 4.77 5.13 4.98 6.73 7.25 5.82 9.67 10.05
DepthLab SDv2 6.33 3.96 5.80 5.16 3.38 5.24 7.87 4.70 7.45 8.83 6.40 8.62 39.29 23.12 30.96
PromptDA ViT-L 17.00 3.76 17.13 15.27 4.21 15.44 18.34 9.01 18.73 18.24 9.97 18.55 21.61 54.35 22.14
DAv2-B+ViT-S 2.09 2.88 3.17 2.14 2.56 2.94 1.65 3.96 4.16 3.76 4.43 4.89 3.97 8.38 8.53
PriorDA
DAv2-B+ViT-B 2.04 2.82 3.09 2.11 2.50 2.86 1.56 3.84 4.08 3.62 4.30 4.73 3.86 8.40 8.57
(ours)
Depth Pro+ViT-B 2.01 2.82 3.08 2.06 2.48 2.82 1.61 3.83 4.05 3.40 4.23 4.60 3.37 8.12 8.25
Table 2. Zero-shot depth estimiation with mixed prior. All results are reported in AbsRel ↓. “S”: “Extreme” setting in sparse points,
“L”: “×16” in low-Resolution, “M”: “Shape” (square masks of 160) in missing area. We highlight Best, second best results. “Depth
Pro+ViT-B” indicates the frozen MDE and conditioned MDE. DAv2-B: Depth Anything v2 ViT-B [56], SDv2: Stable Diffusion v2. [38].
NYUv2 ScanNet ETH-3D DIODE KITTI
Model Encoder
SfM LiDAR Extreme SfM LiDAR Extreme SfM LiDAR Extreme SfM LiDAR Extreme SfM LiDAR Extreme
DAv2 ViT-L 5.31 4.85 4.77 5.88 4.60 4.68 5.84 7.41 6.61 12.45 12.55 14.37 9.83 8.86 9.25
Depth Pro ViT-L 4.80 4.47 4.41 5.27 4.12 4.23 5.68 5.31 6.51 10.53 9.08 8.98 6.45 6.05 6.19
Omni-DC - 2.87 2.12 2.63 6.09 2.02 2.71 2.57 1.88 1.98 4.99 3.96 4.13 3.34 5.27 4.17
Marigold-DC SDv2 2.65 1.90 2.13 4.32 1.76 2.12 4.65 2.27 2.03 8.41 5.12 4.77 6.19 6.88 5.62
DepthLab SDv2 5.92 4.30 6.30 9.87 3.56 5.09 13.82 6.40 8.01 16.67 7.45 8.66 25.91 37.17 40.29
PromptDA ViT-L 18.68 17.59 16.96 18.13 15.99 15.18 25.02 18.86 18.18 25.46 18.58 17.93 22.26 21.96 21.39
DAv2-B+ViT-S 2.42 2.01 2.01 3.90 2.19 2.09 3.26 1.90 1.61 5.25 3.88 3.65 3.73 4.81 3.76
PriorDA
DAv2-B+ViT-B 2.38 1.95 1.97 3.85 2.15 2.04 3.10 1.82 1.50 5.16 3.77 3.64 3.74 4.73 3.76
(ours)
Depth Pro+ViT-B 2.36 1.96 1.96 3.84 2.14 2.01 3.13 1.84 1.54 5.07 3.66 3.37 3.35 4.12 3.28
Table 3. Zero-shot depth completion (e.g. sparse points prior). All results are reported in AbsRel ↓. “SfM”: points sampled with
SIFT [32] and ORB [39], “LiDAR”: 8 LiDAR lines, “Extreme”: 100 random points.
Synthetic Training Data As discussed in [27, 56], real Training Setting We train the conditioned MDE model
depth datasets often face issues such as blurred edges and for 200K steps with a batch size of 64, using 8 GPUs. The
missing values. Therefore, we leverage synthetic datasets, AdamW optimizer with a cosine scheduler is employed,
Hypersim [36] and vKITTI [6], with precise GT to drive our where the base learning rate is set to 5e-6 for the MDE en-
conditioned MDE model to rectify the noise in measure- coder and 5e-5 for the MDE decoder.
ments. From the precise ground truth, we randomly sample
sparse points, create square missing areas, or apply down- 4. Experiment
sampling to construct different synthetic priors. To mimic
4.1. Experimental Setting
real-world measurement noise, we add outliers and bound-
ary noise to perturb the sampled prior, following [66]. Benchmarks Our method aims to provide accurate and
complete metric depth maps in a zero-shot manner for any
Learning Objective As mentioned earlier, both the met- image with any prior. To cover “any image”, we eval-
ric and geometry conditions are normalized. Thus, we ap- uate models on 7 unseen real-world datasets, including
ply the de-normalization transformation to convert the out- NYUv2 [42] and ScanNet [13] for indoor, ETH3D [41]
put into the ground truth scale. Following ZoeDepth [3], we and DIODE [46] for indoor/outdoor, KITTI [18] for out-
use the scale-invariant log loss for pixel-level supervision. door, ARKitScenes [1] and RGB-D-D [23] for captured
low-resolution depth. To cover “any prior”, we construct 9
3.4. Implementation Details individual patterns: sparse points (SfM, LiDAR, Extremely
Network Design During training, we utilize the Depth sparse), low-resolution (captured, x8, x16), and missing ar-
Anything V2 ViT-B model as the frozen MDE model to eas (Range, Shape, Object). We also mix these patterns to
produce relative depth predictions. During inference, the simulate more complex scenarios.
frozen MDE model can be swapped with any other pre- Baselines We compare with two kinds of methods: 1)
trained model. The k-value of the kNN process in Sec. 3.2 Post-aligned MDE: Depth Anything v2 (DAv2) [56] and
is set to 5. We initialize the conditioned MDE Model with Depth Pro [4]; and 2) Prior-based MDE: Omni-DC [66],
two versions of Depth Anything V2: ViT-S and ViT-B. Marigold-DC [47], DepthLab [30] and PromptDA [29].

5
ARKitScenes RGB-D-D NYUv2 ScanNet ETH-3D DIODE KITTI
Model Encoder
AbsRel↓ RMSE↓ AbsRel↓ RMSE↓ 8× 16× 8× 16× 8× 16× 8× 16× 8× 16×
DAv2 ViT-L 3.67 0.0764 4.67 0.1116 4.77 5.13 4.64 4.85 6.27 7.38 12.49 11.20 9.54 11.22
Depth Pro ViT-L 3.25 0.0654 4.28 0.1030 4.48 4.83 4.17 4.40 5.88 6.79 8.20 8.33 6.76 9.16
Omni-DC - 2.14 0.0435 2.09 0.0685 1.57 3.11 1.29 2.65 1.86 4.09 2.81 4.71 4.05 8.35
Marigold-DC SDv2 2.17 0.0448 2.15 0.0672 1.83 3.32 1.63 2.83 2.33 4.75 4.28 6.60 5.17 9.47
DepthLab SDv2 2.10 0.0411 2.13 0.0624 2.60 3.73 1.89 3.19 2.60 4.50 4.42 6.16 17.17 22.90
PromptDA ViT-L 1.34 0.0347 2.79 0.0708 1.61 1.75 1.87 1.93 1.80 2.56 3.18 3.73 3.92 4.95
DAv2-B+ViT-S 2.09 0.0414 2.07 0.0597 1.73 2.79 1.60 2.50 2.06 3.91 3.09 4.36 4.54 8.20
PriorDA
DAv2-B+ViT-B 1.94 0.0404 2.02 0.0581 1.72 2.73 1.61 2.45 2.00 3.79 3.10 4.23 4.65 8.24
(ours)
Depth Pro+ViT-B 1.95 0.0408 2.02 0.0581 1.72 2.74 1.58 2.43 1.99 3.77 3.01 4.15 4.44 7.99
Table 4. Zero-shot depth super-resolution (e.g. low-resolution prior). ARKitScenes and RGB-D-D provide captured low-resolution
depth. For other datasets, results are reported in AbsRel ↓, with low-resolution maps created by downsampling the GT depths.
NYUv2 ScanNet ETH-3D DIODE KITTI
Model Encoder
Range Shape Object Range Shape Object Range Shape Object Range Shape Object Range Shape Object
DAv2 ViT-L 17.40 5.24 6.56 16.75 4.64 6.74 68.76 8.23 19.22 51.55 29.20 13.41 31.12 14.93 17.94
Depth Pro ViT-L 10.89 9.20 6.52 16.76 15.39 6.80 10.37 34.28 17.28 37.44 34.74 13.53 14.51 16.11 8.19
Omni-DC - 23.24 5.94 13.79 22.89 5.44 8.71 29.47 4.81 17.97 38.83 7.75 25.43 35.42 8.94 15.06
Marigold-DC SDv2 19.83 2.37 6.18 17.14 1.97 6.66 25.36 2.15 7.72 39.33 7.59 18.97 33.44 9.21 7.72
DepthLab SDv2 23.85 2.66 10.87 21.17 2.08 10.40 30.61 2.75 10.53 41.01 6.51 17.17 40.43 13.60 18.66
PromptDA ViT-L 36.67 20.88 23.14 35.86 17.87 21.89 46.21 24.94 27.42 49.50 25.66 28.29 55.79 32.74 38.29
DAv2-B+ViT-S 16.86 2.30 5.72 14.29 2.01 5.87 21.16 1.98 6.52 36.59 5.58 10.77 30.04 6.67 7.99
PriorDA
DAv2-B+ViT-B 16.61 2.30 5.49 14.48 1.99 5.73 21.90 1.76 6.09 36.64 5.94 9.72 30.79 6.29 7.52
(ours)
Depth Pro+ViT-B 16.31 2.17 5.59 14.18 1.98 5.87 22.72 1.76 6.21 34.90 4.86 11.99 30.44 5.47 6.04
Table 5. Zero-shot depth inpainting (e.g. missing area prior). All results are reported in AbsRel ↓. Metrics are calculated only on the
masked (inpainted) regions. “Range”: masks for depth beyond 3m (indoors) and 15m (outdoors), “Shape”: average result for square masks
of sizes 80, 120, 160, and 200, “Object”: object segmentation masks detected by YOLO [26].

4.2. Comparison on Mixed Depth Prior pling (e.g. NYUv2 [42], ScanNet [13], etc.), our approach
achieves performance comparable to state-of-the-art meth-
We quantitatively evaluate the ability to handle challenging
ods. However, since downsampling tends to include overly
unseen mixed priors in Tab 2. In terms of absolute per-
specific details from the GT depths, directly replicating
formance, all versions of our model outperform compared
noise and blurred boundaries from GT leads to better results
baselines. More importantly, our model is less impacted by
instead. Therefore, ARKitScenes [1] and RGB-D-D [23]
the additional patterns. For example, compared to the set-
are more representative and practical, as they use low-power
ting that only uses sparse points in Tab 3, adding missing
cameras to capture the low-resolution depths. On these
areas or low-resolution results in only minor performance
two benchmarks, our method achieves leading performance
drops (1.96→2.01, 3.08 in NYUv2). In contrast, Omni-DC
compared to other zero-shot methods.
(2.63→2.86, 3.81), and Marigold-DC (2.13→2.26, 3.82)
show larger declines. These results highlight the robustness Zero-shot Depth Inpainting In Tab 5, we evaluate the
of our method to different prior inputs. performance of inpainting missing regions in depth maps.
In the practical and challenging “Range” setting, our
4.3. Comparison on Individual Depth Prior
method achieves superior results, which is highly meaning-
Zero-shot Depth Completion Tab 3 shows the zero-shot ful for improving depth sensors with limited effective work-
depth completion results with different kinds and sparsity ing ranges. Additionally, it outperforms all alternatives in
levels of sparse points as priors. Compared to Omni- filling square and object masks, demonstrating its potential
DC [66] and Marigold-DC [47], which are specifically de- for 3D content generation and editing.
signed for depth completion and rely on sophisticated, time-
consuming structures, our approach achieves better overall 4.4. Qualitative Analysis
performance with simpler and more efficient designs.
In Fig 3, we provide a qualitative comparison of the outputs
Zero-shot Depth Super-resolution In Tab 4, we present from different models. Our model consistently outperforms
results for super-resolution depth maps. On benchmarks previous approaches, offering richer details, sharper bound-
where low-resolution maps are created through downsam- aries, and more accurate metrics.

6
RGB image Ours Marigold-DC Omni-DC PromptDA DepthLab

Figure 3. Qualitative comparisons with previous methods. The depth prior or error map is shown below each sample.

RGB image GT & Prior Ours Error

Figure 4. Error analysis on widely used but indeed noisy benchmarks [13, 42]. Red means higher error, while blue indicates lower error.

Fig 4 visualizes the error maps of our model. The errors These “beyond ground truth” cases highlight the potential
mainly occur around blurred edges in the “ground truth” of of our approach in addressing the inherent noise in depth
real data. Our method effectively corrects the noise in labels measurement techniques. More visualizations can be found
while aligning with the metric information from the prior. in the supp.

7
S L M S+M L+M S+L Model Encoder S L M S+M L+M S+L
Interpolation 7.93 3.88 8.96 8.38 4.55 7.99 ViT-S 2.15 2.77 2.68 2.22 2.87 3.20
Ours (w/o re-weight) 2.92 3.44 6.91 3.22 4.53 4.36 Depth ViT-B 1.97 2.73 2.50 2.02 2.82 3.09
Ours 2.42 3.51 6.70 2.60 4.32 4.40 Anything V2 ViT-L 1.92 2.71 2.29 1.97 2.79 3.04
Table 6. Accuracy of pre-filled depth maps with different strate- ViT-G 1.87 2.70 2.22 1.94 2.76 3.02
gies. To independently compare each pre-fill strategy, we directly Depth Pro ViT-L 1.96 2.74 2.35 2.01 2.82 3.08
compare the pre-filled maps with ground truth.
Table 9. Effect of using different frozen MDE models. The condi-
Seen Unseen tioned MDE model is ViT-B version here.
S L M S+M L+M S+L Model Encoder Param Latency(ms)
None 2.50 3.71 46.07 2.50 3.74 3.64 Omni-DC - 85M 334
Interpolation 3.40 2.68 4.28 3.53 2.94 3.56 Marigold-DC SDv2 1,290M 30,634
Ours (w/o re-weight) 2.13 2.86 2.58 2.19 2.94 3.25 DepthLab SDv2 2,080M 9,310
Ours 1.99 2.82 2.26 2.06 2.90 3.11 PromptDA ViT-L 340M 32

Table 7. Effect of pre-filled strategies on generalization. We train DAv2-B+ViT-S 97M+25M 157 19+123+15
PriorDA
DAv2-B+ViT-B 97M+98M 161 19+123+19
models with various pre-fill strategies using only sparse points and (ours)
Depth Pro+ViT-B 952M+98M 760 618+123+19
evaluate their ability to generalize to unseen types of depth priors.
Table 10. Analysis of inference efficiency. “x+x+x ” represents the
Metric Geometry S L M S+M L+M S+L latency of the frozen MDE model, coarse metric alignment, and
✗ ✓ 5.46 5.29 5.48 5.36 5.30 5.46 conditioned MDE model, respectively.
✓ ✗ 2.10 2.94 2.58 2.17 3.02 3.31
✓ ✓ 1.96 2.74 2.48 2.01 2.82 3.08 Testing-time improvement We investigate the potential
of test-time improvements in Tab 9. Our findings reveal that
Table 8. Effect of each condition for conditioned MDE models.
larger and stronger frozen MDE models consistently bring
higher accuracy, while smaller models maintain competi-
4.5. Ablation Study
tive performance and enhance the efficiency of the entire
We use Depth Anything V2 ViT-B as the frozen MDE and pipeline. These findings underscore the flexibility of our
ViT-S as the conditioned MDE for ablation studies by de- model and its adaptability to various scenarios.
fault. All results are evaluated on NYUv2. Inference efficiency analysis In Tab 10, we analyze the
Accuracy of different pre-fill strategy As shown in inference efficiency of different models on one A100 GPU
Fig 6, our pre-fill method outperforms simple interpolation for an image resolution of 480×640. Overall, compared
across all scenarios by explicitly utilizing the precise geo- to previous approaches, our model variants achieve leading
metric structures in depth prediction. Additionally, the re- performance while demonstrating certain advantages in pa-
weight mechanism further enhances performance. rameter number and inference latency. For a more detailed
Pre-fill strategy for generalization From Tab 7, we ob- breakdown, we provide the time consumption for each stage
serve that our pixel-level metric alignment helps the model of our method. The coarse metric alignment, which relies
generalize to new prior patterns, and the re-weighting strat- on kNN and least squares, accounts for the majority of the
egy further enhances the robustness by improving the accu- inference latency. However, it still demonstrates significant
racy of the pre-filled depth map. efficiency advantages compared to the sophisticated Omni-
DC and diffusion-based DepthLab and Marigold-DC.
Effectiveness of fine structure refinement Comparing
the pre-filled coarse depth maps in Tab 6 with the fi- 5. Application
nal output accuracy in Tab 3, 4, 5 and 2, the perfor-
mance improvements after fine structure refinement (sparse To demonstrate our model’s real-world applicability, we
points: 2.42→2.01, low-resolution: 3.51→2.79, missing ar- employ prior-based monocular depth estimation models to
eas: 6.70→2.48, S+M: 2.60→2.09, L+M: 4.32→2.88, S+L: refine the depth predictions from VGGT, a state-of-the-art
4.40→3.17) demonstrate its effectiveness in rectifying mis- 3D reconstruction foundation model. VGGT provides both
aligned geometric structures in the pre-filled depth maps a depth and confidence map. We take the top 30% most con-
while maintaining its accurate metric information. fident pixels as depth prior and apply different prior-based
Effectiveness of metric and geometry condition We models to obtain finer depth predictions. 1
evaluate the impact of metric and geometry guidance for Table 11 reports VGGT’s performance in monocular and
the conditioned MDE model in Tab 8. The results show that multi-view depth estimation, along with the effectiveness
combining both conditions achieves the best performance, 1 For models less adept at handling missing pixels (DepthLab,

emphasizing the importance of reinforcing geometric infor- PromptDA), the entire VGGT depth prediction was provided as prior.
mation during the fine structure refinement stage.
8
Monocular Depth Estimation Multi-view Depth Estimation
NYU ETH-3D KITTI ETH-3D KITTI
VGGT 3.54 (-) 4.94 (-) 6.56 (-) 2.46 (-) 18.75 (-)
+Omni-DC 4.12 (0.58) 6.08 (1.14) 6.85 (0.29) 2.64 (0.18) 18.66 (-0.09)
+Marigold-DC 4.06 (0.52) 5.43 (0.49) 7.63 (1.07) 2.81 (0.35) 18.86 (0.11)
+DepthLab 3.56 (0.02 ) 4.92 (-0.02) 7.97 (1.41) 2.25 (-0.21) 19.47 (0.72)
+PromptDA 3.43 (-0.11) 4.97 (0.03) 6.50 (-0.06) 2.48 (0.02) 18.91 (0.16)
+PriorDA 3.45 (-0.09) 4.43 (-0.51) 6.39 (-0.17) 1.99 (-0.47) 18.61 (-0.14)
Table 11. Results of refining VGGT depth prediction with different methods. All results are reported as AbsRel.

of different prior-based methods as refiners. We observe [7] Manuel Carranza-Garcı́a, F Javier Galán-Sales, José Marı́a
that only our PriorDA consistently improves VGGT’s pre- Luna-Romera, and José C Riquelme. Object detection using
dictions, primarily due to its ability to adapt to diverse pri- depth completion and camera-lidar fusion for autonomous
ors. These surprising results highlight PriorDA’s broad ap- driving. Integrated Computer-Aided Engineering, 2022. 1
plication potential. [8] Xinjing Cheng, Peng Wang, and Ruigang Yang. Learn-
ing depth with convolutional spatial propagation network.
6. Conclusion TPAMI, 2019. 3
[9] Xinjing Cheng, Peng Wang, Chenye Guan, and Ruigang
In this work, we present Prior Depth Anything, a robust Yang. Cspn++: Learning context and resource aware convo-
and powerful solution for prior-based monocular depth esti- lutional spatial propagation networks for depth completion.
mation. We propose a coarse-to-fine pipeline to progres- In AAAI, 2020. 3
sively integrate the metric information from incomplete [10] Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee,
depth measurements and the geometric structure from rel- and Kyoung Mu Lee. Luciddreamer: Domain-free gener-
ative depth predictions. The model offers three key advan- ation of 3d gaussian splatting scenes. arXiv:2311.13384,
tages: 1) delivering accurate and fine-grained depth esti- 2023. 3
mation with any type of depth prior; 2) offering flexibility [11] Jaeyoung Chung, Jeongtaek Oh, and Kyoung Mu Lee.
to adapt to extensive applications through test-time module Depth-regularized optimization for 3d gaussian splatting in
switching; and 3) showing the potential to rectify inherent few-shot images. In CVPR, 2024. 1, 3
noise and blurred boundaries in real depth measurements. [12] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe
References Franke, Stefan Roth, and Bernt Schiele. The cityscapes
dataset for semantic urban scene understanding. In CVPR,
[1] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, 2016. 2
Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, [13] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal-
Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse ber, Thomas Funkhouser, and Matthias Nießner. Scannet:
real-world dataset for 3d indoor scene understanding using Richly-annotated 3d reconstructions of indoor scenes. In
mobile rgb-d data. In NeurIPS, 2021. 5, 6 CVPR, 2017. 5, 6, 7
[2] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. [14] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ra-
Adabins: Depth estimation using adaptive bins. In CVPR, manan. Depth-supervised nerf: Fewer views and faster train-
2021. 2 ing for free. In CVPR, 2022. 1
[3] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka,
[15] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map
and Matthias Müller. Zoedepth: Zero-shot transfer by com-
prediction from a single image using a multi-scale deep net-
bining relative and metric depth. arXiv:2302.12288, 2023.
work. In NeurIPS, 2014. 2
5
[16] Patrick Esser, Johnathan Chiu, Parmida Atighehchian,
[4] Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain,
Jonathan Granskog, and Anastasis Germanidis. Structure
Marcel Santos, Yichao Zhou, Stephan R Richter, and
and content-guided video synthesis with diffusion models.
Vladlen Koltun. Depth pro: Sharp monocular metric depth
In ICCV, 2023. 1
in less than a second. In ICLR, 2025. 1, 2, 5
[5] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- [17] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat-
man, Simran Arora, Sydney von Arx, Michael S Bern- manghelich, and Dacheng Tao. Deep ordinal regression net-
stein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, work for monocular depth estimation. In CVPR, 2018. 2
et al. On the opportunities and risks of foundation models. [18] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
arXiv:2108.07258, 2021. 2 ready for autonomous driving? the kitti vision benchmark
[6] Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- suite. In CVPR, 2012. 5
tual kitti 2. arXiv:2001.10773, 2020. 2, 5 [19] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel

9
Urtasun. Vision meets robotics: The kitti dataset. The inter- [34] René Ranftl, Katrin Lasinger, David Hafner, Konrad
national journal of robotics research, 2013. 2 Schindler, and Vladlen Koltun. Towards robust monocular
[20] Christian Häne, Lionel Heng, Gim Hee Lee, Friedrich Fraun- depth estimation: Mixing datasets for zero-shot cross-dataset
dorfer, Paul Furgale, Torsten Sattler, and Marc Pollefeys. 3d transfer. TPAMI, 2020. 1, 2
visual perception for self-driving cars using a multi-camera [35] Alex Rasla and Michael Beyeler. The relative importance
system: Calibration, mapping, localization, and obstacle de- of depth cues and semantic edges for indoor mobility using
tection. Image and Vision Computing, 2017. 1 simulated prosthetic vision in immersive virtual reality. In
[21] Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Proceedings of the 28th ACM symposium on virtual reality
Kaiqiang Zhou, Hongbo Zhang, Bingbing Liu, and Ying- software and technology, 2022. 1
Cong Chen. Lotus: Diffusion-based visual foundation model [36] Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit
for high-quality dense prediction. In ICLR, 2025. 1, 2 Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb,
[22] Lingzhi He, Hongguang Zhu, Feng Li, Huihui Bai, Runmin and Joshua M Susskind. Hypersim: A photorealistic syn-
Cong, Chunjie Zhang, Chunyu Lin, Meiqin Liu, and Yao thetic dataset for holistic indoor scene understanding. In
Zhao. Towards fast and accurate real-world depth super- ICCV, 2021. 2, 5
resolution: Benchmark dataset and baseline. In CVPR, 2021. [37] Barbara Roessle, Jonathan T Barron, Ben Mildenhall,
1 Pratul P Srinivasan, and Matthias Nießner. Dense depth pri-
[23] Lingzhi He, Hongguang Zhu, Feng Li, Huihui Bai, Runmin ors for neural radiance fields from sparse input views. In
Cong, Chunjie Zhang, Chunyu Lin, Meiqin Liu, and Yao CVPR, 2022. 1, 3
Zhao. Towards fast and accurate real-world depth super- [38] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
resolution: Benchmark dataset and baseline. In CVPR, 2021. Patrick Esser, and Björn Ommer. High-resolution image syn-
3, 5, 6 thesis with latent diffusion models. In CVPR, 2022. 5
[24] Daniel Herrera, Juho Kannala, and Janne Heikkilä. Joint [39] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary
depth and color camera calibration with distortion correc- Bradski. Orb: An efficient alternative to sift or surf. In ICCV,
tion. TPAMI, 2012. 1 2011. 1, 5
[40] Johannes L Schonberger and Jan-Michael Frahm. Structure-
[25] Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long,
from-motion revisited. In CVPR, 2016. 1
Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and
Shaojie Shen. Metric3d v2: A versatile monocular geomet- [41] Thomas Schops, Johannes L Schonberger, Silvano Galliani,
ric foundation model for zero-shot metric depth and surface Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An-
normal estimation. TPAMI, 2024. 1, 2 dreas Geiger. A multi-view stereo benchmark with high-
resolution images and multi-camera videos. In CVPR, 2017.
[26] Glenn Jocher, Jing Qiu, and Ayush Chaurasia. Ultralytics
5
YOLO, 2023. 6
[42] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob
[27] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met-
Fergus. Indoor segmentation and support inference from
zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos-
rgbd images. In ECCV, 2012. 2, 5, 6, 7
ing diffusion-based image generators for monocular depth
[43] Mel Slater and Sylvia Wilbur. A framework for immersive
estimation. In CVPR, 2024. 1, 2, 3, 5
virtual environments (five): Speculations on the role of pres-
[28] Robert Lange and Peter Seitz. Solid-state time-of-flight ence in virtual environments. Presence: Teleoperators &
range camera. IEEE Journal of quantum electronics, 2001. Virtual Environments, 1997. 1
1 [44] Jie Tang, Fei-Peng Tian, Boshi An, Jian Li, and Ping Tan. Bi-
[29] Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Ji- lateral propagation network for depth completion. In CVPR,
aming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei 2024. 3
Zhou, and Bingyi Kang. Prompting depth anything for 4k [45] Yifu Tao, Marija Popović, Yiduo Wang, Sundara Tejaswi
resolution accurate metric depth estimation. In CVPR, 2025. Digumarti, Nived Chebrolu, and Maurice Fallon. 3d lidar re-
2, 3, 5 construction with probabilistic depth completion for robotic
[30] Zhiheng Liu, Ka Leong Cheng, Qiuyu Wang, Shuzhe Wang, navigation. In IROS, 2022. 1
Hao Ouyang, Bin Tan, Kai Zhu, Yujun Shen, Qifeng [46] Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo,
Chen, and Ping Luo. Depthlab: From partial to complete. Haochen Wang, Falcon Z Dai, Andrea F Daniele, Mo-
arXiv:2412.18153, 2024. 2, 3, 5 hammadreza Mostajabi, Steven Basart, Matthew R Walter,
[31] Zhiheng Liu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, et al. Diode: A dense indoor and outdoor depth dataset.
Jie Xiao, Kai Zhu, Nan Xue, Yu Liu, Yujun Shen, and Yang arXiv:1908.00463, 2019. 5
Cao. Infusion: Inpainting 3d gaussians via learning depth [47] Massimiliano Viola, Kevin Qu, Nando Metzger, Bingxin Ke,
completion from diffusion prior. arXiv:2404.11613, 2024. Alexander Becker, Konrad Schindler, and Anton Obukhov.
1, 3 Marigold-dc: Zero-shot monocular depth completion with
[32] David G Lowe. Distinctive image features from scale- guided diffusion. arXiv:2412.13389, 2024. 2, 3, 5, 6
invariant keypoints. IJCV, 2004. 1, 5 [48] Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng,
[33] Jin-Hwi Park, Chanhwi Jeong, Junoh Lee, and Hae-Gon Kaiyong Zhao, and Xiaowen Chu. Irs: A large naturalis-
Jeon. Depth prompting for sensor-agnostic depth estimation. tic indoor robotics stereo dataset to train deep models for
In CVPR, 2024. 3 disparity and surface normal estimation. In ICME, 2021. 2

10
[49] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris [65] Yiming Zuo and Jia Deng. Ogni-dc: Robust depth comple-
Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- tion with optimization-guided neural iterations. In ECCV,
sion made easy. In CVPR, 2024. 1 2024. 3
[50] Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, [66] Yiming Zuo, Willow Yang, Zeyu Ma, and Jia Deng. Omni-
Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- dc: Highly robust depth completion with multiresolution
bastian Scherer. Tartanair: A dataset to push the limits of depth integration. arXiv:2411.19278, 2024. 2, 3, 5, 6
visual slam. In IROS, 2020. 2
[51] Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariha-
ran, Mark Campbell, and Kilian Q Weinberger. Pseudo-lidar
from visual depth estimation: Bridging the gap in 3d object
detection for autonomous driving. In CVPR, 2019. 1
[52] Diana Wofk, Fangchang Ma, Tien-Ju Yang, Sertac Karaman,
and Vivienne Sze. Fastdepth: Fast monocular depth estima-
tion on embedded systems. In ICRA, 2019. 1
[53] Chuhua Xian, Kun Qian, Zitian Zhang, and Charlie CL
Wang. Multi-scale progressive fusion learning for depth map
super-resolution. arXiv:2011.11865, 2020. 3
[54] Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan,
Kangyang Xie, Zhiyue Zhao, Hao Chen, and Chunhua Shen.
Diffusion models trained with large data are transferable vi-
sual models. In ICLR, 2025. 2
[55] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi
Feng, and Hengshuang Zhao. Depth anything: Unleashing
the power of large-scale unlabeled data. In CVPR, 2024. 1,
2
[56] Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao-
gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any-
thing v2. In NeurIPS, 2024. 1, 2, 3, 5
[57] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren,
Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-
scale dataset for generalized multi-view stereo networks. In
CVPR, 2020. 2
[58] Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu,
Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d:
Towards zero-shot metric 3d prediction from a single image.
In ICCV, 2023. 1, 2
[59] Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T
Freeman, and Jiajun Wu. Wonderworld: Interactive 3d scene
generation from a single image. arXiv:2406.09394, 2024. 1,
3
[60] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
conditional control to text-to-image diffusion models. In
ICCV, 2023. 1
[61] Youmin Zhang, Xianda Guo, Matteo Poggi, Zheng Zhu,
Guan Huang, and Stefano Mattoccia. Completionformer:
Depth completion with convolutions and vision transform-
ers. In CVPR, 2023. 3
[62] Zixiang Zhao, Jiangshe Zhang, Shuang Xu, Zudi Lin, and
Hanspeter Pfister. Discrete cosine transform network for
guided depth map super-resolution. In CVPR, 2022. 3
[63] Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang,
Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-
vla: A 3d vision-language-action generative world model. In
ICML, 2024. 1
[64] Zhiwei Zhong, Xianming Liu, Junjun Jiang, Debin Zhao, and
Xiangyang Ji. Guided depth map super-resolution: A survey.
ACM Computing Surveys, 2023. 3

11
S L M S+M L+M S+L of the Depth Anything v2 model exhibit stronger capabil-
ities, training conditioned MDE models based on larger
k=3 2.00 2.52 2.74 2.10 2.83 3.07 backbones is an important direction for future work. Ad-
k=5 1.97 2.16 2.73 2.04 2.82 3.09 ditionally, following Depth Anything, all training images
k=10 2.00 2.31 2.74 2.09 2.83 3.12 are resized to 518×518. In contrast, PromptDA is natively
k=20 2.10 2.27 2.76 2.16 2.83 3.14 trained at 1440×1920 resolution. Therefore, training at
Table 12. Impact of different k-value on the accuracy of the final higher resolutions to better handle easily accessible high-
output Doutput . resolution RGB images is another crucial direction for our
future research.
S L M S+M L+M S+L
Only Sparse 1.99 2.82 2.26 2.06 2.90 3.11
Three Pattern 1.96 2.74 2.48 2.01 2.82 3.08
Table 13. Impact of used prior patterns during training.

A. More Ablation Study


Effect of k-value in Coarse Metric Alignment In
Tab 12, we analyze the impact of the k-value on the ac-
curacy of the final output Doutput . Overall, our method is
non-sensitive to the selection of k-value, with most selec-
tions yielding strong structural results. This highlights the
effectiveness of the nearest neighbor approach in preserving
detailed metric information.

B. More Qualitative Results


To further explore the boundaries of our model’s capabili-
ties and its potential to rectify the “ground truth” depth, we
offer more error analysis with different patterns of priors on
the 7 unseen datasets (Figure 5 for RGB-D-D, Figure 6 for
ARKitScenes, Figure 7 for NYUv2, Figure 8 for ScanNet,
Figure 9 for ETH-3D, Figure 10 for DIODE, Figure 11 for
KITTI).

C. More Training Details


For each training sample, we randomly select one of three
patterns (e.g. sparse points, low-resolution, or missing ar-
eas) with equal probability to sample the depth prior from
the ground truth depth map. Specifically: for sparse points,
we randomly select 100 to 2000 pixels as valid; for low-
resolution, we downsample the GT map by a factor of 8;
and for missing areas, we generate a random square mask
with a side length of 160 pixels. It is worth mentioning that,
we find that using multiple patterns or only using sparse
points lead to similar results, as shown in Tab 13. This in-
dicates that our method’s ability to generalize to any form
of prior stems from the coarse metric alignment, rather than
the use of multiple patterns during training.

D. Limitations and Future Works


Currently, our largest conditioned MDE model is initialized
with Depth Anything v2 ViT-B. Given that larger versions

12
RGB image GT & Prior Ours Error

Figure 5. Error analysis on RGB-D-D.

RGB image GT & Prior Ours Error

Figure 6. Error analysis on ARKitScenes.

13
RGB image GT & Prior Ours Error

Figure 7. Error analysis on NYUv2.

RGB image GT & Prior Ours Error

Figure 8. Error analysis on ScanNet.

14
RGB image GT & Prior Ours Error

Figure 9. Error analysis on ETH-3D.

RGB image GT & Prior Ours Error

Figure 10. Error analysis on DIODE.

15
RGB image

GT & Prior

Ours

Error

Figure 11. Error analysis on KITTI.

16

Common questions

Powered by AI

Synthetic data, while precise, tends to have limited diversity and repetitive patterns due to pre-defined video sampling, which can restrict real-world scene coverage. In contrast, real images from large-scale datasets provide a wide array of distinct scenes. Thus, when training models for depth estimation, purely synthetic data may not be sufficient for achieving robust generalization, leading to the necessity of incorporating unlabeled real images to enhance scene diversity and coverage .

Knowledge distillation enhances small model performance by allowing these models to learn from the outputs of a larger, more capable teacher model. This is notably facilitated through the transfer of capabilities achieved using unlabeled real images, bypassing direct synthetic-to-real transfer. The practice involves mimicking the predictions of the teacher model while fine-tuning with additional data for robustness, thus equipping smaller models to better handle zero-shot depth estimation tasks .

Real images in depth estimation training provide a broader diversity of scenes, which mitigate the distribution shift inherent in purely synthetic datasets. Unlike synthetic images, which are often redundant, unlabeled real images from large datasets cover numerous distinct and informative scenes, enhancing the model's robustness and generalization capability .

Depth models' generalization ability is optimized using pre-fill strategies and the incorporation of multiple diverse priors during training, evaluated through sparse point generalization tests. Incorporating mixed priors ensures that the model can adapt to various unseen conditions by minimizing performance degradation compared to baseline models like Omni-DC or Marigold-DC. Success is quantified by measuring performance on unseen prior types versus known benchmarks, demonstrating robust adaptability and consistent accuracy improvements across different datasets .

The scale-invariant log loss is used for pixel-level supervision in depth estimation models to ensure accurate predictions despite the variations in scale between the model's output and the ground truth depth. This approach, following the ZoeDepth model, allows the model to produce relative depth predictions that can be accurately transformed into the ground truth scale, improving overall performance .

Model robustness in handling zero-shot depth super-resolution is attributed to adaptive learning techniques that allow for accurate depth map refinement across diverse benchmarks. Unlike simple approaches that might directly replicate noise from GT depths, the models utilize strategies that balance detail retention and noise reduction, especially on ARKitScenes and RGB-D-D datasets. This adaptability ensures performance leading in preserving spatial details and accuracy without excessive noise, thereby outperforming other models in practical scenarios with limited initial data .

Depth estimation models are evaluated across seven unseen real-world datasets including NYUv2, ScanNet, ETH3D, DIODE, KITTI, ARKitScenes, and RGB-D-D. These datasets represent diverse environments, both indoor and outdoor. Evaluation criteria include zero-shot accuracy in terms of metrics like AbsRel and RMSE under various depth priors, such as sparse points, low-resolution, and missing areas. Success is determined by the ability of the models to maintain performance across different image and prior combinations, revealing robustness and adaptability in varying real-world scenarios .

The 'Depth Anything V2' model is trained using a combination of synthetic and real data to leverage the strengths of both. Initially, a capable teacher model is trained on synthetic images to achieve high precision. This teacher model's knowledge is then distilled into smaller student models using high-quality pseudo labels generated from real images, addressing distribution shift and limited synthetic diversity. Furthermore, a gradual transfer of learning from large to smaller models, mimicking knowledge distillation with unlabeled real data, facilitates the safe scale-down, enhancing smaller models' robustness .

The zero-shot inpainting capability of depth models fills missing depth data in maps created by limitations such as sensor range or occlusions. This feature is crucial for enhancing the effectiveness of depth sensors in real-world applications, like 3D modeling and augmented reality, where complete spatial data is necessary for accurate scene representation and interaction. By effectively filling gaps and demonstrating superior performance in settings like 'Range' and 'Object' masks, these models advance practical applications by improving depth data completeness and fidelity .

The models exhibit robustness in handling mixed depth priors by combining various patterns such as sparse points, low-resolution inputs, and missing areas. The ability to manage the additional complexity with only minor performance degradation, compared to more significant drops seen in alternative models like Omni-DC and Marigold-DC, underscores their efficacy. This performance resilience highlights the model's capability to process and integrate multiple sources of depth information effectively .

You might also like