0% found this document useful (0 votes)

94 views16 pages

Foundationstereo: Zero-Shot Stereo Matching

FoundationStereo is a new foundation model for stereo depth estimation that achieves strong zero-shot generalization without the need for per-domain fine-tuning. It utilizes a large-scale synthetic dataset of 1 million stereo pairs and incorporates innovative network components like a side-tuning feature backbone and an Attentive Hybrid Cost Volume module to enhance robustness and accuracy. The model outperforms existing methods in diverse real-world scenarios, establishing a new standard in zero-shot stereo matching.

Uploaded by

maor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views16 pages

Foundationstereo: Zero-Shot Stereo Matching

Uploaded by

maor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

FoundationStereo: Zero-Shot Stereo Matching

Bowen Wen Matthew Trepte Joseph Aribido Jan Kautz

Orazio Gallo Stan Birchfield

NVIDIA
arXiv:2501.09898v4 [cs.CV] 4 Apr 2025

Figure 1. Zero-shot prediction on in-the-wild images. Our method generalizes to diverse scenarios (indoor / outdoor), objects of challenging properties
(textureless / reflective / translucent / thin-structured), complex illuminations (shadow / strong exposure), various viewing perspectives and sensing ranges.

Abstract Recent stereo algorithms can achieve amazing results, al-

Tremendous progress has been made in deep stereo match- most saturating the most challenging benchmarks—thanks
ing to excel on benchmark datasets through per-domain to the proliferation of training datasets and advances in deep
fine-tuning. However, achieving strong zero-shot general- neural network architectures. Yet, fine-tuning on the dataset
ization — a hallmark of foundation models in other com- of the target domain is still the method of choice to get com-
puter vision tasks — remains challenging for stereo match- petitive results. Given the zero-shot generalization ability
ing. We introduce FoundationStereo, a foundation model shown on other problems within computer vision via the
for stereo depth estimation designed to achieve strong zero- scaling law [32, 46, 78, 79], what prevents stereo matching
shot generalization. To this end, we first construct a large- algorithms from achieving a similar level of generalization?
scale (1M stereo pairs) synthetic training dataset featuring Leading stereo networks [11, 41, 53, 54, 73, 80] con-
large diversity and high photorealism, followed by an auto- struct cost volumes from the unary features and lever-
matic self-curation pipeline to remove ambiguous samples. age 3D CNNs for cost filtering. Refinement-based meth-
We then design a number of network architecture compo- ods [14, 21, 27, 34, 36, 60, 67, 86] iteratively refine the
nents to enhance scalability, including a side-tuning fea- disparity map based on recurrent modules such as Gated
ture backbone that adapts rich monocular priors from vi- Recurrent Units (GRU). Despite their success on public
sion foundation models to mitigate the sim-to-real gap, and benchmarks under per-domain fine-tuning setup, however,
long-range context reasoning for effective cost volume fil- they struggle to gather non-local information to effectively
tering. Together, these components lead to strong robust- scale to larger datasets. Other methods [35, 68] explore
ness and accuracy across domains, establishing a new stan- transformer architectures for unary feature extraction, while
dard in zero-shot stereo depth estimation. Project page: lacking the specialized structure afforded by cost volumes
https:// nvlabs.github.io/ FoundationStereo/ and iterative refinement to achieve high accuracy.
Such limitations have, to date, hindered the development
1. Introduction of a stereo network that generalizes well to other domains.
Since the advent of the first stereo matching algorithm While it is true that cross-domain generalization has been
nearly half a century ago [42], we have come a long way. explored by some prior works [10, 17, 37, 49, 82, 84], such
approaches have not achieved results that are competitive (DT) module that performs full self-attention over the
with those obtained by fine-tuning on the target domain, ei- disparity dimension.
ther due to insufficient structure in the network architecture,
impoverished training data, or both. These networks are 2. Related Work
generally experimented on Scene Flow [43], a rather small
Deep Stereo Matching. Recent advances in stereo match-
dataset with only 40K annotated training image pairs. As a
ing have been driven by deep learning, significantly enhanc-
result, none of these methods can be used as an off-the-shelf
ing accuracy and generalization. Cost volume aggregation
solution, as opposed to the strong generalizability of vision
methods construct cost volumes from unary features and
foundation models that have emerged in other tasks.
perform 3D CNN for volume filtering [11, 41, 53, 54, 73,
To address these limitations, we propose Foundation- 80], though the high memory consumption prevents direct
Stereo, a large foundation model for stereo depth estimation application to high resolution images. Iterative refinement
that achieves strong zero-shot generalization without per- methods, inspired by RAFT [57], bypasses the costly 4D
domain fine-tuning. We train the network on a large-scale volume construction and filtering by recurrently refining the
(1M image pairs) high-fidelity synthetic training dataset disparity [14, 21, 27, 34, 36, 60, 67, 86]. While they gener-
with high diversity and photorealism. An automatic self- alize well to various disparity range, the recurrent updates
curation pipeline is developed to eliminate the ambiguous are often time-consuming, and lack long-range context rea-
samples that are inevitably introduced during the domain soning. Recent works [71, 72] thus combine the strengths
randomized data generation process, improving both the of cost filtering and iterative refinement. With the tremen-
dataset quality and model robustness over iterate updates. dous progress made by vision transformers, another line of
To mitigate the sim-to-real gap, we propose a side-tuning research [23, 35, 68] introduces transformer architecture to
feature backbone that adapts internet-scale rich priors from stereo matching, particularly in the unary feature extrac-
DepthAnythingV2 [79] that is trained on real monocular tion stage. Despite their success on per-domain fine-tuning
images to the stereo setup. To effectively leverage these setup, zero-shot generalization still remains challenging. To
rich monocular priors embedded into the 4D cost volume, tackle this problem, [10, 17, 37, 49, 82, 84] explore learning
we then propose an Attentive Hybrid Cost Volume (AHCF) domain-invariant features for cross-domain generalization,
module, consisting of 3D Axial-Planar Convolution (APC) with a focus on training on Scene Flow [43] dataset. Con-
filtering that decouples standard 3D convolution into two current work [3] achieves remarkable zero-shot generaliza-
separate spatial- and disparity-oriented 3D convolutions, tion with monocular prior enhanced correlation volumes.
enhancing the receptive fields for volume feature aggrega- However, the strong generalizability of vision foundation
tion; and a Disparity Transformer (DT) that performs self- models emerged in other tasks that is supported by scaling
attention over the entire disparity space within the cost vol- law has yet to be fully realized in stereo matching for prac-
ume, providing long range context for global reasoning. To- tical applications.
gether, these innovations significantly enhance the represen- Stereo Matching Training Data. Training data is essen-
tation, leading to better disparity initialization, as well as tial for deep learning models. KITTI 12 [20] and KITTI
more powerful features for the subsequent iterative refine- 15 [45] provide hundreds of training pairs on driving sce-
ment process. narios. DrivingStereo [76] further scales up to 180K stereo
Our contributions can be summarized as follows: pairs. Nevertheless, the sparse ground-truth disparity ob-
• We present FoundationStereo, a zero-shot generalizable tained by LiDAR sensors hinders learning accurate and
stereo matching model that achieves comparable or even dense stereo matching. Middlebury [51] and ETH3D [52]
more favorable results to prior works fine-tuned on a develop a low number of training data covering both in-
target domain; it also significantly outperforms existing door and outdoor scenarios beyond driving. Booster [48]
methods when applied to in-the-wild data. presents a real-world dataset focusing on transparent ob-
• We create a large-scale (1M) high-fidelity synthetic jects. InStereo2K [2] presents a larger training dataset con-
dataset for stereo learning with high diversity and pho- sisting of 2K stereo pairs with denser ground-truth dispar-
torealism; and a self-curation pipeline to ensure that bad ity obtained with structured light system. However, chal-
samples are pruned. lenges of scarce data size, imperfect ground-truth disparity
• To harness internet-scale knowledge containing rich se- and lack of collection scalability in real-world have driven
mantic and geometric priors, we propose a Side-Tuning the widespread adoption of synthetic data for training. This
Adapter (STA) that adapts the ViT-based monocular includes Scene Flow [43], Sintel [6], CREStereo [34],
depth estimation model [79] to the stereo setup. IRS [64], TartanAir [66], FallingThings [61], Virtual KITTI
• We develop Attentive Hybrid Cost Filtering (AHCF), 2 [7], CARLA HR-VS [75], Dynamic Replica [28]. In
which includes an hourglass module with 3D Axial- Tab. 1, we compare our proposed FoundationStereo dataset
Planar Convolution (APC), and a Disparity Transformer (FSD) with commonly used synthetic training datasets for
Properties Sintel [6] Sceneflow [43] CREStereo [34] IRS [64] TartanAir [66] FallingThings [61] UnrealStereo4K [59] Spring [44] FSD (Ours)
Flying Objects ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓
Scenarios

Indoor ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✓
Outdoor ✗ ✓ ✗ ✗ ✓ ✓ ✓ ✗ ✓
Driving ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓
Movie ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗
Simulator Blender Blender Blender Unreal Engine Unreal Engine Unreal Engine Unreal Engine Blender NVIDIA Omniverse
Rendering Realism High Low High High High High High High High
Scenes 10 9 0 4 18 3 8 47 12
Layout Realism Medium Low Low High High Medium High High High
Stereo Pairs 1K† 40K† 200K 103K† 306K† 62K 7.7K 6K† 1000K
Resolution 1024 × 436 960 × 540 1920 × 1080 960 × 540 640 × 480 960 × 540 3840 × 2160 1920 × 1080 1280 × 720
Reflections ✗ ✗ ✓ ✓ ✓ ✓ ✓ ✗ ✓
Constant baseline, Varying baseline
Camera Params Constant Constant Constant Constant Constant Constant Constant
varying intrinsics and intrinsics
Table 1. Synthetic datasets for training stereo algorithms (excluding test images with inaccessible ground truth). † Indicates reduced diversity, caused by
including many similar frames from video sequences.

stereo matching. Our dataset encompasses a wide range of We explored multiple design choices for combining
scenarios, features the largest data volume to date, includes CNN and ViT approaches, as outlined in Fig. 3 (left). In
diverse 3D assets, captures stereo images under diversely particular, (a) directly uses the feature pyramids from the
randomized camera parameters, and achieves high fidelity DPT head in a frozen DepthAnythingV2 [79] without using
in both rendering and spatial layouts. CNN features. (b) resembles ViT-Adapter [12] by exchang-
Vision Foundation Models. Vision foundation models ing features between CNN and ViT. (c) applies a 4 × 4 con-
have significantly advanced across various vision tasks in volution with stride 4 to downscale the feature before the
2D, 3D and multi-modal alignment. CLIP [47] leverages DepthAnythingV2 final output head. The feature is then
large-scale image-text pair training to align visual and tex- concatenated with the same level CNN feature to obtain a
tual modalities, enabling zero-shot classification and facil- hybrid feature at 1/4 scale. The side CNN network is thus
itating cross-modal applications. DINO series [8, 38, 46] learned to adapt the ViT features [83] to stereo matching
employ self-supervised learning for dense representation task. Surprisingly, while being simple, we found (c) signifi-
learning, effectively capturing detailed features critical for cantly surpasses the alternative choices on the stereo match-
segmentation and recognition tasks. SAM series [32, 50, ing task, as shown in the experiments (Sec. 4.5). As a result,
77] demonstrate high versatility in segmentation driven we adopt (c) as the main design of STA module.
by various prompts such as points, bounding boxes, lan- Formally, given a pair of left and right images Il , Ir ∈
guage. Similar advancements also appear in 3D vision RH×W ×3 , we employ EdgeNeXt-S [40] as the CNN
tasks. DUSt3R [65] and MASt3R [33] present generaliz- module within STA to extract multi-level pyramid fea-
able frameworks for dense 3D reconstruction from uncali- tures, where the 1/4 level feature is equipped with
brated and unposed cameras. FoundationPose [69] develops (i) (i) H W
DepthAnythingV2 feature: fl , fr ∈ RCi × i × i , i ∈
a unified framework of 6D object pose estimation and track- {4, 8, 16, 32}. EdgeNeXt-S [40] is chosen for its memory
ing for novel objects. More closely related to this work, a efficiency and because larger CNN backbones did not yield
number of efforts [4, 29, 78, 79] demonstrated strong gener- additional benefits in our investigation. When forwarding to
alization in monocular depth estimation task and multi-view DepthAnythingV2, we first resize the image to be divisible
stereo [26]. Together, these approaches exemplify under the by 14, to be consistent with its pretrained patch size. The
scaling law, how foundation models in vision are evolving STA weights are shared when applied to Il , Ir .
to support robust applications across diverse scenarios with-
Similarly, we employ STA to extract context feature,
out tedious per-domain fine-tuning.
with the difference that the CNN module is designed with
a sequence of residual blocks [25] and down-sampling lay-
3. Approach (i)
ers. It generates context features of multiple scales: fc ∈
H W
The overall network architecture is shown in Fig. 2. The RCi × i × i , i ∈ {4, 8, 16}, as in [36]. fc participates in
rest of this section describes the various components. initializing the hidden state of the ConvGRU block and in-
putting to the ConvGRU block at each iteration, effectively
3.1. Monocular Foundation Model Adaptation guiding the iterative process with progressively refined con-
To mitigate the sim-to-real gap when the stereo network is textual information.
primarily trained on synthetic dataset, we leverage the re- Fig. 3 visualizes the power of rich monocular prior that
cent advancements on monocular depth estimation trained helps to reliably predict on ambiguous regions which is
on internet-scale real data [5, 79]. We use a CNN network challenging to deal with by naive correspondence search
to adapt the ViT-based monocular depth estimation network along the epipolar line. Instead of using the raw monocular
to the stereo setup, thus synergizing the strengths of both depth from DepthAnythingV2 which has scale ambiguity,
CNN and ViT architectures. we use its latent feature as geometric priors extracted from
CNN 1/32 CNN 1/32
1/16
Depth Depth CNN 1/16 Depth CNN 1/16
Anything 1/8 Anything Anything
CNN 1/8 CNN 1/8
V2 1/4 V2 V2
CNN 1/4 CNN 1/4

Down

(a) (b) (c)

Figure 2. Overview of our proposed FoundationStereo. The Side-Tuning Adapter (STA) adapts the rich monocular priors from a frozen DepthAny-
thingV2 [79], while combined with fine-grained high-frequency features from multi-level CNN for unary feature extraction. Attentive Hybrid Cost Filtering
(AHCF) combines the strengths of the Axial-Planar Convolution (APC) filtering and a Disparity Transformer (DT) module to effectively aggregate the
features along spatial and disparity dimensions over the 4D hybrid cost volume. An initial disparity is then predicted from the filtered cost volume, and
subsequently refined through GRU blocks. At each refinement step, the latest disparity is used to look up features from both filtered hybrid cost volume and
correlation volume to guide the next refinement. The iteratively refined disparity becomes the final output.

Figure 3. Left: Design choices for STA module. Right: Effects of the proposed STA and AHCF modules. “W/o STA” only uses CNN to extract features.
“W/o AHCF” uses conventional 3D CNN-based hourglass network for cost volume filtering. Results are obtained via zero-shot inference without fine-
tuning on target dataset. STA leverages rich monocular prior to reliably predict the lamp region with inconsistent lighting and dark guitar sound hole. AHCF
effectively aggregates the spatial and long-range disparity context to accurately predict over thin repetitive structures.

both stereo images and compared through cost filtering as costs, offering a diverse set of similarity measurement fea-
described next. tures from each group. Vcat preserves unary features includ-
ing the rich monocular priors by concatenating left and right
3.2. Attentive Hybrid Cost Filtering features at shifted disparity. To reduce memory consump-
tion, we linearly downsize the unary feature dimension to
Hybrid Cost Volume Construction. Given unary features
14 using a convolution of kernel size 1 (weights are shared
at 1/4 scale fl4 , fr4 extracted from previous step, we con-
D H W between fl4 and fr4 ) before concatenation. Next, we de-
struct the cost volume VC ∈ RC× 4 × 4 × 4 with a com- scribe two sub-modules for effective cost volume filtering.
bination of group-wise correlation and concatenation [24]:
Axial-Planar Convolution (APC) Filtering. An hourglass
&\mathbf {V_{\text {gwc}}}(g,d,h,w) = \left \langle \widehat {f}_{l,g}^{(4)}(h,w), \widehat {f}_{r,g}^{(4)}(h,w-d) \right \rangle , \notag \\ &\mathbf {V_{\text {cat}}}(d,h,w) = \left [ \text {Conv}(f_{l}^{(4)})(h,w), \text {Conv}(f_{r}^{(4)})(h,w-d) \right ], \notag \\ &\mathbf {V_C}(d,h,w) = \left [ \mathbf {V_{\text {gwc}}}(d,h,w), \mathbf {V_{\text {cat}}}(d,h,w) \right ] \label {eq:V_C} network consisting of 3D convolutions, with three down-
sampling blocks and three up-sampling blocks with residual
connections, is leveraged for cost volume filtering [1, 71].
While 3D convolutions of kernel size 3 × 3 × 3 are com-
(1) monly used for relatively small disparity sizes [9, 24, 71],
where fb denotes L2 normalized feature for better training we observe it struggles with larger disparities when applied
stability; ⟨·, ·⟩ represents dot product; g ∈ {1, 2, ..., G} is to high resolution images, especially since the disparity di-
the group index among the total G = 8 feature groups that mension is expected to model the probability distribution
we evenly divide the total features into; d ∈ {1, 2, ..., D
4 } is for the initial disparity prediction. However, it is impracti-
the disparity index. [·, ·] denotes concatenation along chan- cal to naively increase the kernel size, due to the intensive
nel dimension. The group-wise correlation Vgwc harnesses memory consumption. In fact, even when setting kernel size
the strengths of conventional correlation-based matching to 5 × 5 × 5 we observe unmanageable memory usage on an
80 GB GPU. This drastically limits the model’s representa- can be formulated as:
tion power when scaling up with large amount of training &\mathbf {V}_{\text {corr}}(w',h,w)=\left \langle f_l^{(4)}(h,w), f_r^{(4)}(h,w') \right \rangle \\ &\mathbf {F_V}(h,w)=[\mathbf {V_C'}(d_k,h,w), \mathbf {V}_{\text {corr}}(w-d_k,h,w)] \\ &x_k=[\text {Conv}_v(\mathbf {F_V}), \text {Conv}_{d}(d_k), d_k, c] \\ &z_k=\sigma \left ( \text {Conv}_z([h_{k-1}, x_k]) \right ) \\ &r_k=\sigma \left ( \text {Conv}_r([h_{k-1}, x_k]) \right ) \\ &\hat {h}_k=\text {tanh}\left ( \text {Conv}_h([r_k \odot h_{k-1}, x_k]) \right ) \\ &h_k=(1-z_k)\odot h_{k-1} + z_k \odot \hat {h}_k \\ &d_{k+1} = d_k + \text {Conv}_{\Delta }(h_k)
data. We thus develop “Axial-Planar Convolution” which
decouples a single 3 × 3 × 3 convolution into two sepa-
rate convolutions: one over spatial dimensions (kernel size
Ks × Ks × 1) and the other over disparity (1 × 1 × Kd ),
each followed by BatchNorm and ReLU. APC can be re-
garded as a 3D version of Separable Convolution [16] with
the difference that we only separate the spatial and dispar-
ity dimensions without subdividing the channel into groups
which sacrifices representation power. The disparity dimen-
sion is specially treated due to its uniquely encoded feature (10)
comparison within the cost volume. We use APC wher- where ⊙ denotes element-wise product; σ denotes sigmoid;
W H W
ever possible in the hourglass network except for the down- Vcorr ∈ R 4 × 4 × 4 is the pair-wise correlation volume;
sampling and up-sampling layers. FV represents the looked up volume features using lat-
Disparity Transformer (DT). While prior works [35, 68] est disparity; c = ReLU(fc ) encodes the context feature
introduced transformer architecture to unary feature extrac- from left image, including STA adapted features (Sec. 3.1)
tion step to scale up stereo training, the cost filtering pro- which effectively guide the refinement process leveraging
cess is often overlooked, which remains an essential step rich monocular priors.
in achieving accurate stereo matching by encapsulating cor- We use three levels of GRU blocks to perform coarse-
respondence information. Therefore, we introduce DT to to-fine hidden state update in each iteration, where the ini-
(i)
further enhance the long-range context reasoning within the tial hidden states are produced from context features h0 =
(i)
4D cost volume. Given VC obtained in Eq. (1), we first tanh(fc ), i ∈ {4, 8, 16}. At each level, attention-based se-
apply a 3D convolution of kernel size 4 × 4 × 4 with stride lection mechanism [67] is leveraged to capture information
4 to downsize the cost volume. We then reshape the vol- at different frequencies. Finally, dk is up-sampled to the full
ume into a batch of token sequences, each with length of resolution using convex sampling [57].
disparity. We apply position encoding before feeding it to a
series (4 in our case) of transformer encoder blocks, where 3.4. Loss Function
FlashAttention [18] is leveraged to perform multi-head self- The model is trained with the following objective:
attention [63]. The process can be written as:
&\mathbf {Q_0}=\text {PE}\left ( \mathbf {R}\left (\text {Conv}_{4\times 4\times 4}(\mathbf {V_C}) \right ) \right ) \in \mathbb {R}^{\left (\frac {H}{16}\times \frac {W}{16}\right ) \times C\times \frac {D}{16}} \notag \\ &\text {MultiHead}(\mathbf {Q}, \mathbf {K}, \mathbf {V}) = [\text {head}_1, \ldots , \text {head}_h]\mathbf {W}_O \notag \\ &\quad \text {where } \text {head}_i = \text {FlashAttention}(\mathbf {Q}_i, \mathbf {K}_i, \mathbf {V}_i) \notag \\ &\mathbf {Q_1}=\text {Norm}\left ( \text {MultiHead}(\mathbf {Q_0}, \mathbf {Q_0}, \mathbf {Q_0}) + \mathbf {Q_0} \right ) \notag \\ &\mathbf {Q_2}=\text {Norm}\left ( \text {FFN}(\mathbf {Q_1}) + \mathbf {Q_1} \right ) \notag \mathcal {L}=\left | d_0-\overline {d} \right |_{\text {smooth}} + \sum _{k=1}^{K}\gamma ^{K-k}\left \| d_k-\overline {d} \right \|_1 (11)

where d represents ground-truth disparity; |·|smooth denotes

smooth L1 loss; k is the iteration number; γ is set to 0.9, and
we apply exponentially increasing weights [36] to supervise
the iteratively refined disparity.
where R(·) denotes reshape operation; PE(·) represents po-
3.5. Synthetic Training Dataset
sition encoding; [·, ·] denotes concatenation along the chan-
nel dimension; WO is linear weights. The number of heads We created a large scale synthetic training dataset with
is h = 4 in our case. Finally, the DT output is up-sampled NVIDIA Omniverse. This FoundationStereo Dataset (FSD)
to the same size as VC using trilinear interpolation and accounts for crucial stereo matching challenges such as re-
summed with hourglass output, as shown in Fig. 2. flections, low-texture surfaces, and severe occlusions. We
Initial Disparity Prediction. We apply soft-argmin [30] to perform domain randomization [58] to augment dataset
the filtered volume VC ′
to produce an initial disparity: diversity, including random stereo baseline, focal length,
camera perspectives, lighting conditions and object config-
urations. Meanwhile, high-quality 3D assets with abundant
d_0=\sum _{d=0}^{\frac {D}{4}-1}d\cdot \text {Softmax}(\mathbf {V_C'})(d) (2)
textures and path-tracing rendering are leveraged to enhance
where d0 is at 1/4 scale of the original image resolution. realism in rendering and layouts. Fig. 4 displays some sam-
ples from our dataset including both structured indoor and
3.3. Iterative Refinement outdoor scenarios, as well as more diversely randomized
Given d0 , we perform iterative GRU updates to progres- flying objects with various geometries and textures under
sively refine disparity, which helps to avoid local optimum complex yet realistic lighting. See the appendix for details.
and accelerate convergence [71]. In general, the k-th update Iterative Self-Curation. While synthetic data generation
Figure 4. Left: Samples from our FoundationStereo dataset (FSD), which consists of synthetic stereo images with structured indoor / outdoor scenes (top),
as well as more randomized scenes with challenging flying objects and higher geometry and texture diversity (bottom). Right: The iterative self-curation
process removes ambiguous samples inevitably produced from the domain randomized synthetic data generation process. Example ambiguities include
severe texture repetition, ubiquitous reflections with limited surrounding context, and pure color under improper lighting.

in theory can produce unlimited amount of data and achieve Middlebury ETH3D KITTI-12 KITTI-15
Methods BP-2 BP-1 D1 D1
large diversity through randomization, ambiguities can be
CREStereo++ [27] 14.8 4.4 4.7 5.2
inevitably introduced especially for less structured scenes DSMNet [82] 13.8 6.2 6.2 6.5
with flying objects, which confuses the learning process. Mask-CFNet [49] 13.7 5.7 4.8 5.8
To eliminate those samples, we design an automatic itera- HVT-RAFT [10] 10.4 3.0 3.7 5.2
RAFT-Stereo [36] 12.6 3.3 4.7 5.5
tive self-curation strategy. Fig. 4 demonstrates this process Selective-IGEV [67] 9.2 5.7 4.5 5.6
and detected ambiguous samples. We start with training an IGEV [36] 8.8 4.0 5.2 5.7
initial version of FoundationStereo on FSD, after which it is Former-RAFT-DAM [84] 8.1 3.3 3.9 5.1
IGEV++ [72] 7.8 4.1 5.1 5.9
evaluated on FSD. Samples where BP-2 (Sec. 4.2) is larger
NMRF [22] 7.5 3.8 4.2 5.1
than 60% are regarded as ambiguous samples and replaced Ours (Scene Flow) 5.5 1.8 3.2 4.9
by regenerating new ones. The training and curation pro- Selective-IGEV* [67] 7.5 3.4 3.2 4.5
cesses are alternated to iteratively (twice in our case) update Ours 1.1 0.5 2.3 2.8
both FSD and FoundationStereo. Table 2. Zero-shot generalization results on four public datasets. The most
commonly used metrics for each dataset were adopted. In the first block,
all methods were trained only on Scene Flow. In the second block, meth-
4. Experiments ods are allowed to train on any existing datasets excluding the four target
domains. The weights and parameters are fixed for evaluation.
4.1. Implementation Details
ing both indoor and outdoor scenarios. KITTI 2012 [20]
We implement FoundationStereo in PyTorch. The foun-
and KITTI 2015 [45] datasets feature real-world driving
dation model is trained on a mixed dataset consisting of
scenes, where sparse ground-truth disparity maps are pro-
our proposed FSD, together with Scene Flow [43], Sin-
vided, which are derived from LIDAR sensors.
tel [6], CREStereo [34], FallingThings [61], InStereo2K [2]
and Virtual KITTI 2 [7]. We train FoundationStereo using Metrics. “EPE” computes average per-pixel disparity er-
AdamW optimizer [39] for 200K steps with a total batch ror. “BP-X” computes the percentage of pixels where the
size of 128 evenly distributed over 32 NVIDIA A100 GPUs. disparity error is larger than X pixels. “D1” computes the
The learning rate starts at 1e-4 and decays by 0.1 at 0.8 of percentage of pixels whose disparity error is larger than 3
the entire training process. Images are randomly cropped pixels and 5% of the ground-truth disparity.
to 320×736 before feeding to the network. Data augmen- 4.3. Zero-Shot Generalization Comparison
tations similar to [36] are performed. During training, 22
iterations are used in GRU updates. In the following, unless Benchmark Evaluation. Tab. 2 exhibits quantitative com-
otherwise mentioned, we use the same foundation model for parison of zero-shot generalization results on four public
zero-shot inference with 32 refinement iterations and 416 real-world datasets. Even when trained solely on Scene
for maximum disparity. Flow, our method outperforms the comparison methods
consistently across all datasets, thanks to the efficacy of
4.2. Benchmark Datasets and Metric adapting rich monocular priors from vision foundation
Datasets. We consider five commonly used public datasets models. We further evaluate in a more realistic setup, al-
for evaluation: Scene Flow [43] is a synthetic dataset lowing methods to train on any available dataset while ex-
including three subsets: FlyingThings3D, Driving, and cluding the target domain, to achieve optimal zero-shot in-
Monkaa. Middlebury [51] consists of indoor stereo image ference results as required in practical applications.
pairs with high-quality ground-truth disparity captured via In-the-Wild Generalization. We compare our foundation
structured light. Unless otherwise mentioned, evaluations model against recent approaches that released their check-
are performed on half resolution and non-occluded regions. points trained on a mixture of datasets, to resemble the prac-
ETH3D [52] provides grayscale stereo image pairs cover- tical zero-shot application on in-the-wild images. Com-
Figure 5. Qualitative comparison of zero-shot inference on in-the-wild images. For each comparison method we select the best performing checkpoint from
their public release, which has been trained on a mixture of public datasets. These images exhibit challenging reflection, translucency, repetitive textures,
complex illuminations and thin-structures, revealing the importance of our network architecture and large-scale training.

Method LEAStereo [15] GANet [81] ACVNet [70] IGEV-Stereo [71] NMRF [22] MoCha-Stereo [14] Selective-IGEV [67] Ours
EPE 0.78 0.84 0.48 0.47 0.45 0.41 0.44 0.34
Table 3. Comparison of methods trained / tested on the Scene Flow train / test sets, respectively.

Method Zero-Shot BP-0.5 BP-1.0 EPE ing the previous best EPE from 0.41 to 0.33. Although in-
GMStereo [74] ✗ 5.94 1.83 0.19 domain training is not the focus of this work, the results
HITNet [56] ✗ 7.83 2.79 0.20 reflect the effectiveness of our model design.
EAI-Stereo [85] ✗ 5.21 2.31 0.21
RAFT-Stereo [36] ✗ 7.04 2.44 0.18 Tab. 4 exhibits quantitative comparison on ETH3D
CREStereo [34] ✗ 3.58 0.98 0.13 leaderboard (test set). For our approach, we perform eval-
IGEV-Stereo [71] ✗ 3.52 1.12 0.14 uations in two settings. First, we fine-tune our foundation
CroCo-Stereo [68] ✗ 3.27 0.99 0.14
MoCha-Stereo [14] ✗ 3.20 1.41 0.13
model on a mixture of the default training dataset (Sec. 4.1)
Selective-IGEV [67] ✗ 3.06 1.23 0.12 and ETH3D training set for another 50K steps, using the
Ours (finetuned) ✗ 1.26 0.26 0.09 same learning rate schedule and data augmentation. Our
Ours ✓ 2.31 1.52
3912 Contributed paper
0.13 model significantly surpasses the previous best approach by
Context-Oriented Feature Fusion and
Enhancement for Robust Stereo Matching

Table 4. Results on ETH3D leaderboard (test set). All methods except for reducing more than half of the error rates and ranks 1st
the last row have used ETH3D training set for fine-tuning. Our fine-tuned on leaderboard at the time of submission. This indicates
version ranks 1st on leaderboard at the time of submission. Last row is great potential of transferring capability from our founda-
obtained via zero-shot inference from our foundation model.
tion model if in-domain fine-tuning is desired. Second, we
parison methods include CroCo v2 [68], CREStereo [34], also evaluated our foundation model without using any data
IGEV [71] and Selective-IGEV [67]. For each method, we from ETH3D. Remarkably, our foundation model’s zero-
select the best performing checkpoint from their public re- shot inference achieves comparable or even better results
lease. In this evaluation, the four real-world benchmark than leading approaches that perform in-domain training.
datasets [20, 45, 51, 52] have been used for training compar- In addition, our finetuned model also ranks 1st on the
ison methods, whereas they are not used in our fixed foun- Middlebury leaderboard. See appendix for details.
dation model. Fig. 5 displays qualitative comparison on var-
ious scenarios, including a robot scene from DROID [31] 4.5. Ablation Study
dataset and custom captures covering indoor and outdoor. We investigate different design choices for our model
and dataset. Unless otherwise mentioned, we train on
4.4. In-Domain Comparison a randomly subsampled version (100K) of FSD to make
Tab. 3 presents quantitative comparison on Scene Flow, the experiment scale more affordable. Given Middlebury
where all methods are following the same officially divided dataset’s high quality ground-truth, results are evaluated on
train and test split. Our FoundationStereo model outper- its training set to reflect zero-shot generalization. Since the
forms the comparison methods by a large margin, reduc- focus of this work is to build a stereo matching foundation
Row Variations BP-2 AHCF Row FSD BP2
Row STA BP2
APC DT 1 ✗ 2.34
1 DINOv2-L [46] 2.46
2 DepthAnythingV2-S [79] 2.22 1 2.48 2 ✓ 1.15
3 DepthAnythingV2-B [79] 2.11 2 ✓ 2.21
4 DepthAnythingV2-L [79] 1.97 3 ✓ ✓ 2.16
4 ✓ ✓ 2.05
5 STA (a) 6.48 5 ✓ ✓ ✓ 1.97
6 STA (b) 2.22
7 STA (c) 1.97 Table 7. Left: Ablation study of proposed network modules. Right: Ab-
lation study of whether to use FSD dataset when training the foundation
8 Unfreeze ViT 3.94 model described in Sec. 4.1. The choices adopted in our full model are
9 Freeze ViT 1.97 highlighted in green.
Table 5. Ablation study of STA module. Variations (a-c) correspond to ing sequence lengths. However, it does not outperform co-
Fig. 3. The choices adopted in our full model are highlighted in green.
sine position embedding, probably due to the constant dis-
Row Variations BP-2 Row Variations BP-2 parity size in 4D cost volume. While in theory, full volume
1 RoPE 2.19 10 (3,3,1), (1,1,5) 2.10 attention provides larger receptive field, it is less effective
2 Cosine 1.97 11 (3,3,1), (1,1,9) 2.06 than merely applying over the disparity dimension of the
3 1/32 2.06 12 (3,3,1), (1,1,13) 2.01 cost volume. We hypothesize the extremely large space of
4 1/16 1.97 13 (3,3,1), (1,1,17) 1.97
4D cost volume makes it less tractable, whereas attention
5 Full 2.25 14 (3,3,1), (1,1,21) 1.98 over disparity provides sufficient context for a better initial
6 Disparity 1.97 15 (7,7,1), (1,1,17) 1.99
disparity prediction and subsequent volume feature lookup
7 Pre-hourglass 2.06
8 Post-hourglass 2.20
during GRU updates. Next, we compare different kernel
9 Parallel 1.97 sizes in APC (row 10-15), where the last dimension in each
Table 6. Ablation study of AHCF module. Left corresponds to DT, while
parenthesis corresponds to disparity dimension. We observe
right corresponds to APC. The choices adopted in our full model are high- increasing benefits when enlarging disparity kernel size un-
lighted in green. til it saturates at around 17.
Effects of Proposed Modules. The quantitative effects are
model with strong generalization, we do not deliberately
shown in Tab. 7 (left). STA leverages rich monocular pri-
limit model size while pursuing better performance.
ors which greatly enhances generalization to real images
STA Design Choices. As shown in Tab. 5, we first compare
for ambiguous regions. DT and APC effectively aggregate
different vision foundation models for adapting rich monoc-
cost volume features along spatial and disparity dimensions,
ular priors, including different model sizes of DepthAny-
leading to improved context for disparity initialization and
thingV2 [79] and DINOv2-Large [46]. While DINOv2
subsequent volume feature look up during GRU updates.
previously exhibited promising results in correspondence
Fig. 3 further visualizes the resulting effects.
matching [19], it is not as effective as DepthAnythingV2
Effects of FoundationStereo Dataset. We study whether
in the stereo matching task, possibly due to its less task-
to include FSD dataset with the existing public datasets for
relevance and its limited resolution to reason high-precision
training our foundation model described in Sec. 4.1. Results
pixel-level correspondence. We then study different design
are shown in Tab. 7 (right).
choices from Fig. 3. Surprisingly, while being simple, we
found (c) significantly surpasses the alternatives. We hy-
pothesize the latest feature before the final output head pre-
5. Conclusion
serves high-resolution and fine-grained semantic and geo- We introduced FoundationStereo, a foundation model for
metric priors that are suitable for subsequent cost volume stereo depth estimation that achieves strong zero-shot gen-
construction and filtering process. We also experimented eralization across various domains without fine-tuning. We
whether to freeze the adapted ViT model. As expected, un- envision such a foundation model can facilitate broader
freezing ViT corrupts the pretrained monocular priors, lead- adoption of stereo estimation models in practical applica-
ing to degraded performance. tions. Despite its remarkable generalization, it has several
AHCF Design Choices. As shown in Tab. 6, for DT mod- limitations. First, our model is not yet optimized for ef-
ule we study different position embedding (row 1-2); dif- ficiency, which takes 0.7s on image size of 375×1242 on
ferent feature scale to perform transformer (row 3-4); trans- NVIDIA A100 GPU. Future work could explore adapting
former over the full cost-volume or only along the disparity distillation and pruning techniques applied to other vision
dimension (row 5-6); different placements of DT module foundation models [13, 87]. Second, our dataset FSD in-
relative to the hourglass network (row 7-9). Specifically, cludes a limited collection of transparent objects. Robust-
RoPE [55] encodes relative distances between tokens in- ness could be further enhanced by augmenting with a larger
stead of absolute positions, making it more adaptive to vary- diversity of fully transparent objects during training.
References nel attention network for stereo matching. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
[1] Antyanta Bangunharcana, Jae Won Cho, Seokju Lee, In So Recognition (CVPR), pages 27768–27777, 2024. 1, 2, 7
Kweon, Kyung-Soo Kim, and Soohyun Kim. Correlate-and-
[15] Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Yuchao
excite: Real-time stereo matching via guided cost volume
Dai, Xiaojun Chang, Hongdong Li, Tom Drummond, and
excitation. In 2021 IEEE/RSJ International Conference on
Zongyuan Ge. Hierarchical neural architecture search for
Intelligent Robots and Systems (IROS), pages 3542–3548.
deep stereo matching. Proceedings of Neural Information
IEEE, 2021. 4
Processing Systems (NeurIPS), 33:22158–22169, 2020. 7
[2] Wei Bao, Wei Wang, Yuhua Xu, Yulan Guo, Siyu Hong, and
[16] François Chollet. Xception: Deep learning with depthwise
Xiaohu Zhang. InStereo2k: a large real dataset for stereo
separable convolutions. In Proceedings of the IEEE/CVF
matching in indoor scenes. Science China Information Sci-
Conference on Computer Vision and Pattern Recognition
ences, 63:1–11, 2020. 2, 6
(CVPR), pages 1251–1258, 2017. 5
[3] Luca Bartolomei, Fabio Tosi, Matteo Poggi, and Stefano
Mattoccia. Stereo anywhere: Robust zero-shot deep stereo [17] WeiQin Chuah, Ruwan Tennakoon, Reza Hoseinnezhad,
matching even where either stereo or mono fail. arXiv Alireza Bab-Hadiashar, and David Suter. ITSA: An
preprint arXiv:2412.04472, 2024. 2 information-theoretic approach to automatic shortcut avoid-
ance and domain generalization in stereo matching net-
[4] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter
works. In Proceedings of the IEEE/CVF Conference on
Wonka, and Matthias Müller. ZoeDepth: Zero-shot trans-
Computer Vision and Pattern Recognition (CVPR), pages
fer by combining relative and metric depth. arXiv preprint
13022–13032, 2022. 1, 2
arXiv:2302.12288, 2023. 3
[5] Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, [18] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo-
Marcel Santos, Yichao Zhou, Stephan R Richter, and pher Ré. FlashAttention: Fast and memory-efficient exact
Vladlen Koltun. Depth Pro: Sharp monocular metric depth in attention with io-awareness. Proceedings of Neural Informa-
less than a second. arXiv preprint arXiv:2410.02073, 2024. tion Processing Systems (NeurIPS), 35:16344–16359, 2022.
3 5
[6] Daniel J Butler, Jonas Wulff, Garrett B Stanley, and [19] Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Ab-
Michael J Black. A naturalistic open source movie for optical hishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun,
flow evaluation. In Proceedings of the European Conference Leonidas Guibas, Justin Johnson, and Varun Jampani. Prob-
on Computer Vision (ECCV), pages 611–625, 2012. 2, 3, 6 ing the 3D awareness of visual foundation models. In Pro-
[7] Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- ceedings of the IEEE/CVF Conference on Computer Vision
tual KITTI 2. arXiv preprint arXiv:2001.10773, 2020. 2, 6 and Pattern Recognition (CVPR), pages 21795–21806, 2024.
8
[8] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- [20] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
ing properties in self-supervised vision transformers. In Pro- ready for autonomous driving? the KITTI vision benchmark
ceedings of the IEEE International Conference on Computer suite. In Proceedings of the IEEE/CVF Conference on Com-
Vision (ICCV), pages 9650–9660, 2021. 3 puter Vision and Pattern Recognition (CVPR), pages 3354–
[9] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo 3361, 2012. 2, 6, 7
matching network. In Proceedings of the IEEE/CVF Confer- [21] Rui Gong, Weide Liu, Zaiwang Gu, Xulei Yang, and
ence on Computer Vision and Pattern Recognition (CVPR), Jun Cheng. Learning intra-view and cross-view geomet-
pages 5410–5418, 2018. 4 ric knowledge for stereo matching. In Proceedings of
[10] Tianyu Chang, Xun Yang, Tianzhu Zhang, and Meng Wang. the IEEE/CVF Conference on Computer Vision and Pattern
Domain generalized stereo matching via hierarchical visual Recognition (CVPR), pages 20752–20762, 2024. 1, 2
transformation. In Proceedings of the IEEE/CVF Conference [22] Tongfan Guan, Chen Wang, and Yun-Hui Liu. Neural
on Computer Vision and Pattern Recognition (CVPR), pages markov random field for stereo matching. In Proceedings of
9559–9568, 2023. 1, 2, 6 the IEEE/CVF Conference on Computer Vision and Pattern
[11] Liyan Chen, Weihan Wang, and Philippos Mordohai. Learn- Recognition, pages 5459–5469, 2024. 6, 7, 1
ing the distribution of errors in stereo matching for joint [23] Weiyu Guo, Zhaoshuo Li, Yongkui Yang, Zheng Wang, Rus-
disparity and uncertainty estimation. In Proceedings of sell H Taylor, Mathias Unberath, Alan Yuille, and Yingwei
the IEEE/CVF Conference on Computer Vision and Pattern Li. Context-enhanced stereo transformer. In Proceedings
Recognition (CVPR), pages 17235–17244, 2023. 1, 2 of the European Conference on Computer Vision (ECCV),
[12] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong pages 263–279, 2022. 2
Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for [24] Xiaoyang Guo, Kai Yang, Wukui Yang, Xiaogang Wang,
dense predictions. ICLR, 2023. 3 and Hongsheng Li. Group-wise correlation stereo network.
[13] Zigeng Chen, Gongfan Fang, Xinyin Ma, and Xinchao In Proceedings of the IEEE/CVF Conference on Computer
Wang. 0.1% data makes segment anything slim. NeurIPS, Vision and Pattern Recognition (CVPR), pages 3273–3282,
2023. 8 2019. 4
[14] Ziyang Chen, Wei Long, He Yao, Yongjun Zhang, Bingshu [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Wang, Yongbin Qin, and Jia Wu. Mocha-stereo: Motif chan- Deep residual learning for image recognition. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and [36] Lahav Lipson, Zachary Teed, and Jia Deng. RAFT-Stereo:
Pattern Recognition (CVPR), pages 770–778, 2016. 3 Multilevel recurrent field transforms for stereo matching. In
[26] Sergio Izquierdo, Mohamed Sayed, Michael Firman, International Conference on 3D Vision (3DV), pages 218–
Guillermo Garcia-Hernando, Daniyar Turmukhambetov, 227, 2021. 1, 2, 3, 5, 6, 7
Javier Civera, Oisin Mac Aodha, Gabriel J. Brostow, and [37] Biyang Liu, Huimin Yu, and Guodong Qi. GraftNet: To-
Jamie Watson. MVSAnywhere: Zero shot multi-view stereo. wards domain generalized stereo matching with a broad-
In CVPR, 2025. 3 spectrum and task-oriented feature. In Proceedings of the
[27] Junpeng Jing, Jiankun Li, Pengfei Xiong, Jiangyu Liu, IEEE/CVF Conference on Computer Vision and Pattern
Shuaicheng Liu, Yichen Guo, Xin Deng, Mai Xu, Lai Jiang, Recognition (CVPR), pages 13012–13021, 2022. 1, 2
and Leonid Sigal. Uncertainty guided adaptive warping for [38] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao
robust and efficient stereo matching. In Proceedings of the Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang,
IEEE International Conference on Computer Vision (ICCV), Hang Su, et al. Grounding DINO: Marrying DINO with
pages 3318–3327, 2023. 1, 2, 6 grounded pre-training for open-set object detection. In Pro-
[28] Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia ceedings of the European Conference on Computer Vision
Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy- (ECCV), 2024. 3
namicStereo: Consistent dynamic depth from stereo videos. [39] I Loshchilov. Decoupled weight decay regularization. ICLR,
In Proceedings of the IEEE/CVF Conference on Com- 2019. 6
puter Vision and Pattern Recognition (CVPR), pages 13229– [40] Muhammad Maaz, Abdelrahman Shaker, Hisham
13239, 2023. 2 Cholakkal, Salman Khan, Syed Waqas Zamir, Rao Muham-
[29] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- mad Anwer, and Fahad Shahbaz Khan. EdgeNeXt:
zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- Efficiently amalgamated cnn-transformer architecture for
ing diffusion-based image generators for monocular depth mobile vision applications. In Proceedings of the European
estimation. In Proceedings of the IEEE/CVF Conference Conference on Computer Vision (ECCV), pages 3–20, 2022.
on Computer Vision and Pattern Recognition (CVPR), pages 3
9492–9502, 2024. 3 [41] Yamin Mao, Zhihua Liu, Weiming Li, Yuchao Dai, Qiang
[30] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Wang, Yun-Tae Kim, and Hong-Seok Lee. UASNet: Un-
Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. certainty adaptive sampling network for deep stereo match-
End-to-end learning of geometry and context for deep stereo ing. In Proceedings of the IEEE International Conference on
regression. In Proceedings of the IEEE International Con- Computer Vision (ICCV), pages 6311–6319, 2021. 1, 2
ference on Computer Vision (ICCV), pages 66–75, 2017. 5 [42] D. Marr and T. Poggio. Cooperative computation of stereo
[31] Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- disparity. Science, 194:283–287, 1976. 1
win Balakrishna, Sudeep Dasari, Siddharth Karamcheti, [43] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer,
Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yun- Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A
liang Chen, Kirsty Ellis, et al. DROID: A large-scale large dataset to train convolutional networks for disparity,
in-the-wild robot manipulation dataset. arXiv preprint optical flow, and scene flow estimation. In Proceedings of
arXiv:2403.12945, 2024. 7 the IEEE/CVF Conference on Computer Vision and Pattern
[32] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Recognition (CVPR), pages 4040–4048, 2016. 2, 3, 6
Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- [44] Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nali-
head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- vayko, and Andrés Bruhn. Spring: A high-resolution high-
thing. In Proceedings of the IEEE International Conference detail dataset and benchmark for scene flow, optical flow and
on Computer Vision (ICCV), pages 4015–4026, 2023. 1, 3 stereo. In Proc. IEEE/CVF Conference on Computer Vision
[33] Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Ground- and Pattern Recognition (CVPR), 2023. 3
ing image matching in 3D with MASt3R. arXiv preprint [45] Moritz Menze and Andreas Geiger. Object scene flow for au-
arXiv:2406.09756, 2024. 3 tonomous vehicles. In Proceedings of the IEEE/CVF Confer-
[34] Jiankun Li, Peisen Wang, Pengfei Xiong, Tao Cai, Ziwei ence on Computer Vision and Pattern Recognition (CVPR),
Yan, Lei Yang, Jiangyu Liu, Haoqiang Fan, and Shuaicheng pages 3061–3070, 2015. 2, 6, 7
Liu. Practical stereo matching via cascaded recurrent net- [46] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy
work with adaptive correlation. In Proceedings of the Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez,
IEEE/CVF Conference on Computer Vision and Pattern Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al.
Recognition (CVPR), pages 16263–16272, 2022. 1, 2, 3, 6, DINOv2: Learning robust visual features without supervi-
7 sion. TMLR, 2024. 1, 3, 8
[35] Zhaoshuo Li, Xingtong Liu, Nathan Drenkow, Andy Ding, [47] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Francis X Creighton, Russell H Taylor, and Mathias Un- Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
berath. Revisiting stereo depth estimation from a sequence- Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
to-sequence perspective with transformers. In Proceedings ing transferable visual models from natural language super-
of the IEEE International Conference on Computer Vision vision. In International Conference on Machine Learning
(ICCV), pages 6197–6206, 2021. 1, 2, 5 (ICML), pages 8748–8763, 2021. 3
[48] Pierluigi Zama Ramirez, Alex Costanzino, Fabio Tosi, Mat- [59] Fabio Tosi, Yiyi Liao, Carolin Schmitt, and Andreas Geiger.
teo Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Smd-nets: Stereo mixture density networks. In Proceedings
Di Stefano. Booster: A benchmark for depth from images of the IEEE/CVF conference on computer vision and pattern
of specular and transparent surfaces. IEEE Transactions on recognition, pages 8942–8952, 2021. 3
Pattern Analysis and Machine Intelligence (PAMI), 2023. 2, [60] Fabio Tosi, Filippo Aleotti, Pierluigi Zama Ramirez, Matteo
1 Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Di Ste-
[49] Zhibo Rao, Bangshu Xiong, Mingyi He, Yuchao Dai, Renjie fano. Neural disparity refinement. IEEE Transactions on
He, Zhelun Shen, and Xing Li. Masked representation learn- Pattern Analysis and Machine Intelligence (PAMI), 2024. 1,
ing for domain generalized stereo matching. In Proceedings 2
of the IEEE/CVF Conference on Computer Vision and Pat- [61] Jonathan Tremblay, Thang To, and Stan Birchfield. Falling
tern Recognition (CVPR), pages 5435–5444, 2023. 1, 2, 6 things: A synthetic dataset for 3d object detection and pose
[50] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang estimation. In Proceedings of the IEEE Conference on
Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Computer Vision and Pattern Recognition Workshops, pages
Rädle, Chloe Rolland, Laura Gustafson, et al. SAM 2: 2038–2041, 2018. 2, 3, 6
Segment anything in images and videos. arXiv preprint [62] Jonathan Tremblay, Thang To, Balakumar Sundaralingam,
arXiv:2408.00714, 2024. 3 Yu Xiang, Dieter Fox, and Stan Birchfield. Deep object pose
[51] Daniel Scharstein, Heiko Hirschmüller, York Kitajima, estimation for semantic robotic grasping of household ob-
Greg Krathwohl, Nera Nešić, Xi Wang, and Porter West- jects. In Conference on Robot Learning (CoRL), pages 306–
ling. High-resolution stereo datasets with subpixel-accurate 316, 2018. 3
ground truth. In Pattern Recognition: 36th German Confer- [63] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
ence, GCPR 2014, Münster, Germany, September 2-5, 2014, reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Proceedings 36, pages 31–42. Springer, 2014. 2, 6, 7, 1 Polosukhin. Attention is all you need. Advances in Neural
Information Processing Systems (NeurIPS), 30, 2017. 5
[52] Thomas Schops, Johannes L Schonberger, Silvano Galliani,
Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- [64] Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng,
dreas Geiger. A multi-view stereo benchmark with high- Kaiyong Zhao, and Xiaowen Chu. IRS: A large naturalistic
resolution images and multi-camera videos. In Proceedings indoor robotics stereo dataset to train deep models for dis-
of the IEEE/CVF Conference on Computer Vision and Pat- parity and surface normal estimation. In IEEE International
tern Recognition (CVPR), pages 3260–3269, 2017. 2, 6, 7 Conference on Multimedia and Expo (ICME), 2021. 2, 3
[65] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris
[53] Zhelun Shen, Yuchao Dai, and Zhibo Rao. CFNet: Cascade
Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D
and fused cost volume for robust stereo matching. In Pro-
vision made easy. In Proceedings of the IEEE/CVF Confer-
ceedings of the IEEE/CVF Conference on Computer Vision
ence on Computer Vision and Pattern Recognition (CVPR),
and Pattern Recognition (CVPR), pages 13906–13915, 2021.
pages 20697–20709, 2024. 3
1, 2
[66] Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu,
[54] Zhelun Shen, Yuchao Dai, Xibin Song, Zhibo Rao, Dingfu
Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se-
Zhou, and Liangjun Zhang. PCW-Net: Pyramid combina-
bastian Scherer. TartanAir: A dataset to push the limits of
tion and warping cost volume for stereo matching. In Pro-
visual slam. In IEEE/RSJ International Conference on Intel-
ceedings of the European Conference on Computer Vision
ligent Robots and Systems (IROS), pages 4909–4916, 2020.
(ECCV), pages 280–297, 2022. 1, 2
2, 3
[55] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen [67] Xianqi Wang, Gangwei Xu, Hao Jia, and Xin Yang.
Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with Selective-Stereo: Adaptive frequency information selection
rotary position embedding. Neurocomputing, 568:127063, for stereo matching. In Proceedings of the IEEE/CVF
2024. 8 Conference on Computer Vision and Pattern Recognition
[56] Vladimir Tankovich, Christian Hane, Yinda Zhang, Adarsh (CVPR), pages 19701–19710, 2024. 1, 2, 5, 6, 7
Kowdle, Sean Fanello, and Sofien Bouaziz. HITNet: Hier- [68] Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy,
archical iterative tile refinement network for real-time stereo Yohann Cabon, Vaibhav Arora, Romain Brégier, Gabriela
matching. In Proceedings of the IEEE/CVF Conference on Csurka, Leonid Antsfeld, Boris Chidlovskii, and Jérôme Re-
Computer Vision and Pattern Recognition (CVPR), pages vaud. CroCo v2: Improved cross-view completion pre-
14362–14372, 2021. 7 training for stereo matching and optical flow. In Proceedings
[57] Zachary Teed and Jia Deng. RAFT: Recurrent all-pairs field of the IEEE International Conference on Computer Vision
transforms for optical flow. In Proceedings of the European (ICCV), pages 17969–17980, 2023. 1, 2, 5, 7
Conference on Computer Vision (ECCV), pages 402–419, [69] Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield.
2020. 2, 5 FoundationPose: Unified 6D pose estimation and tracking
[58] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Woj- of novel objects. In Proceedings of the IEEE/CVF Confer-
ciech Zaremba, and Pieter Abbeel. Domain randomization ence on Computer Vision and Pattern Recognition (CVPR),
for transferring deep neural networks from simulation to the pages 17868–17879, 2024. 3
real world. In IEEE/RSJ International Conference on Intel- [70] Gangwei Xu, Junda Cheng, Peng Guo, and Xin Yang. Atten-
ligent Robots and Systems (IROS), pages 23–30, 2017. 5 tion concatenation volume for accurate and efficient stereo
matching. In Proceedings of the IEEE/CVF Conference on [83] Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas
Computer Vision and Pattern Recognition (CVPR), pages Guibas, and Jitendra Malik. Side-tuning: a baseline for net-
12981–12990, 2022. 7 work adaptation via additive side networks. In Proceedings
[71] Gangwei Xu, Xianqi Wang, Xiaohuan Ding, and Xin Yang. of the European Conference on Computer Vision (ECCV),
Iterative geometry encoding volume for stereo matching. In pages 698–714, 2020. 3
Proceedings of the IEEE/CVF Conference on Computer Vi- [84] Yongjian Zhang, Longguang Wang, Kunhong Li, Yun Wang,
sion and Pattern Recognition, pages 21919–21928, 2023. 2, and Yulan Guo. Learning representations from foundation
4, 5, 7 models for domain generalized stereo matching. In European
[72] Gangwei Xu, Xianqi Wang, Zhaoxing Zhang, Junda Cheng, Conference on Computer Vision, pages 146–162. Springer,
Chunyuan Liao, and Xin Yang. IGEV++: Iterative multi- 2024. 1, 2, 6
range geometry encoding volumes for stereo matching. [85] Haoliang Zhao, Huizhou Zhou, Yongjun Zhang, Yong Zhao,
arXiv preprint arXiv:2409.00638, 2024. 2, 6 Yitong Yang, and Ting Ouyang. EAI-Stereo: Error aware
[73] Haofei Xu and Juyong Zhang. AANet: Adaptive aggrega- iterative network for stereo matching. In Proceedings of the
tion network for efficient stereo matching. In Proceedings of Asian Conference on Computer Vision (ACCV), pages 315–
the IEEE/CVF Conference on Computer Vision and Pattern 332, 2022. 7
Recognition (CVPR), pages 1959–1968, 2020. 1, 2 [86] Haoliang Zhao, Huizhou Zhou, Yongjun Zhang, Jie Chen,
[74] Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Yitong Yang, and Yong Zhao. High-frequency stereo match-
Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, ing network. In Proceedings of the IEEE/CVF Conference
stereo and depth estimation. IEEE Transactions on Pattern on Computer Vision and Pattern Recognition (CVPR), pages
Analysis and Machine Intelligence (PAMI), 2023. 7 1327–1336, 2023. 1, 2
[75] Gengshan Yang, Joshua Manela, Michael Happold, and [87] Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu,
Deva Ramanan. Hierarchical deep stereo matching on high- Min Li, Ming Tang, and Jinqiao Wang. Fast segment any-
resolution images. In Proceedings of the IEEE/CVF Confer- thing. arXiv preprint arXiv:2306.12156, 2023. 8
ence on Computer Vision and Pattern Recognition (CVPR),
pages 5515–5524, 2019. 2
[76] Guorun Yang, Xiao Song, Chaoqin Huang, Zhidong Deng,
Jianping Shi, and Bolei Zhou. DrivingStereo: A large-scale
dataset for stereo matching in autonomous driving scenar-
ios. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages 899–
908, 2019. 2
[77] Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing
Wang, and Feng Zheng. Track anything: Segment anything
meets videos. arXiv preprint arXiv:2304.11968, 2023. 3
[78] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi
Feng, and Hengshuang Zhao. Depth anything: Unleashing
the power of large-scale unlabeled data. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 10371–10381, 2024. 1, 3
[79] Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao-
gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any-
thing v2. In Proceedings of Neural Information Processing
Systems (NeurIPS), 2024. 1, 2, 3, 4, 8
[80] Menglong Yang, Fangrui Wu, and Wei Li. WaveletStereo:
Learning wavelet coefficients of disparity map in stereo
matching. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pages
12885–12894, 2020. 1, 2
[81] Feihu Zhang, Victor Prisacariu, Ruigang Yang, and
Philip HS Torr. GA-Net: Guided aggregation net for end-
to-end stereo matching. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR), pages 185–194, 2019. 7
[82] Feihu Zhang, Xiaojuan Qi, Ruigang Yang, Victor Prisacariu,
Benjamin Wah, and Philip Torr. Domain-invariant stereo
matching networks. In Proceedings of the European Con-
ference on Computer Vision (ECCV), pages 420–439, 2020.
1, 2, 6
FoundationStereo: Zero-Shot Stereo Matching
Supplementary Material
6. ETH3D Leaderboard objects. We compare with the most competitive methods
from Fig. 5 (main paper) in the zero-shot setting. The quan-
At the time of submission, our fine-tuned model ranks titative and qualitative results are shown below.
1st on the ETH3D leaderboard, significantly outperforming IGEV

both published and unpublished works. The screenshot is

shown in Fig. 6. Half
Methods BP1 BP-2 BP-3 EPE
Selective-IGEV 23.8 15.0 12.0 6.6
Selective-IGEV Ours
7. Middlebury Leaderboard IGEV
Ours
30.8
19.0
22.3
9.6
19.0
6.7
22.7
2.2

At the time of submission, our fine-tuned model ranks 1st

on the Middlebury leaderboard, significantly outperforming
both published and unpublished works. The screenshot is
shown in Fig. 8. 10. More Results on Middlebury Dataset
We compare with competitive methods that released their
8. More Ablation Study on Synthetic Data public weights in zero-shot on Middlebury, shown in below
table. Since NMRF [22] did not report their evaluated Mid-
Effects of Self-Curation. We study the effectiveness of
dlebury resolution, we rerun their released weights on all
self-curation pipeline introduced in Sec. 3.5. When dis-
resolutions. At full resolution, maximum disparity 320 is
abling the self-curation while keeping the same data size,
used for FoundationStereo. Across all resolutions, ours sig-
the synthetic dataset involves ambiguous samples that con-
nificantly outperforms baselines. We also report the peak
fuse the learning process, leading to slight performance
memory usage and running time averaged across the dataset
drop when evaluated on Middlebury [51] dataset.
on the same hardware, particularly single GPU 3090. On
half and quarter resolutions, our peak memory occurs at
Variation BP2
STA module. On full resolution, it occurs at DT module.
W/ self-curation 1.15 Despite the speed limitation which is not the focus when de-
W/o self-curation 1.27
veloping this work, ours can successfully run on a desktop
Table 8. Effectiveness of self-curation pipeline when generating synthetic GPU. Pruning or distillation remains an interesting future
data.
work to improve speed and memory footprint.
Effects of FSD for Other Methods. Table 2 (main paper) Full Half Quarter
Methods peak peak peak
indicates benefits by introducing FSD for FoundationStereo BP-2
mem (G)
time (s) BP-2
mem (G)
time (s) BP-2
mem (G)
time (s)

model. To answer the question of whether FSD can benefit Selective-

IGEV[61]
12.9 6.9 2.52 9.2 1.7 0.72 7.0 0.5 0.25
other methods beyond FoundationStereo, we now train rep- IGEV[33] 13.1 6.3 2.06 8.8 1.6 0.53 6.4 0.5 0.18
IGEV++[66] 12.7 13.1 2.12 7.8 3.4 0.50 6.3 0.9 0.15
resentative works IGEV and Selective-IGEV on FSD and NMRF[20] 35.3 8.1 0.95 10.9 1.8 0.20 5.0 0.5 0.05
Ours 4.8 18.5 8.14 1.1 10.5 2.97 1.3 2.3 0.55
compare with their counterparts trained on Scene Flow. As
Table 10. Results on varying resolutions in Middlebury.
shown in the table below, for both methods, our proposed
FSD effectively boosts the performance compared to the
commonly used Scene Flow dataset.
11. More Details of Synthetic Data Generation
Middlebury ETH3D KITTI-12 KITTI-15
Methods Train data BP-2 BP-1 D1 D1 Tooling and Assets. The dataset generation is built on
IGEV Scene Flow 8.8 4.0 5.2 5.7 NVIDIA Omniverse. We use RTX path-tracing with 32
IGEV FSD 7.8 3.5 3.2 4.7 to 128 samples per pixel for high-fidelity photorealistic
Selective-IGEV Scene Flow 9.2 5.7 4.5 5.6
Selective-IGEV FSD 7.9 3.5 3.0 4.4
rendering. The data generation is performed across 48
NVIDIA A40 GPUs for 10 days. There are more than 5K
Table 9. Effects of FSD for other methods.
object assets collected from varying sources including artist
designs and 3D scanning with high-frequency geometry de-
9. Results on Translucent Objects tails. Object assets are divided into the groups of: furniture,
open containers, vehicles, robots, floor tape, free-standing
We evaluate on Booster [48] (half resolution), which is a walls, stairs, plants, forklifts, dynamically animated digi-
challenging dataset consisting of specular and transparent tal humans, other obstacles and distractors. Each group is
Figure 6. ETH3D leaderboard screenshot. Our fine-tuned foundation model (red box) ranks 1st at the time of submission.

Figure 7. Examples scene models involving factory, hospital, wood attic, office, grocery store and warehouse. In the third column, we demonstrate an
example of metallic material randomization being applied to augment scene diversity. The last column shows comparison of a warehouse between the real
(bottom) and our simulated digital twin (top) in high fidelity.

defined with a separate randomization range for sampling butions. Next, objects are spawned into the scene in two
locations, scales and appearances. In addition, we curated different methods to randomize the scene configuration: 1)
12 large scene models (Fig. 7), 16 skybox images, more camera is spawned in a random pose, and objects are added
than 150 materials, and 400 textures for tiled wrapping on relative to the camera at random locations; 2) objects are
object geometries for appearance augmentation. These tex- spawned near a random location, and the camera is spawned
tures are obtained from real-world photos and procedurally nearby and oriented to the center of mass of the object clut-
generated random patterns. ter.

Camera Configuration. For each data sample, we first ran- Layout Configuration. We generate layouts in two kinds
domly sample the stereo baseline camera focal length to di- of styles: chaotic and realistic. Such combination of the
versify the coverage of field-of-views and disparity distri- more realistic structured layouts with the more randomized
Figure 8. Middlebury leaderboard screenshot. Our fine-tuned foundation model (red box) ranks 1st at the time of submission.

Figure 9. Disparity distribution in our proposed FSD.

setups with flying objects has been shown to benefit sim-to- to 2 seconds to create physically realistic layouts with no
real generalization [62]. Specifically, chaotic-style scenes penetration, involving both settled and falling objects. Ma-
involve large number of flying distractors and simple scene terials and scales native to object assets are maintained and
layouts which consists of infinitely far skybox and a back- more natural lighting is applied. Among the realistic-style
ground plane. The lighting and object appearances (texture data, we further divide the scenes into three types which de-
and material) are highly randomized. The realistic-style termine what categories of objects are selected to compose
data uses indoor and outdoor scene models where the cam- the scene for more consistent semantics:
era is restricted to locate at predefined areas. Object assets • Navigation - camera poses are often in parallel to the
are dropped and applied with physical properties for colli- ground and objects are often spawned further away. Ob-
sion. The simulation is performed randomly between 0.25 jects such as free-standing walls, furniture, and digital
humans are sampled with higher probability.
• Driving - camera is often in parallel to the ground above
the ground and objects are often spawned further away.
Objects such as vehicles, digital humans, poles, signs and
speed bumps are sampled with higher probability.
• Manipulation - camera is oriented to face front or down-
ward as in ego-centric views and objects are often
spawned in closer range to resemble interaction scenar-
ios. Objects such as household or grocery items, open
containers, robotic arms are sampled with higher proba-
bility.
Lighting Configuration. Light types include global illu-
mination, directed sky rays, lights baked-into 3D scanned
assets, and light spheres which add dynamic lighting when
spawned near to surfaces. Light colors, intensities and di-
rections are randomized. Lighting vibes such as daytime,
dusk and night are included within the random sampling
ranges.
Disparity Distribution. Fig. 9 shows the disparity distribu-
tion of our FSD dataset.

12. Acknowledgement
We would like to thank Gordon Grigor, Jack Zhang, Xu-
tong Ren, Karsten Patzwaldt, Hammad Mazhar and other
NVIDIA Isaac team members for their tremendous engi-
neering support and valuable discussions.

Stereo Match Article
No ratings yet
Stereo Match Article
32 pages
RAFT-Stereo: Multilevel Recurrent Field Transforms For Stereo Matching
No ratings yet
RAFT-Stereo: Multilevel Recurrent Field Transforms For Stereo Matching
10 pages
Revisiting Depth Completion From A Stereo Matching Perspective For Cross-Domain Generalization
No ratings yet
Revisiting Depth Completion From A Stereo Matching Perspective For Cross-Domain Generalization
22 pages
GCNet
No ratings yet
GCNet
10 pages
2007.12140v5
No ratings yet
2007.12140v5
18 pages
3D Motion Learning from Stereo Videos
No ratings yet
3D Motion Learning from Stereo Videos
17 pages
Deep Learning Stereo Vision at The Edge: Luca Puglia and Cormac Brick
No ratings yet
Deep Learning Stereo Vision at The Edge: Luca Puglia and Cormac Brick
10 pages
Unsupervised Domain Adaptation For Depth Prediction From Images
No ratings yet
Unsupervised Domain Adaptation For Depth Prediction From Images
14 pages
Review of Stereo Matching Algorithms For 3d Vision
No ratings yet
Review of Stereo Matching Algorithms For 3d Vision
9 pages
Park Depth Prompting For Sensor-Agnostic Depth Estimation CVPR 2024 Paper
No ratings yet
Park Depth Prompting For Sensor-Agnostic Depth Estimation CVPR 2024 Paper
11 pages
ESPReSSo Efficient Slanted PatchMatch For Real-Time Spacetime Stereo
No ratings yet
ESPReSSo Efficient Slanted PatchMatch For Real-Time Spacetime Stereo
9 pages
Yang DrivingStereo A Large-Scale Dataset For Stereo Matching in Autonomous Driving CVPR 2019 Paper
No ratings yet
Yang DrivingStereo A Large-Scale Dataset For Stereo Matching in Autonomous Driving CVPR 2019 Paper
10 pages
Stereo Vision Algorithm Review
No ratings yet
Stereo Vision Algorithm Review
18 pages
A Low-Cost Stereo System For 3D Object Recognition
No ratings yet
A Low-Cost Stereo System For 3D Object Recognition
7 pages
ChiTransformer: Stereo Depth Innovation
No ratings yet
ChiTransformer: Stereo Depth Innovation
11 pages
双目-Revisiting Domain Generalized Stereo Matching Networks From a Feature Consistency Perspective
No ratings yet
双目-Revisiting Domain Generalized Stereo Matching Networks From a Feature Consistency Perspective
17 pages
A Stereo Perception Framework For Autonomous Vehicles
No ratings yet
A Stereo Perception Framework For Autonomous Vehicles
6 pages
NeRF-Supervised Deep Stereo Training
No ratings yet
NeRF-Supervised Deep Stereo Training
12 pages
Fdsafdsfsafasdfbrwa
No ratings yet
Fdsafdsfsafasdfbrwa
14 pages
Depth Estimation for Vision Experts
No ratings yet
Depth Estimation for Vision Experts
18 pages
Depthanything
No ratings yet
Depthanything
18 pages
Event-Based Stereo Depth Estimation - A Survey
No ratings yet
Event-Based Stereo Depth Estimation - A Survey
28 pages
Slide - 3DP - 09 - Beyond Classical Stereo Matching
No ratings yet
Slide - 3DP - 09 - Beyond Classical Stereo Matching
27 pages
Visually Imbalanced Stereo Matching
No ratings yet
Visually Imbalanced Stereo Matching
10 pages
Disparity Estimation of Stereo-Endoscopic Images Using Deep Generative Network
No ratings yet
Disparity Estimation of Stereo-Endoscopic Images Using Deep Generative Network
6 pages
Depth Anything - Unleashing The Power of Large-Scale Unlabeled Data
No ratings yet
Depth Anything - Unleashing The Power of Large-Scale Unlabeled Data
11 pages
Neural Networks for Depth Sensors
No ratings yet
Neural Networks for Depth Sensors
11 pages
4 3 Stereo-App
No ratings yet
4 3 Stereo-App
70 pages
Tailieuxanh Stereo Vision 498
No ratings yet
Tailieuxanh Stereo Vision 498
180 pages
Object Detection and Localization Using Stereo Cameras
No ratings yet
Object Detection and Localization Using Stereo Cameras
6 pages
3D Lidar and Stereo Fusion Using Stereo Matching Network With Conditional Cost Volume Normalization
No ratings yet
3D Lidar and Stereo Fusion Using Stereo Matching Network With Conditional Cost Volume Normalization
8 pages
Wang Selective-Stereo Adaptive Frequency Information Selection For Stereo Matching CVPR 2024 Paper
No ratings yet
Wang Selective-Stereo Adaptive Frequency Information Selection For Stereo Matching CVPR 2024 Paper
10 pages
Demon: Depth and Motion Network For Learning Monocular Stereo
No ratings yet
Demon: Depth and Motion Network For Learning Monocular Stereo
22 pages
MVD-Fusion: Single-View 3D Via Depth-Consistent Multi-View Generation
No ratings yet
MVD-Fusion: Single-View 3D Via Depth-Consistent Multi-View Generation
11 pages
AI Depth Estimation Overview
No ratings yet
AI Depth Estimation Overview
16 pages
3D Reconstruction
No ratings yet
3D Reconstruction
28 pages
Rukhovich ImVoxelNet Image To Voxels Projection For Monocular and Multi-View General-Purpose WACV 2022 Paper
No ratings yet
Rukhovich ImVoxelNet Image To Voxels Projection For Monocular and Multi-View General-Purpose WACV 2022 Paper
10 pages
CGI Stereo
No ratings yet
CGI Stereo
10 pages
Zoe Depth
No ratings yet
Zoe Depth
20 pages
2013vlsisocbc fpgaRealTime3DStereoMatch
No ratings yet
2013vlsisocbc fpgaRealTime3DStereoMatch
21 pages
A Cost Effective Estimation of Depth From Stereo Image Pairs Using Shallow Siamese Convolutional Networks
No ratings yet
A Cost Effective Estimation of Depth From Stereo Image Pairs Using Shallow Siamese Convolutional Networks
5 pages
Ark Task Documentation
No ratings yet
Ark Task Documentation
5 pages
Stereo Vision Algorithms for Resource-Limited Systems
No ratings yet
Stereo Vision Algorithms for Resource-Limited Systems
21 pages
Zero-Shot 3D Metric Prediction from Images
No ratings yet
Zero-Shot 3D Metric Prediction from Images
17 pages
Depth Estimation From A Single Image Using Deep Learned Phase Coded Mask
No ratings yet
Depth Estimation From A Single Image Using Deep Learned Phase Coded Mask
12 pages
Multi-View Stereo: A Tutorial: Washington University in St. Louis Furukawa@wustl - Edu
No ratings yet
Multi-View Stereo: A Tutorial: Washington University in St. Louis Furukawa@wustl - Edu
40 pages
The Application of Deep Learning in Stereo Matching and Disparity
No ratings yet
The Application of Deep Learning in Stereo Matching and Disparity
24 pages
Real-Time Temporal Stereo Matching
No ratings yet
Real-Time Temporal Stereo Matching
6 pages
Stereo Research Paper
No ratings yet
Stereo Research Paper
10 pages
The Fourth Monocular Depth Estimation Challenge
No ratings yet
The Fourth Monocular Depth Estimation Challenge
14 pages
Dattnet: Monocular Depth Estimation Network Based On Attention Mechanisms
No ratings yet
Dattnet: Monocular Depth Estimation Network Based On Attention Mechanisms
10 pages
Point Based Rendering Enhancement Via Deep Learning: December 16, 2018
No ratings yet
Point Based Rendering Enhancement Via Deep Learning: December 16, 2018
13 pages
双目-Robust Synthetic-To-Real Transfer for Stereo Matching
No ratings yet
双目-Robust Synthetic-To-Real Transfer for Stereo Matching
19 pages
Depth Reconstruction With Deep Neural Networks (Part 2)
No ratings yet
Depth Reconstruction With Deep Neural Networks (Part 2)
54 pages
FADNet A Fast and Accurate Network For Disparity Estimation
No ratings yet
FADNet A Fast and Accurate Network For Disparity Estimation
7 pages
Basak 2020
No ratings yet
Basak 2020
6 pages
Main Paper With Supp
No ratings yet
Main Paper With Supp
16 pages
Well Test Analysis Course Guide
No ratings yet
Well Test Analysis Course Guide
3 pages
Advanced USB Mouse DVR System
No ratings yet
Advanced USB Mouse DVR System
2 pages
HANA Security
100% (1)
HANA Security
34 pages
2021 DCME Annual Report SEC Form 17-A
No ratings yet
2021 DCME Annual Report SEC Form 17-A
294 pages
The Transmission of Electric Energy: ET2105 Electrical Power System Essentials
No ratings yet
The Transmission of Electric Energy: ET2105 Electrical Power System Essentials
39 pages
SQL Notebook by Rishabh
100% (1)
SQL Notebook by Rishabh
101 pages
93FWD 6
No ratings yet
93FWD 6
10 pages
Untitled 1
No ratings yet
Untitled 1
6 pages
Module 2: Problem Solving Techniques Unit 1
No ratings yet
Module 2: Problem Solving Techniques Unit 1
7 pages
MCM User Guides Latest
No ratings yet
MCM User Guides Latest
36 pages
Guide To The Installation of PV Systems 2nd Edition
100% (1)
Guide To The Installation of PV Systems 2nd Edition
32 pages
Jet Presentation Guidelines
No ratings yet
Jet Presentation Guidelines
5 pages
PCA Exam
No ratings yet
PCA Exam
19 pages
Horn Strobe - Sounder Cum Strob - TR-P2RL-P2GRL-CHSRL-SGWL
No ratings yet
Horn Strobe - Sounder Cum Strob - TR-P2RL-P2GRL-CHSRL-SGWL
4 pages
Minor Project Synopsis
No ratings yet
Minor Project Synopsis
6 pages
Book Review Fundamentals of Power Supply Design
No ratings yet
Book Review Fundamentals of Power Supply Design
3 pages
Solaredge 330 Kwinverterh 1300 Poweroptimizerbrochure 1706031532729
No ratings yet
Solaredge 330 Kwinverterh 1300 Poweroptimizerbrochure 1706031532729
2 pages
Modbus TCP Server - AOI Based Code For ControlLogix V 2.02.00
No ratings yet
Modbus TCP Server - AOI Based Code For ControlLogix V 2.02.00
16 pages
Viewse Um006 - en e (113 224)
No ratings yet
Viewse Um006 - en e (113 224)
112 pages
GLOBE PFG-red Vane Motors Guide
No ratings yet
GLOBE PFG-red Vane Motors Guide
26 pages
Project Management Quiz 1 SEO Title
No ratings yet
Project Management Quiz 1 SEO Title
4 pages
CorrectedROD-1 LED
No ratings yet
CorrectedROD-1 LED
3 pages
Fire Pump Installation Safety Guide
100% (4)
Fire Pump Installation Safety Guide
2 pages
SateeshReddyB (3 - 10) - Load Runner
No ratings yet
SateeshReddyB (3 - 10) - Load Runner
5 pages
Job Seekers' Resume Email Guide
100% (1)
Job Seekers' Resume Email Guide
8 pages
B.Tech Computer Science and Engineering-Swami Vivekanand Institute of Engineering and Technology, Patiala - Admission-Details - 20230405060310
No ratings yet
B.Tech Computer Science and Engineering-Swami Vivekanand Institute of Engineering and Technology, Patiala - Admission-Details - 20230405060310
6 pages
Understanding It Applications in Global Business: Understanding Information Management (IM) and Information Systems (IS)
100% (1)
Understanding It Applications in Global Business: Understanding Information Management (IM) and Information Systems (IS)
31 pages
PLC Timer and Pulse Generator Guide
No ratings yet
PLC Timer and Pulse Generator Guide
10 pages
Streaming Scan Network
0% (1)
Streaming Scan Network
10 pages
P0934-Line Pressure Sensor Circuit Low: Theory of Operation
No ratings yet
P0934-Line Pressure Sensor Circuit Low: Theory of Operation
5 pages

Foundationstereo: Zero-Shot Stereo Matching

Uploaded by

Foundationstereo: Zero-Shot Stereo Matching

Uploaded by

FoundationStereo: Zero-Shot Stereo Matching

Bowen Wen Matthew Trepte Joseph Aribido Jan Kautz

Abstract Recent stereo algorithms can achieve amazing results, al-

(a) (b) (c)

where d represents ground-truth disparity; |·|smooth denotes

both published and unpublished works. The screenshot is

At the time of submission, our fine-tuned model ranks 1st

model. To answer the question of whether FSD can benefit Selective-

Figure 9. Disparity distribution in our proposed FSD.

You might also like