Foundationstereo: Zero-Shot Stereo Matching
Foundationstereo: Zero-Shot Stereo Matching
NVIDIA
arXiv:2501.09898v4 [cs.CV] 4 Apr 2025
Figure 1. Zero-shot prediction on in-the-wild images. Our method generalizes to diverse scenarios (indoor / outdoor), objects of challenging properties
(textureless / reflective / translucent / thin-structured), complex illuminations (shadow / strong exposure), various viewing perspectives and sensing ranges.
Indoor ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✓
Outdoor ✗ ✓ ✗ ✗ ✓ ✓ ✓ ✗ ✓
Driving ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓
Movie ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗
Simulator Blender Blender Blender Unreal Engine Unreal Engine Unreal Engine Unreal Engine Blender NVIDIA Omniverse
Rendering Realism High Low High High High High High High High
Scenes 10 9 0 4 18 3 8 47 12
Layout Realism Medium Low Low High High Medium High High High
Stereo Pairs 1K† 40K† 200K 103K† 306K† 62K 7.7K 6K† 1000K
Resolution 1024 × 436 960 × 540 1920 × 1080 960 × 540 640 × 480 960 × 540 3840 × 2160 1920 × 1080 1280 × 720
Reflections ✗ ✗ ✓ ✓ ✓ ✓ ✓ ✗ ✓
Constant baseline, Varying baseline
Camera Params Constant Constant Constant Constant Constant Constant Constant
varying intrinsics and intrinsics
Table 1. Synthetic datasets for training stereo algorithms (excluding test images with inaccessible ground truth). † Indicates reduced diversity, caused by
including many similar frames from video sequences.
stereo matching. Our dataset encompasses a wide range of We explored multiple design choices for combining
scenarios, features the largest data volume to date, includes CNN and ViT approaches, as outlined in Fig. 3 (left). In
diverse 3D assets, captures stereo images under diversely particular, (a) directly uses the feature pyramids from the
randomized camera parameters, and achieves high fidelity DPT head in a frozen DepthAnythingV2 [79] without using
in both rendering and spatial layouts. CNN features. (b) resembles ViT-Adapter [12] by exchang-
Vision Foundation Models. Vision foundation models ing features between CNN and ViT. (c) applies a 4 × 4 con-
have significantly advanced across various vision tasks in volution with stride 4 to downscale the feature before the
2D, 3D and multi-modal alignment. CLIP [47] leverages DepthAnythingV2 final output head. The feature is then
large-scale image-text pair training to align visual and tex- concatenated with the same level CNN feature to obtain a
tual modalities, enabling zero-shot classification and facil- hybrid feature at 1/4 scale. The side CNN network is thus
itating cross-modal applications. DINO series [8, 38, 46] learned to adapt the ViT features [83] to stereo matching
employ self-supervised learning for dense representation task. Surprisingly, while being simple, we found (c) signifi-
learning, effectively capturing detailed features critical for cantly surpasses the alternative choices on the stereo match-
segmentation and recognition tasks. SAM series [32, 50, ing task, as shown in the experiments (Sec. 4.5). As a result,
77] demonstrate high versatility in segmentation driven we adopt (c) as the main design of STA module.
by various prompts such as points, bounding boxes, lan- Formally, given a pair of left and right images Il , Ir ∈
guage. Similar advancements also appear in 3D vision RH×W ×3 , we employ EdgeNeXt-S [40] as the CNN
tasks. DUSt3R [65] and MASt3R [33] present generaliz- module within STA to extract multi-level pyramid fea-
able frameworks for dense 3D reconstruction from uncali- tures, where the 1/4 level feature is equipped with
brated and unposed cameras. FoundationPose [69] develops (i) (i) H W
DepthAnythingV2 feature: fl , fr ∈ RCi × i × i , i ∈
a unified framework of 6D object pose estimation and track- {4, 8, 16, 32}. EdgeNeXt-S [40] is chosen for its memory
ing for novel objects. More closely related to this work, a efficiency and because larger CNN backbones did not yield
number of efforts [4, 29, 78, 79] demonstrated strong gener- additional benefits in our investigation. When forwarding to
alization in monocular depth estimation task and multi-view DepthAnythingV2, we first resize the image to be divisible
stereo [26]. Together, these approaches exemplify under the by 14, to be consistent with its pretrained patch size. The
scaling law, how foundation models in vision are evolving STA weights are shared when applied to Il , Ir .
to support robust applications across diverse scenarios with-
Similarly, we employ STA to extract context feature,
out tedious per-domain fine-tuning.
with the difference that the CNN module is designed with
a sequence of residual blocks [25] and down-sampling lay-
3. Approach (i)
ers. It generates context features of multiple scales: fc ∈
H W
The overall network architecture is shown in Fig. 2. The RCi × i × i , i ∈ {4, 8, 16}, as in [36]. fc participates in
rest of this section describes the various components. initializing the hidden state of the ConvGRU block and in-
putting to the ConvGRU block at each iteration, effectively
3.1. Monocular Foundation Model Adaptation guiding the iterative process with progressively refined con-
To mitigate the sim-to-real gap when the stereo network is textual information.
primarily trained on synthetic dataset, we leverage the re- Fig. 3 visualizes the power of rich monocular prior that
cent advancements on monocular depth estimation trained helps to reliably predict on ambiguous regions which is
on internet-scale real data [5, 79]. We use a CNN network challenging to deal with by naive correspondence search
to adapt the ViT-based monocular depth estimation network along the epipolar line. Instead of using the raw monocular
to the stereo setup, thus synergizing the strengths of both depth from DepthAnythingV2 which has scale ambiguity,
CNN and ViT architectures. we use its latent feature as geometric priors extracted from
CNN 1/32 CNN 1/32
1/16
Depth Depth CNN 1/16 Depth CNN 1/16
Anything 1/8 Anything Anything
CNN 1/8 CNN 1/8
V2 1/4 V2 V2
CNN 1/4 CNN 1/4
Down
Figure 2. Overview of our proposed FoundationStereo. The Side-Tuning Adapter (STA) adapts the rich monocular priors from a frozen DepthAny-
thingV2 [79], while combined with fine-grained high-frequency features from multi-level CNN for unary feature extraction. Attentive Hybrid Cost Filtering
(AHCF) combines the strengths of the Axial-Planar Convolution (APC) filtering and a Disparity Transformer (DT) module to effectively aggregate the
features along spatial and disparity dimensions over the 4D hybrid cost volume. An initial disparity is then predicted from the filtered cost volume, and
subsequently refined through GRU blocks. At each refinement step, the latest disparity is used to look up features from both filtered hybrid cost volume and
correlation volume to guide the next refinement. The iteratively refined disparity becomes the final output.
Figure 3. Left: Design choices for STA module. Right: Effects of the proposed STA and AHCF modules. “W/o STA” only uses CNN to extract features.
“W/o AHCF” uses conventional 3D CNN-based hourglass network for cost volume filtering. Results are obtained via zero-shot inference without fine-
tuning on target dataset. STA leverages rich monocular prior to reliably predict the lamp region with inconsistent lighting and dark guitar sound hole. AHCF
effectively aggregates the spatial and long-range disparity context to accurately predict over thin repetitive structures.
both stereo images and compared through cost filtering as costs, offering a diverse set of similarity measurement fea-
described next. tures from each group. Vcat preserves unary features includ-
ing the rich monocular priors by concatenating left and right
3.2. Attentive Hybrid Cost Filtering features at shifted disparity. To reduce memory consump-
tion, we linearly downsize the unary feature dimension to
Hybrid Cost Volume Construction. Given unary features
14 using a convolution of kernel size 1 (weights are shared
at 1/4 scale fl4 , fr4 extracted from previous step, we con-
D H W between fl4 and fr4 ) before concatenation. Next, we de-
struct the cost volume VC ∈ RC× 4 × 4 × 4 with a com- scribe two sub-modules for effective cost volume filtering.
bination of group-wise correlation and concatenation [24]:
Axial-Planar Convolution (APC) Filtering. An hourglass
&\mathbf {V_{\text {gwc}}}(g,d,h,w) = \left \langle \widehat {f}_{l,g}^{(4)}(h,w), \widehat {f}_{r,g}^{(4)}(h,w-d) \right \rangle , \notag \\ &\mathbf {V_{\text {cat}}}(d,h,w) = \left [ \text {Conv}(f_{l}^{(4)})(h,w), \text {Conv}(f_{r}^{(4)})(h,w-d) \right ], \notag \\ &\mathbf {V_C}(d,h,w) = \left [ \mathbf {V_{\text {gwc}}}(d,h,w), \mathbf {V_{\text {cat}}}(d,h,w) \right ] \label {eq:V_C} network consisting of 3D convolutions, with three down-
sampling blocks and three up-sampling blocks with residual
connections, is leveraged for cost volume filtering [1, 71].
While 3D convolutions of kernel size 3 × 3 × 3 are com-
(1) monly used for relatively small disparity sizes [9, 24, 71],
where fb denotes L2 normalized feature for better training we observe it struggles with larger disparities when applied
stability; ⟨·, ·⟩ represents dot product; g ∈ {1, 2, ..., G} is to high resolution images, especially since the disparity di-
the group index among the total G = 8 feature groups that mension is expected to model the probability distribution
we evenly divide the total features into; d ∈ {1, 2, ..., D
4 } is for the initial disparity prediction. However, it is impracti-
the disparity index. [·, ·] denotes concatenation along chan- cal to naively increase the kernel size, due to the intensive
nel dimension. The group-wise correlation Vgwc harnesses memory consumption. In fact, even when setting kernel size
the strengths of conventional correlation-based matching to 5 × 5 × 5 we observe unmanageable memory usage on an
80 GB GPU. This drastically limits the model’s representa- can be formulated as:
tion power when scaling up with large amount of training &\mathbf {V}_{\text {corr}}(w',h,w)=\left \langle f_l^{(4)}(h,w), f_r^{(4)}(h,w') \right \rangle \\ &\mathbf {F_V}(h,w)=[\mathbf {V_C'}(d_k,h,w), \mathbf {V}_{\text {corr}}(w-d_k,h,w)] \\ &x_k=[\text {Conv}_v(\mathbf {F_V}), \text {Conv}_{d}(d_k), d_k, c] \\ &z_k=\sigma \left ( \text {Conv}_z([h_{k-1}, x_k]) \right ) \\ &r_k=\sigma \left ( \text {Conv}_r([h_{k-1}, x_k]) \right ) \\ &\hat {h}_k=\text {tanh}\left ( \text {Conv}_h([r_k \odot h_{k-1}, x_k]) \right ) \\ &h_k=(1-z_k)\odot h_{k-1} + z_k \odot \hat {h}_k \\ &d_{k+1} = d_k + \text {Conv}_{\Delta }(h_k)
data. We thus develop “Axial-Planar Convolution” which
decouples a single 3 × 3 × 3 convolution into two sepa-
rate convolutions: one over spatial dimensions (kernel size
Ks × Ks × 1) and the other over disparity (1 × 1 × Kd ),
each followed by BatchNorm and ReLU. APC can be re-
garded as a 3D version of Separable Convolution [16] with
the difference that we only separate the spatial and dispar-
ity dimensions without subdividing the channel into groups
which sacrifices representation power. The disparity dimen-
sion is specially treated due to its uniquely encoded feature (10)
comparison within the cost volume. We use APC wher- where ⊙ denotes element-wise product; σ denotes sigmoid;
W H W
ever possible in the hourglass network except for the down- Vcorr ∈ R 4 × 4 × 4 is the pair-wise correlation volume;
sampling and up-sampling layers. FV represents the looked up volume features using lat-
Disparity Transformer (DT). While prior works [35, 68] est disparity; c = ReLU(fc ) encodes the context feature
introduced transformer architecture to unary feature extrac- from left image, including STA adapted features (Sec. 3.1)
tion step to scale up stereo training, the cost filtering pro- which effectively guide the refinement process leveraging
cess is often overlooked, which remains an essential step rich monocular priors.
in achieving accurate stereo matching by encapsulating cor- We use three levels of GRU blocks to perform coarse-
respondence information. Therefore, we introduce DT to to-fine hidden state update in each iteration, where the ini-
(i)
further enhance the long-range context reasoning within the tial hidden states are produced from context features h0 =
(i)
4D cost volume. Given VC obtained in Eq. (1), we first tanh(fc ), i ∈ {4, 8, 16}. At each level, attention-based se-
apply a 3D convolution of kernel size 4 × 4 × 4 with stride lection mechanism [67] is leveraged to capture information
4 to downsize the cost volume. We then reshape the vol- at different frequencies. Finally, dk is up-sampled to the full
ume into a batch of token sequences, each with length of resolution using convex sampling [57].
disparity. We apply position encoding before feeding it to a
series (4 in our case) of transformer encoder blocks, where 3.4. Loss Function
FlashAttention [18] is leveraged to perform multi-head self- The model is trained with the following objective:
attention [63]. The process can be written as:
&\mathbf {Q_0}=\text {PE}\left ( \mathbf {R}\left (\text {Conv}_{4\times 4\times 4}(\mathbf {V_C}) \right ) \right ) \in \mathbb {R}^{\left (\frac {H}{16}\times \frac {W}{16}\right ) \times C\times \frac {D}{16}} \notag \\ &\text {MultiHead}(\mathbf {Q}, \mathbf {K}, \mathbf {V}) = [\text {head}_1, \ldots , \text {head}_h]\mathbf {W}_O \notag \\ &\quad \text {where } \text {head}_i = \text {FlashAttention}(\mathbf {Q}_i, \mathbf {K}_i, \mathbf {V}_i) \notag \\ &\mathbf {Q_1}=\text {Norm}\left ( \text {MultiHead}(\mathbf {Q_0}, \mathbf {Q_0}, \mathbf {Q_0}) + \mathbf {Q_0} \right ) \notag \\ &\mathbf {Q_2}=\text {Norm}\left ( \text {FFN}(\mathbf {Q_1}) + \mathbf {Q_1} \right ) \notag \mathcal {L}=\left | d_0-\overline {d} \right |_{\text {smooth}} + \sum _{k=1}^{K}\gamma ^{K-k}\left \| d_k-\overline {d} \right \|_1 (11)
in theory can produce unlimited amount of data and achieve Middlebury ETH3D KITTI-12 KITTI-15
Methods BP-2 BP-1 D1 D1
large diversity through randomization, ambiguities can be
CREStereo++ [27] 14.8 4.4 4.7 5.2
inevitably introduced especially for less structured scenes DSMNet [82] 13.8 6.2 6.2 6.5
with flying objects, which confuses the learning process. Mask-CFNet [49] 13.7 5.7 4.8 5.8
To eliminate those samples, we design an automatic itera- HVT-RAFT [10] 10.4 3.0 3.7 5.2
RAFT-Stereo [36] 12.6 3.3 4.7 5.5
tive self-curation strategy. Fig. 4 demonstrates this process Selective-IGEV [67] 9.2 5.7 4.5 5.6
and detected ambiguous samples. We start with training an IGEV [36] 8.8 4.0 5.2 5.7
initial version of FoundationStereo on FSD, after which it is Former-RAFT-DAM [84] 8.1 3.3 3.9 5.1
IGEV++ [72] 7.8 4.1 5.1 5.9
evaluated on FSD. Samples where BP-2 (Sec. 4.2) is larger
NMRF [22] 7.5 3.8 4.2 5.1
than 60% are regarded as ambiguous samples and replaced Ours (Scene Flow) 5.5 1.8 3.2 4.9
by regenerating new ones. The training and curation pro- Selective-IGEV* [67] 7.5 3.4 3.2 4.5
cesses are alternated to iteratively (twice in our case) update Ours 1.1 0.5 2.3 2.8
both FSD and FoundationStereo. Table 2. Zero-shot generalization results on four public datasets. The most
commonly used metrics for each dataset were adopted. In the first block,
all methods were trained only on Scene Flow. In the second block, meth-
4. Experiments ods are allowed to train on any existing datasets excluding the four target
domains. The weights and parameters are fixed for evaluation.
4.1. Implementation Details
ing both indoor and outdoor scenarios. KITTI 2012 [20]
We implement FoundationStereo in PyTorch. The foun-
and KITTI 2015 [45] datasets feature real-world driving
dation model is trained on a mixed dataset consisting of
scenes, where sparse ground-truth disparity maps are pro-
our proposed FSD, together with Scene Flow [43], Sin-
vided, which are derived from LIDAR sensors.
tel [6], CREStereo [34], FallingThings [61], InStereo2K [2]
and Virtual KITTI 2 [7]. We train FoundationStereo using Metrics. “EPE” computes average per-pixel disparity er-
AdamW optimizer [39] for 200K steps with a total batch ror. “BP-X” computes the percentage of pixels where the
size of 128 evenly distributed over 32 NVIDIA A100 GPUs. disparity error is larger than X pixels. “D1” computes the
The learning rate starts at 1e-4 and decays by 0.1 at 0.8 of percentage of pixels whose disparity error is larger than 3
the entire training process. Images are randomly cropped pixels and 5% of the ground-truth disparity.
to 320×736 before feeding to the network. Data augmen- 4.3. Zero-Shot Generalization Comparison
tations similar to [36] are performed. During training, 22
iterations are used in GRU updates. In the following, unless Benchmark Evaluation. Tab. 2 exhibits quantitative com-
otherwise mentioned, we use the same foundation model for parison of zero-shot generalization results on four public
zero-shot inference with 32 refinement iterations and 416 real-world datasets. Even when trained solely on Scene
for maximum disparity. Flow, our method outperforms the comparison methods
consistently across all datasets, thanks to the efficacy of
4.2. Benchmark Datasets and Metric adapting rich monocular priors from vision foundation
Datasets. We consider five commonly used public datasets models. We further evaluate in a more realistic setup, al-
for evaluation: Scene Flow [43] is a synthetic dataset lowing methods to train on any available dataset while ex-
including three subsets: FlyingThings3D, Driving, and cluding the target domain, to achieve optimal zero-shot in-
Monkaa. Middlebury [51] consists of indoor stereo image ference results as required in practical applications.
pairs with high-quality ground-truth disparity captured via In-the-Wild Generalization. We compare our foundation
structured light. Unless otherwise mentioned, evaluations model against recent approaches that released their check-
are performed on half resolution and non-occluded regions. points trained on a mixture of datasets, to resemble the prac-
ETH3D [52] provides grayscale stereo image pairs cover- tical zero-shot application on in-the-wild images. Com-
Figure 5. Qualitative comparison of zero-shot inference on in-the-wild images. For each comparison method we select the best performing checkpoint from
their public release, which has been trained on a mixture of public datasets. These images exhibit challenging reflection, translucency, repetitive textures,
complex illuminations and thin-structures, revealing the importance of our network architecture and large-scale training.
Method LEAStereo [15] GANet [81] ACVNet [70] IGEV-Stereo [71] NMRF [22] MoCha-Stereo [14] Selective-IGEV [67] Ours
EPE 0.78 0.84 0.48 0.47 0.45 0.41 0.44 0.34
Table 3. Comparison of methods trained / tested on the Scene Flow train / test sets, respectively.
Method Zero-Shot BP-0.5 BP-1.0 EPE ing the previous best EPE from 0.41 to 0.33. Although in-
GMStereo [74] ✗ 5.94 1.83 0.19 domain training is not the focus of this work, the results
HITNet [56] ✗ 7.83 2.79 0.20 reflect the effectiveness of our model design.
EAI-Stereo [85] ✗ 5.21 2.31 0.21
RAFT-Stereo [36] ✗ 7.04 2.44 0.18 Tab. 4 exhibits quantitative comparison on ETH3D
CREStereo [34] ✗ 3.58 0.98 0.13 leaderboard (test set). For our approach, we perform eval-
IGEV-Stereo [71] ✗ 3.52 1.12 0.14 uations in two settings. First, we fine-tune our foundation
CroCo-Stereo [68] ✗ 3.27 0.99 0.14
MoCha-Stereo [14] ✗ 3.20 1.41 0.13
model on a mixture of the default training dataset (Sec. 4.1)
Selective-IGEV [67] ✗ 3.06 1.23 0.12 and ETH3D training set for another 50K steps, using the
Ours (finetuned) ✗ 1.26 0.26 0.09 same learning rate schedule and data augmentation. Our
Ours ✓ 2.31 1.52
3912 Contributed paper
0.13 model significantly surpasses the previous best approach by
Context-Oriented Feature Fusion and
Enhancement for Robust Stereo Matching
Table 4. Results on ETH3D leaderboard (test set). All methods except for reducing more than half of the error rates and ranks 1st
the last row have used ETH3D training set for fine-tuning. Our fine-tuned on leaderboard at the time of submission. This indicates
version ranks 1st on leaderboard at the time of submission. Last row is great potential of transferring capability from our founda-
obtained via zero-shot inference from our foundation model.
tion model if in-domain fine-tuning is desired. Second, we
parison methods include CroCo v2 [68], CREStereo [34], also evaluated our foundation model without using any data
IGEV [71] and Selective-IGEV [67]. For each method, we from ETH3D. Remarkably, our foundation model’s zero-
select the best performing checkpoint from their public re- shot inference achieves comparable or even better results
lease. In this evaluation, the four real-world benchmark than leading approaches that perform in-domain training.
datasets [20, 45, 51, 52] have been used for training compar- In addition, our finetuned model also ranks 1st on the
ison methods, whereas they are not used in our fixed foun- Middlebury leaderboard. See appendix for details.
dation model. Fig. 5 displays qualitative comparison on var-
ious scenarios, including a robot scene from DROID [31] 4.5. Ablation Study
dataset and custom captures covering indoor and outdoor. We investigate different design choices for our model
and dataset. Unless otherwise mentioned, we train on
4.4. In-Domain Comparison a randomly subsampled version (100K) of FSD to make
Tab. 3 presents quantitative comparison on Scene Flow, the experiment scale more affordable. Given Middlebury
where all methods are following the same officially divided dataset’s high quality ground-truth, results are evaluated on
train and test split. Our FoundationStereo model outper- its training set to reflect zero-shot generalization. Since the
forms the comparison methods by a large margin, reduc- focus of this work is to build a stereo matching foundation
Row Variations BP-2 AHCF Row FSD BP2
Row STA BP2
APC DT 1 ✗ 2.34
1 DINOv2-L [46] 2.46
2 DepthAnythingV2-S [79] 2.22 1 2.48 2 ✓ 1.15
3 DepthAnythingV2-B [79] 2.11 2 ✓ 2.21
4 DepthAnythingV2-L [79] 1.97 3 ✓ ✓ 2.16
4 ✓ ✓ 2.05
5 STA (a) 6.48 5 ✓ ✓ ✓ 1.97
6 STA (b) 2.22
7 STA (c) 1.97 Table 7. Left: Ablation study of proposed network modules. Right: Ab-
lation study of whether to use FSD dataset when training the foundation
8 Unfreeze ViT 3.94 model described in Sec. 4.1. The choices adopted in our full model are
9 Freeze ViT 1.97 highlighted in green.
Table 5. Ablation study of STA module. Variations (a-c) correspond to ing sequence lengths. However, it does not outperform co-
Fig. 3. The choices adopted in our full model are highlighted in green.
sine position embedding, probably due to the constant dis-
Row Variations BP-2 Row Variations BP-2 parity size in 4D cost volume. While in theory, full volume
1 RoPE 2.19 10 (3,3,1), (1,1,5) 2.10 attention provides larger receptive field, it is less effective
2 Cosine 1.97 11 (3,3,1), (1,1,9) 2.06 than merely applying over the disparity dimension of the
3 1/32 2.06 12 (3,3,1), (1,1,13) 2.01 cost volume. We hypothesize the extremely large space of
4 1/16 1.97 13 (3,3,1), (1,1,17) 1.97
4D cost volume makes it less tractable, whereas attention
5 Full 2.25 14 (3,3,1), (1,1,21) 1.98 over disparity provides sufficient context for a better initial
6 Disparity 1.97 15 (7,7,1), (1,1,17) 1.99
disparity prediction and subsequent volume feature lookup
7 Pre-hourglass 2.06
8 Post-hourglass 2.20
during GRU updates. Next, we compare different kernel
9 Parallel 1.97 sizes in APC (row 10-15), where the last dimension in each
Table 6. Ablation study of AHCF module. Left corresponds to DT, while
parenthesis corresponds to disparity dimension. We observe
right corresponds to APC. The choices adopted in our full model are high- increasing benefits when enlarging disparity kernel size un-
lighted in green. til it saturates at around 17.
Effects of Proposed Modules. The quantitative effects are
model with strong generalization, we do not deliberately
shown in Tab. 7 (left). STA leverages rich monocular pri-
limit model size while pursuing better performance.
ors which greatly enhances generalization to real images
STA Design Choices. As shown in Tab. 5, we first compare
for ambiguous regions. DT and APC effectively aggregate
different vision foundation models for adapting rich monoc-
cost volume features along spatial and disparity dimensions,
ular priors, including different model sizes of DepthAny-
leading to improved context for disparity initialization and
thingV2 [79] and DINOv2-Large [46]. While DINOv2
subsequent volume feature look up during GRU updates.
previously exhibited promising results in correspondence
Fig. 3 further visualizes the resulting effects.
matching [19], it is not as effective as DepthAnythingV2
Effects of FoundationStereo Dataset. We study whether
in the stereo matching task, possibly due to its less task-
to include FSD dataset with the existing public datasets for
relevance and its limited resolution to reason high-precision
training our foundation model described in Sec. 4.1. Results
pixel-level correspondence. We then study different design
are shown in Tab. 7 (right).
choices from Fig. 3. Surprisingly, while being simple, we
found (c) significantly surpasses the alternatives. We hy-
pothesize the latest feature before the final output head pre-
5. Conclusion
serves high-resolution and fine-grained semantic and geo- We introduced FoundationStereo, a foundation model for
metric priors that are suitable for subsequent cost volume stereo depth estimation that achieves strong zero-shot gen-
construction and filtering process. We also experimented eralization across various domains without fine-tuning. We
whether to freeze the adapted ViT model. As expected, un- envision such a foundation model can facilitate broader
freezing ViT corrupts the pretrained monocular priors, lead- adoption of stereo estimation models in practical applica-
ing to degraded performance. tions. Despite its remarkable generalization, it has several
AHCF Design Choices. As shown in Tab. 6, for DT mod- limitations. First, our model is not yet optimized for ef-
ule we study different position embedding (row 1-2); dif- ficiency, which takes 0.7s on image size of 375×1242 on
ferent feature scale to perform transformer (row 3-4); trans- NVIDIA A100 GPU. Future work could explore adapting
former over the full cost-volume or only along the disparity distillation and pruning techniques applied to other vision
dimension (row 5-6); different placements of DT module foundation models [13, 87]. Second, our dataset FSD in-
relative to the hourglass network (row 7-9). Specifically, cludes a limited collection of transparent objects. Robust-
RoPE [55] encodes relative distances between tokens in- ness could be further enhanced by augmenting with a larger
stead of absolute positions, making it more adaptive to vary- diversity of fully transparent objects during training.
References nel attention network for stereo matching. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
[1] Antyanta Bangunharcana, Jae Won Cho, Seokju Lee, In So Recognition (CVPR), pages 27768–27777, 2024. 1, 2, 7
Kweon, Kyung-Soo Kim, and Soohyun Kim. Correlate-and-
[15] Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Yuchao
excite: Real-time stereo matching via guided cost volume
Dai, Xiaojun Chang, Hongdong Li, Tom Drummond, and
excitation. In 2021 IEEE/RSJ International Conference on
Zongyuan Ge. Hierarchical neural architecture search for
Intelligent Robots and Systems (IROS), pages 3542–3548.
deep stereo matching. Proceedings of Neural Information
IEEE, 2021. 4
Processing Systems (NeurIPS), 33:22158–22169, 2020. 7
[2] Wei Bao, Wei Wang, Yuhua Xu, Yulan Guo, Siyu Hong, and
[16] François Chollet. Xception: Deep learning with depthwise
Xiaohu Zhang. InStereo2k: a large real dataset for stereo
separable convolutions. In Proceedings of the IEEE/CVF
matching in indoor scenes. Science China Information Sci-
Conference on Computer Vision and Pattern Recognition
ences, 63:1–11, 2020. 2, 6
(CVPR), pages 1251–1258, 2017. 5
[3] Luca Bartolomei, Fabio Tosi, Matteo Poggi, and Stefano
Mattoccia. Stereo anywhere: Robust zero-shot deep stereo [17] WeiQin Chuah, Ruwan Tennakoon, Reza Hoseinnezhad,
matching even where either stereo or mono fail. arXiv Alireza Bab-Hadiashar, and David Suter. ITSA: An
preprint arXiv:2412.04472, 2024. 2 information-theoretic approach to automatic shortcut avoid-
ance and domain generalization in stereo matching net-
[4] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter
works. In Proceedings of the IEEE/CVF Conference on
Wonka, and Matthias Müller. ZoeDepth: Zero-shot trans-
Computer Vision and Pattern Recognition (CVPR), pages
fer by combining relative and metric depth. arXiv preprint
13022–13032, 2022. 1, 2
arXiv:2302.12288, 2023. 3
[5] Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, [18] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo-
Marcel Santos, Yichao Zhou, Stephan R Richter, and pher Ré. FlashAttention: Fast and memory-efficient exact
Vladlen Koltun. Depth Pro: Sharp monocular metric depth in attention with io-awareness. Proceedings of Neural Informa-
less than a second. arXiv preprint arXiv:2410.02073, 2024. tion Processing Systems (NeurIPS), 35:16344–16359, 2022.
3 5
[6] Daniel J Butler, Jonas Wulff, Garrett B Stanley, and [19] Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Ab-
Michael J Black. A naturalistic open source movie for optical hishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun,
flow evaluation. In Proceedings of the European Conference Leonidas Guibas, Justin Johnson, and Varun Jampani. Prob-
on Computer Vision (ECCV), pages 611–625, 2012. 2, 3, 6 ing the 3D awareness of visual foundation models. In Pro-
[7] Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- ceedings of the IEEE/CVF Conference on Computer Vision
tual KITTI 2. arXiv preprint arXiv:2001.10773, 2020. 2, 6 and Pattern Recognition (CVPR), pages 21795–21806, 2024.
8
[8] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- [20] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
ing properties in self-supervised vision transformers. In Pro- ready for autonomous driving? the KITTI vision benchmark
ceedings of the IEEE International Conference on Computer suite. In Proceedings of the IEEE/CVF Conference on Com-
Vision (ICCV), pages 9650–9660, 2021. 3 puter Vision and Pattern Recognition (CVPR), pages 3354–
[9] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo 3361, 2012. 2, 6, 7
matching network. In Proceedings of the IEEE/CVF Confer- [21] Rui Gong, Weide Liu, Zaiwang Gu, Xulei Yang, and
ence on Computer Vision and Pattern Recognition (CVPR), Jun Cheng. Learning intra-view and cross-view geomet-
pages 5410–5418, 2018. 4 ric knowledge for stereo matching. In Proceedings of
[10] Tianyu Chang, Xun Yang, Tianzhu Zhang, and Meng Wang. the IEEE/CVF Conference on Computer Vision and Pattern
Domain generalized stereo matching via hierarchical visual Recognition (CVPR), pages 20752–20762, 2024. 1, 2
transformation. In Proceedings of the IEEE/CVF Conference [22] Tongfan Guan, Chen Wang, and Yun-Hui Liu. Neural
on Computer Vision and Pattern Recognition (CVPR), pages markov random field for stereo matching. In Proceedings of
9559–9568, 2023. 1, 2, 6 the IEEE/CVF Conference on Computer Vision and Pattern
[11] Liyan Chen, Weihan Wang, and Philippos Mordohai. Learn- Recognition, pages 5459–5469, 2024. 6, 7, 1
ing the distribution of errors in stereo matching for joint [23] Weiyu Guo, Zhaoshuo Li, Yongkui Yang, Zheng Wang, Rus-
disparity and uncertainty estimation. In Proceedings of sell H Taylor, Mathias Unberath, Alan Yuille, and Yingwei
the IEEE/CVF Conference on Computer Vision and Pattern Li. Context-enhanced stereo transformer. In Proceedings
Recognition (CVPR), pages 17235–17244, 2023. 1, 2 of the European Conference on Computer Vision (ECCV),
[12] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong pages 263–279, 2022. 2
Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for [24] Xiaoyang Guo, Kai Yang, Wukui Yang, Xiaogang Wang,
dense predictions. ICLR, 2023. 3 and Hongsheng Li. Group-wise correlation stereo network.
[13] Zigeng Chen, Gongfan Fang, Xinyin Ma, and Xinchao In Proceedings of the IEEE/CVF Conference on Computer
Wang. 0.1% data makes segment anything slim. NeurIPS, Vision and Pattern Recognition (CVPR), pages 3273–3282,
2023. 8 2019. 4
[14] Ziyang Chen, Wei Long, He Yao, Yongjun Zhang, Bingshu [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Wang, Yongbin Qin, and Jia Wu. Mocha-stereo: Motif chan- Deep residual learning for image recognition. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and [36] Lahav Lipson, Zachary Teed, and Jia Deng. RAFT-Stereo:
Pattern Recognition (CVPR), pages 770–778, 2016. 3 Multilevel recurrent field transforms for stereo matching. In
[26] Sergio Izquierdo, Mohamed Sayed, Michael Firman, International Conference on 3D Vision (3DV), pages 218–
Guillermo Garcia-Hernando, Daniyar Turmukhambetov, 227, 2021. 1, 2, 3, 5, 6, 7
Javier Civera, Oisin Mac Aodha, Gabriel J. Brostow, and [37] Biyang Liu, Huimin Yu, and Guodong Qi. GraftNet: To-
Jamie Watson. MVSAnywhere: Zero shot multi-view stereo. wards domain generalized stereo matching with a broad-
In CVPR, 2025. 3 spectrum and task-oriented feature. In Proceedings of the
[27] Junpeng Jing, Jiankun Li, Pengfei Xiong, Jiangyu Liu, IEEE/CVF Conference on Computer Vision and Pattern
Shuaicheng Liu, Yichen Guo, Xin Deng, Mai Xu, Lai Jiang, Recognition (CVPR), pages 13012–13021, 2022. 1, 2
and Leonid Sigal. Uncertainty guided adaptive warping for [38] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao
robust and efficient stereo matching. In Proceedings of the Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang,
IEEE International Conference on Computer Vision (ICCV), Hang Su, et al. Grounding DINO: Marrying DINO with
pages 3318–3327, 2023. 1, 2, 6 grounded pre-training for open-set object detection. In Pro-
[28] Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia ceedings of the European Conference on Computer Vision
Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy- (ECCV), 2024. 3
namicStereo: Consistent dynamic depth from stereo videos. [39] I Loshchilov. Decoupled weight decay regularization. ICLR,
In Proceedings of the IEEE/CVF Conference on Com- 2019. 6
puter Vision and Pattern Recognition (CVPR), pages 13229– [40] Muhammad Maaz, Abdelrahman Shaker, Hisham
13239, 2023. 2 Cholakkal, Salman Khan, Syed Waqas Zamir, Rao Muham-
[29] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- mad Anwer, and Fahad Shahbaz Khan. EdgeNeXt:
zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- Efficiently amalgamated cnn-transformer architecture for
ing diffusion-based image generators for monocular depth mobile vision applications. In Proceedings of the European
estimation. In Proceedings of the IEEE/CVF Conference Conference on Computer Vision (ECCV), pages 3–20, 2022.
on Computer Vision and Pattern Recognition (CVPR), pages 3
9492–9502, 2024. 3 [41] Yamin Mao, Zhihua Liu, Weiming Li, Yuchao Dai, Qiang
[30] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Wang, Yun-Tae Kim, and Hong-Seok Lee. UASNet: Un-
Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. certainty adaptive sampling network for deep stereo match-
End-to-end learning of geometry and context for deep stereo ing. In Proceedings of the IEEE International Conference on
regression. In Proceedings of the IEEE International Con- Computer Vision (ICCV), pages 6311–6319, 2021. 1, 2
ference on Computer Vision (ICCV), pages 66–75, 2017. 5 [42] D. Marr and T. Poggio. Cooperative computation of stereo
[31] Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- disparity. Science, 194:283–287, 1976. 1
win Balakrishna, Sudeep Dasari, Siddharth Karamcheti, [43] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer,
Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yun- Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A
liang Chen, Kirsty Ellis, et al. DROID: A large-scale large dataset to train convolutional networks for disparity,
in-the-wild robot manipulation dataset. arXiv preprint optical flow, and scene flow estimation. In Proceedings of
arXiv:2403.12945, 2024. 7 the IEEE/CVF Conference on Computer Vision and Pattern
[32] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Recognition (CVPR), pages 4040–4048, 2016. 2, 3, 6
Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- [44] Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nali-
head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- vayko, and Andrés Bruhn. Spring: A high-resolution high-
thing. In Proceedings of the IEEE International Conference detail dataset and benchmark for scene flow, optical flow and
on Computer Vision (ICCV), pages 4015–4026, 2023. 1, 3 stereo. In Proc. IEEE/CVF Conference on Computer Vision
[33] Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Ground- and Pattern Recognition (CVPR), 2023. 3
ing image matching in 3D with MASt3R. arXiv preprint [45] Moritz Menze and Andreas Geiger. Object scene flow for au-
arXiv:2406.09756, 2024. 3 tonomous vehicles. In Proceedings of the IEEE/CVF Confer-
[34] Jiankun Li, Peisen Wang, Pengfei Xiong, Tao Cai, Ziwei ence on Computer Vision and Pattern Recognition (CVPR),
Yan, Lei Yang, Jiangyu Liu, Haoqiang Fan, and Shuaicheng pages 3061–3070, 2015. 2, 6, 7
Liu. Practical stereo matching via cascaded recurrent net- [46] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy
work with adaptive correlation. In Proceedings of the Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez,
IEEE/CVF Conference on Computer Vision and Pattern Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al.
Recognition (CVPR), pages 16263–16272, 2022. 1, 2, 3, 6, DINOv2: Learning robust visual features without supervi-
7 sion. TMLR, 2024. 1, 3, 8
[35] Zhaoshuo Li, Xingtong Liu, Nathan Drenkow, Andy Ding, [47] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Francis X Creighton, Russell H Taylor, and Mathias Un- Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
berath. Revisiting stereo depth estimation from a sequence- Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
to-sequence perspective with transformers. In Proceedings ing transferable visual models from natural language super-
of the IEEE International Conference on Computer Vision vision. In International Conference on Machine Learning
(ICCV), pages 6197–6206, 2021. 1, 2, 5 (ICML), pages 8748–8763, 2021. 3
[48] Pierluigi Zama Ramirez, Alex Costanzino, Fabio Tosi, Mat- [59] Fabio Tosi, Yiyi Liao, Carolin Schmitt, and Andreas Geiger.
teo Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Smd-nets: Stereo mixture density networks. In Proceedings
Di Stefano. Booster: A benchmark for depth from images of the IEEE/CVF conference on computer vision and pattern
of specular and transparent surfaces. IEEE Transactions on recognition, pages 8942–8952, 2021. 3
Pattern Analysis and Machine Intelligence (PAMI), 2023. 2, [60] Fabio Tosi, Filippo Aleotti, Pierluigi Zama Ramirez, Matteo
1 Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Di Ste-
[49] Zhibo Rao, Bangshu Xiong, Mingyi He, Yuchao Dai, Renjie fano. Neural disparity refinement. IEEE Transactions on
He, Zhelun Shen, and Xing Li. Masked representation learn- Pattern Analysis and Machine Intelligence (PAMI), 2024. 1,
ing for domain generalized stereo matching. In Proceedings 2
of the IEEE/CVF Conference on Computer Vision and Pat- [61] Jonathan Tremblay, Thang To, and Stan Birchfield. Falling
tern Recognition (CVPR), pages 5435–5444, 2023. 1, 2, 6 things: A synthetic dataset for 3d object detection and pose
[50] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang estimation. In Proceedings of the IEEE Conference on
Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Computer Vision and Pattern Recognition Workshops, pages
Rädle, Chloe Rolland, Laura Gustafson, et al. SAM 2: 2038–2041, 2018. 2, 3, 6
Segment anything in images and videos. arXiv preprint [62] Jonathan Tremblay, Thang To, Balakumar Sundaralingam,
arXiv:2408.00714, 2024. 3 Yu Xiang, Dieter Fox, and Stan Birchfield. Deep object pose
[51] Daniel Scharstein, Heiko Hirschmüller, York Kitajima, estimation for semantic robotic grasping of household ob-
Greg Krathwohl, Nera Nešić, Xi Wang, and Porter West- jects. In Conference on Robot Learning (CoRL), pages 306–
ling. High-resolution stereo datasets with subpixel-accurate 316, 2018. 3
ground truth. In Pattern Recognition: 36th German Confer- [63] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
ence, GCPR 2014, Münster, Germany, September 2-5, 2014, reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Proceedings 36, pages 31–42. Springer, 2014. 2, 6, 7, 1 Polosukhin. Attention is all you need. Advances in Neural
Information Processing Systems (NeurIPS), 30, 2017. 5
[52] Thomas Schops, Johannes L Schonberger, Silvano Galliani,
Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- [64] Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng,
dreas Geiger. A multi-view stereo benchmark with high- Kaiyong Zhao, and Xiaowen Chu. IRS: A large naturalistic
resolution images and multi-camera videos. In Proceedings indoor robotics stereo dataset to train deep models for dis-
of the IEEE/CVF Conference on Computer Vision and Pat- parity and surface normal estimation. In IEEE International
tern Recognition (CVPR), pages 3260–3269, 2017. 2, 6, 7 Conference on Multimedia and Expo (ICME), 2021. 2, 3
[65] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris
[53] Zhelun Shen, Yuchao Dai, and Zhibo Rao. CFNet: Cascade
Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D
and fused cost volume for robust stereo matching. In Pro-
vision made easy. In Proceedings of the IEEE/CVF Confer-
ceedings of the IEEE/CVF Conference on Computer Vision
ence on Computer Vision and Pattern Recognition (CVPR),
and Pattern Recognition (CVPR), pages 13906–13915, 2021.
pages 20697–20709, 2024. 3
1, 2
[66] Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu,
[54] Zhelun Shen, Yuchao Dai, Xibin Song, Zhibo Rao, Dingfu
Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se-
Zhou, and Liangjun Zhang. PCW-Net: Pyramid combina-
bastian Scherer. TartanAir: A dataset to push the limits of
tion and warping cost volume for stereo matching. In Pro-
visual slam. In IEEE/RSJ International Conference on Intel-
ceedings of the European Conference on Computer Vision
ligent Robots and Systems (IROS), pages 4909–4916, 2020.
(ECCV), pages 280–297, 2022. 1, 2
2, 3
[55] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen [67] Xianqi Wang, Gangwei Xu, Hao Jia, and Xin Yang.
Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with Selective-Stereo: Adaptive frequency information selection
rotary position embedding. Neurocomputing, 568:127063, for stereo matching. In Proceedings of the IEEE/CVF
2024. 8 Conference on Computer Vision and Pattern Recognition
[56] Vladimir Tankovich, Christian Hane, Yinda Zhang, Adarsh (CVPR), pages 19701–19710, 2024. 1, 2, 5, 6, 7
Kowdle, Sean Fanello, and Sofien Bouaziz. HITNet: Hier- [68] Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy,
archical iterative tile refinement network for real-time stereo Yohann Cabon, Vaibhav Arora, Romain Brégier, Gabriela
matching. In Proceedings of the IEEE/CVF Conference on Csurka, Leonid Antsfeld, Boris Chidlovskii, and Jérôme Re-
Computer Vision and Pattern Recognition (CVPR), pages vaud. CroCo v2: Improved cross-view completion pre-
14362–14372, 2021. 7 training for stereo matching and optical flow. In Proceedings
[57] Zachary Teed and Jia Deng. RAFT: Recurrent all-pairs field of the IEEE International Conference on Computer Vision
transforms for optical flow. In Proceedings of the European (ICCV), pages 17969–17980, 2023. 1, 2, 5, 7
Conference on Computer Vision (ECCV), pages 402–419, [69] Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield.
2020. 2, 5 FoundationPose: Unified 6D pose estimation and tracking
[58] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Woj- of novel objects. In Proceedings of the IEEE/CVF Confer-
ciech Zaremba, and Pieter Abbeel. Domain randomization ence on Computer Vision and Pattern Recognition (CVPR),
for transferring deep neural networks from simulation to the pages 17868–17879, 2024. 3
real world. In IEEE/RSJ International Conference on Intel- [70] Gangwei Xu, Junda Cheng, Peng Guo, and Xin Yang. Atten-
ligent Robots and Systems (IROS), pages 23–30, 2017. 5 tion concatenation volume for accurate and efficient stereo
matching. In Proceedings of the IEEE/CVF Conference on [83] Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas
Computer Vision and Pattern Recognition (CVPR), pages Guibas, and Jitendra Malik. Side-tuning: a baseline for net-
12981–12990, 2022. 7 work adaptation via additive side networks. In Proceedings
[71] Gangwei Xu, Xianqi Wang, Xiaohuan Ding, and Xin Yang. of the European Conference on Computer Vision (ECCV),
Iterative geometry encoding volume for stereo matching. In pages 698–714, 2020. 3
Proceedings of the IEEE/CVF Conference on Computer Vi- [84] Yongjian Zhang, Longguang Wang, Kunhong Li, Yun Wang,
sion and Pattern Recognition, pages 21919–21928, 2023. 2, and Yulan Guo. Learning representations from foundation
4, 5, 7 models for domain generalized stereo matching. In European
[72] Gangwei Xu, Xianqi Wang, Zhaoxing Zhang, Junda Cheng, Conference on Computer Vision, pages 146–162. Springer,
Chunyuan Liao, and Xin Yang. IGEV++: Iterative multi- 2024. 1, 2, 6
range geometry encoding volumes for stereo matching. [85] Haoliang Zhao, Huizhou Zhou, Yongjun Zhang, Yong Zhao,
arXiv preprint arXiv:2409.00638, 2024. 2, 6 Yitong Yang, and Ting Ouyang. EAI-Stereo: Error aware
[73] Haofei Xu and Juyong Zhang. AANet: Adaptive aggrega- iterative network for stereo matching. In Proceedings of the
tion network for efficient stereo matching. In Proceedings of Asian Conference on Computer Vision (ACCV), pages 315–
the IEEE/CVF Conference on Computer Vision and Pattern 332, 2022. 7
Recognition (CVPR), pages 1959–1968, 2020. 1, 2 [86] Haoliang Zhao, Huizhou Zhou, Yongjun Zhang, Jie Chen,
[74] Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Yitong Yang, and Yong Zhao. High-frequency stereo match-
Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, ing network. In Proceedings of the IEEE/CVF Conference
stereo and depth estimation. IEEE Transactions on Pattern on Computer Vision and Pattern Recognition (CVPR), pages
Analysis and Machine Intelligence (PAMI), 2023. 7 1327–1336, 2023. 1, 2
[75] Gengshan Yang, Joshua Manela, Michael Happold, and [87] Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu,
Deva Ramanan. Hierarchical deep stereo matching on high- Min Li, Ming Tang, and Jinqiao Wang. Fast segment any-
resolution images. In Proceedings of the IEEE/CVF Confer- thing. arXiv preprint arXiv:2306.12156, 2023. 8
ence on Computer Vision and Pattern Recognition (CVPR),
pages 5515–5524, 2019. 2
[76] Guorun Yang, Xiao Song, Chaoqin Huang, Zhidong Deng,
Jianping Shi, and Bolei Zhou. DrivingStereo: A large-scale
dataset for stereo matching in autonomous driving scenar-
ios. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages 899–
908, 2019. 2
[77] Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing
Wang, and Feng Zheng. Track anything: Segment anything
meets videos. arXiv preprint arXiv:2304.11968, 2023. 3
[78] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi
Feng, and Hengshuang Zhao. Depth anything: Unleashing
the power of large-scale unlabeled data. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 10371–10381, 2024. 1, 3
[79] Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao-
gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any-
thing v2. In Proceedings of Neural Information Processing
Systems (NeurIPS), 2024. 1, 2, 3, 4, 8
[80] Menglong Yang, Fangrui Wu, and Wei Li. WaveletStereo:
Learning wavelet coefficients of disparity map in stereo
matching. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pages
12885–12894, 2020. 1, 2
[81] Feihu Zhang, Victor Prisacariu, Ruigang Yang, and
Philip HS Torr. GA-Net: Guided aggregation net for end-
to-end stereo matching. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR), pages 185–194, 2019. 7
[82] Feihu Zhang, Xiaojuan Qi, Ruigang Yang, Victor Prisacariu,
Benjamin Wah, and Philip Torr. Domain-invariant stereo
matching networks. In Proceedings of the European Con-
ference on Computer Vision (ECCV), pages 420–439, 2020.
1, 2, 6
FoundationStereo: Zero-Shot Stereo Matching
Supplementary Material
6. ETH3D Leaderboard objects. We compare with the most competitive methods
from Fig. 5 (main paper) in the zero-shot setting. The quan-
At the time of submission, our fine-tuned model ranks titative and qualitative results are shown below.
1st on the ETH3D leaderboard, significantly outperforming IGEV
Figure 7. Examples scene models involving factory, hospital, wood attic, office, grocery store and warehouse. In the third column, we demonstrate an
example of metallic material randomization being applied to augment scene diversity. The last column shows comparison of a warehouse between the real
(bottom) and our simulated digital twin (top) in high fidelity.
defined with a separate randomization range for sampling butions. Next, objects are spawned into the scene in two
locations, scales and appearances. In addition, we curated different methods to randomize the scene configuration: 1)
12 large scene models (Fig. 7), 16 skybox images, more camera is spawned in a random pose, and objects are added
than 150 materials, and 400 textures for tiled wrapping on relative to the camera at random locations; 2) objects are
object geometries for appearance augmentation. These tex- spawned near a random location, and the camera is spawned
tures are obtained from real-world photos and procedurally nearby and oriented to the center of mass of the object clut-
generated random patterns. ter.
Camera Configuration. For each data sample, we first ran- Layout Configuration. We generate layouts in two kinds
domly sample the stereo baseline camera focal length to di- of styles: chaotic and realistic. Such combination of the
versify the coverage of field-of-views and disparity distri- more realistic structured layouts with the more randomized
Figure 8. Middlebury leaderboard screenshot. Our fine-tuned foundation model (red box) ranks 1st at the time of submission.
setups with flying objects has been shown to benefit sim-to- to 2 seconds to create physically realistic layouts with no
real generalization [62]. Specifically, chaotic-style scenes penetration, involving both settled and falling objects. Ma-
involve large number of flying distractors and simple scene terials and scales native to object assets are maintained and
layouts which consists of infinitely far skybox and a back- more natural lighting is applied. Among the realistic-style
ground plane. The lighting and object appearances (texture data, we further divide the scenes into three types which de-
and material) are highly randomized. The realistic-style termine what categories of objects are selected to compose
data uses indoor and outdoor scene models where the cam- the scene for more consistent semantics:
era is restricted to locate at predefined areas. Object assets • Navigation - camera poses are often in parallel to the
are dropped and applied with physical properties for colli- ground and objects are often spawned further away. Ob-
sion. The simulation is performed randomly between 0.25 jects such as free-standing walls, furniture, and digital
humans are sampled with higher probability.
• Driving - camera is often in parallel to the ground above
the ground and objects are often spawned further away.
Objects such as vehicles, digital humans, poles, signs and
speed bumps are sampled with higher probability.
• Manipulation - camera is oriented to face front or down-
ward as in ego-centric views and objects are often
spawned in closer range to resemble interaction scenar-
ios. Objects such as household or grocery items, open
containers, robotic arms are sampled with higher proba-
bility.
Lighting Configuration. Light types include global illu-
mination, directed sky rays, lights baked-into 3D scanned
assets, and light spheres which add dynamic lighting when
spawned near to surfaces. Light colors, intensities and di-
rections are randomized. Lighting vibes such as daytime,
dusk and night are included within the random sampling
ranges.
Disparity Distribution. Fig. 9 shows the disparity distribu-
tion of our FSD dataset.
12. Acknowledgement
We would like to thank Gordon Grigor, Jack Zhang, Xu-
tong Ren, Karsten Patzwaldt, Hammad Mazhar and other
NVIDIA Isaac team members for their tremendous engi-
neering support and valuable discussions.