0% found this document useful (0 votes)
46 views15 pages

SLAM3R: Real-Time Dense Scene Reconstruction From Monocular RGB Videos

Uploaded by

chunfeng277
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views15 pages

SLAM3R: Real-Time Dense Scene Reconstruction From Monocular RGB Videos

Uploaded by

chunfeng277
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos

Yuzheng Liu1 * Siyan Dong2 * Shuzhe Wang3 Yanchao Yang2† Qingnan Fan4 Baoquan Chen1†
1
Peking University 2 The University of Hong Kong 3 Aalto University 4
VIVO

Abstract
arXiv:2412.09401v2 [cs.CV] 19 Dec 2024

In this paper, we introduce SLAM3R, a novel and effec-


tive monocular RGB SLAM system for real-time and high-
quality dense 3D reconstruction. SLAM3R provides an end-
to-end solution by seamlessly integrating local 3D recon-
struction and global coordinate registration through feed-
forward neural networks. Given an input video, the sys-
tem first converts it into overlapping clips using a sliding
window mechanism. Unlike traditional pose optimization-
based methods, SLAM3R directly regresses 3D pointmaps
from RGB images in each window and progressively aligns
and deforms these local pointmaps to create a globally con-
sistent scene reconstruction - all without explicitly solving
any camera parameters. Experiments across datasets con-
sistently show that SLAM3R achieves state-of-the-art re-
construction accuracy and completeness while maintaining
real-time performance at 20+ FPS. Code and weights at:
https://github.com/PKU-VCL-3DV/SLAM3R. Figure 1. We introduce a novel dense SLAM system - SLAM3R.
This system takes a monocular RGB video as input and recon-
structs the scene as a dense pointcloud. The video is converted
1. Introduction into short clips for local reconstruction (denoted as inner-window),
which are then incrementally registered together (inter-window) to
Dense 3D reconstruction, a long-standing challenge in com- create a global scene model. This process runs in real-time, pro-
puter vision, aims to capture and reconstruct the detailed ge- ducing a reconstruction that is both accurate and complete.
ometry of real-world scenes. Traditional approaches have
largely relied on multi-stage pipelines. These typically
begin with sparse Simultaneous Localization and Map- tems [27, 44, 71, 73, 75, 77] have been proposed to tackle
ping (SLAM) [7, 14, 23, 35, 36] or Structure-from-Motion dense scene reconstruction from RGB videos. By incorpo-
(SfM) [29, 31, 45, 51, 64] algorithms to estimate camera rating advanced scene representations [22, 34, 38, 55, 60],
parameters, followed by Multi-View Stereo (MVS) [16, 46, these systems produce accurate and complete scene recon-
58, 66] techniques to fill in scene details. While these meth- structions. However, this comes at the cost of reduced run-
ods offer high-quality reconstructions, they often require of- ning efficiency. For example, NICER-SLAM [77] operates
fline processing to produce a complete model, which limits at a speed significantly below 1 FPS. Therefore, current ap-
their applicability in real-world scenarios. proaches struggle with at least one of three key criteria: re-
In the literature, dense SLAM approaches [5, 9, 10, 15, construction accuracy, completeness, or efficiency.
20, 37, 40, 53, 54, 74, 76] have been developed to address While monocular dense SLAM systems encounter the
dense scene reconstruction as a complete system. How- limitations mentioned earlier, recent advances in two-view
ever, these approaches often fall short in terms of recon- geometry have shown promising potential. DUSt3R [62]
struction accuracy or completeness, or they rely heavily on introduces a purely end-to-end approach for learning dense
depth sensors. Recently, several monocular SLAM sys- reconstruction. Trained on large-scale datasets, its network
* Joint first authors: liu [email protected], [email protected] is capable of producing high-quality dense reconstructions
† Corresponding authors from paired images in real-time. However, for multiple

1
views, a global optimization step is required to align these 31, 45, 51, 64], followed by dense 3D points triangulation
image pairs, which significantly hampers its efficiency. A with Multi-View Stereo (MVS) [16, 46, 58, 66]. In recent
concurrent work, Spann3R [59], extends DUSt3R to multi- years, neural implicit [8, 28, 34, 60, 63] and 3D Gaus-
view (video) scenarios through a pairwise incremental re- sian [11, 17, 18] representations have been applied to fur-
construction pipeline. While this method accelerates the re- ther enhance the quality of dense reconstruction. While
construction process, it unfortunately results in significant these methods deliver high-quality results, they have a sig-
accumulated drift and reduced reconstruction quality. nificant limitation: the requirement for offline processing
To address these challenges, we introduce SLAM3R to generate the final 3D model, which restricts their appli-
(pronounced “slæm@r”), a novel SLAM system designed cability in real-time scenarios. In this paper, we focus on
to perform dense 3D reconstruction in real-time using only online dense reconstruction in the context of Simultaneous
RGB videos. SLAM3R comprises a two-hierarchy frame- Localization and Mapping (SLAM).
work. First, it reconstructs local 3D geometry from a slid-
ing window that processes short clips from the input video. Dense SLAM. Early works on SLAM [4, 7, 13, 14,
Then, it progressively registers these local reconstructions 23, 35, 36] focused on reconstructing the structure of un-
to build a globally consistent 3D scene. Both modules known environments while simultaneously localizing cam-
are developed with simple yet effective feed-forward mod- era poses. These approaches prioritize real-time perfor-
els, enabling end-to-end and efficient scene reconstruction. mance but produce only sparse structures of the scene.
Specifically, the two modules are the Images-to-Points (I2P) Dense SLAM approaches [5, 9, 10, 15, 20, 25, 37, 40, 53,
network and the Local-to-World (L2W) network. The I2P 54, 74, 76] incorporate detailed scene geometry information
module, inspired by DUSt3R, selects a keyframe in a lo- to improve pose estimation. DROID-SLAM [54] introduces
cal window as the coordinate system reference. It directly recurrent iterative updates of camera poses and pixel-wise
predicts the dense 3D point map supported by the remaining depth estimates, while TANDEM [25] proposes an online
frames within that window. The L2W module incrementally MVS module for depth prediction. These systems enable
fuses locally reconstructed points into a coherent global co- real-time dense scene reconstruction. However, their focus
ordinate system. Both processes reconstruct the 3D points on camera trajectory accuracy often results in incomplete
without explicitly estimating any camera parameters. and noisy 3D reconstruction. Neural implicit and Gaussian
Through extensive experiments, we demonstrate that representations have also been integrated with dense SLAM
SLAM3R provides high-quality scene reconstructions with systems [9, 19–21, 30, 32, 40, 40, 42, 43, 53, 65, 70, 76].
minimal drift, outperforming existing dense SLAM sys- However, these approaches often rely on additional depth
tems across various benchmarks. Furthermore, SLAM3R sensors or focus primarily on novel view synthesis rather
achieves these results at 20+ FPS, bridging the gap between than producing detailed geometric reconstruction.
quality and efficiency in RGB-only dense scene reconstruc- More recently, several monocular dense SLAM sys-
tion. Our contributions are summarized below: tems [27, 44, 71, 73, 75, 77] have been developed to produce
• We present a novel real-time system for end-to-end dense dense scene geometry reconstruction. A notable limitation
3D reconstruction that directly predicts 3D pointmaps in of these systems is their slow runtime. Among these sys-
a unified coordinate system. tems, GO-SLAM [73] achieves a speed of ∼8 FPS, which
• Through careful design, our Image-to-Points module can still falls short of real-time capability. Furthermore, these
process an arbitrary number of images simultaneously, ef- methods all share a common strategy: they alternate be-
fectively extending DUSt3R to handle multiple views and tween solving for camera poses and estimating the scene
produce higher-quality predictions. representation. In contrast, this paper presents a novel ap-
• The proposed Local-to-World module directly aligns pre- proach to dense scene reconstruction that eliminates the
dicted local 3D pointmaps into a unified global coordinate need for explicitly solving camera parameters, offering a
system. This eliminates the need for explicit camera pa- more efficient and streamlined solution.
rameter estimation and costly global optimization.
• We evaluate our method on multiple public benchmarks.
End-to-end dense 3D reconstruction. DUSt3R [62] in-
It achieves state-of-the-art reconstruction quality in terms
troduces the first purely end-to-end dense 3D reconstruction
of both accuracy and completeness at real-time speeds.
pipeline without reply on camera parameters. Recently, sev-
2. Related Work eral works have adopted a similar approach for single-view
reconstruction [61], feature matching [26], novel view syn-
Traditional offline approaches. Dense 3D pointcloud re- thesis [50, 68], and dynamics reconstruction [72]. These
construction is a long-standing problem in computer vi- successes demonstrate the effectiveness of end-to-end dense
sion. Classical approaches to this problem first determine point prediction, inspiring us to develop a dense reconstruc-
camera parameters using Structure from Motion (SfM) [29, tion system with a similar methodology.

2
While DUSt3R operates in real-time for two-view pre- local point map produced by the I2P) as input for the L2W
dictions, its extension to multiple views involves exhaustive model. The L2W model incrementally register these lo-
pairing images and performing an additional global opti- cal reconstructions into a unified global 3D coordinate sys-
mization step. This process significantly increases compu- tem. To ensure both accuracy and efficiency during this pro-
tational time, thereby hindering its real-time performance. cess, the system maintains a limited reservoir of registered
MASt3R [26] enhances the matching capability of DUSt3R frames, called scene frames. Whenever the L2W model reg-
by adding a match head, achieving more accurate keypoint isters a new keyframe, we retrieve the best-correlated scene
correspondences for 3D reconstruction, but at the cost of in- frames as a reference. The details are introduced in Sec. 3.2.
creased computational time. More recently, the concurrent
work Spann3R [59] extends DUSt3R with spatial memory. 3.1. Inner-Window Local Reconstruction
It takes a video as input and performs incremental scene re- The Image-to-Points (I2P) model aims to infer dense 3D
construction in a unified coordinate system without requir- pointmaps for every pixel of a keyframe in a given video
ing global optimization. While this approach significantly clip. By default, the middle image of a window W is chosen
improves runtime efficiency, the pairwise incremental re- as the keyframe Ikey to define the local coordinate system,
construction pipeline leads to noticeable accumulated drift. as it is most likely to have the largest overlap with other
Unlike Spann3R, our networks at each hierarchy take mul- frames. The remaining images {Isupi }i=1 L−1
serve as sup-
tiple frames as input to minimize drift. Additionally, we porting frames. Note that the 3D pointmaps of supporting
propose a self-contained retrieval module that, when regis- frames can also be reconstructed through I2P.
tering a new frame, this module selects not only its previ- The I2P network draws inspiration from DUSt3R [62],
ous few frames but also other similar frames from long-term originally designed for stereo 3D reconstruction. We in-
history for more global scene reference. troduce several simple yet effective modifications to extend
it for multi-view scenarios. The I2P model uses a multi-
3. Method branch Vision Transformer (ViT) [12] as its backbone. It
Problem statement. Given a monocular video consisting consists of a shared encoder Eimg , two separate decoders
of a sequence of RGB image frames {Ii ∈ RH×W ×3 }N Dkey and Dsup , and a point regression head for final pre-
i=1
that captures a static scene, the goal is to reconstruct its diction. These components are detailed below.
dense 3D poincloud P ∈ RM ×3 , where M is the num-
ber of 3D points. Research in this field focuses on three Image encoder. For a given video clip, the image encoder
key objectives: maximizing 3D points recovery for recon- Eimg encodes each frame Ii to obtain token representa-
struction completeness, improving the accuracy of each re- tions Fi ∈ RT ×d , where T is the number of tokens and
covered point, and achieving these goals while preserving d is the token dimension. The encoder Eimg comprises
real-time performance. m ViT encoder blocks, each containing self-attention and
feed-forward layers. The encoding process is denoted as
System overview. Figure 2 illustrates an overview of the
(T ×d) (H×W ×3)
proposed dense SLAM system. It consists of two main Fi = Eimg (Ii ), i = 1, ..., L.
components: an Image-to-Points (I2P) network that recov-
ers local 3D points from video clips, and a Local-to-World The frames are processed independently and in parallel,
(L2W) network that registers local reconstructions into a with the output divided into two parts: Fkey for the
L−1
global scene coordinate system. During the reconstruction keyframe and {Fsupi }i=1 for the supporting frames.
of the dense point cloud, the system does not explicitly
solve any camera parameters. Instead, it directly predicts Keyframe decoder. The keyframe decoder Dkey consists
3D point maps in unified coordinate systems. of n ViT decoder blocks, each containing self-attention,
The system starts by applying a sliding window mecha- cross-attention, and feed-forward layers. Unlike DUSt3R
nism of length L to convert the input video into short clips which uses the standard cross-attention, we introduce a
{Wi ∈ RL×H×W ×3 }. The I2P network then processes novel multi-view cross-attention to combine information
each window Wi to recover local 3D pointmaps. Within from different supporting frames. Given the feature tokens
each window, the system selects a keyframe to define a ref- Fkey and {Fsupi }L−1i=1 , the keyframe decoder Dkey takes
erence coordinate system for point reconstruction, as de- Fkey as input for self-attention and performs cross-attention
L−1
tailed in Sec. 3.1. By default, the stride of the sliding win- between Fkey and {Fsupi }i=1 . A decoder block is illus-
dow is set to 1, ensuring each input frame in the video is trated in Figure 3. For each cross-attention layer, queries are
selected at least once as a keyframe. For global scene recon- taken from Fkey , while keys and values are extracted from
struction, we initialize the world coordinate system with the the supporting tokens Fsupi . These L − 1 cross-attention
first window and use the reconstructed frames (image and layers are independent of each other, allowing for parallel

3
Figure 2. System overview. Given an input monocular RGB video, we apply a sliding window mechanism to convert it into overlapping
clips (referred to as windows). Each window is fed into an Image-to-Points (I2P) network to recover 3D points in a local coordinate
system. Next, the local points are incrementally fed into a Local-to-World (L2W) network to create a globally consistent scene model. The
proposed I2P and L2W networks elegantly share similar architectures. In the I2P step (Sec. 3.1), we select a keyframe as a reference to set
up a local coordinate system and use the remaining frames in the window to estimate the 3D geometry captured within it. The points from
the first window are used to establish the world coordinate system. We then incrementally fuse the following windows in the L2W step
(Sec. 3.2). This process involves retrieving the most relevant already-registered keyframes as a reference, and integrating new keyframes.
Through this iterative process, we eventually obtain the full scene reconstruction.

processing. A max-pooling layer is then employed to ag- decoder architecture used in DUSt3R, consisting of n stan-
gregate features after cross-attention. We obtain decoded dard ViT decoder blocks. The cross-attention mechanism
keyframe tokens Gkey as: is applied only to exchange information with the keyframe.
Note that all supporting frames share the same Dsup . This
Gkey = Dkey (Fkey , Fsup1 , ..., FsupL−1 ). process is denoted as
Gsupi = Dsup (Fsupi , Fkey ), i = 1, ..., L − 1.

Points reconstruction. Similar to DUSt3R, we apply a


linear head [62] to regress dense 3D pointmaps in the uni-
fied coordinate system from decoded tokens. In addition to
the pointmaps, we also predict the confidence maps for all
frames to evaluate their reliability. The final predictions are:
(H×W ×3) (H×W ×1) (T ×d)
X̂i , Ĉi = H(Gi ), i = 1, ..., L.

Training loss. Following DUSt3R, the I2P network


is trained end-to-end using ground-truth scene points
{Xi }Li=1 . Both the ground truth and predicted point maps
are normalized to a canonical scale, determined by the av-
erage distance of all valid points within the window to the
Figure 3. Illustration of a decoder block in the proposed keyframe origin. The confidence-aware training loss is:
decoder Dkey . We present a minimalist modification to integrate
information from different supporting images. Our approach tra-
L
verses each of them, selects its token keys and values, and uses the X 1 1
keyframe queries to interact with them separately across the sup- LI2P = Mi · (Ĉi · L1( X̂i , Xi ) − α logĈi ),
i=1
ẑ z
porting images. This multi-view information is then aggregated
through max-pooling. The registration decoder Dreg and scene where Mi represents a mask of valid points that have
decoder Dsce (described in Sec. 3.2) share the same architecture. ground-truth values in Xi , z and ẑ are the scale factor, Ĉi
is the confidence map, the operator · denotes the element-
wise matrix multiplication, L1(·) denotes the point-wise
Supporting decoder. The supporting decoder Dsup is de- Euclidean distance, and α is a hyper-parameter to control
signed to complement the keyframe decoder. It inherits the the regularization term. We will detail the process in Sec 4.

4
3.2. Inter-Window Global Registration we have K scene frames and one keyframe as the input for
the following L2W model.
After obtaining the 3D pointmap {X̂key } from the I2P
network, we use the inter-window Local-to-World (L2W)
model to incrementally register the newly generated Points embedding. The 3D pointmaps reconstructed by
pointmap into a global 3D coordinate system. Similar to the the I2P model are encoded into the L2W model using a
I2P network, the L2W model also relies on some frames to patch embedding method similar to image patchification in
serve as a reference for the scene coordinate system. Fur- the ViT encoder Eimg . We process the new keyframe and
thermore, it can leverage multiple registered keyframes as a K retrieved scene frames in parallel as:
global reference. These registered keyframes are referred to (T ×d)
Pi = Epts (X̂i
(H×W ×3)
), i = 1, ..., K + 1.
as scene frames and they are maintained in a buffering set
through a sampling mechanism. The encoded geometric tokens are combined with their cor-
The buffering set is designed for scalability in handling responding visual tokens by
long videos. We apply a reservoir strategy [57] that main- (T ×d) (T ×d) (T ×d)
Fi = Fi + Pi , i = 1, ..., K + 1.
tains a maximum of B registered frames in the buffering set.
When a new keyframe is inferred from I2P and ready for This resulting a token set {Fkey , {Fscei }K
i=1 } contains joint
fusion, we retrieve the top-K best-correlated scene frames features of image patch appearance and 3D geometry for the
from the buffering set as its support for global registration. keyframe and retrieved scene frames.

Scene initialization. The first window is used to initialize Registration decoder. The registration decoder Dreg
the scene model. It’s crucial to ensure this initialization is takes feature tokens {Fkey , {Fscei }K
i=1 } as input and aims
as accurate as possible. To achieve this, we execute the I2P to transform the local reconstruction of the keyframe to the
network L times, attempting to traverse and designate each scene coordinate system. It takes the same network archi-
frame within the window as the keyframe. We then select tecture of the keyframe decoder Dkey . This decoding pro-
the result with the highest total confidence score for scene cess is denoted by
model initialization. This process results in a scene point- Gkey = Dreg (Fkey , Fsce1 , ..., FsceK ).
cloud along with a set of registered frames. All these frames
are regarded as scene frames, and they are used to initialize
Scene decoder. The scene decoder Dsce takes the token
the buffering set.
set {Fkey , {Fscei }K
i=1 } as input to refine the scene geom-
etry without coordinate system changes. It uses the same
Reservoir and retrieval. Each scene frame Iscei is network architecture as the keyframe decoder Dkey , allow-
recorded with its latent feature Fscei and pointmaps X̃scei . ing us to extend to multi-keyframe co-registration (see sup-
For efficiency, we apply reservoir sampling to allow storing plementary material for details). By default, we register one
an unbiased subset of an empirical distribution in a bounded keyframe each time. Each of the Fscei has information ex-
amount of memory. The first B registered frames chosen change only with the Fkey . This decoding process is de-
are directly inserted into the buffering set. For each subse- noted by
quent frame with id > B, the probability of inserting it is
Gscei = Dsce (Fscei , Fkey ), i = 1, ..., K.
B/id. If chosen for insertion, it will randomly replace one
of the current scene frames in the buffering set.
Given a new keyframe Ikey to be registered, we feed its Points reconstruction and training loss. We apply the
feature Fkey and the features from the buffering set into a same head design as that of the I2P network to predict all
retrieval module the pointmaps X̃i in the global scene coordinate system:
(H×W ×3) (H×W ×1) (T ×d)
(T ×d) (T ×d) X̃i , C̃i = H(Gi ), i = 1, ..., K + 1.
Retrieval(Fkey , {Fscei
})
We train the L2W network using a similar loss function as
to obtain a list of correlation scores, measuring both the I2P network. Differently, no normalization is applied
the visual similarity and baseline suitability between the to the predicted point map, as the output scale must align
keyframe and the scene frames in the buffering set. The re- with the scene frames in the input. This alignment ensures
trieval module uses the first r decoder blocks from the I2P that the output can be directly integrated into the existing
module as its backbone. A linear projection and an average- reconstruction. The training loss of the L2W network is:
pooling layer follow, together producing an image-wise cor- L
relation score. We then select the top-K scene frames as a
X
LL2W = Mi · (C̃i · L1(X̃i , Xi ) − α logC̃i ).
global reference to fuse the current keyframe. As a result, i=1

5
The following section provides a detailed discussion of the truth point cloud model for each test scene by back-
training process and its implementation details. projecting pixels to the world using ground-truth depths
and camera parameters. To address scene scale issues,
4. Experiments we initially align our dense pointmap predictions with the
ground truth using Umeyama algorithm [56]. This align-
Datasets. For both the Image-to-Points (I2P) and Local-
ment is further refined through an ICP [41] procedure, en-
to-World (L2W) models, we perform training with a mix-
suring optimal correspondence between the predicted and
ture of three datasets: ScanNet++ [69], Aria Synthetic En-
ground-truth point clouds. Reconstruction quality is quanti-
vironments [3] and CO3D-v2 [39]. These datasets vary
fied by computing the accuracy and completeness of the two
from scene-level to object-centric, and contain both real-
point clouds. Furthermore, we report the frames per second
world and synthetic scenes. Since they are all recorded se-
(FPS) achieved during the reconstruction process on a sin-
quentially, we can easily extract video clips with a sliding-
gle NVIDIA 4090D GPU to demonstrate the computational
window mechanism as our training data. We select about
efficiency of our approach.
880K clips for training in total. To validate our reconstruc-
tion quality, we conduct quantitative evaluations on two un-
seen datasets: 7 Scenes [49], a real-world dataset of partial Experiments on the 7 Scenes [49] dataset. The numer-
scenes, and Replica [52], a synthetic dataset of complete ical results of scene reconstruction quality are reported
scenes. We also demonstrate visual reconstruction results in Table 1. Following Spann3R [59]’s setting, we uni-
across diverse datasets and in-the-wild captured videos to formly sample one-twentieth of the frames in each test
showcase the generalization ability of SLAM3R. sequence as input video. Each video is regarded as an
individual scene. We evaluate SLAM3R using two set-
Implementation details. Both the I2P and L2W models tings: directly using the full pointmaps predicted for all
build upon the architecture of DUSt3R [62] with minimal input frames to create reconstruction results (denoted by
but effective modifications, making it natural for them to SLAM3R-NoConf), and filtering each frame’s predicted
initialize their weights from the DUSt3R pre-trained model. pointmaps with a simultaneously predicted confidence map
We initialize our weights using the DUSt3R model trained before creating reconstruction results (SLAM3R). We com-
on 224×224 resolution images with m = 24 encoder pare our method with optimization-based reconstruction
blocks, n = 12 decoder blocks with linear heads. DUSt3R [62], triangulation-based MASt3R [26], and on-
To train our I2P model, we extend the training process line incremental reconstruction Spann3R. DUSt3R is tested
of DUSt3R from two views to multiple views. Specifically, using the weight-224 model with a resolution of 224×224,
our I2P model takes as input a video clip of length 11 with a the same as our input resolution, while MASt3R is tested
suitable stride between adjacent frames, and designates the using the weight-512 model with a resolution of 512×384.
middle frame as the keyframe. We train the I2P model for Our method outperforms all baselines in both accuracy and
100 epochs, which takes about 6 hours. completeness while maintaining real-time performance. As
After that, we train the retrieval module built on the I2P shown in the Office-09 scene (the top row in Figure 4), our
model. During training, we freeze all other modules and approach demonstrates much less drift compared to the con-
use L1 loss to supervise the correlation score against the current work Spann3R.
mean confidence of the I2P model’s final predictions. This
module requires only 20 epochs of training.
The window size is set to 13 when training the L2W Experiments on the Replica [52] dataset. Besides the
model. A confidence-aware loss without scale normaliza- baselines mentioned in 7 Scene datasets, we also com-
tion is applied, ensuring that the predicted point maps retain pare the SLAM-based reconstruction approaches NICER-
consistent scale with the input scene frames. The model is SLAM [77], DROID-SLAM [54], DIM-SLAM [27] and
trained for 200 epochs, with the entire training process tak- GO-SLAM [73] on the Replica [52] dataset. The numer-
ing approximately 15 hours. ical results on full scene reconstruction are reported in Ta-
Due to computational constraints, all training images are ble 2. Due to the memory constraint, DUSt3R [62] and
center-cropped to 224 × 224 pixels. Standard data augmen- MASt3R [26] process only one-twentieth of the frames for
tation techniques [62] are applied. Our training is conducted reconstruction. As is shown in the table, our method sur-
on 8 NVIDIA 4090D GPUs, each with 24 GB of memory. passes all baselines with FPS greater than 1. Notably, with-
For more details please refer to our supplementary material. out any optimization procedure, our method achieves recon-
struction quality comparable to optimization-based meth-
4.1. Comparisons
ods such as NICER-SLAM [77] and DUSt3R [62]. Exam-
Evaluation metrics. Following the implementation of ple of the Office 2 (the bottom row in Figure 4) also illus-
NICER-SLAM [77] and Spann3R [59], we build a ground- trates the global consistency of our reconstruction result.

6
Chess Fire Heads Office Pumpkin RedKitchen Stairs Average
Method FPS
Acc. / Comp. Acc. / Comp. Acc. / Comp. Acc. / Comp. Acc. / Comp. Acc. / Comp. Acc. / Comp. Acc. / Comp.
DUSt3R [62] 2.26 / 2.13 1.04 / 1.50 1.66 / 0.98 4.62 / 4.74 1.73 / 2.43 1.95 / 2.36 3.37 / 10.75 2.19 / 3.24 <1
MASt3R [26] 2.08 / 2.12 1.54 / 1.43 1.06 / 1.04 3.23 / 3.19 5.68 / 3.07 3.50 / 3.37 2.36 / 13.16 3.04 / 3.90 ≪1
Spann3R [59] 2.23 / 1.68 0.88 / 0.92 2.67 / 0.98 5.86 / 3.54 2.25 / 1.85 2.68 / 1.80 5.65 / 5.15 3.42 / 2.41 >50
SLAM3R-NoConf (Ours) 2.12 / 1.21 0.95 / 0.80 3.23 / 1.67 2.59 / 2.21 1.99 / 2.04 2.09 / 1.88 4.54 / 6.38 2.40 / 2.24 ∼25
SLAM3R (Ours) 1.63 / 1.31 0.84 / 0.83 2.95 / 1.22 2.32 / 2.26 1.81 / 2.05 1.84 / 1.94 4.19 / 6.91 2.13 / 2.34 ∼25

Table 1. Reconstruction results on 7 Scenes [49] dataset. We report accuracy and completeness in centimeters. The best results are shown
in bold.The average numbers are computed over all test sequences.

Room 0 Room 1 Room 2 Office 0 Office 1 Office 2 Office 3 Office 4 Average


Method FPS
Acc. / Comp. Acc. / Comp. Acc. / Comp. Acc. / Comp. Acc. / Comp. Acc. / Comp. Acc. / Comp. Acc. / Comp. Acc. / Comp.
DUSt3R [62] 3.47 / 2.50 2.53 / 1.86 2.95 / 1.76 4.92 / 3.51 3.09 / 2.21 4.01 / 3.10 3.27 / 2.25 3.66 / 2.61 3.49 / 2.48 <1
MASt3R [26] 4.01 / 4.10 3.61 / 3.25 3.13 / 2.15 2.57 / 1.63 12.85 / 8.13 3.13 / 1.99 4.67 / 3.15 3.69 / 2.47 4.71 / 3.36 ≪1
NICER-SLAM [77] 2.53 / 3.04 3.93 / 4.10 3.40 / 3.42 5.49 / 6.09 3.45 / 4.42 4.02 / 4.29 3.34 / 4.03 3.03 / 3.87 3.65 / 4.16 ≪1
DROID-SLAM [54] 12.18 / 8.96 8.35 / 6.07 3.26 / 16.01 3.01 / 16.19 2.39 / 16.20 5.66 / 15.56 4.49 / 9.73 4.65 / 9.63 5.50 / 12.29 ∼20
DIM-SLAM [27]* 14.19 / 6.24 9.56 / 6.45 8.41 / 12.17 10.16 / 5.95 7.86 / 8.33 16.50 / 8.28 13.01 / 6.77 13.08 / 8.62 11.60 / 7.85 ∼3
GO-SLAM [73] - - - - - - - - 3.81 / 4.79 ∼8
Spann3R [59] 9.75 / 12.94 15.51 / 12.94 7.28 / 8.50 5.46 / 18.75 5.24 / 16.64 9.33 / 11.80 16.00 / 9.03 13.97 / 16.02 10.32 / 13.33 >50
SLAM3R-NoConf (Ours) 3.37 / 2.40 3.22 / 2.33 3.15 / 2.00 4.43 / 2.59 3.18 / 2.34 3.95 / 2.78 4.20 / 3.15 4.57 / 3.38 3.76 / 2.62 ∼24
SLAM3R (Ours) 3.19 / 2.40 3.12 / 2.34 2.72 / 2.00 4.28 / 2.60 3.17 / 2.34 3.84 / 2.78 3.90 / 3.16 4.32 / 3.36 3.57 / 2.62 ∼24

Table 2. Reconstruction results on Replica [52] dataset. We report accuracy and completeness in centimeters. The best results are shown
in bold. * denotes the results reported in NICER-SLAM.

Figure 4. We visualize the reconstruction results on two scenes: Office-09 and Office 2 from the 7-Scenes [49] and Replica [52] datasets.
Our method runs in real-time and achieves high-quality reconstruction comparable to the offline method DUSt3R [62].

4.2. Analyses Method # Frames Acc. Comp. FPS


DUSt3R [62] 2 3.16 2.89 42.55
Effectiveness of the I2P model. To highlight the advan- I2P 2 3.39 3.04 42.55
tages of our multi-view I2P model over the original two- I2P 5 2.62 2.28 40.82
I2P 11 2.38 2.03 40.11
view DUSt3R [62], we evaluate the reconstruction quality I2P 15 2.27 1.94 35.51
of keyframes with varying numbers of supporting views. I2P 51 2.23 1.86 11.97
We conduct experiments on the Replica [52] dataset, where
input views are sampled using a sliding window of different Table 3. Inner-window keyframe reconstruction results with vari-
sizes, and the reconstruction accuracy and completeness of ous window lengths.
the keyframes are computed. The results are reported in Ta-
ble 3. As the number of supporting views increases, our ap-
proach progressively improves reconstruction quality. No-
tably, the efficiency of our method remains stable until the ishing returns as the number of views increases, which we
window size exceeds 11, demonstrating the effectiveness of detail in the supplementary material. Visual results of I2P
our parallel design. However, the results also show dimin- reconstruction can be found in Figure 1.

7
Figure 5. Qualitative examples. We show our reconstruction results on Tanks and Temples [24], BlendedMVS [67], Map-free Reloc [2],
LLFF [33], and ETH3D [47, 48] datasets, as well as in-the-wild captured videos, to demonstrate SLAM3R’s generalization ability.

Method Acc. Comp. FPS method with a baseline approach that selects the ten near-
I2P+GA 4.87 3.00 ∼3 est previous frames, named I2P+L2W. The results in Table
I2P+UI 7.47 3.86 ∼1 4 indicate a significant performance improvement with our
I2P+L2W 6.19 3.54 ∼92 retrieval strategy, demonstrating its effectiveness.
I2P+L2W+Re (Full) 3.62 2.70 ∼43

Table 4. Reconstruction results using various point alignment In-the-wild scene reconstruction. We have tested our
methods and scene frame selection strategies. The FPS reported method on a diverse range of unseen datasets and found
only accounts for the overhead of the alignment operation. that SLAM3R shows strong generalization capabilities.
Figure 5 shows our reconstruction results on Tanks and
Temples [24], BlendedMVS [67], Map-free Reloc [2],
Advantages of the L2W model. The effectiveness of LLFF [33], and ETH3D [47, 48] datasets, as well as in-
the L2W model is evaluated through ablation studies on the-wild videos we captured. These results show that our
the Replica [52] dataset. Per-window reconstructions are method performs reliably across different scales and scenes.
first generated with a window size of 11 using the I2P
model. Local points are then aligned to a unified coor- 5. Conclusion
dinate frame using different methods: global optimization
from DUSt3R [62] (I2P-GA), traditional approaches such In this paper, we present SLAM3R, a novel and efficient
as Umeyama [56] and ICP [41] (I2P+UI), and our L2W monocular RGB SLAM system for real-time high-quality
model (I2P+L2W+Re). For consistency, we set the win- dense 3D reconstruction. It employs a two-hierarchy neu-
dow size for global optimization to 10, which is equal to the ral network framework to perform end-to-end 3D recon-
number of views used to align new frames in other meth- struction through streamlined feed-forward processes, elim-
ods. Results in Table 4 show that our full method achieves inating the need to explicitly solve any camera parameters.
superior alignment accuracy and computational efficiency Experiments demonstrate its state-of-the-art reconstruction
compared to the alternatives. quality and real-time efficiency, achieving 20+ FPS.

6. Limitation and Future Work


Analysis of the retrieval module. we propose a
lightweight retrieval module that selects historical scene Our system reconstructs scenes without explicitly solving
frames from the reservoir. This approach effectively per- or optimizing camera parameters. While this approach is
forms implicit re-localization. We compare our retrieval efficient, it prevents us from performing global bundle ad-

8
justment. Consequently, in large-scale outdoor scenes, we this extension significantly reduces computational overhead
still face the challenge of accumulated drift. Addressing this by registering multiple keyframes with a single pass of
limitation will be a focus of our future work. the scene decoder. Furthermore, incorporating information
from additional keyframes enhances the refinement of scene
Supplementary Material frame features, leading to more accurate reconstruction for
all input frames.
In this appendix, we first present additional implementa-
tion details and experimental settings in Sec. A and Sec. B,
which were omitted from the main paper due to page limit. Training details. To construct the training data, we utilize
We then report additional analyses in Sec. C. Finally, we all iPhone and DSLR frames registered by COLMAP [45]
show more reconstruction results of our method in Sec. D. from the training and validation splits of ScanNet++[69].
Additionally, we include all frames from the first 450 scenes
A. Implementation details of the Aria Synthetic Environments (ASE)[3] dataset and 41
categories from CO3D-v2 [39], with each category contain-
Retrieval module. We propose a lightweight module for ing up to 50 randomly sampled scene sequences. We intro-
efficient scene frame retrieval to support the keyframe reg- duce two ways to extract video clips for training. For Scan-
istration. The retrieval module directly reuses I2P’s decoder Net++ and ASE, we adopt uniform sampling with strides of
blocks as its backbone, followed by a linear projection and 3 and 2, respectively. For CO3D-v2, frames are randomly
an average-pooling layer. Specifically, it uses the first two sampled within temporal segments covering half the length
blocks from both the supporting and keyframe decoders, for of each video. In total, we extract approximately 880K
scene frames and keyframes (awaiting registration), respec- clips. During each epoch of training for the I2P model, we
tively. It takes as input image features of one keyframe and randomly sample 2000, 1000, and 1000 clips of length 11
all the scene frames in the buffering set, predicting correla- from the ScanNet++, ASE, and CO3D-v2 datasets, respec-
tion scores between the keyframe and each buffering frame. tively. When training the L2W model, the number of clips
Notably, the correlation scores share similar behavior with sampled per epoch doubles, and the clip length increases to
the mean confidence of the I2P model’s final prediction and 13. When training with ground truth pointmaps, we set in-
offer unique advantages over the cosine similarity between valid points to (0,0,0). The effective batch sizes of training
image features of two frames. These correlation scores ac- are 16 for the I2P model and 32 for the L2W model.
count for both visual similarity and provide suitable base-
lines for 3D reconstruction. B. Details for experimental settings
The module inherits the weights of the first two layers of
the decoder in I2P model. During training, only the weights Full video as input on Replica [52]. On the Replica
of the linear projection are updated using an L1 loss: dataset, we reconstruct the entire scene geometry using all
video frames. The scene is initialized with a 5-frame win-
R
X dow, and the local pointmap is recovered for each frame
LRetr = |Si′ − M ean(Ci′ )|
using the I2P model with a window size of 11. Within
i=1
each window, frames are sampled at intervals of I = 20
Si′ = Sigmoid(Si ) to ensure reasonable camera motion. By default, we co-
Ci′ = (Ci − 1)/Ci register K = 10 keyframes at a time, which share the same
S = 10 scene frames as a reference. The S scene frames
where R is the number of input supporting frames, Si is are selected through a two-step process. First, we calcu-
the predicted correlation score between supporting frame i late the correlation score between all frames in the buffer-
and the keyframe, Ci is the predicted confidence from the ing set and the K keyframes. Then, we select S frames
complete I2P model. Both Si and Ci are normalized to [0,1] from the buffering set that show the highest total correla-
before calculating the loss. tion score with these keyframes. After every R = 20 reg-
istered keyframes, we update the buffering set by retaining
the keyframes with the highest reconstruction scores, where
Multi-keyframe co-registration. In practice, our scene
reconstruction score of a frame is the product of its mean
decoder in the L2W model adopts the same architecture as
confidence predicted by I2P and L2W model.
the keyframe decoder in the I2P model, allowing for the si-
multaneous input and registration of multiple keyframes. In
the decoding stage, scene frames and keyframes exchange Sampled frames as input on 7 Scenes [49]. Following
information bidirectionally: each scene frame queries fea- recent research [59], we uniformly sample frames from the
tures from all keyframes, and each keyframe interacts with 7 Scenes dataset before evaluation. The frames are sam-
all scene frames. Compared to single-keyframe registration, pled with a stride of 20 from each test sequence, and we

9
Figure 6. The reconstruction results and the corresponding accuracy heatmaps of MASt3R [26] on Office 3 from Replica [52] dataset under
different confidence thresholds. Lighter colors indicate higher accuracy.

Figure 7. Visualization of the incremental reconstruction process of our method on the Office 3 and Room 1 of Replica [52] dataset. Our
method achieves low drift without any global-optimization stage.

Figure 8. Reconstruction results on unorganized image collections from DTU [1] dataset.

only reconstruct the points from the sampled frames. To quently, to evaluate the global reconstruction quality of
handle sampled-frame-only input, we adapt our reconstruc- these two methods on the Replica dataset, we uniformly
tion pipeline for full-video input by setting I = 1, K = 2, sample 1/20 of the images. As MASt3R provides model
S = 5, and R = 1 in practice. weights trained exclusively on a resolution of 512px, we
specifically use images with resolutions of 512×384 and
Experiments on DUSt3R [62] and MASt3R [26]. The 512×288 as inputs for reconstructing the 7 Scenes [49] and
global optimization with complete graph setting in DUSt3R Replica [52] datasets, respectively. In contrast, our method
and MASt3R requires substantial GPU memory. Conse- center-crops and resizes the input images to a fixed resolu-

10
Method Chess Fire Heads Office Pumpkin RedKitchen Stairs Average FPS
DUSt3R [62] 9.03 8.46 3.16 15.83 13.85 13.00 25.31 12.65 <1
DUSt3R [62] (gt intrinsic) 5.08 5.08 2.47 13.19 10.96 10.71 13.27 8.68 <1
MASt3R [26] 4.31 2.55 1.78 11.37 11.31 7.84 5.27 6.35 ≪1
MASt3R [26] (gt intrinsic) 4.68 2.91 1.43 13.94 12.82 8.21 3.18 6.74 ≪1
NICER-SLAM [77] 3.28 6.85 4.16 10.84 20.00 3.94 10.81 8.55 <1
DIM-SLAM [27]* 60.13 2.41 39.42 24.15 13.07 39.11 10.45 26.96 ∼3
DROID-SLAM [54] 3.36 2.40 1.43 9.19 16.46 4.94 1.85 5.66 ∼20
DROID-SLAM [54]* 3.55 2.49 3.32 10.93 48.53 4.72 2.58 10.87 ∼20
Spann3R [59] 9.86 6.83 7.02 23.85 12.87 15.16 13.89 12.78 >50
SLAM3R (Ours) 6.59 5.99 12.24 15.52 15.05 11.44 12.19 11.27 ∼25

Table 5. Camera pose estimation results on the 7Scenes [49] dataset reported using the ATE-RMSE (cm) metric. The average numbers are
computed over all test scenes.

Method Room 0 Room 1 Room 2 Office 0 Office 1 Office 2 Office 3 Office 4 Average FPS
DUSt3R [62] (1/20) 7.69 8.28 11.41 7.00 4.82 5.16 5.32 8.19 7.23 <1
DUSt3R [62] (gt intrinsic) 3.97 4.80 7.65 5.10 3.99 3.84 2.95 6.42 4.84 <1
MASt3R [26] 1.60 1.88 4.05 1.57 6.72 1.49 75.09 5.09 12.19 ≪1
MASt3R [26] (gt intrinsic) 1.06 1.05 0.85 0.95 7.52 1.47 2.01 1.84 2.09 ≪1
NICER-SLAM [77] 1.36 1.60 1.14 2.12 3.23 2.12 1.42 2.01 1.88 <1
GO-SLAM [73] - - - - - - - - 0.39 ∼8
DIM-SLAM [27]* 1.06 0.49 0.32 0.43 0.26 0.65 0.55 3.69 0.93 ∼3
DROID-SLAM [54] 0.34 0.13 0.27 0.25 0.42 0.32 0.52 0.40 0.33 ∼20
DROID-SLAM [54]* 0.58 0.58 0.38 1.06 0.40 0.70 0.53 1.33 0.70 ∼20
Spann3R [59] (1/20) 30.14 39.39 32.48 37.65 27.06 45.02 57.11 42.68 38.94 >50
SLAM3R (Ours) 4.62 5.95 5.46 10.96 7.18 5.85 5.27 16.33 7.70 ∼24

Table 6. Camera pose estimation results on the Replica [52] dataset reported using the ATE-RMSE (cm) metric.

tion of 224×224, which results in less overlap between ad-


jacent frames, making reconstruction inherently more chal-
lenging.
During the evaluation, we observed that MASt3R oc-
casionally generates floating points with high confidence
scores, which are difficult to filter using confidence thresh-
olds and significantly degrade accuracy. An example of
this issue is shown in Figure 6. In contrast, our confidence
scores are more effective and successfully reduce erroneous
points. The results of SLAM3R reported on 7 Scenes and
Replica datasets use a fixed confidence threshold of 3.
Figure 9. Inner-window keyframe reconstruction results from var-
C. Additional analyses ious window lengths.

Diminishing return of window length. In the main pa-


per, we report the I2P reconstruction results with different window size to 11 in our main experiments, balancing the
window lengths. Here, we further analyze the diminishing reconstruction quality and runtime efficiency.
returns, which indicate that the window length should not
be too large. As Figure 9 shows, the accuracy and com- Effect of scene frame numbers on registration. We con-
pleteness of the keyframe reconstruction improve rapidly at duct experiments on the Replica [52] dataset to investigate
first as input frames increase, but then gradually decline. how the number of scene frames selected as a global refer-
This is because larger windows result in less and less over- ence affects the registration quality of keyframes. As re-
lapping. Additionally, the inference time becomes signifi- ported in Table 7, the accuracy of full-scene registration
cantly slower as length increases. Consequently, we set the initially improves as the maximum number of input scene

11
# Scene frames Acc. Comp. FPS reconstruction quality on the Replica dataset. Therefore,
1 4.18 2.61 ∼398 pose accuracy and reconstruction quality sometimes show
5 3.99 2.79 ∼247 inconsistencies. This discrepancy between pose and recon-
10 3.57 2.62 ∼152 struction errors indicates that effective end-to-end 3D re-
20 3.57 2.60 ∼86
30 3.59 2.58 ∼61 construction is possible and promising without first obtain-
40 4.15 3.05 ∼46 ing precise camera poses.
50 4.27 3.15 ∼37
D. More visual results
Table 7. Reconstruction results on Replica [52] dataset, with vari-
ous maximum number of scene frames selected for keyframe reg- Visualization of incremental reconstruction. Figure 7
istration. The FPS of the L2W model aligning 10 keyframes at visualizes the process of our incremental reconstruction on
once with different numbers of input scene frames is also reported. two scenes from Replica [52]. Our method achieves effec-
tive alignment at loops while experiencing minimal cumu-
lative drift, without offline global optimization step.
frames increases but eventually declines beyond a certain
threshold. Retrieving too few scene frames from the buffer- Reconstruction on DTU [1] dataset. The results are
ing set risks missing suitable frames and causing keyframe shown in Figure 8. Note that our method does not require
registration to get stuck in local minimums. Conversely, se- any camera parameters, and produces dense point cloud re-
lecting too many scene frames can introduce irrelevant ones constructions end-to-end in real-time.
that add noise and hinder registration.
To balance reconstruction accuracy and runtime effi- References
ciency, we set the number of retrieved scene frames to 5
and 10 on 7 Scenes [49] and Replica [52] dataset, which [1] Henrik Aanæs, Rasmus Ramsbøl Jensen, George Vogiatzis,
achieves consistent and reliable performance. Engin Tola, and Anders Bjorholm Dahl. Large-scale data for
multiple-view stereopsis. International Journal of Computer
Vision, 120:153–168, 2016. 10, 12
Camera tracking results. Our method is designed in [2] Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo
a new paradigm that reconstructs 3D points end-to-end Garcia-Hernando, Aron Monszpart, Victor Prisacariu, Dani-
without explicitly solving camera parameters. We have yar Turmukhambetov, and Eric Brachmann. Map-free visual
found that camera poses and scene reconstruction results relocalization: Metric pose relative to a single image. In
are not fully positively correlated. Here, following NICER- European Conference on Computer Vision, pages 690–708.
SLAM [77], we evaluate the camera pose accuracy on 7 Springer, 2022. 8
Scenes [49] and Replica [52] datasets using absolute tra- [3] Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins,
jectory error (ATE-RMSE). It is calculated by aligning the Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang,
Duncan Frost, Luke Holland, Campbell Orme, et al. Scene-
predicted camera trajectory with the GT trajectory. The
script: Reconstructing scenes with an autoregressive struc-
predicted trajectory is derived by estimating camera poses
tured language model. arXiv preprint arXiv:2403.13064,
through the PnP-RANSAC solver in OpenCV [6], lever- 2024. 6, 9
aging the predicted pointmaps and GT intrinsics of each [4] Tim Bailey and Hugh Durrant-Whyte. Simultaneous local-
frame. Camera pose translations are then used to construct ization and mapping (slam): Part ii. IEEE robotics & au-
the trajectory. For DUSt3R [62] and MASt3R [26], we tomation magazine, 13(3):108–117, 2006. 2
evaluate both the camera poses generated during global op- [5] Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan
timization and those derived via the PnP-RANSAC solver Leutenegger, and Andrew J Davison. Codeslam—learning
with their predicted pointmaps and GT intrinsic parameters. a compact, optimisable representation for dense visual slam.
When evaluating Spann3R [59] on the Replica [52] dataset, In Proceedings of the IEEE conference on computer vision
only one-twentieth of the frames are used, as it fails to give and pattern recognition, pages 2560–2568, 2018. 1, 2
reasonable results with all frames input. [6] G Bradski. The opencv library. Dr. Dobb’s Journal of Soft-
The results are presented in Table 5 and Table 6. We out- ware Tools, 2000. 12
[7] Carlos Campos, Richard Elvira, Juan J Gómez Rodrı́guez,
perform the concurrent work Spann3R [59], demonstrating
José MM Montiel, and Juan D Tardós. Orb-slam3: An accu-
the effectiveness of our hierarchical design with multi-view
rate open-source library for visual, visual–inertial, and mul-
input and global retrieval. Among classical SLAM systems, timap slam. IEEE Transactions on Robotics, 37(6):1874–
the pose errors of GO-SLAM [73] and DROID-SLAM [54] 1890, 2021. 1, 2
are lower than those of NICER-SLAM. However, their re- [8] Yiyang Chen, Siyan Dong, Xulong Wang, Lulu Cai, Youyi
construction accuracy and completeness are worse. A sim- Zheng, and Yanchao Yang. Sg-nerf: Neural surface re-
ilar pattern appears in comparison with DUSt3R — while construction with scene graph optimization. arXiv preprint
its poses are worse than NICER-SLAM, it achieves the best arXiv:2407.12667, 2024. 2

12
[9] Chi-Ming Chung, Yang-Che Tseng, Ya-Ching Hsu, Xiang- [22] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler,
Qian Shi, Yun-Hung Hua, Jia-Fong Yeh, Wen-Chin Chen, and George Drettakis. 3d gaussian splatting for real-time
Yi-Ting Chen, and Winston H Hsu. Orbeez-slam: A real- radiance field rendering. ACM Trans. Graph., 42(4):139–1,
time monocular visual slam with orb features and nerf- 2023. 1
realized mapping. In 2023 IEEE International Confer- [23] Georg Klein and David Murray. Parallel tracking and map-
ence on Robotics and Automation (ICRA), pages 9400–9406. ping on a camera phone. In 2009 8th IEEE International
IEEE, 2023. 1, 2 Symposium on Mixed and Augmented Reality, pages 83–86.
[10] Jan Czarnowski, Tristan Laidlow, Ronald Clark, and An- IEEE, 2009. 1, 2
drew J Davison. Deepfactors: Real-time probabilistic dense [24] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen
monocular slam. IEEE Robotics and Automation Letters, 5 Koltun. Tanks and temples: Benchmarking large-scale scene
(2):721–728, 2020. 1, 2 reconstruction. ACM Transactions on Graphics (ToG), 36
[11] Pinxuan Dai, Jiamin Xu, Wenxiang Xie, Xinguo Liu, (4):1–13, 2017. 8
Huamin Wang, and Weiwei Xu. High-quality surface recon- [25] Lukas Koestler, Nan Yang, Niclas Zeller, and Daniel Cre-
struction using gaussian surfels. In ACM SIGGRAPH 2024 mers. Tandem: Tracking and dense mapping in real-time us-
Conference Papers, pages 1–11, 2024. 2 ing deep multi-view stereo. In Conference on Robot Learn-
[12] Alexey Dosovitskiy. An image is worth 16x16 words: ing, pages 34–45. PMLR, 2022. 2
Transformers for image recognition at scale. arXiv preprint
[26] Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Ground-
arXiv:2010.11929, 2020. 3
ing image matching in 3d with mast3r. arXiv preprint
[13] Hugh Durrant-Whyte and Tim Bailey. Simultaneous local- arXiv:2406.09756, 2024. 2, 3, 6, 7, 10, 11, 12
ization and mapping: part i. IEEE robotics & automation
[27] Heng Li, Xiaodong Gu, Weihao Yuan, Luwei Yang, Zilong
magazine, 13(2):99–110, 2006. 2
Dong, and Ping Tan. Dense rgb slam with neural implicit
[14] Jakob Engel, Thomas Schöps, and Daniel Cremers. Lsd-
maps. In Proceedings of the International Conference on
slam: Large-scale direct monocular slam. In European con-
Learning Representations, 2023. 1, 2, 6, 7, 11
ference on computer vision, pages 834–849. Springer, 2014.
1, 2 [28] Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H Tay-
lor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin.
[15] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct
Neuralangelo: High-fidelity neural surface reconstruction. In
sparse odometry. IEEE transactions on pattern analysis and
Proceedings of the IEEE/CVF Conference on Computer Vi-
machine intelligence, 40(3):611–625, 2017. 1, 2
sion and Pattern Recognition, pages 8456–8465, 2023. 2
[16] Yasutaka Furukawa, Carlos Hernández, et al. Multi-view
[29] Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Lars-
stereo: A tutorial. Foundations and Trends® in Computer
son, and Marc Pollefeys. Pixel-perfect structure-from-
Graphics and Vision, 9(1-2):1–148, 2015. 1, 2
motion with featuremetric refinement. In Proceedings of
[17] Antoine Guédon and Vincent Lepetit. Sugar: Surface-
the IEEE/CVF international conference on computer vision,
aligned gaussian splatting for efficient 3d mesh reconstruc-
pages 5987–5997, 2021. 1, 2
tion and high-quality mesh rendering. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern [30] Lorenzo Liso, Erik Sandström, Vladimir Yugay, Luc
Recognition, pages 5354–5363, 2024. 2 Van Gool, and Martin R Oswald. Loopy-slam: Dense neural
slam with loop closures. In Proceedings of the IEEE/CVF
[18] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and
Conference on Computer Vision and Pattern Recognition,
Shenghua Gao. 2d gaussian splatting for geometrically ac-
pages 20363–20373, 2024. 2
curate radiance fields. In ACM SIGGRAPH 2024 Conference
Papers, pages 1–11, 2024. 2 [31] Shaohui Liu, Yifan Yu, Rémi Pautrat, Marc Pollefeys, and
[19] Huajian Huang, Longwei Li, Hui Cheng, and Sai-Kit Yeung. Viktor Larsson. 3d line mapping revisited. In Proceedings of
Photo-slam: Real-time simultaneous localization and photo- the IEEE/CVF Conference on Computer Vision and Pattern
realistic mapping for monocular stereo and rgb-d cameras. Recognition, pages 21445–21455, 2023. 1, 2
In Proceedings of the IEEE/CVF Conference on Computer [32] Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and An-
Vision and Pattern Recognition, pages 21584–21593, 2024. drew J Davison. Gaussian splatting slam. In Proceedings
2 of the IEEE/CVF Conference on Computer Vision and Pat-
[20] Jiahui Huang, Shi-Sheng Huang, Haoxuan Song, and Shi- tern Recognition, pages 18039–18048, 2024. 2
Min Hu. Di-fusion: Online implicit 3d reconstruction with [33] Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon,
deep priors. In Proceedings of the IEEE/CVF Conference Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and
on Computer Vision and Pattern Recognition, pages 8932– Abhishek Kar. Local light field fusion: Practical view syn-
8941, 2021. 1, 2 thesis with prescriptive sampling guidelines. ACM Transac-
[21] Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, tions on Graphics (ToG), 38(4):1–14, 2019. 8
Gengshan Yang, Sebastian Scherer, Deva Ramanan, and [34] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
Jonathon Luiten. Splatam: Splat track & map 3d gaussians Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
for dense rgb-d slam. In Proceedings of the IEEE/CVF Con- Representing scenes as neural radiance fields for view syn-
ference on Computer Vision and Pattern Recognition, pages thesis. Communications of the ACM, 65(1):99–106, 2021. 1,
21357–21366, 2024. 2 2

13
[35] Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open- resolution images and multi-camera videos. In Proceed-
source slam system for monocular, stereo, and rgb-d cam- ings of the IEEE conference on computer vision and pattern
eras. IEEE transactions on robotics, 33(5):1255–1262, 2017. recognition, pages 3260–3269, 2017. 8
1, 2 [48] Thomas Schops, Torsten Sattler, and Marc Pollefeys. Bad
[36] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D slam: Bundle adjusted direct rgb-d slam. In Proceedings of
Tardos. Orb-slam: a versatile and accurate monocular slam the IEEE/CVF Conference on Computer Vision and Pattern
system. IEEE transactions on robotics, 31(5):1147–1163, Recognition, pages 134–144, 2019. 8
2015. 1, 2 [49] Jamie Shotton, Ben Glocker, Christopher Zach, Shahram
[37] Richard A Newcombe, Steven J Lovegrove, and Andrew J Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene co-
Davison. Dtam: Dense tracking and mapping in real-time. ordinate regression forests for camera relocalization in rgb-d
In 2011 international conference on computer vision, pages images. In Proceedings of the IEEE conference on computer
2320–2327. IEEE, 2011. 1, 2 vision and pattern recognition, pages 2930–2937, 2013. 6,
[38] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc 7, 9, 10, 11, 12
Pollefeys, and Andreas Geiger. Convolutional occupancy [50] Brandon Smart, Chuanxia Zheng, Iro Laina, and Vic-
networks. In Computer Vision–ECCV 2020: 16th European tor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splat-
Conference, Glasgow, UK, August 23–28, 2020, Proceed- ting from uncalibarated image pairs. arXiv preprint
ings, Part III 16, pages 523–540. Springer, 2020. 1 arXiv:2408.13912, 2024. 2
[39] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, [51] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo
Luca Sbordone, Patrick Labatut, and David Novotny. Com- tourism: exploring photo collections in 3d. In ACM siggraph
mon objects in 3d: Large-scale learning and evaluation 2006 papers, pages 835–846. 2006. 1, 2
of real-life 3d category reconstruction. In Proceedings of [52] Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik
the IEEE/CVF international conference on computer vision, Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal,
pages 10901–10911, 2021. 6, 9 Carl Ren, Shobhit Verma, Anton Clarkson, Mingfei Yan,
[40] Antoni Rosinol, John J Leonard, and Luca Carlone. Nerf- Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang
slam: Real-time dense monocular slam with neural radiance Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler
fields. In 2023 IEEE/RSJ International Conference on Intel- Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva,
ligent Robots and Systems (IROS), pages 3437–3444. IEEE, Dhruv Batra, Hauke M. Strasdat, Renzo De Nardi, Michael
2023. 1, 2 Goesele, Steven Lovegrove, and Richard Newcombe. The
[41] Szymon Rusinkiewicz and Marc Levoy. Efficient variants of Replica dataset: A digital replica of indoor spaces. arXiv
the icp algorithm. In Proceedings third international confer- preprint arXiv:1906.05797, 2019. 6, 7, 8, 9, 10, 11, 12
ence on 3-D digital imaging and modeling, pages 145–152. [53] Edgar Sucar, Shikun Liu, Joseph Ortiz, and Andrew J Davi-
IEEE, 2001. 6, 8 son. imap: Implicit mapping and positioning in real-time. In
[42] Erik Sandström, Yue Li, Luc Van Gool, and Martin R Os- Proceedings of the IEEE/CVF international conference on
wald. Point-slam: Dense neural point cloud-based slam. In computer vision, pages 6229–6238, 2021. 1, 2
Proceedings of the IEEE/CVF International Conference on [54] Zachary Teed and Jia Deng. Droid-slam: Deep visual slam
Computer Vision, pages 18433–18444, 2023. 2 for monocular, stereo, and rgb-d cameras. Advances in neu-
[43] Erik Sandström, Kevin Ta, Luc Van Gool, and Martin R Os- ral information processing systems, 34:16558–16569, 2021.
wald. Uncle-slam: Uncertainty learning for dense neural 1, 2, 6, 7, 11, 12
slam. In Proceedings of the IEEE/CVF International Con- [55] Fabio Tosi, Youmin Zhang, Ziren Gong, Erik Sandström,
ference on Computer Vision, pages 4537–4548, 2023. 2 Stefano Mattoccia, Martin R Oswald, and Matteo Poggi.
[44] Erik Sandström, Keisuke Tateno, Michael Oechsle, Michael How nerfs and 3d gaussian splatting are reshaping slam: a
Niemeyer, Luc Van Gool, Martin R Oswald, and Federico survey. arXiv preprint arXiv:2402.13255, 4, 2024. 1
Tombari. Splat-slam: Globally optimized rgb-only slam with [56] Shinji Umeyama. Least-squares estimation of transforma-
3d gaussians. arXiv preprint arXiv:2405.16544, 2024. 1, 2 tion parameters between two point patterns. IEEE Transac-
[45] Johannes L Schonberger and Jan-Michael Frahm. Structure- tions on Pattern Analysis & Machine Intelligence, 13(04):
from-motion revisited. In Proceedings of the IEEE con- 376–380, 1991. 6, 8
ference on computer vision and pattern recognition, pages [57] Jeffrey S Vitter. Random sampling with a reservoir. ACM
4104–4113, 2016. 1, 2, 9 Transactions on Mathematical Software (TOMS), 11(1):37–
[46] Johannes L Schönberger, Enliang Zheng, Jan-Michael 57, 1985. 5
Frahm, and Marc Pollefeys. Pixelwise view selection for [58] Fangjinhua Wang, Silvano Galliani, Christoph Vogel, and
unstructured multi-view stereo. In Computer Vision–ECCV Marc Pollefeys. Itermvs: Iterative probability estimation for
2016: 14th European Conference, Amsterdam, The Nether- efficient multi-view stereo. In Proceedings of the IEEE/CVF
lands, October 11-14, 2016, Proceedings, Part III 14, pages conference on computer vision and pattern recognition,
501–518. Springer, 2016. 1, 2 pages 8606–8615, 2022. 1, 2
[47] Thomas Schops, Johannes L Schonberger, Silvano Galliani, [59] Hengyi Wang and Lourdes Agapito. 3d reconstruction with
Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- spatial memory. arXiv preprint arXiv:2408.16061, 2024. 2,
dreas Geiger. A multi-view stereo benchmark with high- 3, 6, 7, 9, 11, 12

14
[60] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku [73] Youmin Zhang, Fabio Tosi, Stefano Mattoccia, and Matteo
Komura, and Wenping Wang. Neus: Learning neural implicit Poggi. Go-slam: Global optimization for consistent 3d in-
surfaces by volume rendering for multi-view reconstruction. stant reconstruction. In Proceedings of the IEEE/CVF Inter-
arXiv preprint arXiv:2106.10689, 2021. 1, 2 national Conference on Computer Vision, pages 3727–3737,
[61] Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, 2023. 1, 2, 6, 7, 11, 12
Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking [74] Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox.
accurate monocular geometry estimation for open-domain Deeptam: Deep tracking and mapping. In Proceedings of
images with optimal training supervision. arXiv preprint the European conference on computer vision (ECCV), pages
arXiv:2410.19115, 2024. 2 822–838, 2018. 1, 2
[62] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris [75] Heng Zhou, Zhetao Guo, Shuhong Liu, Lechen Zhang,
Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- Qihao Wang, Yuxiang Ren, and Mingrui Li. Mod-slam:
sion made easy. In Proceedings of the IEEE/CVF Conference Monocular dense mapping for unbounded 3d scene recon-
on Computer Vision and Pattern Recognition, pages 20697– struction. arXiv preprint arXiv:2402.03762, 2024. 1, 2
20709, 2024. 1, 2, 3, 4, 6, 7, 8, 10, 11, 12 [76] Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hu-
[63] Yiming Wang, Qin Han, Marc Habermann, Kostas Dani- jun Bao, Zhaopeng Cui, Martin R Oswald, and Marc Polle-
ilidis, Christian Theobalt, and Lingjie Liu. Neus2: Fast feys. Nice-slam: Neural implicit scalable encoding for slam.
learning of neural implicit surfaces for multi-view recon- In Proceedings of the IEEE/CVF conference on computer vi-
struction. In Proceedings of the IEEE/CVF International sion and pattern recognition, pages 12786–12796, 2022. 1,
Conference on Computer Vision, pages 3295–3306, 2023. 2 2
[77] Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui,
[64] Changchang Wu. Visualsfm: A visual structure from motion
Martin R Oswald, Andreas Geiger, and Marc Pollefeys.
system. http://www. cs. washington. edu/homes/ccwu/vsfm,
Nicer-slam: Neural implicit scene encoding for rgb slam. In
2011. 1, 2
2024 International Conference on 3D Vision (3DV), pages
[65] Chi Yan, Delin Qu, Dan Xu, Bin Zhao, Zhigang Wang, Dong 42–52. IEEE, 2024. 1, 2, 6, 7, 11, 12
Wang, and Xuelong Li. Gs-slam: Dense visual slam with 3d
gaussian splatting. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
19595–19604, 2024. 2
[66] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan.
Mvsnet: Depth inference for unstructured multi-view stereo.
In Proceedings of the European conference on computer vi-
sion (ECCV), pages 767–783, 2018. 1, 2
[67] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren,
Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-
scale dataset for generalized multi-view stereo networks. In
Proceedings of the IEEE/CVF conference on computer vi-
sion and pattern recognition, pages 1790–1799, 2020. 8
[68] Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys,
Ming-Hsuan Yang, and Songyou Peng. No pose, no problem:
Surprisingly simple 3d gaussian splats from sparse unposed
images. arXiv preprint arXiv:2410.24207, 2024. 2
[69] Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner,
and Angela Dai. Scannet++: A high-fidelity dataset of 3d in-
door scenes. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 12–22, 2023. 6, 9
[70] Vladimir Yugay, Yue Li, Theo Gevers, and Martin R Os-
wald. Gaussian-slam: Photo-realistic dense slam with gaus-
sian splatting. arXiv preprint arXiv:2312.10070, 2023. 2
[71] Ganlin Zhang, Erik Sandström, Youmin Zhang, Manthan Pa-
tel, Luc Van Gool, and Martin R Oswald. Glorie-slam: Glob-
ally optimized rgb-only implicit encoding point cloud slam.
arXiv preprint arXiv:2403.19549, 2024. 1, 2
[72] Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam-
pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-
Hsuan Yang. Monst3r: A simple approach for estimat-
ing geometry in the presence of motion. arXiv preprint
arXiv:2410.03825, 2024. 2

15

You might also like