0% found this document useful (0 votes)
81 views8 pages

Monocular Depth for 3D Gaussian Splatting

The document presents Mode-GS, a novel rendering algorithm that utilizes anchored Gaussian splats for improved scene rendering in ground-robot trajectory datasets. This method addresses challenges such as splat drift and scale ambiguity by integrating monocular depth estimation and employing a scale-consistent depth calibration technique. Mode-GS achieves state-of-the-art rendering performance on various datasets, demonstrating robustness in scenarios with free trajectory patterns and limited multi-view observations.

Uploaded by

baimaxuanbi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views8 pages

Monocular Depth for 3D Gaussian Splatting

The document presents Mode-GS, a novel rendering algorithm that utilizes anchored Gaussian splats for improved scene rendering in ground-robot trajectory datasets. This method addresses challenges such as splat drift and scale ambiguity by integrating monocular depth estimation and employing a scale-consistent depth calibration technique. Mode-GS achieves state-of-the-art rendering performance on various datasets, demonstrating robustness in scenarios with free trajectory patterns and limited multi-view observations.

Uploaded by

baimaxuanbi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Mode-GS: Monocular Depth Guided Anchored 3D Gaussian Splatting

for Robust Ground-View Scene Rendering


Yonghan Lee1,2 , Jaehoon Choi1 , Dongki Jung1 , Jaeseong Yun2 ,
Soohyun Ryu2 , Dinesh Manocha1 , and Suyong Yeon2

Abstract— We present a novel-view rendering algorithm, Ours


Mode-GS, for ground-robot trajectory datasets. Our approach Camera IMU LiDAR
is based on using anchored Gaussian splats, which are designed
Multi-Sensor Odometry
to overcome the limitations of existing 3D Gaussian splatting
algorithms. Prior neural rendering methods suffer from severe
arXiv:2410.04646v1 [[Link]] 6 Oct 2024

splat drift due to scene complexity and insufficient multi-view


observation, and can fail to fix splats on the true geometry
in ground-robot datasets. Our method integrates pixel-aligned
3DGS
anchors from monocular depths and generates Gaussian splats
Trajectory Data (Images / Poses) Anchored Gaussian Splatting
around these anchors using residual-form Gaussian decoders.
To address the inherent scale ambiguity of monocular depth, we A

parameterize anchors with per-view depth-scales and employ


scale-consistent depth loss for online scale calibration. Our
method results in improved rendering performance, based
on PSNR, SSIM, and LPIPS metrics, in ground scenes with
free trajectory patterns, and achieves state-of-the-art rendering
performance on the R3 LIVE odometry dataset and the Tanks
Anchor Initialization (Monocular Depth)
and Temples dataset.

Fig. 1. Our Mode-GS integrates monocular depth estimation with anchored


I. I NTRODUCTION Gaussian splatting, uses a scale-consistent depth calibration technique and
residual-based Gaussian decoders. By incorporating dense pixel-aligned
The development of navigation and perception algorithms anchor points from monocular depth, anchored splatting improves robustness
for autonomous robots typically requires extensive and costly in scenarios without dense multi-view images and mitigates the impact
field experiments for training and validation. In this context, of inaccurate poses in complex ground-view scenes. Our method can be
developed using multi-sensor odometry poses in a point-cloud-free setting.
neural rendering offers a practical solution, as it can signif- Overall, it offers a practical and robust rendering pipeline for ground-view
icantly reduce the time and effort based on data simulation robotic datasets, as shown in Section V.
and augmentation [1], [2], [3]. Specifically, neural rendering
can be used to generate novel view images by learning
a neural scene representation from a set of input training view information; (2) the difficulty of obtaining pixel-level
images and corresponding poses [4], [5]. accurate trajectory poses [8], [9], [10].
Current neural rendering research is based on two pop- First of all, 3DGS requires dense point clouds for splat
ular methods: implicit Neural Radiance Fields (NeRF) [4] initialization and informative multi-view photometric gra-
and explicit 3D Gaussian Splatting (3DGS) [5], [6]. NeRF dients for Adaptive Density Control (ADC) [5] to expand
gained popularity because of its high-fidelity and continuous splats into unoccupied areas, which significantly deteriorates
scene representations due to the implicit nature of radiance its performance on ground-robot datasets. Since training a
fields. However, it is hard to scale to large scenes because neural rendering algorithm heavily relies on dense multi-
of the limited representational capacity of its coordinate- view observations due to its inherent high dimensionality,
based Multi-Layer Perceptron (MLP), which cannot effi- previous neural rendering approaches have largely focused
ciently handle the cubic growth in scene complexity. On the on datasets with structured viewing patterns—such as aerial
other hand, 3DGS provides a feasible alternative for scene- (top-down view) [11], [12],object-centric (inward view) [13],
scale rendering of ground-view robot datasets by explicitly [14], and street (forward motion) datasets [15]. However,
representing only the non-empty parts of the scene, and there is relatively less work on ground-robot datasets with
using more interpretable Gaussian splats as scene primitives. free trajectories [16].
It turns out that complex ground-view robot datasets with The second challenge in ground-view rendering is the
free trajectories [7] present two fundamental challenges sensitivity of 3DGS to pixel-level pose accuracy of training
with respect 3DGS algorithms: (1) the scarcity of multiple- images, which is crucial due to its reliance on pixel-based
photometric loss. Acquiring pixel-accurate poses in ground-
1 University of Maryland, 8125 Paint Branch Dr, College Park, MD
view datasets is difficult. Structure from Motion (SfM) [17]
20742, USA.
2 NAVER LABS, 95 Jeongjail-ro, Bundang-gu, Seongnam-si, Gyeonggi- or vision-based SLAM methods [18] often struggle to con-
do, South Korea. sistently estimate poses in ground-robot datasets without
Training Progress ambiguous monocular depth networks to initialize pixel-
aligned splats. Through our anchor depth-scale param-
eterization and scale-consistent depth loss, we achieve
consistent depth calibration during training, eliminating
the need for initial SfM or LiDAR point clouds.
• Anchor-Decoder Structure with Residual-Form
MLP Decoder: We present an advanced anchor-decoder
structure for 3DGS, featuring our proposed residual-
form Gaussian decoder. This allows for the direct initial-
ization of anchored Gaussian splat attributes, improving
both the efficiency and accuracy of the scene training
process.
• Novel-view-synthesis from Ground-Robot Dataset:
Our method shows robust rendering performance on
ground-robot datasets with free trajectory patterns,
Sequential Type Non-Anchored Type Anchored Type achieving state-of-the-art rendering performance on the
(MonoGS) (Original 3DGS) (Ours)
R3 LIVE odometry dataset [16] even without LiDAR
Fig. 2. We compare the degenerate training patterns of 3DGS in scenarios point clouds, while maintaining comparable perfor-
without dense multi-view information. The patterns are categorized accord- mance on the Tanks and Temples dataset. Note that our
ing to their type: (a) Sequential Type: SLAM-based Gaussian splatting algorithm can be built on easily obtainable odometry
utilizes sequential information by processing consecutive images with pose
refinement, initially generating sharper images. However, their pose tends poses in a point-cloud free setting, providing practical
to drift and eventually diverges; (b) Non-Anchored Type: In the original and robust rendering pipeline for ground-view robot
3DGS and their variants with ADC, the splats tend to drift from the true datasets. We highlight the improvements over prior
geometry without dense multi-view photometric information; (c) Anchored
Type: Anchoring effectively prevents splats from becoming detached from methods in terms of rendering metrics (PSNR, SSIM,
the actual geometry. and LPIPS) on these datasets in Section V.

II. R ELATED W ORKS


fragmenting or diverging, when images lack salient features
or textures to track. While multi-sensor SLAM and odometry Neural Rendering for Robot Navigation Neural rendering
methods [19], [20], [16] are more reliable for trajectory [4], [5] can achieve remarkable photo-realistic rendering
pose estimation, they frequently fail to achieve pixel-level quality. This photorealistic rendering can be applied to vari-
accuracy due to heterogeneous sensor configurations and ous robotics applications such as navigation [1], [2], [25],
sensor fusion. Due to sensitivity of 3DGS on pose accuracy, robot data simulation [3], [26], and robotic teleoperation
the SLAM algorithms that directly integrate 3D Gaussian [27]. Prior works present the robot navigation methods
splats [21], [22], [23] as scene representation often fail to designed for 3D environment represented as NeRF [1], [25]
reliably estimate complete trajectories. or 3D Gaussian Splats [2]. UAV-Sim [3] and PEGASUS
Main Results: We present a novel rendering approach, [26] utilize neural rendering to synthesize data for aerial
Mode-GS, to address these challenges in ground-robot perception or robotic manipulation.
datasets. Our method integrates monocular depth networks 3D Gaussian Splatting with Geometric Prior 3D Gaussian
with an anchored Gaussian splat generation, incorporating Splatting (3DGS) [5] has received considerable attention.
a scale-consistent depth calibration mechanism. By utilizing 3DGS utilizes 3D Gaussian splats [6] as 3D primitives which
monocular depths, we initialize pixel-aligned anchors that can be rendered thorough sorting and rasterization. Scaffold-
fully cover the frustums across all input images, effectively GS [24] introduces anchors to cluster adjacent 3D Gaussians
preventing drift of splats caused by degenerate densification and uses an Multi-Layer Perceptron (MLP) to predict their
in the absence of sufficient multi-view photometric cues (Fig. attributes. However, since 3D Gaussian splats rely solely
2). Being inspired by [24], we design Residual-form Gaus- on photometric constraints, they often violate geometric
sian Decoder to robustly generate Gaussian splats around coherence. Several studies [28], [29] focusing on surface
these anchors. Unlike the decoder structure in [24], our extraction from Gaussian splats have addressed this issue by
novel decoder enables direct initialization of splat attributes aligning the flat 3D Gaussian splats with the geometry. 2DGS
(e.g. color) and greatly improves the training efficiency [30] directly projects 3D Gaussian onto flat 2D Gaussian
thanks to the efficient residual structure. Lastly, the inherent for effective mesh extraction. Gaussian Opacity Fields [31]
scale ambiguity of monocular depth is mitigated during the extract surfaces using ray-tracing based volume rendering. To
training by our Anchor Depth-Scale Parameterization and consider geometry regularization, prior methods either used
Scale-Consistent Depth Loss, leading to the consistent depth monocular depth estimation [32], [33] or multi-view stereo
calibration. Our novel contributions include: [34], [35] for designing training strategies.
• Scale-consistent Integration of Monocular Depths: 3D Gaussian Splatting with SLAM All of previous meth-
We introduce a novel approach that integrates scale- ods depend on precise poses obtained from a Structure
from Motion system [17], which require significant compu- Gaussian Splat Generation, and 3) Training from Rendering
tational resources. Recent research has expanded the 3DGS Losses. Instead of directly initializing Gaussian splats, we
by incorporating various SLAM systems, including those initialize anchor points with embedded features and nominal
using monocular camera [21] or RGB-D sensors [22].The Gaussian attributes, and generate child Gaussian splats by
aforementioned methods are limited to small-scale scenes combining the nominal attributes with the residuals decoded
that allow for the collection of dense viewpoints and cover from the Gaussian decoders. Note that our pipeline does
large areas of the environment. Additionally, CF-3DGS [8] not require initial SfM or LiDAR point clouds, but built on
and InstantSplat [36] present end-to-end frameworks for joint easy-to-obtain odometry poses and corresponding images,
novel view synthesis and camera pose estimation from se- which increases the practicality of our approach for real-
quential images. However, none of these methods effectively world datasets.
estimate a reliable camera trajectory in ground-view robot
datasets for training the Gaussian Splatting technique. A. Per-View Anchor Initialization
Monocular Anchor Initialization Instead of utilizing input
III. P RELIMINARY
SfM or LiDAR point clouds, we utilize monocular depth net-
3DGS [5] is a differentiable rendering method, which works to initialize anchor points P fused from each training
can be trained from n images {I1 , I2 , · · · , In } to learn view i. Firstly, monocular depth images [37] {Di }1:n are
volumetric scene representation G as a mixture of anisotropic generated from training images {Ii }1:n . With given odometry
Gaussian primitives {G1 , G2 , · · · , Gm }. From the trained poses {(Ri , ti )}1:n with Ri ∈ SO(3), ti ∈ R3 . depth images
scene representation G, tile-based differentiable rasterizer R D1:n are unprojected to the 3D space. The 3D points are
can render image I for novel view T ∈ SE(3) as I = then voxelized with small resolution ϵ to remove redundant
R(G | T ). anchor points, generating a cluster of per-view anchor points
Gaussian primitives Gi are initialized from input point P = {Pi }i∈1:n .
cloud with means µi as the corresponding point positions. Anchor Depth-Scale Parameterization Each set of per-
As such, each Gaussian G(x) is represented as, view anchor points Pi has inherent scale-ambiguity which

1
 needs to be calibrated for multi-view consistency. To allow
T −1
G(x) = exp − (x − µ) Σ (x − µ) (1) depth-scale adjustment of each per-view anchor points Pi , we
2
introduce scale parameter ŝi for each view i. For each point
where x is arbitrary position in 3D space and Σ denotes p ∈ Pi , the scale parameter is applied as pW = ŝi Ri pC + ti ,
covariance of Gaussian kernel. Here, Σ is decomposed into to transforming the point from camera coordinates C to
scaling matrix S and rotation matrix R to preserve semi the world coordiate W . This scale parameterization of each
positive-definite form, Σ = RSS T RT Additionally, each anchor points group allows online depth-scale adjustment
Gaussian stores an opacity value oi , which is multiplied with during the training time, to mitigate inherent scale-ambiguity
G(x) as αi (x) = Gi (x)oi for alpha-blending weight αi , and problem of monocular depth images.
the view-dependent color ci represented by spherical har- In complex ground scenes, SfM or LiDAR point cloud
monics (SH). 3D Gaussians are projected into 2D space [6], data often contains missing 3D structures due to frequent
and the color C is computed through volumetric rendering, occlusion and lack of observation. This leads to signifi-
following a front-to-back depth order: cant drifts in 3DGS algorithms when sufficient multi-view
X i−1
Y observations are unavailable, caused by degenerate cloning
C= ci αi (1 − αi ) (2) through ADC in 3DGS (Fig. 2). Our monocular initialization
i∈N j=1 produces pixel-aligned, fully covered initial point clouds,
where N is the set of sorted Gaussians overlapping with the while addressing per-view depth-scale ambiguity through the
given pixel. proposed parameterization and scale-consistent depth loss
that is further explained in the Sec. IV-C.
IV. M ETHODS
Our method aims to enabling the rendering of novel view B. Anchored Gaussian Splat Generation
images from ground-view robot trajectory datasets, which Residual-Form Gaussian Decoder We define our Residual-
is challenging for the original 3DGS algorithms due to Form Gaussian Decoders to generate k Gaussian splats from
complexity of ground-view scenes and lack of multi-view each anchor. Each anchor point pj ∈ P is associated with
observations with free trajectory pattern (Fig. 2). Our ground- a feature descriptor fj ∈ R32 , nominal Gaussian splat
view rendering pipeline integrates monocular depth networks attributes, including position µ̄j ∈ R3 , color c̄j ∈ R3 , opacity
[37] into anchored Gaussian splatting [24] scheme with the ōj ∈ R, and scaling s̄j ∈ R3 for covariance composition.
proposed scale-consistent depth calibration framework. This For clarity, we slightly modify our notation such that µ̄j
approach leverages pixel-aligned splat initialization from represents the position of the anchor point, while pj denotes
monocular depth networks, thereby further improving the the point itself. We set nonominal values of covariance
robustness of the anchored Gaussian structure [24]. related rotation r̄j as identity quaternions. Our residual
As can be seen in Fig. 3, our pipeline consists of three decoders Fµ , Fo , Fc , Fs , Fr are defined for each Gaussian
main steps: 1) Per-View Anchor Initialization, 2) Anchored attribute α ∈ {µ, o, c, s, r} lightweight 2-layer Multi-Layer
Monocular Anchor Initialization Anchored Gaussian Splat Generation Generated Gaussian Splats 𝑳𝒑𝒉𝒐𝒕𝒐
Anchor Depth-Scale Parameterization
𝜇! 𝜇":$
View 1

Gaussian Renderer
View 2 𝑐#! Anchor Attributes
+
Splat Attributes 𝑐":$
Image View 3 𝑠#! 𝑠":$ Render Image
X X 𝑟":$ X X

residual
𝑟#!
freeze

Monocular X ∆𝜇":$ X
𝑜#! 𝑜":$
Depth X
Prediction
X ∆𝑐":$
𝑓! 𝑫𝒆𝒄𝝁
∆s":$
∆𝑟":$
X ∆𝑜":$ X
𝑫𝒆𝒄𝒄
Depth XXX : Anchors Residual-Form Gaussian Decoder : Gaussian Splats
Render Depth
𝑳𝒅𝒆𝒑𝒕𝒉
× Depth-Scale Parameter (λ+)
Scale-Consistent Depth Loss

Fig. 3. Our methods consists of three main steps: (a) Per-View Anchor Initialization: Given monocular depth images, depth-scale adjustable anchors
are initialized from each view. Each anchor is fixed in the 3D scene except the depth-scale toward the corresponding view. (b) Anchor Decoding with
Residual-Form Gaussian Decoder: Each anchor is decoded into k Gaussian splats by our residual-form Gaussian decoders. When initialized, each anchor
contains nominal Gaussian splat attributes (µ̄j , r̄j , c̄j , ōj , s̄j ) and an embedded feature fj . The residual decoders generate k sets of residual attributes for
child splats, which are combined with nominal anchor attributes to generate child Gaussian splats. (c) Training with Scale-Consistent Depth Loss Online
Depth-Scale Calibration: We use scale-consistent depth loss Ldepth that incorporates scales for each monocular depth supervision.

Perceptron (MLP) structures. During training and rendering, Scale-Consistent Depth Loss Monocular depth images D
the decoders generates on-the-fly the residuals ∆αi from inherently contain scale ambiguity, and therefore need to
nominal attribute ᾱi stored in each anchor pj , as follows: be calibrated with adequate scale parameters when they
are used for depth supervision. Unlike previous approaches
{∆α0 , ∆α1 , ..., ∆αk−1 } = Fα (fj ) (3) that employ depth losses based on scale-invariant Pearson
Correlation [33], we define our depth loss term with a depth-
The use of residual-form decoders along with nominal scale parameter λ̂i embedded for each monocular depth Di ,
attributes (µ̄j , c̄j , ōj , s̄j ) enables faster training of decoders as follows:
and direct initialization of splat attributes, offering significant
advantages over other anchored Gaussian methods [38], [24]. n
X
Anchored Gaussian Generation k child Gaussian splats Ldepth = log ||1 + (λ̂i Di − D̂i )||2 (9)
are spawned from each anchor pj by combining the decoded i=1

residual attributes {α}1:k and nominal attributes ᾱj . This is Note that this depth-scale parameter λ̂i differs from the
expressed as: scale parameter ŝi introduced in our anchor depth-scale
parameterization. Specifically, ŝi allows to each initialized
µ1:k = µ̄j + ∆µ1:k (4) per-view anchor group to adjust its scale toward the reference
view, while λ̂i corrects the monocular depth scale ambiguity
c1:k = c̄j + ∆c1:k (5)
during the loss calibration.
s1:k = s̄j · ∆s1:k (6) Full Loss Design In addition to our proposed scaled-depth
o1:k = ōj + ∆o1:k (7) loss, the full loss function consists of a photometric loss
(8) Lphoto , a volumetric regularization loss Lvol [39], and an
anisotorpic regularization loss Laniso [40]. For completeness,
As shown in Fig. 5(a), our residual attribute structure we list these loss functions below:
enables direct initialization of splat attributes α1:k (e.g.
color) by incorporating the reference value ᾱ. This approach
accelerates the training of decoders, as they only need to Lphoto = w · D-SSIM(Ii , Iˆi ) + (1 − w) · ||Ii − Iˆi ||1 (10)
learn the deviations from the reference value. By addressing
In this equation, Lphoto represents a combination of L1 loss
one of the main weaknesses of the anchor-decoder scaffold
and D-SSIM loss [5], where SSIM stands for the Structural
structure [24], [38], our method improves both the training
Similarity Index Measure . The weight parameter w controls
efficiency and robustness of the original framework.
the balance between the two loss components.
C. Training from Rendering Losses
The generated Gaussian splats provide an explicit 3D
X
Lvol = Prod(si ) (11)
scene representation that can be rendered into novel view p∈P
color images and depth images, I,ˆ D̂ using the tile-based
1 X
rasterizer. We use the color image I and monocular depth Laniso = max{max(sp )/min(sp ), r} − r (12)
|P|
image D as supervision during training. p∈P
(hku_main_building)
Main Building
(hku_campus_seq_00)
Campus

3DGS GOF Scaffold-GS Ours GT


Fig. 4. Qualitative comparison on two scenes from the R3 LIVE dataset. Non-anchored methods, such as 3DGS [5] and GOF [31], exhibit significant
splat drift in the absence of dense multi-view information in sparsely captured scenes. In contrast, both Scaffold-GS [24] and our method demonstrate
robust performance due to their use of anchored splatting. Our approach delivers sharper and more accurate results, attributed to fast training from the
direct initialization of splat attributes and dense, pixel-aligned anchor initialization from monocular depth estimation.

Both Lvol and Laniso are applied to splats P to regu- distortion loss. (2) Sequential methods, including Colmap-
larize their shape. Here, Prod means the multiplication of Free 3D Gaussian Splatting (CF-3DGS) [8] and MonoGS
covariance-related Gaussian scales [40]. The overall loss [21], process each frame sequentially to fully exploit local
function is formulated as: information and refine image poses. (3) Anchored methods,
such as Scaffold-GS [24] and our approach, generate splats
anchored to points with restricted movement in 3D space.
L = λp Lphoto + λs Lscale + λd Ldepth + λu Laniso (13) Finally, (4) Baseline include the original 3DGS [5] and Mip-
Here, λp , λs , λd , and λu represent the weighting factors Splatting [43].
assigned to each corresponding loss term.
A. Rendering evaluation on R3 LIVE dataset
V. E XPERIMENT R3 LIVE dataset is a publicly available odometry dataset
We evaluate our method on four challenging ground-view captured by a hand-held device with a 15 Hz camera, 200
scenes: one indoor and one outdoor scene each from the Hz Inertial Measurement Unit (IMU), and 10 Hz Livox
R3 LIVE odometry dataset [16] and the Tanks and Temples Avia LiDAR sensor. It includes diverse indoor and outdoor
dataset [14]. To demonstrate the performance of our method, scenes from the campuses of HKU and HKUST. Unlike
we report rendering metrics including PSNR, SSIM [41], typical neural rendering datasets with object-centric views or
and LPIPS [42], which are widely used in neural rendering structured viewing patterns [14], [11], [13], R3 LIVE dataset
benchmarks [7], [5]. Our algorithm utilized no input point captures many complex indoor and outdoor structures with
clouds in any of the scenes, while we used LiDAR point free trajectory patterns.
cloud for R3 LIVE and SfM point cloud for Tanks and We process IMU, LiDAR, and Image data using R3 LIVE
Temples dataset. [16] multi-sensor odometry pipeline to generate pose-tagged
For effective comparison and analysis, we categorized image sequences. To synchronize pose estimation with image
existing 3DGS variants into four types according to the time stamps, we slightly modified the R3 LIVE odometry
the characteristics of regularization or information that they implementation. However, it still inevitably introduces pixel-
utilize: (1) Geometric 3DGS methods include 2D Gaus- level errors due to sensor fusion and inaccurate extrinsic cal-
sian Splatting (2DGS) [30] and Gaussian Opacity Field ibration. To avoid redundancy from high-frame rate images,
(GOF) [31]. To align the splats with the actual geometry, we subsample the images by selecting every 10th frame from
2DGS constraint splats to be flat and GOF applies depth the dataset.
TABLE I

With Residual-Form
R ESULTS ON R3 LIVE DATASET

Decoder
Main Building Campus
Method Type Input
PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓
3DGS [5] Baseline LiDAR 15.84 0.684 0.497 21.60 0.760 0.364
Mip-Splatting [43] Baseline LiDAR 13.58 0.630 0.499 15.89 0.679 0.449

With Direct-Form
CF-3DGS [8] Sequential Mono – – – – – –

Decoder
MonoGS [21] Sequential Mono 11.60 0.486 0.557 16.06 0.600 0.527
GOF [31] Geometric LiDAR 15.53 0.676 0.502 19.78 0.734 0.478
2DGS [30] Geometric LiDAR 15.72 0.674 0.507 20.67 0.743 0.418
Scaffold-GS [24] Anchored LiDAR 17.01 0.697 0.495 20.94 0.756 0.419
Ours w/ mono Anchored Mono 17.27 0.703 0.470 22.98 0.774 0.365
(a) (b)
Fig. 5. (a) With Residual-Form Gaussian Decoder (top) , only residual
TABLE II from nominal color is estimated and trained by the decoder, allowing
direct color initialization and fast training. Direct-Form Gaussian Decoder
R ESULTS ON TANKS AND T EMPLES DATASET (bottom) [24], [38] does not allow color initialization due to its on-the-
Method Type Input
Main Building Campus fly decoding scheme. (b) Rendering Performance (PSNR) ablation between
PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ Direct-Form Color MLP and Residual-Form Color MLP.
3DGS [5] Baseline SfM 15.92 0.689 0.396 16.86 0.639 0.435
Mip-Splatting [43] Baseline SfM 15.44 0.683 0.366 16.39 0.646 0.417
TABLE III
CF-3DGS [8] Sequential Mono – – – 15.53 0.614 0.510
MonoGS [21] Sequential Mono 9.78 0.453 0.637 12.01 0.488 0.591 A BLATION OF EACH PROPOSED MODULE
GOF [31] Geometry SfM 15.49 0.680 0.377 16.56 0.636 0.459
2DGS [30] Geometry SfM 16.57 0.705 0.382 17.20 0.636 0.455 depth cal. res. MLP PSNR ↑ SSIM ↑ LPIPS ↓
Scaffold-GS [24] Anchored SfM 17.12 0.719 0.345 17.42 0.677 0.413
Ours w/ mono Anchored Mono 15.70 0.682 0.442 16.66 0.641 0.462 16.84 0.687 0.475
✓ 16.91 0.693 0.467
✓ 17.03 0.688 0.465
✓ ✓ 17.27 0.703 0.470
As shown in Table I, our method outperforms all state-
of-the-art 3DGS variants in terms of rendering performance.
Notably, all algorithms except the Anchored variants exhibit
significantly lower performance in the main building scene, is the anchor-based Scaffold-GS [24], validating our analysis
which presents considerable challenges due to the complexity of the degeneracy patterns associated with different algorithm
of its narrow corridors and hallways. In both scenes, our al- types (Fig. 2).
gorithm achieves state-of-the-art rendering performance even As shown in Table. II, our method still shows comparable
without using initial LiDAR point clouds. Monocular depth performance to other 3DGS methods. In this sense, Our
based scene initialization delivers much better or comparable method achieves a suitable balance between robustness to
performance to LiDAR based intiailzation in both scenes. limited multi-view information and high performance in
This is largely due to the inherent incompleteness of LiDAR densely captured datasets, demonstrating the most stable
point clouds in such complex environments. Our results performance across all the datasets from R3 LIVE and Tanks
demonstrate that effectively integrating monocular depths and Temples datasets.
can be more beneficial in these challenging scenes, directly C. Ablation Studies
generating pixel-aligned and complete point clouds. We evaluated our depth calibration framework and
B. Rendering evaluation of Tanks and Temples dataset residual-form Gaussian MLP in Table III. As shown in the
ablation results, each proposed module contributes to the
We also validate our method on the Tanks and Temples overall improvement in rendering performance. Additionally,
dataset, a widely recognized benchmark for neural rendering our residual-form Gaussian decoder enables fast initialization
evaluation. We select the courthouse scene and the meeting of Gaussian attributes, as illustrated in Fig. 5(a). For this
room scenes as they are geometrically most challenging in ablation, we used a direct-form MLP that generates color
the benchmarks [31]. Similar to the evaluation of R3 LIVE attributes from features, similar to [24], [38]. Compared
dataset, we subsample every 10th images. Instead of LiDAR to this direct-form MLP decoder, our proposed decoder
point clouds, we directly use SfM PCD for the other variants. significantly accelerates the training process (Fig. 5(b)).
Unlike the R3 LIVE odometry dataset, the trajectories in
this dataset follow object-centric or circular patterns, provid- VI. C ONCLUSIONS , L IMITATIONS , AND F UTURE W ORK
ing relatively dense observations to each part of scene. Due In this paper, we presented Mode-GS, a novel 3DGS
to this reason, our method does not achieve the best perfor- algorithm designed for robust neural rendering from ground-
mance. In these viewing patterns, the initial point cloud has robot trajectory datasets. Our algorithm introduces a practical
less impact on 3DGS algorithms, as splat cloning and split- rendering pipeline for ground-view robot datasets, utilizing
ting are effectively guided by salient multi-view photometric easily obtainable odometry poses and operating in a point-
gradients. It has even been shown that random initialization cloud-free setting. However, our approach is less effective in
can yield plausible results in such cases [44]. As monocular scenarios where extensive multi-view data is available, such
depth usually contains inevitable inner distortion that can not as in densely captured, object-centric datasets. Future work
be adjusted by scale factor, anchoring on these depths can be will focus on developing a hybrid approach that integrates
detrimental when enough photometric information is given. our method with non-anchored splats to achieve optimal
Nonetheless, the best-performing algorithm in this scenario performance.
R EFERENCES [22] N. Keetha, J. Karhade, K. M. Jatavallabhula, G. Yang, S. Scherer,
D. Ramanan, and J. Luiten, “SplaTAM: Splat Track & Map 3D
[1] M. Adamkiewicz, T. Chen, A. Caccavale, R. Gardner, P. Culbertson, Gaussians for Dense RGB-D SLAM,” in Proceedings of the IEEE/CVF
J. Bohg, and M. Schwager, “Vision-Only Robot Navigation in a Neural Conference on Computer Vision and Pattern Recognition (CVPR),
Radiance World,” IEEE Robotics and Automation Letters, vol. 7, no. 2, 2024.
pp. 4606–4613, 2022. [23] C. Yan, D. Qu, D. Xu, B. Zhao, Z. Wang, D. Wang, and X. Li,
[2] T. Chen, O. Shorinwa, J. Bruno, J. Yu, W. Zeng, K. Nagami, P. Dames, “GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting,” in
and M. Schwager, “Splat-Nav: Safe Real-Time Robot Navigation in Proceedings of the IEEE/CVF Conference on Computer Vision and
Gaussian Splatting Maps,” arXiv preprint arXiv:2403.02751, 2024. Pattern Recognition (CVPR), 2024.
[3] C. Maxey, J. Choi, H. Lee, D. Manocha, and H. Kwon, “UAV-Sim: [24] T. Lu, M. Yu, L. Xu, Y. Xiangli, L. Wang, D. Lin, and B. Dai,
NeRF-based Synthetic Data Generation for UAV-based Perception,” “Scaffold-GS: Structured 3D Gaussians for View-Adaptive Render-
arXiv preprint arXiv:2310.16255, 2023. ing,” in Proceedings of the IEEE/CVF Conference on Computer Vision
[4] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor- and Pattern Recognition (CVPR), 2024.
thi, and R. Ng, “NeRF: Representing Scenes as Neural Radiance Fields [25] Q. Liu, H. Xin, Z. Liu, and H. Wang, “Integrating Neural Radiance
for View Synthesis,” Communications of the ACM, vol. 65, no. 1, pp. Fields End-to-End for Cognitive Visuomotor Navigation,” IEEE Trans-
99–106, 2021. actions on Pattern Analysis and Machine Intelligence, 2024.
[5] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3D Gaussian [26] L. Meyer, F. Erich, Y. Yoshiyasu, M. Stamminger, N. Ando, and Y. Do-
Splatting for Real-Time Radiance Field Rendering,” ACM Transac- mae, “PEGASUS: Physically Enhanced Gaussian Splatting Simulation
tions on Graphics, vol. 42, no. 4, pp. 1–14, 2023. System for 6DOF Object Pose Dataset Generation,” arXiv preprint
[6] M. Zwicker, H. Pfister, J. Van Baar, and M. Gross, “EWA splatting,” arXiv:2401.02281, 2024.
IEEE Transactions on Visualization and Computer Graphics, vol. 8, [27] V. Patil and M. Hutter, “Radiance Fields for Robotic Teleoperation,”
no. 3, pp. 223–238, 2002. arXiv preprint arXiv:2407.20194, 2024.
[7] P. Wang, Y. Liu, Z. Chen, L. Liu, Z. Liu, T. Komura, C. Theobalt, and [28] A. Guédon and V. Lepetit, “SuGaR: Surface-Aligned Gaussian Splat-
W. Wang, “F2-NeRF: Fast Neural Radiance Field Training with Free ting for Efficient 3D Mesh Reconstruction and High-Quality Mesh
Camera Trajectories,” in Proceedings of the IEEE/CVF Conference on Rendering,” in Proceedings of the IEEE/CVF Conference on Computer
Computer Vision and Pattern Recognition (CVPR), 2023. Vision and Pattern Recognition (CVPR), 2024, pp. 5354–5363.
[8] Y. Fu, S. Liu, A. Kulkarni, J. Kautz, A. A. Efros, and X. Wang, [29] H. Chen, C. Li, and G. H. Lee, “NeuSG: Neural Implicit Surface
“COLMAP-Free 3D Gaussian Splatting,” in Proceedings of the Reconstruction with 3D Gaussian Splatting Guidance,” arXiv preprint
IEEE/CVF Conference on Computer Vision and Pattern Recognition arXiv:2312.00846, 2023.
(CVPR), 2024. [30] B. Huang, Z. Yu, A. Chen, A. Geiger, and S. Gao, “2D Gaussian
[9] C. Liu, S. Chen, Y. Bhalgat, S. Hu, Z. Wang, M. Cheng, V. A. Splatting for Geometrically Accurate Radiance Fields,” in ACM SIG-
Prisacariu, and T. Braud, “GSLoc: Efficient Camera Pose Refinement GRAPH 2024 Conference Papers, 2024, pp. 1–11.
via 3D Gaussian Splatting,” arXiv preprint arXiv:2408.11085, 2024. [31] Z. Yu, T. Sattler, and A. Geiger, “Gaussian Opacity Fields: Effi-
[10] L. Zhao, P. Wang, and P. Liu, “BAD-Gaussians: Bundle Adjusted cient Adaptive Surface Reconstruction in Unbounded Scenes,” arXiv
Deblur Gaussian Splatting,” arXiv preprint arXiv:2403.11831, 2024. preprint arXiv:2404.10772, 2024.
[11] H. Turki, D. Ramanan, and M. Satyanarayanan, “Mega-NERF: Scal- [32] B. Zhang, C. Fang, R. Shrestha, Y. Liang, X. Long, and P. Tan,
able Construction of Large-Scale NeRFs for Virtual Fly-Throughs,” “RaDe-GS: Rasterizing Depth in Gaussian Splatting,” arXiv preprint
in Proceedings of the IEEE/CVF Conference on Computer Vision and arXiv:2406.01467, 2024.
Pattern Recognition (CVPR), 2022. [33] M. Turkulainen, X. Ren, I. Melekhov, O. Seiskari, E. Rahtu, and
[12] Y. Xiangli, L. Xu, X. Pan, N. Zhao, A. Rao, C. Theobalt, B. Dai, and J. Kannala, “DN-Splatter: Depth and Normal Priors for Gaussian
D. Lin, “BungeeNeRF: Progressive Neural Radiance Field for Ex- Splatting and Meshing,” arXiv preprint arXiv:2403.17822, 2024.
treme Multi-scale Scene Rendering,” in Proceedings of the European [34] K. Cheng, X. Long, K. Yang, Y. Yao, W. Yin, Y. Ma, W. Wang, and
Conference on Computer Vision (ECCV), 2022. X. Chen, “GaussianPro: 3D Gaussian Splatting with Progressive Prop-
[13] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, agation,” in Forty-first International Conference on Machine Learning,
“Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields,” 2024.
in Proceedings of the IEEE/CVF Conference on Computer Vision and [35] Z. Li, S. Yao, Y. Chu, A. F. Garcia-Fernandez, Y. Yue, E. G. Lim,
Pattern Recognition (CVPR), 2022. and X. Zhu, “MVG-Splatting: Multi-View Guided Gaussian Splatting
[14] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, “Tanks and Temples: with Adaptive Quantile-Based Geometric Consistency Densification,”
Benchmarking Large-Scale Scene Reconstruction,” ACM Transactions arXiv preprint arXiv:2407.11840, 2024.
on Graphics, vol. 36, no. 4, 2017. [36] Z. Fan, W. Cong, K. Wen, K. Wang, J. Zhang, X. Ding, D. Xu,
[15] Y. Liao, J. Xie, and A. Geiger, “KITTI-360: A novel dataset and B. Ivanovic, M. Pavone, G. Pavlakos, Z. Wang, and Y. Wang, “In-
benchmarks for urban scene understanding in 2d and 3d,” IEEE stantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in
Transactions on Pattern Analysis and Machine Intelligence, 2022. 40 Seconds,” arXiv preprint arXiv:2403.20309, 2024.
[16] J. Lin and F. Zhang, “R3 LIVE: A Robust, Real-time, RGB-colored, [37] W. Yin, C. Zhang, H. Chen, Z. Cai, G. Yu, K. Wang, X. Chen,
LiDAR-Inertial-Visual tightly-coupled state Estimation and mapping and C. Shen, “Metric3D: Towards Zero-shot Metric 3D Prediction
package,” in Proceedings of the IEEE International Conference on from A Single Image,” in Proceedings of the IEEE/CVF International
Robotics and Automation (ICRA), 2022. Conference on Computer Vision (ICCV), 2023.
[17] J. L. Schonberger and J.-M. Frahm, “Structure-From-Motion Revis- [38] E. Ververas, R. A. Potamias, J. Song, J. Deng, and S. Zafeiriou,
ited,” in Proceedings of the IEEE/CVF Conference on Computer Vision “SAGS: Structure-Aware 3D Gaussian Splatting,” arXiv preprint
and Pattern Recognition (CVPR), 2016. arXiv:2404.19149, 2024.
[18] C. Campos, R. Elvira, J. J. Gómez, J. M. M. Montiel, and J. D. Tardós, [39] S. Lombardi, T. Simon, G. Schwartz, M. Zollhoefer, Y. Sheikh, and
“ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual- J. Saragih, “Mixture of Volumetric Primitives for Efficient Neural
Inertial and Multi-Map SLAM,” IEEE Transactions on Robotics, Rendering,” ACM Transactions on Graphics, vol. 40, no. 4, 2021.
vol. 37, no. 6, pp. 1874–1890, 2021. [Online]. Available: [Link]
[19] T. Shan and B. Englot, “LeGO-LOAM: Lightweight and Ground- [40] T. Xie, Z. Zong, Y. Qiu, X. Li, Y. Feng, Y. Yang, and C. Jiang, “Phys-
Optimized Lidar Odometry and Mapping on Variable Terrain,” in Gaussian: Physics-Integrated 3D Gaussians for Generative Dynamics,”
Proceedings of the IEEE/RSJ International Conference on Intelligent arXiv preprint arXiv:2311.12198, 2023.
Robots and Systems (IROS), 2018. [41] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
[20] T. Qin, P. Li, and S. Shen, “VINS-Mono: A Robust and Versatile quality assessment: from error visibility to structural similarity,” IEEE
Monocular Visual-Inertial State Estimator,” IEEE Transactions on Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
Robotics, vol. 34, no. 4, pp. 1004–1020, 2018. [42] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The
[21] H. Matsuki, R. Murai, P. H. Kelly, and A. J. Davison, “Gaussian Unreasonable Effectiveness of Deep Features as a Perceptual Metric,”
Splatting SLAM,” in Proceedings of the IEEE/CVF Conference on in Proceedings of the IEEE/CVF Conference on Computer Vision and
Computer Vision and Pattern Recognition (CVPR), 2024. Pattern Recognition (CVPR), 2018.
[43] Z. Yu, A. Chen, B. Huang, T. Sattler, and A. Geiger, “Mip-Splatting:
Alias-free 3D Gaussian Splatting,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR),
2024.
[44] J. Jung, J. Han, H. An, J. Kang, S. Park, and S. Kim, “Relaxing
Accurate Initialization Constraint for 3D Gaussian Splatting,” arXiv
preprint arXiv:2403.09413, 2024.

You might also like