0% found this document useful (0 votes)
74 views8 pages

A Robust Deep Learning Enhanced Monocular SLAM System For Dynamic Environments

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views8 pages

A Robust Deep Learning Enhanced Monocular SLAM System For Dynamic Environments

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

A Robust Deep Learning Enhanced Monocular SLAM System for

Dynamic Environments
Yaoqing Li Sheng-Hua Zhong∗
1 Shenzhen
University Shenzhen University
2 SFMAP
Technology Shenzhen, China
Shenzhen, China [email protected]
[email protected]

Shuai Li Yan Liu


1 University
of Oulu The Hong Kong Polytechnic University
2 VTT-Technology Research Center of Finland Hong Kong, China
Oulu, Finland [email protected]
[email protected]
ABSTRACT In International Conference on Multimedia Retrieval (ICMR ’23), June 12–
Simultaneous Localization and Mapping (SLAM) has developed as a 15, 2023, Thessaloniki, Greece. ACM, New York, NY, USA, 8 pages. https:
//doi.org/10.1145/3591106.3592295
fundamental method for intelligent robot perception over the past
decades. Most of the existing feature-based SLAM systems relied
on traditional hand-crafted visual features and a strong static world 1 INTRODUCTION
assumption, which makes these systems vulnerable in complex Simultaneous Localization and Mapping (SLAM) plays an impor-
dynamic environments. In this paper, we propose a robust monoc- tant role in the intelligent robot field, which constructs maps of
ular SLAM system by combining geometry-based methods with unknown environments by leveraging the information collected
two convolutional neural networks. Specifically, a lightweight deep from embedded sensors, and simultaneously estimates the position
local feature detection network is proposed as the system front-end, and pose of the robot. In the past decades, thanks to the low cost
which can efficiently generate keypoints and binary descriptors of cameras and the development of computer vision technologies,
robust against variations in illumination and viewpoint. Besides, many feature-based visual SLAM approaches like ORB-SLAM2 [21]
we propose a motion segmentation and depth estimation network have been developed and achieved satisfactory performance in
for simultaneously predicting pixel-wise motion object segmenta- static environments. However, most of these SLAM systems were
tion and depth map, so that our system can easily discard dynamic based on hand-crafted features, such as SIFT [18] and ORB [23],
features and reconstruct 3D maps without dynamic objects. The which may fail to provide consistent feature detection and associa-
comparison against state-of-the-art methods on publicly available tion results in complex environments with changes in illumination
datasets shows the effectiveness of our system in highly dynamic and viewpoint. On the other hand, most existing SLAM systems
environments. relied on the scene rigidity assumption, also known as the static
world assumption. While moving objects are unavoidable in real-
CCS CONCEPTS world environments, which can reduce the accuracy of camera pose
• Computing methodologies → Computer vision; Image seg- estimation and impair the building of maps.
mentation; Vision for robotics. To address these issues, there is a trend in SLAM to investigate
deep learning enhanced methods. Some recent works [16, 19] em-
KEYWORDS ployed local features detected by convolutional neural networks
monocular SLAM, deep local feature, motion segmentation (CNNs) as a substitute for traditional hand-crafted features and
demonstrated that the SLAM systems based on deep local features
ACM Reference Format: achieved better performances in various scenes, including chal-
Yaoqing Li, Sheng-Hua Zhong, Shuai Li, and Yan Liu. 2023. A Robust Deep
lenging environments with significant illumination or viewpoint
Learning Enhanced Monocular SLAM System for Dynamic Environments.
changes. However, the computation cost of matching the float de-
∗ Corresponding author. scriptors of these methods is high. Moreover, due to the static
world assumption, the accuracy of these methods in highly dy-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed namic scenes is still unacceptable. To eliminate the influence of
for profit or commercial advantage and that copies bear this notice and the full citation moving objects, many researchers used existing semantic segmen-
on the first page. Copyrights for components of this work owned by others than the tation networks, such as Mask R-CNN [14] and SegNet [1], to detect
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission semantically moving objects in images, and then discard the feature
and/or a fee. Request permissions from [email protected]. points lying on moving objects before pose estimation. However, the
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece trained semantic segmentation networks cannot detect the dynamic
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0178-8/23/06. . . $15.00 objects that are not labeled as the dynamic objects in the training
https://doi.org/10.1145/3591106.3592295 datasets. In a complex real-world environment, semantically static

508
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Li et al.

objects can be moved by other dynamic objects. Therefore, using are among the first successful systems that integrate CNN-based
existing semantic segmentation networks is unreliable since it sim- depth predictions into a geometric SLAM pipeline. Building on
ply relies on semantic prior information to determine whether an CodeSLAM, DeepFactors [7] achieved better performance by inte-
object is moving. grating learned priors with different classical SLAM formulations in
To this end, we direct our efforts to improve the robustness a factor-graph probabilistic framework. DROID-SLAM [29] outper-
of features through a deep local feature detection network and formed prior works by performing recurrent iterative optimizations
discard dynamic features through a motion segmentation network. of camera poses and depth maps through a dense BA layer, but the
In summary, the main contributions of this paper are as follows: computation cost is high. Experimental results demonstrated the
• We introduce a deep feature detection network (DFNet) to effectiveness of the above-mentioned methods in challenging envi-
detect robust keypoints and descriptors for SLAM. DFNet has ronments with illumination or viewpoint changes. However, under
a simplified CNN structure and a binary descriptor activation the interference of dynamic objects, their performances usually
layer, which allows our system to detect and match features drop significantly.
effectively and efficiently. Therefore, it’s essential to remove the unreliable features of dy-
• Instead of using the existing semantic segmentation net- namic objects. Some previous work addressed this issue by using
works, we propose a novel network (MDNet) to segment semantic information. Zhong et al. [36] applied the object detection
dynamic objects. This network can efficiently extract the network SSD [17] to detect dynamic objects, then the features of se-
appearance and motion information of image sequences and mantically dynamic objects such as people or cars were considered
accurately segment the most discriminative dynamic objects dynamic and discarded. Yu et al. [33] utilized a semantic segmen-
by using an attention transfer module. tation network SegNet [1] to get prior knowledge about dynamic
• Based on the proposed networks, we implement a robust objects, then discarded the unreliable features on dynamic objects.
monocular SLAM system and achieve better results when Similarly, Bescos et al. [3] proposed DynaSLAM, which leverages
compared to state-of-the-art SLAM systems in highly dy- Mask R-CNN [14] to remove the potential moving objects, then
namic environments. reconstruct the static background via ORB-SLAM2 [21] framework.
These approaches performed well in many dynamic environments.
2 RELATED WORK However, due to the semantic-based methods can only segment the
labeled dynamic objects in the training datasets, their robustness
2.1 Deep Local Feature would decrease when the environment contains unknown dynamic
With the progress of deep learning, many deep local features have objects. Moreover, they also failed to segment the dynamic objects
been proposed, which have been proved to outperform hand-crafted that are semantically static, such as chairs, laptops, and boxes that
local features in challenging environments with illumination or can be moved by people (see Fig. 3).
viewpoint changes. Yi et al. [32] proposed a convolutional replace-
ment for SIFT, called LIFT, which combines keypoint detection, di- 3 PROPOSED METHOD
rection estimation, and descriptor calculation modules, but requires
additional supervision from a classical Structure from Motion sys- To achieve accurate and robust performance in highly dynamic
tem. DeTone et al. [8] designed a self-supervised domain adaptation environments, we propose two convolutional neural networks and
framework called Homographic Adaptation and used it to train interpolate them into the traditional SLAM system. On the one
a fully convolutional neural network for local feature detection hand, to deal with the illumination or viewpoint changes, the hand-
(SuperPoint). To avoid ambiguous areas and obtain reliable local crafted feature extractor in the original SLAM system is replaced
features, R2D2 [22] proposed to jointly learn keypoint detection and by a deep local feature detection network, which can generate
description together with a confidence for keypoint reliability. Af- keypoints and binary descriptors that are robust to environmental
ter feature detection and feature description, the above-mentioned variations. On the other hand, a motion segmentation and depth
local features need to further use the nearest neighbor search to estimation network is proposed to simultaneously generate pixel-
find matches between two images. Sun et al. [26] proposed LoFTR, wise motion segmentation mask and depth map, so that our system
which can obtain matching features directly by using self and cross can easily discard dynamic features and reconstruct static maps.
attention layers [11, 27, 30] in transformer. The proposed networks as well as the workflow of our SLAM system
are described in detail as follows.
2.2 Deep Learning Enhanced SLAM
With high accuracy and robustness, deep features also have been 3.1 Learning-based Local Feature Extraction
proposed to substitute hand-crafted feature descriptors in tradi- Network Structure. The structure of our deep local feature detec-
tional SLAM. Ma et al. [19] proposed a novel deep local feature tion network (DFNet) is shown in Fig. 1 (a). Similar to many existing
descriptor (ASD) and designed a SLAM system named ASD-SLAM local feature detection networks [8, 19], DFNet has a shared encoder
based on it. Similarly, Li et al. [16] incorporated HF-Net into the and two decoders to process a gray image, yielding a probability
ORB-SLAM2 to extract both local and global features. In addition map sized Rℎ×𝑤×1 for keypoint confidences as well as a dense
to replacing hand-crafted with deep features, there are a number feature map sized Rℎ×𝑤×256 for descriptors. Then sparse keypoints
of works [5, 7, 28, 29] focused on how deep learning-based depth can be obtained by performing non-maximum suppression (NMS)
estimation can be deployed for accurate and dense reconstruc- over the probability map, and their descriptors are sampled from
tion in monocular systems. CNN-SLAM [28] and CodeSLAM [5] the feature map by using bilinear sampling.

509
A Robust Deep Learning Enhanced Monocular SLAM System for Dynamic Environments ICMR ’23, June 12–15, 2023, Thessaloniki, Greece

Softmax Reshape Keypoint


256 65
ࢎ灅࢝灅૚

128 128 Bi-linear Descriptor


128 128
Gray
64 64
Interpolation
Binarization ࢎ 灅 ࢝ 灅 ૛૞૟
32 32 256 256
ࢎ灅࢝灅૚
(a) DFNet

(d) RM

(d) RM

(d) RM
(d) Sigmoid
RM
256
128
64 Depth
32
16 16 ࢎ灅࢝灅૚
256 512 Current Frame
RGB
128 ࡲࢉ
64 (c) Motion ࡲࡹ ࡾ૚ ࡾ૛ ࡾ૜ ࡾ૝ ࡾ૞

(d) RM

(d) RM

(d) RM
ࢎ灅࢝灅૜ 64 Feature Maps Attention (d) RM Sigmoid
Database Last Frame
Module 256
ࡲࡸ ࡿ૚ 128 64
ࡿ૛ Mask
ࡿ૜ 32 16 16 ࢎ灅࢝灅૚
ࡿ૝

(b) MDNet

ࡲ࡯ ࡲࡸ ࡾ࢏ ࡿ࢏
ࡴ灅ࢃ灅࡯ ࡴ灅ࢃ灅࡯ ࡴ࢏ /2 灅 ࢃ࢏ /૛ 灅 ࡯࢏ ࡴ࢏ 灅 ࢃ࢏ 灅 ࡯ᇱ࢏
: ConvBlock of ResNet18
: 2D Conv + BN + ReLU
Spatial Attention Spatial Attention Up Sample
: MaxPool2d

Conv2d Conv2d
: Motion Attention Module
ࡴ 灅 ࢃ 灅 ࡯/r ࡴ 灅 ࢃ 灅 ࡯/r
Conv2d : Refining Module
ࡴ࢏ 灅 ࢃ࢏ 灅 (࡯࢏ +࡯ᇱ࢏ )
: Operations
Softmax Conv2d
ࡴ࢏ 灅 ࢃ࢏ 灅 ࡯࢏ /2 : Matrix Multiplication

: Element-wise Sum
ࡲࡹ ࡾ࢏ା૚ : Concatenation
ࡴ 灅 ࢃ 灅 ૛࡯ ࡴ࢏ 灅 ࢃ࢏ 灅 ࡯࢏ /2

(c) Motion Attention Module (MAM) (d) Refining Module (RM)

Figure 1: Pipeline of (a) the deep local feature detection network (DFNet) and (b) the motion segmentation and depth estimation
network (MDNet). (c) Illustration of the motion attention module (MAM), which leverages two spatial attention modules [31]
to extract features of discriminative motion areas between two adjacent frames. (d) Illustration of the refining module (RM).
All refining modules of MDNet share a similar structure, which receives the outputs from the corresponding residual block and
the previous refining module, and then produces the feature maps of doubled resolution.

Binary Descriptor. Binary descriptors are more efficient than allow the loss to converge more efficiently, instead of calculating dis-
float descriptors at feature matching. Thus, different from existing tances between all unmatching points, we only consider the hardest
works [16, 19] that directly apply the float descriptors from the negative points that lie outside of a square local neighborhood of
network to estimate pose, we develop our network to learn binary the matching point:
descriptors. To this end, we perform binarization on each descriptor
𝑥𝑛 = 𝑎𝑟𝑔𝑚𝑖𝑛 ∥𝑑𝑡 (𝑥𝑛 ) − 𝑑𝑐 (𝑥𝑐 )∥ 2 𝑠.𝑡 . ∥𝑥𝑛 − 𝑥𝑡 ∥ 2 > 𝑘 (2)
after bilinear sampling. The final binary descriptor is expressed as: 𝑥𝑛 ∈𝐼𝑡

1 where 𝑑𝑐 , 𝑑𝑡 are the predicted descriptors of the current frame and


(𝑠𝑖𝑔𝑛(𝑓 (𝑥)) + 1)
𝑑 (𝑥) = (1) the target frame, respectively; 𝑘 is a threshold used to determine
2
where 𝑥 is the pixel coordinate of a keypoint. 𝑓 (·) represents the whether a pair of points is matching. Here, 𝑥𝑐 denotes the coor-
original float descriptor. 𝑠𝑖𝑔𝑛(𝑓 (𝑥)) = 1 if 𝑓 (𝑥) > 0 and -1 oth- dinate of a keypoint in the current frame, 𝑥𝑡 is the coordinate of
erwise. Due to the binarization operation is not differentiable, the matching point of 𝑥𝑐 , and 𝑥𝑛 is the coordinate of the hardest
we leverage a straight-through estimator algorithm [6] to back- negative point in the target frame.
propagating the gradients.
Loss Function for Feature Detection. We adopt a weighted 3.2 Motion Segmentation and Depth Estimation
combination loss of the cross-entropy loss and the triplet margin Network Structure. Based on the typical encoder-decoder ar-
loss to jointly optimize the keypoint detector and descriptor. The chitecture, we construct a novel network named MDNet, which
final loss function for DFNet is similar to [8], while the difference can generate a motion segmentation mask and a single-view depth
lies in the calculation of the distance of the negative samples. To map for the current frame. The overview of MDNet is shown in

510
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Li et al.

Fig. 1 (b). For motion segmentation, our goal is to segment mov- TRACKING

ing objects by using the appearance information obtained from Feature Monocular
Pose Estimation
Local Map New Keyframe
two consecutive frames instead of the semantic prior information Filtering Initialization
Relocalization
Tracking Decision

[3, 33], so that we can remove features from dynamic objects ac-
curately. Besides, to build a joint training framework and allow

MAPPING
FEATURE DETECTION

LOCAL LOOP CLOSURE


Keypoint
Local Mapping
the monocular SLAM system to reconstruct dense maps effectively,
our network is supposed to predict depth maps while performing
motion segmentation.

DETECTION
DFNet Detect Loop
Descriptor
We employ a ResNet18 [15] as the encoder to extract high-level Feature Maps Database
Global
appearance representations of the current frame and reference Frame
Last
Frame
Optimization

Current
Frame
Mask
frame. Then features of discriminative motion areas can be ob-

MAPPING (offline)
tained by using a motion attention module (see Fig. 1 (c)). This

DENSE
Static Map

module takes the appearance representations of the current frame


Reconstruction

𝐹𝐶 and the appearance representations of last frame 𝐹𝐿 as the input,


MDNet
Depth
MOTION SEGMENTATION &
and outputs a motion attention feature 𝐹𝑀 . Given 𝐹𝑀 , the motion DEPTH ESTIMATION

segmentation decoder can produce a probability map, where each


value represents the dynamic probability of the corresponding pixel. Figure 2: Overview of our SLAM system. The proposed DFNet
At the same time, the appearance representations can be fed into a and MDNet are used to substitute the front-end of ORB-
depth estimation decoder to generate a depth map. Both the motion SLAM2. Added or modified modules based on our networks
segmentation decoder and the depth estimation decoder are mainly are shown in blue.
composed of four refined models (see Fig. 1 (d)). All refining mod-
ules share a similar structure, which receives the outputs from the
corresponding residual block 𝑆𝑖 and the previous refining module Loss Function for Depth Estimation. We train the depth
𝑅𝑖 , then produces the feature maps of doubled resolution for next estimation model in a self-supervised manner, with the guidance of
module 𝑅𝑖+1 . the reprojected errors and the triangulated 3D structure. The loss
Loss Function for Motion Segmentation. Motion segmen- function for depth estimation is formulated as follows:
tation can be regarded as a binary classification problem, so the L𝑑 = 𝑤𝑑 𝑓 L𝑑 𝑓 + 𝑤𝑑𝑠 L𝑑𝑠 + 𝑤𝑑𝑝 L𝑑𝑝 + 𝑤𝑑𝑡 L𝑑𝑡 (7)
network can learn to detect dynamic objects by using the cross-
entropy loss: where L𝑑 𝑓 and L𝑑𝑠 are the widely used photometric loss and depth
smoothness loss, respectively [4]. L𝑑𝑝 is the depth reprojection
L𝑚 = L𝑐𝑒 (𝑀, 𝐺) (3)
error between two frames. L𝑑𝑡 is the error between the triangu-
where L𝑐𝑒 indicates the classic cross-entropy loss. 𝑀 is the pre- lated depth and the depth generated from network. In this work,
dicted segmentation mask, and 𝐺 is the ground-truth binary mask. we follow the designs of L𝑑 𝑓 and L𝑑𝑠 in [4]. L𝑑𝑝 and L𝑑𝑡 are
Since there is no large video dataset with motion segmentation introduced as follows.
labels, inspired by [34], we propose to generate the ground-truth bi- The calculation of L𝑑𝑝 is based on the relative pose and depth
nary mask by employing optical flow residual. Specifically, for two predictions of two adjacent frames. To recover the relative pose,
adjacent frames 𝐼𝑐 and 𝐼𝑡 , the optical flow is calculated as follows: we first detect reliable local features by employing the pre-trained
F𝑜 𝑓 (𝑥𝑐 ) = 𝑥𝑡 − 𝑥𝑐 (4) DFNet. Then we solve the fundamental matrix via the eight-point
algorithm [12] and RANSAC [9] to obtain the relative camera pose.
where 𝑥𝑐 and 𝑥𝑡 represent the coordinates of a pair of matching Given the recovered relative pose and depth predictions 𝐷𝑐 and 𝐷𝑡
pixels in frame 𝐼𝑐 and 𝐼𝑡 , respectively. Given the ground truth pose from MDNet, depth reprojection error is calculated as:
𝑇 and depth map 𝐷𝑐 , the camera ego-motion flow is computed as
follows: 1 ∑︁ 𝐷 𝑡 (W (𝑥𝑐 ))
L𝑑𝑝 = 𝑀 (𝑥𝑐 ) 1 − 𝑐 (8)
𝑁1
 
F𝑒 𝑓 (𝑥𝑐 ) = 𝜎 𝑇 𝜎 −1 (𝑥𝑐 , 𝐷𝑐 (𝑥𝑐 )) − 𝑥𝑐 (5) 𝑥𝑐 ∈𝐾1 𝐷𝑡𝑏 (W (𝑥𝑐 ))
where 𝐾1 defines the pixel coordinates in the current frame 𝐼𝑐 , 𝑁 1
where the function 𝜎 (·) : R3 → R2 is used to project 3D points onto
denotes the number of pixel coordinates in 𝐾1 , W (𝑥𝑐 ) denotes the
the image plane. 𝐷𝑐 (·) represents the depth of a given coordinate.
reprojected pixel coordinate of 𝑥𝑐 , 𝐷𝑐𝑡 is the reprojected depth map
Then the optical flow residual of the pixel 𝑥𝑐 can be defined as
by warping 𝐷𝑐 with the recovered pose [4]. Due to W (𝑥𝑐 ) not being
follows:
aligned to the pixel grid, we need to use the depth map 𝐷𝑡𝑏 bilinearly
R (𝑥𝑐 ) = F𝑜 𝑓 (𝑥𝑐 ) − F𝑒 𝑓 (𝑥𝑐 ) (6) interpolated from 𝐷𝑡 . 𝑀 stands for the motion segmentation mask,
According to the above equations, we can see that the optical which allows the gradients calculated on moving objects to carry
flow residuals of dynamic objects are high, because these optical less weight.
flows come not only from the motion of the camera but also from the In addition, as proposed in [35], the depths obtained from trian-
motion of dynamic objects, while the optical flow residuals of static gulation are useful for supervising the learning of depth estimation.
objects are close to zero. Therefore, pseudo ground-truth masks for We use the recovered pose and the matching keypoints to perform
motion segmentation training can be generated by thresholding for midpoint triangulation [13], so that we can obtain the triangulated
the optical flow residuals. depth 𝐷𝑡𝑟𝑖 . Then the error L𝑑𝑡 between the triangulated depth

511
A Robust Deep Learning Enhanced Monocular SLAM System for Dynamic Environments ICMR ’23, June 12–15, 2023, Thessaloniki, Greece

𝐷𝑡𝑟𝑖 and predicted depth 𝐷𝑐 is calculated as: Table 1: Feature evaluation on HPatches dataset. The best
results are highlighted in bold, while the second best results
1 ∑︁ 𝐷𝑡𝑟𝑖 (𝑝𝑐 ) − 𝑠𝐷𝑐 (𝑝𝑐 ) 2
 
L𝑑𝑡 = 𝑀 (𝑝𝑐 ) (9) are underlined.
𝑁2 𝐷𝑡𝑟𝑖 (𝑝𝑐 )
𝑝𝑐 ∈𝐾2
Homography
where the predicted depth 𝐷𝑐 of the current frame is aligned to Detector metric Descriptor metric
est. accuracy
the triangulated depth with a scale factor 𝑠, which explicitly align Repeatability NN mAP e=1px e=3px e=5px
the predicted depth to the triangulated structure. 𝐾2 denotes the ORB 0.641 0.735 0.150 0.395 0.538
SIFT 0.495 0.694 0.424 0.676 0.759
coordinates of matching keypoint in the current frame, and 𝑁 2 is LIFT 0.449 0.664 0.284 0.598 0.717
the number of coordinates in 𝐾2 . SuperPoint 0.581 0.821 0.310 0.684 0.829
By combining deep learning with traditional two-view geometry, LoFTR 0.670 - 0.498 0.597 0.626
Ours 0.612 0.813 0.450 0.770 0.832
we construct a unified framework for joint optimization of the
proposed MDNet, where motion segmentation and depth estimation
can benefit each other. The final loss function can be formulated as Table 2: Efficiency comparison of our proposed networks
a weighted combination of the losses defined in Eq. (3) and Eq. (7): with some widely used methods.
L = 𝜇L𝑚 + 𝜆L𝑑 (10)
Time(ms/frame)
where 𝜇 and 𝜆 are loss weightings for motion segmentation and Task Method
CPU GPU
FLOPs Parameters
depth estimation, respectively. ORB 23.6 - - -
Feature
SuperPoint 233.6 78.6 18.2 G 1.3 M
Extraction &
3.3 Deep Learning Enhanced Monocular SLAM LoFTR 532.5 32.6 103.1 G 5.9 M
Matching
DFNet(Ours) 153.3 24.9 11.3 G 1.3 M
System Mask R-CNN
Motion 967.5 67.3 123.3 G 63.2 M
As mentioned before, the proposed networks can generate robust (DynaSLAM)
Segmentation
static features and depth maps for tracking and mapping, which MDNet(Ours) 85.7 8.1 11.5 G 20.3 M
can substitute feature extraction modules of any type of feature-
based monocular SLAM. In this work, we provide an example of
employing the monocular version of ORB-SLAM2 as the backbone
to implement a robust monocular SLAM system for dynamic en- described in [8]. Then we only train the encoder of MDNet and
vironments, as this system has been well explored and the feature motion segmentation decoder for 15 epochs. Finally, we jointly train
originally used by it has the same structure as our deep feature, motion segmentation and depth estimation networks for 20 epochs.
which makes it easier to incorporate our networks into this system. To generate pseudo-ground-truth motion segmentation labels using
The framework of our SLAM system is shown in Fig. 2. the optical flow residual, we employ PWCNet [25], an excellent
At the front-end of the system, a DFNet is used to generate optical flow estimation network, to generate optical flow. In our
keypoints and binary descriptors for each frame. Meanwhile, the implementation, an Adam optimizer with a learning rate of 0.0001
appearance representations of the current frame are extracted by is used to train our networks. All training samples have a fixed
the encoder of MDNet, then these representations are fed together resolution (832 × 256), and the batch size, momentum, and weight
with the appearance representations of the last frame into the mo- decay are set to 16, 0.9, and 0.0005, respectively. The thresholds
tion segmentation decoder to generate a motion segmentation mask. 𝑘 is 8. The loss weightings are [𝑤𝑑𝑠 , 𝑤𝑑𝑡 , 𝑤𝑑𝑝 ] = [0.001, 1.0, 0.1]
Through a depth estimation decoder, MDNet can also generate a and [𝜇, 𝜆] = [0.2, 1.0]. Our networks are trained using Python and
dense depth map for each frame. In the tracking thread and the PyTorch framework. After training, we use libtorch, a Pytorch C++
mapping thread, the local features belonging to moving objects are API, to implement our networks. All experiments are performed
first filtered by the predicted mask. Only reliable static features are on a computer with a 2.9 GHz Intel(R) Xeon(R) CPU and an Nvidia
reserved to perform pose estimation, relocalization, and mapping. Tesla V100 GPU.
Once the pose estimation of the camera has been done, we can
reconstruct static dense maps with predicted motion segmentation
masks and depth estimates using the methods described in [20]. 4.2 Datasets
HPatches dataset [2] includes 116 challenging scenes with 696 im-
4 EXPERIMENTS ages. The first 57 scenes contain significant lighting changes, and
the remaining 59 scenes have obvious viewpoint changes.
4.1 Training Strategy and Implementation KITTI dataset [10] contains 11 labeled sequences captured in
Details urban and highway environments with many moving objects like
Similar to [4, 35], we train the proposed networks on the KITTI pedestrians and vehicles. Besides, high moving speed, fast rotation
dataset [10], which is widely used for training monocular depth of camera and repeated visual elements such as similar walls and
estimation networks and evaluating visual SLAM systems. KITTI bushes pose high challenges to monocular SLAM algorithms.
contains 11 sequences, while only sequences 00-06 are used for TUM RGB-D dataset [24] contains 39 sequences obtained from
training in this work. The whole training schedule can be divided different indoor scenes, which can be divided into three categories:
into three stages. Firstly, we train DFNet in a self-supervised manner static, low dynamic, and high dynamic.

512
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Li et al.

Table 3: ATE [m] of different methods on KITTI and TUM


RGB-D dataset.

ORB-SLAM2 DynaSLAM ASD- DROID-


Sequence Ours
(Monocular) (Monocular) SLAM SLAM
07 2.74 2.82 1.59 16.28 1.63
08 58.47 52.93 52.40 67.69 32.13
KITTI
09 50.71 46.02 7.17 78.66 6.48
10 8.67 6.97 7.15 18.21 5.36
fr3_s_hal 0.015 0.018 - 0.014 0.019
fr3_s_rpy 0.019 0.013 - 0.022 0.024
TUM
fr3_w_hal 0.019 0.017 - 0.027 0.016
fr3_w_rpy 0.075 0.035 - 0.039 0.028
Reference Frame Current Frame DynaSLAM Ours

Figure 3: The motion segmentation results of our method Table 4: ATE [m] and percentage of successfully tracked tra-
and DynaSLAM (monocular) [3] based on Mask R-CNN. Our jectory for several variants of our system on TUM RGB-D
method can accurately segment moving objects. While Dy- dataset. ’*’ means initialization failure or tracking failure.
naSLAM fails to segment the dynamic objects that are seman-
Method Low Dynamic High Dynamic
tically static, such as the chair, laptop, and box. Metric
ORB-
DF MASK fr3_s_hal fr3_s_rpy fr3_w_hal fr3_w_rpy fr3_w_xyz
SLAM2
√ ATE(m) 0.0152 0.0185 0.0191 0.0747 *
Ground-truth Ground-truth % Traj. 94.98 53.93 82.97 77.80 0.00
ORB-SLAM2 Ours
√ √ ATE(m) 0.0144 0.0208 0.0240 0.0577 0.0194
% Traj. 91.67 57.14 77.93 80.55 78.92
√ √ ATE(m) 0.0203 0.0273 0.0171 0.0305 0.0170
% Traj. 97.86 99.15 97.20 86.31 84.96
√ √ √ ATE(m) 0.0189 0.0241 0.0162 0.0284 0.0132
% Traj. 98.11 99.34 98.93 82.52 88.17

ORB-SLAM2 Ours
local features (except LoFTR). For homography estimation, we use
(a) KITTI 09 the method provided by OpenCV3 (with RANSAC) as the robust
Ground-truth Ground-truth
estimator. The results are shown in Table 1. It is obvious that our DF
ORB-SLAM2 Ours
outperforms hand-crafted features in most metrics. Although ORB
achieves higher repeatability, its detections tend to cluster together,
which generally results in more mismatches and incorrect pose
estimations. While our DF can produce more distributed features,
thus it obtains more correct matches compared with ORB in chal-
lenging scenarios with illumination changes, viewpoint changes,
or high repetition. In addition, as shown in Table 2, thanks to the
ORB-SLAM2 Ours
lightweight network structure, our DFNet is much more efficient
(b) KITTI 10 than LoFTR. Though SuperPoint is also lightweight, it is not effi-
cient in feature matching due to its float descriptors. While our DF
Figure 4: Ground-truth and trajectories estimated by ORB- can achieve a lower matching time by calculating the Hamming
SLAM2 (Monocular) (left) and the proposed SLAM system distance with XOR operations.
(right) in the sequence 09 and the sequence 10 of KITTI
dataset. The absolute trajectory errors (ATE) are mapped 4.4 Segmentation of Dynamic Objects
onto corresponding trajectory. As shown in Fig. 3, our MDNet can effectively segment moving
objects. While based on the semantic segmentation network Mask
R-CNN, DynaSLAM (monocular) [3] fails to segment the moving
4.3 Feature Evaluation chair, laptop, and box, since these objects are considered as the
To evaluate the performance of our deep feature (DF), we compare potentially static classes. Therefore, DynaSLAM cannot deal with
it with widely used hand-crafted features (SIFT [18], ORB [23]) and the interference caused by these moving objects. Besides, as shown
other state-of-the-art deep features, including LIFT [32], SuperPoint in Table 2, the FLOPs and the number of parameters of Mask R-
[8] and LoFTR [26], on the HPatches dataset. We follow the evalua- CNN are much larger than our MDNet, because it relies on a more
tion strategy proposed by [8] and use nearest neighbor mAP (NN complex backbone (ResNet50/ResNet101[15]) and some additional
mAP), repeatability, and homography estimation accuracy under modules, such as RoIPooling and RoIAlign. Though the motion
different correct distances (e=1, 3, and 5 pixels) as the evaluation attention module introduces a large computational cost, our MDNet
metrics. We use the nearest neighbor search to find matches for all is still much more efficient than Mask R-CNN.

513
A Robust Deep Learning Enhanced Monocular SLAM System for Dynamic Environments ICMR ’23, June 12–15, 2023, Thessaloniki, Greece

4.5 Full System Evaluation The comparison against the state-of-the-art SLAM systems on pub-
To illustrate the effectiveness of our methods in pose estimation, licly available datasets shows the effectiveness of our system. In
we perform evaluations of the full system on the KITTI and TUM the future, we seek to deploy our system on mobile robots to help
RGB-D datasets. The absolute trajectory RMSE (ATE) is adopted them accomplish the tasks of localization and mapping.
as the evaluation metric. We compare our system with four SLAM
systems: ORB-SLAM2 [21], ASD-SLAM [19], DynaSLAM [3], and ACKNOWLEDGMENTS
DROID-SLAM [29]. ORB-SLAM2 is a widely used system using the This research was funded by Natural Science Foundation of Guang-
hand-crafted feature. Based on ORB-SLAM2, ASD-SLAM handles dong Province (2023A1515012685, 2023A1515011296), Open Re-
the environmental changes by leveraging the deep feature descrip- search Fund from Guangdong Laboratory of Artificial Intelligence
tor. Similarly, DynaSLAM is also built on ORB-SLAM2, but it still and Digital Economy (SZ) under Grant No. GML-KF-22-28, and Na-
uses ORB features and proposes to use the semantic segmentation tional Natural Science Foundation of China (62002230, 62032015).
network Mask R-CNN to filter dynamic features. DROID-SLAM is
a recent end-to-end deep learning-based SLAM. The comparison REFERENCES
results are shown in Table 3. Our method can dramatically improve [1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. Segnet: A deep
convolutional encoder-decoder architecture for image segmentation. IEEE trans-
the absolute trajectory accuracy of the original ORB-SLAM2. On actions on pattern analysis and machine intelligence 39, 12 (2017), 2481–2495.
the one hand, as mentioned above, our deep feature can produce [2] Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. 2017.
more distributed features and correct matches, which are useful HPatches: A benchmark and evaluation of handcrafted and learned local de-
scriptors. In Proceedings of the IEEE conference on computer vision and pattern
for pose estimation and loop closure detection (see Fig. 4). On the recognition. 5173–5182.
other hand, thanks to motion segmentation, our system can fil- [3] Berta Bescos, José M Fácil, Javier Civera, and José Neira. 2018. DynaSLAM: Track-
ing, mapping, and inpainting in dynamic scenes. IEEE Robotics and Automation
ter dynamic features and achieve higher accuracy compared with Letters 3, 4 (2018), 4076–4083.
ORB-SLAM2 and ASD-SLAM. Moreover, though DynaSLAM can [4] Jiawang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, and Ian Reid. 2019.
discard dynamic features on semantically dynamic objects based Unsupervised scale-consistent depth and ego-motion learning from monocular
video. Advances in neural information processing systems 32 (2019).
on the Mask R-CNN, our system also outperform it. Because our [5] Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan Leutenegger, and An-
method can extract more reliable features, while DynaSLAM failed drew J Davison. 2018. CodeSLAM—learning a compact, optimisable representa-
to segment the dynamic objects that are semantically static (see Fig. tion for dense visual SLAM. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 2560–2568.
3). As for the end-to-end system DROID-SLAM, due to the limited [6] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, and Yoshua Bengio. 2016.
generalization ability, it performs poorly on the KITTI dataset. Binarized neural networks: Training deep neural networks with weights and
activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830 (2016).
[7] Jan Czarnowski, Tristan Laidlow, Ronald Clark, and Andrew J Davison. 2020.
Deepfactors: Real-time probabilistic dense monocular slam. IEEE Robotics and
4.6 Ablation Studies Automation Letters 5, 2 (2020), 721–728.
In this section, we further conduct experiments to study the im- [8] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. 2018. Superpoint:
Self-supervised interest point detection and description. In Proceedings of the
pacts of our deep local feature and motion segmentation mask. The IEEE conference on computer vision and pattern recognition workshops. 224–236.
performances of different variations of our system on the TUM [9] Martin A Fischler and Robert C Bolles. 1981. Random sample consensus: a
paradigm for model fitting with applications to image analysis and automated
RGB-D dataset are shown in Table 4. Here the absolute trajectory cartography. Commun. ACM 24, 6 (1981), 381–395.
RMSE (ATE) and the percentage of the successfully tracked trajec- [10] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision
tory (Traj.) are used to evaluate the accuracy and robustness of the meets robotics: The kitti dataset. The International Journal of Robotics Research
32, 11 (2013), 1231–1237.
SLAM system. From Table 4, we can see clearly that both using our [11] Haifan Gong, Guanqi Chen, Sishuo Liu, Yizhou Yu, and Guanbin Li. 2021. Cross-
deep local feature and motion segmentation mask can improve the modal self-attention with multi-task pre-training for medical visual question
accuracy and robustness of the original ORB-SLAM2. Especially in answering. In Proceedings of the 2021 International Conference on Multimedia
Retrieval (ICMR). 456–460.
high dynamic environments, the improvement of our final system [12] Richard I Hartley. 1997. In defense of the eight-point algorithm. IEEE Transactions
is more obvious, which illustrates the effectiveness of our methods on pattern analysis and machine intelligence 19, 6 (1997), 580–593.
[13] Richard I Hartley and Peter Sturm. 1997. Triangulation. Computer vision and
in improving the accuracy and robustness. Although in some low- image understanding 68, 2 (1997), 146–157.
dynamic scenes, the original ORB-SLAM2 achieves higher absolute [14] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn.
trajectory accuracy. It is due to the successful tracking trajectory of In Proceedings of the IEEE international conference on computer vision. 2961–2969.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
ORB-SLAM2 is shorter, which results in fewer cumulative errors. learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770–778.
[16] Dongjiang Li, Xuesong Shi, Qiwei Long, Shenghui Liu, Wei Yang, Fangshi Wang,
5 CONCLUSION Qi Wei, and Fei Qiao. 2020. DXSLAM: A robust and efficient visual SLAM system
with deep features. In 2020 IEEE/RSJ International Conference on Intelligent Robots
In this paper, a robust monocular SLAM system is presented, which and Systems (IROS). IEEE, 4958–4965.
can effectively reduce the influence of illumination changes, view- [17] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector.
point changes, and dynamic objects in dynamic environments. This In European conference on computer vision. Springer, 21–37.
is achieved by combining the traditional SLAM system with two [18] David G Lowe. 2004. Distinctive image features from scale-invariant keypoints.
International journal of computer vision 60, 2 (2004), 91–110.
deep learning networks. The proposed DFNet can detect more [19] Taiyuan Ma, Yafei Wang, Zili Wang, Xulei Liu, and Huimin Zhang. 2020. ASD-
robust and efficient features for pose estimation and loop closure SLAM: A Novel Adaptive-Scale Descriptor Learning for Visual SLAM. In 2020
detection. Besides, the proposed MDNet can segment reliable masks IEEE Intelligent Vehicles Symposium (IV). IEEE, 809–816.
[20] Tomoyuki Mukasa, Jiu Xu, and Bjorn Stenger. 2017. 3d scene mesh from cnn depth
based on the motion information of two adjacent images, which predictions and sparse monocular slam. In Proceedings of the IEEE international
overcomes the drawback of utilizing semantic prior information. conference on computer vision workshops. 921–928.

514
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Li et al.

[21] Raul Mur-Artal and Juan D Tardós. 2017. Orb-slam2: An open-source slam system [29] Zachary Teed and Jia Deng. 2021. Droid-slam: Deep visual slam for monocular,
for monocular, stereo, and rgb-d cameras. IEEE transactions on robotics 33, 5 stereo, and rgb-d cameras. Advances in neural information processing systems 34
(2017), 1255–1262. (2021), 16558–16569.
[22] Jerome Revaud, Cesar De Souza, Martin Humenberger, and Philippe Weinzaepfel. [30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
2019. R2d2: Reliable and repeatable detector and descriptor. Advances in neural Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
information processing systems 32 (2019). you need. Advances in neural information processing systems 30 (2017).
[23] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. ORB: An [31] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam:
efficient alternative to SIFT or SURF. In 2011 International conference on computer Convolutional block attention module. In Proceedings of the European conference
vision. Ieee, 2564–2571. on computer vision (ECCV). 3–19.
[24] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel [32] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. 2016. Lift: Learned
Cremers. 2012. A benchmark for the evaluation of RGB-D SLAM systems. In 2012 invariant feature transform. In European conference on computer vision. Springer,
IEEE/RSJ international conference on intelligent robots and systems. IEEE, 573–580. 467–483.
[25] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2018. Pwc-net: Cnns [33] Chao Yu, Zuxin Liu, Xin-Jun Liu, Fugui Xie, Yi Yang, Qi Wei, and Qiao Fei. 2018.
for optical flow using pyramid, warping, and cost volume. In Proceedings of the DS-SLAM: A semantic visual SLAM towards dynamic environments. In 2018
IEEE conference on computer vision and pattern recognition. 8934–8943. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE,
[26] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. 2021. 1168–1174.
LoFTR: Detector-free local feature matching with transformers. In Proceedings of [34] Tianwei Zhang, Huayan Zhang, Yang Li, Yoshihiko Nakamura, and Lei Zhang.
the IEEE/CVF conference on computer vision and pattern recognition. 8922–8931. 2020. Flowfusion: Dynamic dense rgb-d slam based on optical flow. In 2020 IEEE
[27] Qiang Sun and Yanwei Fu. 2019. Stacked self-attention networks for visual International Conference on Robotics and Automation (ICRA). IEEE, 7322–7328.
question answering. In Proceedings of the 2019 on International Conference on [35] Wang Zhao, Shaohui Liu, Yezhi Shu, and Yong-Jin Liu. 2020. Towards better
Multimedia Retrieval (ICMR). 207–211. generalization: Joint depth-pose learning without posenet. In Proceedings of the
[28] Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab. 2017. Cnn-slam: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9151–9161.
Real-time dense monocular slam with learned depth prediction. In Proceedings of [36] Fangwei Zhong, Sheng Wang, Ziqi Zhang, and Yizhou Wang. 2018. Detect-SLAM:
the IEEE conference on computer vision and pattern recognition. 6243–6252. Making object detection and SLAM mutually beneficial. In 2018 IEEE Winter
Conference on Applications of Computer Vision (WACV). IEEE, 1001–1010.

515

You might also like