A Robust Deep Learning Enhanced Monocular SLAM System For Dynamic Environments
A Robust Deep Learning Enhanced Monocular SLAM System For Dynamic Environments
Dynamic Environments
Yaoqing Li Sheng-Hua Zhong∗
1 Shenzhen
University Shenzhen University
2 SFMAP
Technology Shenzhen, China
Shenzhen, China [email protected]
[email protected]
508
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Li et al.
objects can be moved by other dynamic objects. Therefore, using are among the first successful systems that integrate CNN-based
existing semantic segmentation networks is unreliable since it sim- depth predictions into a geometric SLAM pipeline. Building on
ply relies on semantic prior information to determine whether an CodeSLAM, DeepFactors [7] achieved better performance by inte-
object is moving. grating learned priors with different classical SLAM formulations in
To this end, we direct our efforts to improve the robustness a factor-graph probabilistic framework. DROID-SLAM [29] outper-
of features through a deep local feature detection network and formed prior works by performing recurrent iterative optimizations
discard dynamic features through a motion segmentation network. of camera poses and depth maps through a dense BA layer, but the
In summary, the main contributions of this paper are as follows: computation cost is high. Experimental results demonstrated the
• We introduce a deep feature detection network (DFNet) to effectiveness of the above-mentioned methods in challenging envi-
detect robust keypoints and descriptors for SLAM. DFNet has ronments with illumination or viewpoint changes. However, under
a simplified CNN structure and a binary descriptor activation the interference of dynamic objects, their performances usually
layer, which allows our system to detect and match features drop significantly.
effectively and efficiently. Therefore, it’s essential to remove the unreliable features of dy-
• Instead of using the existing semantic segmentation net- namic objects. Some previous work addressed this issue by using
works, we propose a novel network (MDNet) to segment semantic information. Zhong et al. [36] applied the object detection
dynamic objects. This network can efficiently extract the network SSD [17] to detect dynamic objects, then the features of se-
appearance and motion information of image sequences and mantically dynamic objects such as people or cars were considered
accurately segment the most discriminative dynamic objects dynamic and discarded. Yu et al. [33] utilized a semantic segmen-
by using an attention transfer module. tation network SegNet [1] to get prior knowledge about dynamic
• Based on the proposed networks, we implement a robust objects, then discarded the unreliable features on dynamic objects.
monocular SLAM system and achieve better results when Similarly, Bescos et al. [3] proposed DynaSLAM, which leverages
compared to state-of-the-art SLAM systems in highly dy- Mask R-CNN [14] to remove the potential moving objects, then
namic environments. reconstruct the static background via ORB-SLAM2 [21] framework.
These approaches performed well in many dynamic environments.
2 RELATED WORK However, due to the semantic-based methods can only segment the
labeled dynamic objects in the training datasets, their robustness
2.1 Deep Local Feature would decrease when the environment contains unknown dynamic
With the progress of deep learning, many deep local features have objects. Moreover, they also failed to segment the dynamic objects
been proposed, which have been proved to outperform hand-crafted that are semantically static, such as chairs, laptops, and boxes that
local features in challenging environments with illumination or can be moved by people (see Fig. 3).
viewpoint changes. Yi et al. [32] proposed a convolutional replace-
ment for SIFT, called LIFT, which combines keypoint detection, di- 3 PROPOSED METHOD
rection estimation, and descriptor calculation modules, but requires
additional supervision from a classical Structure from Motion sys- To achieve accurate and robust performance in highly dynamic
tem. DeTone et al. [8] designed a self-supervised domain adaptation environments, we propose two convolutional neural networks and
framework called Homographic Adaptation and used it to train interpolate them into the traditional SLAM system. On the one
a fully convolutional neural network for local feature detection hand, to deal with the illumination or viewpoint changes, the hand-
(SuperPoint). To avoid ambiguous areas and obtain reliable local crafted feature extractor in the original SLAM system is replaced
features, R2D2 [22] proposed to jointly learn keypoint detection and by a deep local feature detection network, which can generate
description together with a confidence for keypoint reliability. Af- keypoints and binary descriptors that are robust to environmental
ter feature detection and feature description, the above-mentioned variations. On the other hand, a motion segmentation and depth
local features need to further use the nearest neighbor search to estimation network is proposed to simultaneously generate pixel-
find matches between two images. Sun et al. [26] proposed LoFTR, wise motion segmentation mask and depth map, so that our system
which can obtain matching features directly by using self and cross can easily discard dynamic features and reconstruct static maps.
attention layers [11, 27, 30] in transformer. The proposed networks as well as the workflow of our SLAM system
are described in detail as follows.
2.2 Deep Learning Enhanced SLAM
With high accuracy and robustness, deep features also have been 3.1 Learning-based Local Feature Extraction
proposed to substitute hand-crafted feature descriptors in tradi- Network Structure. The structure of our deep local feature detec-
tional SLAM. Ma et al. [19] proposed a novel deep local feature tion network (DFNet) is shown in Fig. 1 (a). Similar to many existing
descriptor (ASD) and designed a SLAM system named ASD-SLAM local feature detection networks [8, 19], DFNet has a shared encoder
based on it. Similarly, Li et al. [16] incorporated HF-Net into the and two decoders to process a gray image, yielding a probability
ORB-SLAM2 to extract both local and global features. In addition map sized Rℎ×𝑤×1 for keypoint confidences as well as a dense
to replacing hand-crafted with deep features, there are a number feature map sized Rℎ×𝑤×256 for descriptors. Then sparse keypoints
of works [5, 7, 28, 29] focused on how deep learning-based depth can be obtained by performing non-maximum suppression (NMS)
estimation can be deployed for accurate and dense reconstruc- over the probability map, and their descriptors are sampled from
tion in monocular systems. CNN-SLAM [28] and CodeSLAM [5] the feature map by using bilinear sampling.
509
A Robust Deep Learning Enhanced Monocular SLAM System for Dynamic Environments ICMR ’23, June 12–15, 2023, Thessaloniki, Greece
(d) RM
(d) RM
(d) RM
(d) Sigmoid
RM
256
128
64 Depth
32
16 16 ࢎ灅࢝灅
256 512 Current Frame
RGB
128 ࡲࢉ
64 (c) Motion ࡲࡹ ࡾ ࡾ ࡾ ࡾ ࡾ
(d) RM
(d) RM
(d) RM
ࢎ灅࢝灅 64 Feature Maps Attention (d) RM Sigmoid
Database Last Frame
Module 256
ࡲࡸ ࡿ 128 64
ࡿ Mask
ࡿ 32 16 16 ࢎ灅࢝灅
ࡿ
(b) MDNet
ࡲ ࡲࡸ ࡾ ࡿ
ࡴ灅ࢃ灅 ࡴ灅ࢃ灅 ࡴ /2 灅 ࢃ / 灅 ࡴ 灅 ࢃ 灅 ᇱ
: ConvBlock of ResNet18
: 2D Conv + BN + ReLU
Spatial Attention Spatial Attention Up Sample
: MaxPool2d
Conv2d Conv2d
: Motion Attention Module
ࡴ 灅 ࢃ 灅 /r ࡴ 灅 ࢃ 灅 /r
Conv2d : Refining Module
ࡴ 灅 ࢃ 灅 ( +ᇱ )
: Operations
Softmax Conv2d
ࡴ 灅 ࢃ 灅 /2 : Matrix Multiplication
: Element-wise Sum
ࡲࡹ ࡾା : Concatenation
ࡴ 灅 ࢃ 灅 ࡴ 灅 ࢃ 灅 /2
Figure 1: Pipeline of (a) the deep local feature detection network (DFNet) and (b) the motion segmentation and depth estimation
network (MDNet). (c) Illustration of the motion attention module (MAM), which leverages two spatial attention modules [31]
to extract features of discriminative motion areas between two adjacent frames. (d) Illustration of the refining module (RM).
All refining modules of MDNet share a similar structure, which receives the outputs from the corresponding residual block and
the previous refining module, and then produces the feature maps of doubled resolution.
Binary Descriptor. Binary descriptors are more efficient than allow the loss to converge more efficiently, instead of calculating dis-
float descriptors at feature matching. Thus, different from existing tances between all unmatching points, we only consider the hardest
works [16, 19] that directly apply the float descriptors from the negative points that lie outside of a square local neighborhood of
network to estimate pose, we develop our network to learn binary the matching point:
descriptors. To this end, we perform binarization on each descriptor
𝑥𝑛 = 𝑎𝑟𝑔𝑚𝑖𝑛 ∥𝑑𝑡 (𝑥𝑛 ) − 𝑑𝑐 (𝑥𝑐 )∥ 2 𝑠.𝑡 . ∥𝑥𝑛 − 𝑥𝑡 ∥ 2 > 𝑘 (2)
after bilinear sampling. The final binary descriptor is expressed as: 𝑥𝑛 ∈𝐼𝑡
510
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Li et al.
Fig. 1 (b). For motion segmentation, our goal is to segment mov- TRACKING
ing objects by using the appearance information obtained from Feature Monocular
Pose Estimation
Local Map New Keyframe
two consecutive frames instead of the semantic prior information Filtering Initialization
Relocalization
Tracking Decision
[3, 33], so that we can remove features from dynamic objects ac-
curately. Besides, to build a joint training framework and allow
MAPPING
FEATURE DETECTION
DETECTION
DFNet Detect Loop
Descriptor
We employ a ResNet18 [15] as the encoder to extract high-level Feature Maps Database
Global
appearance representations of the current frame and reference Frame
Last
Frame
Optimization
Current
Frame
Mask
frame. Then features of discriminative motion areas can be ob-
MAPPING (offline)
tained by using a motion attention module (see Fig. 1 (c)). This
DENSE
Static Map
511
A Robust Deep Learning Enhanced Monocular SLAM System for Dynamic Environments ICMR ’23, June 12–15, 2023, Thessaloniki, Greece
𝐷𝑡𝑟𝑖 and predicted depth 𝐷𝑐 is calculated as: Table 1: Feature evaluation on HPatches dataset. The best
results are highlighted in bold, while the second best results
1 ∑︁ 𝐷𝑡𝑟𝑖 (𝑝𝑐 ) − 𝑠𝐷𝑐 (𝑝𝑐 ) 2
L𝑑𝑡 = 𝑀 (𝑝𝑐 ) (9) are underlined.
𝑁2 𝐷𝑡𝑟𝑖 (𝑝𝑐 )
𝑝𝑐 ∈𝐾2
Homography
where the predicted depth 𝐷𝑐 of the current frame is aligned to Detector metric Descriptor metric
est. accuracy
the triangulated depth with a scale factor 𝑠, which explicitly align Repeatability NN mAP e=1px e=3px e=5px
the predicted depth to the triangulated structure. 𝐾2 denotes the ORB 0.641 0.735 0.150 0.395 0.538
SIFT 0.495 0.694 0.424 0.676 0.759
coordinates of matching keypoint in the current frame, and 𝑁 2 is LIFT 0.449 0.664 0.284 0.598 0.717
the number of coordinates in 𝐾2 . SuperPoint 0.581 0.821 0.310 0.684 0.829
By combining deep learning with traditional two-view geometry, LoFTR 0.670 - 0.498 0.597 0.626
Ours 0.612 0.813 0.450 0.770 0.832
we construct a unified framework for joint optimization of the
proposed MDNet, where motion segmentation and depth estimation
can benefit each other. The final loss function can be formulated as Table 2: Efficiency comparison of our proposed networks
a weighted combination of the losses defined in Eq. (3) and Eq. (7): with some widely used methods.
L = 𝜇L𝑚 + 𝜆L𝑑 (10)
Time(ms/frame)
where 𝜇 and 𝜆 are loss weightings for motion segmentation and Task Method
CPU GPU
FLOPs Parameters
depth estimation, respectively. ORB 23.6 - - -
Feature
SuperPoint 233.6 78.6 18.2 G 1.3 M
Extraction &
3.3 Deep Learning Enhanced Monocular SLAM LoFTR 532.5 32.6 103.1 G 5.9 M
Matching
DFNet(Ours) 153.3 24.9 11.3 G 1.3 M
System Mask R-CNN
Motion 967.5 67.3 123.3 G 63.2 M
As mentioned before, the proposed networks can generate robust (DynaSLAM)
Segmentation
static features and depth maps for tracking and mapping, which MDNet(Ours) 85.7 8.1 11.5 G 20.3 M
can substitute feature extraction modules of any type of feature-
based monocular SLAM. In this work, we provide an example of
employing the monocular version of ORB-SLAM2 as the backbone
to implement a robust monocular SLAM system for dynamic en- described in [8]. Then we only train the encoder of MDNet and
vironments, as this system has been well explored and the feature motion segmentation decoder for 15 epochs. Finally, we jointly train
originally used by it has the same structure as our deep feature, motion segmentation and depth estimation networks for 20 epochs.
which makes it easier to incorporate our networks into this system. To generate pseudo-ground-truth motion segmentation labels using
The framework of our SLAM system is shown in Fig. 2. the optical flow residual, we employ PWCNet [25], an excellent
At the front-end of the system, a DFNet is used to generate optical flow estimation network, to generate optical flow. In our
keypoints and binary descriptors for each frame. Meanwhile, the implementation, an Adam optimizer with a learning rate of 0.0001
appearance representations of the current frame are extracted by is used to train our networks. All training samples have a fixed
the encoder of MDNet, then these representations are fed together resolution (832 × 256), and the batch size, momentum, and weight
with the appearance representations of the last frame into the mo- decay are set to 16, 0.9, and 0.0005, respectively. The thresholds
tion segmentation decoder to generate a motion segmentation mask. 𝑘 is 8. The loss weightings are [𝑤𝑑𝑠 , 𝑤𝑑𝑡 , 𝑤𝑑𝑝 ] = [0.001, 1.0, 0.1]
Through a depth estimation decoder, MDNet can also generate a and [𝜇, 𝜆] = [0.2, 1.0]. Our networks are trained using Python and
dense depth map for each frame. In the tracking thread and the PyTorch framework. After training, we use libtorch, a Pytorch C++
mapping thread, the local features belonging to moving objects are API, to implement our networks. All experiments are performed
first filtered by the predicted mask. Only reliable static features are on a computer with a 2.9 GHz Intel(R) Xeon(R) CPU and an Nvidia
reserved to perform pose estimation, relocalization, and mapping. Tesla V100 GPU.
Once the pose estimation of the camera has been done, we can
reconstruct static dense maps with predicted motion segmentation
masks and depth estimates using the methods described in [20]. 4.2 Datasets
HPatches dataset [2] includes 116 challenging scenes with 696 im-
4 EXPERIMENTS ages. The first 57 scenes contain significant lighting changes, and
the remaining 59 scenes have obvious viewpoint changes.
4.1 Training Strategy and Implementation KITTI dataset [10] contains 11 labeled sequences captured in
Details urban and highway environments with many moving objects like
Similar to [4, 35], we train the proposed networks on the KITTI pedestrians and vehicles. Besides, high moving speed, fast rotation
dataset [10], which is widely used for training monocular depth of camera and repeated visual elements such as similar walls and
estimation networks and evaluating visual SLAM systems. KITTI bushes pose high challenges to monocular SLAM algorithms.
contains 11 sequences, while only sequences 00-06 are used for TUM RGB-D dataset [24] contains 39 sequences obtained from
training in this work. The whole training schedule can be divided different indoor scenes, which can be divided into three categories:
into three stages. Firstly, we train DFNet in a self-supervised manner static, low dynamic, and high dynamic.
512
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Li et al.
Figure 3: The motion segmentation results of our method Table 4: ATE [m] and percentage of successfully tracked tra-
and DynaSLAM (monocular) [3] based on Mask R-CNN. Our jectory for several variants of our system on TUM RGB-D
method can accurately segment moving objects. While Dy- dataset. ’*’ means initialization failure or tracking failure.
naSLAM fails to segment the dynamic objects that are seman-
Method Low Dynamic High Dynamic
tically static, such as the chair, laptop, and box. Metric
ORB-
DF MASK fr3_s_hal fr3_s_rpy fr3_w_hal fr3_w_rpy fr3_w_xyz
SLAM2
√ ATE(m) 0.0152 0.0185 0.0191 0.0747 *
Ground-truth Ground-truth % Traj. 94.98 53.93 82.97 77.80 0.00
ORB-SLAM2 Ours
√ √ ATE(m) 0.0144 0.0208 0.0240 0.0577 0.0194
% Traj. 91.67 57.14 77.93 80.55 78.92
√ √ ATE(m) 0.0203 0.0273 0.0171 0.0305 0.0170
% Traj. 97.86 99.15 97.20 86.31 84.96
√ √ √ ATE(m) 0.0189 0.0241 0.0162 0.0284 0.0132
% Traj. 98.11 99.34 98.93 82.52 88.17
ORB-SLAM2 Ours
local features (except LoFTR). For homography estimation, we use
(a) KITTI 09 the method provided by OpenCV3 (with RANSAC) as the robust
Ground-truth Ground-truth
estimator. The results are shown in Table 1. It is obvious that our DF
ORB-SLAM2 Ours
outperforms hand-crafted features in most metrics. Although ORB
achieves higher repeatability, its detections tend to cluster together,
which generally results in more mismatches and incorrect pose
estimations. While our DF can produce more distributed features,
thus it obtains more correct matches compared with ORB in chal-
lenging scenarios with illumination changes, viewpoint changes,
or high repetition. In addition, as shown in Table 2, thanks to the
ORB-SLAM2 Ours
lightweight network structure, our DFNet is much more efficient
(b) KITTI 10 than LoFTR. Though SuperPoint is also lightweight, it is not effi-
cient in feature matching due to its float descriptors. While our DF
Figure 4: Ground-truth and trajectories estimated by ORB- can achieve a lower matching time by calculating the Hamming
SLAM2 (Monocular) (left) and the proposed SLAM system distance with XOR operations.
(right) in the sequence 09 and the sequence 10 of KITTI
dataset. The absolute trajectory errors (ATE) are mapped 4.4 Segmentation of Dynamic Objects
onto corresponding trajectory. As shown in Fig. 3, our MDNet can effectively segment moving
objects. While based on the semantic segmentation network Mask
R-CNN, DynaSLAM (monocular) [3] fails to segment the moving
4.3 Feature Evaluation chair, laptop, and box, since these objects are considered as the
To evaluate the performance of our deep feature (DF), we compare potentially static classes. Therefore, DynaSLAM cannot deal with
it with widely used hand-crafted features (SIFT [18], ORB [23]) and the interference caused by these moving objects. Besides, as shown
other state-of-the-art deep features, including LIFT [32], SuperPoint in Table 2, the FLOPs and the number of parameters of Mask R-
[8] and LoFTR [26], on the HPatches dataset. We follow the evalua- CNN are much larger than our MDNet, because it relies on a more
tion strategy proposed by [8] and use nearest neighbor mAP (NN complex backbone (ResNet50/ResNet101[15]) and some additional
mAP), repeatability, and homography estimation accuracy under modules, such as RoIPooling and RoIAlign. Though the motion
different correct distances (e=1, 3, and 5 pixels) as the evaluation attention module introduces a large computational cost, our MDNet
metrics. We use the nearest neighbor search to find matches for all is still much more efficient than Mask R-CNN.
513
A Robust Deep Learning Enhanced Monocular SLAM System for Dynamic Environments ICMR ’23, June 12–15, 2023, Thessaloniki, Greece
4.5 Full System Evaluation The comparison against the state-of-the-art SLAM systems on pub-
To illustrate the effectiveness of our methods in pose estimation, licly available datasets shows the effectiveness of our system. In
we perform evaluations of the full system on the KITTI and TUM the future, we seek to deploy our system on mobile robots to help
RGB-D datasets. The absolute trajectory RMSE (ATE) is adopted them accomplish the tasks of localization and mapping.
as the evaluation metric. We compare our system with four SLAM
systems: ORB-SLAM2 [21], ASD-SLAM [19], DynaSLAM [3], and ACKNOWLEDGMENTS
DROID-SLAM [29]. ORB-SLAM2 is a widely used system using the This research was funded by Natural Science Foundation of Guang-
hand-crafted feature. Based on ORB-SLAM2, ASD-SLAM handles dong Province (2023A1515012685, 2023A1515011296), Open Re-
the environmental changes by leveraging the deep feature descrip- search Fund from Guangdong Laboratory of Artificial Intelligence
tor. Similarly, DynaSLAM is also built on ORB-SLAM2, but it still and Digital Economy (SZ) under Grant No. GML-KF-22-28, and Na-
uses ORB features and proposes to use the semantic segmentation tional Natural Science Foundation of China (62002230, 62032015).
network Mask R-CNN to filter dynamic features. DROID-SLAM is
a recent end-to-end deep learning-based SLAM. The comparison REFERENCES
results are shown in Table 3. Our method can dramatically improve [1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. Segnet: A deep
convolutional encoder-decoder architecture for image segmentation. IEEE trans-
the absolute trajectory accuracy of the original ORB-SLAM2. On actions on pattern analysis and machine intelligence 39, 12 (2017), 2481–2495.
the one hand, as mentioned above, our deep feature can produce [2] Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. 2017.
more distributed features and correct matches, which are useful HPatches: A benchmark and evaluation of handcrafted and learned local de-
scriptors. In Proceedings of the IEEE conference on computer vision and pattern
for pose estimation and loop closure detection (see Fig. 4). On the recognition. 5173–5182.
other hand, thanks to motion segmentation, our system can fil- [3] Berta Bescos, José M Fácil, Javier Civera, and José Neira. 2018. DynaSLAM: Track-
ing, mapping, and inpainting in dynamic scenes. IEEE Robotics and Automation
ter dynamic features and achieve higher accuracy compared with Letters 3, 4 (2018), 4076–4083.
ORB-SLAM2 and ASD-SLAM. Moreover, though DynaSLAM can [4] Jiawang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, and Ian Reid. 2019.
discard dynamic features on semantically dynamic objects based Unsupervised scale-consistent depth and ego-motion learning from monocular
video. Advances in neural information processing systems 32 (2019).
on the Mask R-CNN, our system also outperform it. Because our [5] Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan Leutenegger, and An-
method can extract more reliable features, while DynaSLAM failed drew J Davison. 2018. CodeSLAM—learning a compact, optimisable representa-
to segment the dynamic objects that are semantically static (see Fig. tion for dense visual SLAM. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 2560–2568.
3). As for the end-to-end system DROID-SLAM, due to the limited [6] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, and Yoshua Bengio. 2016.
generalization ability, it performs poorly on the KITTI dataset. Binarized neural networks: Training deep neural networks with weights and
activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830 (2016).
[7] Jan Czarnowski, Tristan Laidlow, Ronald Clark, and Andrew J Davison. 2020.
Deepfactors: Real-time probabilistic dense monocular slam. IEEE Robotics and
4.6 Ablation Studies Automation Letters 5, 2 (2020), 721–728.
In this section, we further conduct experiments to study the im- [8] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. 2018. Superpoint:
Self-supervised interest point detection and description. In Proceedings of the
pacts of our deep local feature and motion segmentation mask. The IEEE conference on computer vision and pattern recognition workshops. 224–236.
performances of different variations of our system on the TUM [9] Martin A Fischler and Robert C Bolles. 1981. Random sample consensus: a
paradigm for model fitting with applications to image analysis and automated
RGB-D dataset are shown in Table 4. Here the absolute trajectory cartography. Commun. ACM 24, 6 (1981), 381–395.
RMSE (ATE) and the percentage of the successfully tracked trajec- [10] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision
tory (Traj.) are used to evaluate the accuracy and robustness of the meets robotics: The kitti dataset. The International Journal of Robotics Research
32, 11 (2013), 1231–1237.
SLAM system. From Table 4, we can see clearly that both using our [11] Haifan Gong, Guanqi Chen, Sishuo Liu, Yizhou Yu, and Guanbin Li. 2021. Cross-
deep local feature and motion segmentation mask can improve the modal self-attention with multi-task pre-training for medical visual question
accuracy and robustness of the original ORB-SLAM2. Especially in answering. In Proceedings of the 2021 International Conference on Multimedia
Retrieval (ICMR). 456–460.
high dynamic environments, the improvement of our final system [12] Richard I Hartley. 1997. In defense of the eight-point algorithm. IEEE Transactions
is more obvious, which illustrates the effectiveness of our methods on pattern analysis and machine intelligence 19, 6 (1997), 580–593.
[13] Richard I Hartley and Peter Sturm. 1997. Triangulation. Computer vision and
in improving the accuracy and robustness. Although in some low- image understanding 68, 2 (1997), 146–157.
dynamic scenes, the original ORB-SLAM2 achieves higher absolute [14] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn.
trajectory accuracy. It is due to the successful tracking trajectory of In Proceedings of the IEEE international conference on computer vision. 2961–2969.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
ORB-SLAM2 is shorter, which results in fewer cumulative errors. learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770–778.
[16] Dongjiang Li, Xuesong Shi, Qiwei Long, Shenghui Liu, Wei Yang, Fangshi Wang,
5 CONCLUSION Qi Wei, and Fei Qiao. 2020. DXSLAM: A robust and efficient visual SLAM system
with deep features. In 2020 IEEE/RSJ International Conference on Intelligent Robots
In this paper, a robust monocular SLAM system is presented, which and Systems (IROS). IEEE, 4958–4965.
can effectively reduce the influence of illumination changes, view- [17] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector.
point changes, and dynamic objects in dynamic environments. This In European conference on computer vision. Springer, 21–37.
is achieved by combining the traditional SLAM system with two [18] David G Lowe. 2004. Distinctive image features from scale-invariant keypoints.
International journal of computer vision 60, 2 (2004), 91–110.
deep learning networks. The proposed DFNet can detect more [19] Taiyuan Ma, Yafei Wang, Zili Wang, Xulei Liu, and Huimin Zhang. 2020. ASD-
robust and efficient features for pose estimation and loop closure SLAM: A Novel Adaptive-Scale Descriptor Learning for Visual SLAM. In 2020
detection. Besides, the proposed MDNet can segment reliable masks IEEE Intelligent Vehicles Symposium (IV). IEEE, 809–816.
[20] Tomoyuki Mukasa, Jiu Xu, and Bjorn Stenger. 2017. 3d scene mesh from cnn depth
based on the motion information of two adjacent images, which predictions and sparse monocular slam. In Proceedings of the IEEE international
overcomes the drawback of utilizing semantic prior information. conference on computer vision workshops. 921–928.
514
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Li et al.
[21] Raul Mur-Artal and Juan D Tardós. 2017. Orb-slam2: An open-source slam system [29] Zachary Teed and Jia Deng. 2021. Droid-slam: Deep visual slam for monocular,
for monocular, stereo, and rgb-d cameras. IEEE transactions on robotics 33, 5 stereo, and rgb-d cameras. Advances in neural information processing systems 34
(2017), 1255–1262. (2021), 16558–16569.
[22] Jerome Revaud, Cesar De Souza, Martin Humenberger, and Philippe Weinzaepfel. [30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
2019. R2d2: Reliable and repeatable detector and descriptor. Advances in neural Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
information processing systems 32 (2019). you need. Advances in neural information processing systems 30 (2017).
[23] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. ORB: An [31] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam:
efficient alternative to SIFT or SURF. In 2011 International conference on computer Convolutional block attention module. In Proceedings of the European conference
vision. Ieee, 2564–2571. on computer vision (ECCV). 3–19.
[24] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel [32] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. 2016. Lift: Learned
Cremers. 2012. A benchmark for the evaluation of RGB-D SLAM systems. In 2012 invariant feature transform. In European conference on computer vision. Springer,
IEEE/RSJ international conference on intelligent robots and systems. IEEE, 573–580. 467–483.
[25] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2018. Pwc-net: Cnns [33] Chao Yu, Zuxin Liu, Xin-Jun Liu, Fugui Xie, Yi Yang, Qi Wei, and Qiao Fei. 2018.
for optical flow using pyramid, warping, and cost volume. In Proceedings of the DS-SLAM: A semantic visual SLAM towards dynamic environments. In 2018
IEEE conference on computer vision and pattern recognition. 8934–8943. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE,
[26] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. 2021. 1168–1174.
LoFTR: Detector-free local feature matching with transformers. In Proceedings of [34] Tianwei Zhang, Huayan Zhang, Yang Li, Yoshihiko Nakamura, and Lei Zhang.
the IEEE/CVF conference on computer vision and pattern recognition. 8922–8931. 2020. Flowfusion: Dynamic dense rgb-d slam based on optical flow. In 2020 IEEE
[27] Qiang Sun and Yanwei Fu. 2019. Stacked self-attention networks for visual International Conference on Robotics and Automation (ICRA). IEEE, 7322–7328.
question answering. In Proceedings of the 2019 on International Conference on [35] Wang Zhao, Shaohui Liu, Yezhi Shu, and Yong-Jin Liu. 2020. Towards better
Multimedia Retrieval (ICMR). 207–211. generalization: Joint depth-pose learning without posenet. In Proceedings of the
[28] Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab. 2017. Cnn-slam: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9151–9161.
Real-time dense monocular slam with learned depth prediction. In Proceedings of [36] Fangwei Zhong, Sheng Wang, Ziqi Zhang, and Yizhou Wang. 2018. Detect-SLAM:
the IEEE conference on computer vision and pattern recognition. 6243–6252. Making object detection and SLAM mutually beneficial. In 2018 IEEE Winter
Conference on Applications of Computer Vision (WACV). IEEE, 1001–1010.
515