SLAM Systems for Autonomous Vehicles
SLAM Systems for Autonomous Vehicles
https://doi.org/10.1007/s10846-022-01582-8
SURVEY PAPER
Received: 14 April 2021 / Accepted: 24 January 2022 / Published online: 20 April 2022
© The Author(s), under exclusive licence to Springer Nature B.V. 2022
Abstract
Simultaneous Localization and Mapping (SLAM) have been widely studied over the last years for autonomous vehicles.
SLAM achieves its purpose by constructing a map of the unknown environment while keeping track of the location. A
major challenge, which is paramount during the design of SLAM systems, lies in the efficient use of onboard sensors to
perceive the environment. The most widely applied algorithms are camera-based SLAM and LiDAR-based SLAM. Recent
research focuses on the fusion of camera-based and LiDAR-based frameworks that show promising results. In this paper, we
present a study of commonly used sensors and the fundamental theories behind SLAM algorithms. The study then presents
the hardware architectures used to process these algorithms and the performance obtained when possible. Secondly, we
highlight state-of-the-art methodologies in each modality and in the multi-modal framework. A brief comparison followed
by future challenges is then underlined. Additionally, we provide insights to possible fusion approaches that can increase
the robustness and accuracy of modern SLAM algorithms; hence allowing the hardware-software co-design of embedded
systems taking into account the algorithmic complexity and the embedded architectures and real-time constraints.
being one of the most challenging, is the main scope of our 1. Reviewing, discussing, and comparing the commonly
article. Indeed, autonomous vehicles engender additional used sensors in the SLAM framework.
complexity related to real-time constraints, the high 2. Deeply analyzing state-of-the-art SLAM systems depend-
dynamics of the scene, the large-scale of the surrounding ing on the general tendency of the use-case, the type of
environment and the long term variations that can occurs in the adopted scenario, the estimated precision and most
a visited place. importantly the corresponding hardware architecture.
The complex problem introduced by SLAM can be 3. Reviewing the most recent SLAM systems depending
addressed by integrating multiple sensors in a single frame- on the sensor and software technologies used in the
work that can offer the level of integrity and redundancy that estimation and optimization steps.
a SLAM system for autonomous vehicles needs. However, 4. Presenting a simple and comprehensive survey of SLAM
in literature, many existing solutions focus on improving frameworks using multi-modal sensors and classifying
established ones, while fewer approaches are groundbreak- these solutions depending on the coupling strategy.
ing. Historically, one of the most relevant solutions was to 5. Concluding on current developments and proposing
use fusion techniques to enhance accuracy. Following these future direction of research on multi-modal SLAM and
same steps, it should be noted that one of the most creative its hardware/software optimization to design embedded
techniques in SLAM were the ones that combined several real-time systems.
modalities even if they are very different from each other
and can induce some complications. Indeed, with each sen- The remainder of this paper is structured into five sections
sor having its own strengths and weaknesses, formulating organized as follows: Section 2 presents the main sensors
a solution that can integrate a number of them is shown to used in the case of autonomous vehicles, details about
improve the overall robustness and accuracy. their particularities and a justification of the wide choice of
Moreover, most of the state-of-the-art solutions focus LiDAR and Camera is added. Section 3 is a brief memo
on the improvement of the algorithmic aspect of the on SLAM formulation, its main building blocks and the
SLAM problem. Even if this approach is very essential classification of techniques used to solve it. Section 4
to the progress of research in the robotic community, reviews the main mono-modal SLAMs in the literature that
much progress remains to be done to study the design use either camera or LiDAR sensor in their process, a
of specific, embedded architectures and real-time systems. description of each approach followed by a comparison and
The hardware constraints associated with SLAM for the future challenges to overcome is provided. Section 5
autonomous vehicles (small design, low-consumption, near- presents and surveys different state-of-the-art multi-modal
sensor processing and real-time data processing ) require strategies using both camera and LiDAR at least, a
researchers to adopt a systemic approach in the resolution classification of these strategies is established. Section 6
of this problem. Defining an efficient distribution of the discusses the future challenges SLAM systems still face
computational load over general purpose processing units and introduces possible research directions to solve some
and application-specific hardware accelerators is still an of them. Figure 1 gives the general structure of the paper
open challenge that is not explored enough despite its many and allows the reader to target each SLAM method and the
promising possibilities. different levels of data processing in a SLAM system.
Our aim in this paper is to highlight the novel ground-
breaking existing SLAM strategies with a special focus on
the solutions based on both Camera and LiDAR, while 2 Multi-Sensor Data Fusion in the SLAM
exposing existing problems and future possible solutions for Framework
automotive applications evolving in a large-scale dynamic
environment. In contrast to existing surveys such as [27] As mentioned in the introduction (see Section 1), the com-
that limit their scope to the algorithmic aspect of SLAM plexity of SLAM applied to autonomous vehicles is a result
and omitting the desired application of such system, we of the duality of the problem that needs to simultaneously
present a more complete survey taking into account a localize itself and map the surrounding environment. There-
systemic approach. Our paper, present a detailed description fore, using data sources that can provide both abundant
of both software and hardware coupled camera-LiDAR amount and reliable information is critical. The commonly
SLAM systems applied to autonomous vehicles. Moreover, used sensors can be classified as proprioceptive and extero-
we provide the reader with more insights to actual trends ceptive sensors. The first category of sensors measures the
and future research orientations to SLAM systems based on state of the vehicle itself (position, speed of the wheels, etc.)
sensor fusion. The main contributions of this work are: which can be exploited only to acquire pertinent information
J Intell Robot Syst (2022) 105: 2 Page 3 of 35 2
about the carrier localization. Fortunately, the second cate- 2.1 Active Positioning Sensors
gory is dual and can answer the need of SLAM by providing
information that can ensure obtaining both localization and 2.1.1 Active Beacons
mapping by measuring the state of the environment sur-
rounding the carrier (temperature, range, luminosity, etc.). Active beacon navigation systems are commonly used in
However, proprioceptive sensors become a necessity for ships and airplanes, as well as in indoor mobile robot sys-
a reliable system in case exteroceptive sensors fail. For tems. They can be detected reliably and provide accu-
instance, this is the case when the surrounding environment rate positioning information with minimal processing. Two
is highly dynamic and the sensor reaches the data rate limits, different types of active beacon systems can be distin-
preventing the system from locating itself. guished: trilateration (distance measurements) and triangu-
Since the conventional sensors that will be reviewed are lation (angle measurements). Practical applications com-
able to generate different kinds of information, the type monly use three or more active beacons in the environment
and quality of derived maps can vary greatly. The rest of and the accurate mounting of beacons is required for accu-
this section will be dedicated to the discussion of the most rate positioning. The position of a mobile robot in a three
common sensors in the context of autonomous vehicles and dimensional space can be calculated basically from the dis-
the most commonly used map representations. tance information from the beacons which have their own
2 Page 4 of 35 J Intell Robot Syst (2022) 105: 2
absolute position information. This approach allows high IMU combines information from motion model and tempo-
sampling rates and yields high reliability, but it does also ral measurements integration to infer the position (x, y, z)
incur high cost in installation and maintenance [14]. and the orientation (θx , θy , θz ) relative to the previous posi-
tion from these six measurements [72]. The estimation step
2.1.2 GNSS Receiver is usually obtained by a filter, such as the Kalman fil-
ter, which provides an estimate of position, orientation and
The Global Navigation Satellite System (GNSS) is the most speed. The localization is performed using an integration
intuitive solution to provide Geo-referenced position of method, while the calculation of the position and the orien-
the vehicle. The concept is based on specialized satellites tation with respect to the point of origin is carried out by
that continuously transmit a radio signal containing the combining the displacements calculated step by step during
current time and data about their position. Each GNSS the movement. This sensor exhibit a drift which increases
receiver requires at least four satellites to compute four rapidly during movement due to the integration step of
unknown quantities (three position coordinates and the shift the measurements required to calculate the position. There-
of its clock relative to the time reference in GNSS). This fore, it can not be used alone to perform an accurate pose
condition challenges the consistency of this sensor in a estimation.
multi-path environment and during signal outages which
limits its use to outdoor environments where the signal 2.3 Exteroceptive Sensors
quality can not be disturbed by high buildings or dense trees
[74]. Moreover, the vehicle’s orientation can be determined 2.3.1 Radar
by measuring the relative positions of multiple GNSS
antennas mounted on different positions. However, such A Radar (an acronym for Radio Detection And Ranging) is a
solutions could be challenging to integrate in automotive measuring instrument that indicates the presence of a distant
applications considering dimensions constraints and the object, its size, its speed, and its direction. It can operate in
minimum baseline between two antennas (40cm to 3m). challenging weather conditions such as fog, rain, snow, and
Additionally, it should be noted that unlike propriocep- low/high luminosity.
tive sensors that only allow determining relative motion, In general, two radar technologies are used: Pulse-Doppler
active positioning sensors (like GNSS receivers, active bea- and Frequency-Modulated Continuous-Wave (FMCW). The
cons, etc.) are able to provide the absolute position of their Pulse-Doppler technology uses a transmitter that emits
carrier. radio waves and a receiver that collects the reflected waves
to accurately measure the object’s velocity by detecting
2.2 Proprioceptive Sensors changes in the reflected wavelengths. Alternatively, FMCW
technology, which is commonly used in the automotive
2.2.1 Wheel Odometer context, is based on continuously transmitting a carrier
frequency and extracting useful information from the return
The wheels of terrestrial vehicles have encoders that make echo such as the round trip delay of received signal or the
it possible to count the number of the rotations completed frequency difference between the transmitted and received
by the wheels and thus to estimate the distance traveled. signals [49].
It is then possible to estimate the position and orientation Consequently, object detection accuracy depends on the
of the vehicle relative to its starting point. Wheel odometry object’s size, the distance at which it is located, its absorp-
exhibits a drift which increases during movement. This drift tion characteristics and the reflection angle. In general,
comes in particular from the loss of grip of the wheel due to radars with a large opening angle have a short range and
sliding, it is therefore quite variable depending on the nature vice-versa.
of the ground (smooth or rough ground) and the trajectory Generally, radar sensors cannot distinguish between a
followed by the vehicle (rather straight trajectory or with stationary target and unwanted echoes [52]. Existing tech-
many turns). niques tend to use additional processing to reduce the
measurement errors. Plus, current automotive radar sensors
2.2.2 Inertial Measurement Unit make relatively few detections in one sampling loop due to
the limited sensor angle resolution [95]. Most existing meth-
An Inertial Measurement Unit (IMU) is made up of three ods that perform object recognition using radars use beat
accelerometers and three gyrometers. Accelerometers mea- frequencies in the frequency domain. Other recent research
sure linear acceleration in the three directions of space x, analyzes the characteristics of the radar signals of station-
y, and z. Gyrometers are used to obtain angular speeds ary and moving objects, and maps stationary and moving
around the three axes of rotation (roll, pitch and yaw). The objects to the time-frequency spectrum [114].
J Intell Robot Syst (2022) 105: 2 Page 5 of 35 2
Current radar sensors are usually noisy due to the false- is the scale factor drift while calculating distances between
positive, false-negative or the range and angular errors. points.
The users can either wield the manufacturer’s software to
reduce these noises or choose to handle the raw sensor Stereoscopic Cameras This sensor consists of two monoc-
data. In the first case, the provided software usually do not ular cameras, located at a fixed and known distance with
specify additional information on the processing methods overlapping fields of view. With stereo camera, scene depth
the level of uncertainty associated with the processed data. information is estimated through matching and triangulation.
In the second case, working on raw sensor data directly However, the baseline between the two monocular cameras
does not solve the issue. In fact, radar is more prone to and their resolution limit the maximum depth range. There-
noise than LiDAR and camera since it is highly sensitive fore, the broader the baseline, the more precise depth can be
to the surface reflectance and the reflector pose. Moreover, estimated, albeit at the expense of the field of view in the
multipath interferences can lead to more noises and outliers overlapped area in both cameras. In [26], De Croce et al.
since it can introduce inconsistency between consecutive presented a distributed architecture for the S-PTAM stereo
frames, resulting in many false alarms [49, 104]. SLAM system. This system is designed for mobile robots
with limited processing capacity since it allows the local-
2.3.2 LiDAR ization module to operate onboard. In parallel, the mapping
module is run on a remote base station. Thus, lowering the
A LiDAR (an acronym for Light Detection And Ranging) computational load of the embedded processor.
relies on laser beams to determine object’s position. The
sensor sends laser pulses towards the surrounding world and RGB-D Cameras These cameras have the advantage of pro-
obtains the distance of the scanning points from its center viding pixel depth directly. They are able to supply 3D
by the time-of-flight method. The high frequency, the high structure of the environment in addition to the texture infor-
precision, the long range and the absence of illumination mation in real time. Moreover, the scale is known due to
influence are all known advantages of this technology. the availability of the acquired information in the metric
Moreover, LiDAR reflective measurements provide more space. This is accomplished by using stereo-vision or other
additional information on the structure of the scene. Thus, technologies such as stereo imaging, structured-light or
LiDARs can be used in indoor or outdoor environment and Time-of-Flight (ToF). RGB-D cameras can be described as
can reproduce their 2D or 3D representations. LiDARs have the most complete visual sensor because they combine mul-
also higher resolutions than radar due to the large number of tiple vision properties (monocular and stereoscopic), and
scans and the high density of scanning points. they add active support from light projectors [86]. They are
Unfortunately, LiDARs can not measure velocity also compact and can integrate an IMU as well. However,
directly, therefore the user needs to rely on successive scans due to their limited range and their sensitivity to sunlight
to estimate it [54, 89]. Additionally, the weather has a more they are mostly used for indoor navigation.
significant impact on LiDARs than radars.
Event Cameras Event-based cameras, also known as dynamic
2.3.3 Camera vision sensors (DVSs) are bio-inspired novel sensors that
can provide asynchronous stream of “events”, with microsec-
The camera is the most used sensor in SLAM systems and ond resolution, upon changes in light intensity for each
the most studied by researchers. It is mainly used to achieve individual pixel. These streams are recorded with their posi-
tasks such as target identification and tracking, objects tions, timestamps, and polarities continuously. Therefore,
detection or environment map building. There exist different there is no notion of “frame” as in conventional cameras.
types of cameras: Furthermore, compared to conventional image sensors, they
offer significant advantages: high temporal resolution, high
Monocular Cameras Historically, in case of a single cam- dynamic range, no motion blur, and much lower latency,
era, scene reconstruction requires sensor movement in order bandwidth and power consumption [58, 123, 137]. How-
to detect and track salient spots using image processing ever, event-camera based SLAM is still in its early stages of
techniques to match the same spots in the previous frames. research.
However, some emerging techniques using Artificial Intel-
ligence aim to estimate depth from a single color image 2.4 Map Types
[127]. These matched and tracked features provide informa-
tion about the vehicle ego-motion which can be translated In [130], IEEE has released a standard for 2D map represen-
into trajectory. The main drawback of monocular cameras tation of environments. However, 3D maps require further
2 Page 6 of 35 J Intell Robot Syst (2022) 105: 2
exploration since they present additional challenges by tak- Grid Maps The most common scheme of 3D grid maps is
ing into account their capacity, performance and robustness Octomap which is based on Octree [50]. The data structure
particularly in dynamic, large scale and open environments. is based on trees, where each node can be recursively
The most commonly used map representations are: metric subdivided into eight sub-nodes, as illustrated in Fig. 3. Leaf
maps, topological maps and semantic maps. It should be nodes of Octotree store the occupancy probability of spatial
noted that metric maps have the highest level of geomet- space. On one hand, this simplified representation facilitates
ric information while semantic maps have the highest level merging information from different sensors by taking into
of abstraction. Topological and hybrid maps can achieve a account the sensor’s reliability. On the other hand, data
certain level of balance between geometric information and volume of 3D grid maps grows at a much higher rate than
abstraction. the scale of the mapped region. One possible solution to this
problem is adaptive resolution where only interesting areas
2.4.1 Metric Maps profit from increased resolution.
This category of maps focuses on the geometric prop- Point Cloud-Based Maps These dense representation of the
erties of the environment. The more precise the metric environment is based directly on 3D point cloud obtained
map is, the more accurately the vehicle can localize itself. from the sensor (see Fig. 4). SLAM systems using such
Unfortunately, precision come at the cost of memory and techniques [47, 88] provide rich environment information
computation. Hence a trade-off has to be reached to main- and presents easier mapping process. However, data is
tain performances in terms of accuracy, quality, execution redundant in such representations and data volume become
time, memory management and power consumption. impractical in large-scale scenarios. In modern SLAM
solutions, this map representation can be carried out only on
Landmark-Based Maps In order to meet the limited storage local areas or offline for global maps in order to respect the
and computational resources, landmark-based maps were real-time constraints of map building.
introduced. This kind of maps represents the environment
as a set of 3D landmarks that have well-defined features 2.4.2 Topological Maps
and corresponding spatial coordinates. This discretization
of the environment reduces the data volume of the obtained A topological map can be seen as a graph where a node
description, hence the name “sparse representation” (see represents a certain place and the edges represent the paths
the illustration of Fig. 2). SLAM systems such as [26, 86] betweens places. Each time the vehicle reaches a new
use and build landmark-based maps under the assumption area, a new node is added. This representation is highly
that the environment has distinctive features. Otherwise, the efficient and compact. However, this solution is not easily
algorithm will fail. scalable to describe large environments. Indeed, such
minimalist description can lead to erroneous localization. fine-grained semantic information, such as lane structure or
Hybrid maps were built on topological maps by associating road type [20].
metric information [13] or semantic information [9] to local
spaces. An example of hybrid map (metric-topological) is 2.5 Why Choose LiDAR and Camera
shown in Fig. 5.
To develop an ultimate SLAM for automotive application
2.4.3 Semantic Maps the system must be equipped with sensors that can provide
abundant information about the surrounding space all while
In addition to spatial information, semantic maps provides being able to handle highly dynamic environment and large
higher level of abstraction than the other map representa- scale. The global framework has to be capable of obtaining
tions. In fact, this kind of maps can facilitate more complex the information from multiple streams of data to ensure
tasks for autonomous vehicles such as planning and naviga- reliability, redundancy and approach a fail-safe system.
tion by combining the semantic and geometric information Therefore, using an RGB-D camera is a good choice due
[32, 126]. Semantic elements can be either background to its capacity to inherit from monocular and stereoscopic
classes (e.g., ground, vegetation, wall) or object classes vision systems [47]. In fact, its properties are well-suited to
(e.g., truck, person, building). Each semantic element can SLAM for automotive application since it generates depth
have several attributes and sub-elements (e.g., building can maps in addition to its stereo vision capabilities. The direct
be of a certain color and have a certain number of doors and rich measurements of the environment produced by
and windows). Therefore, the relations in semantic maps this sensor can provide both the localization and the map.
are hierarchical and need large computational resources. However, to deal with large scale, a sensor with long range
Figure 6 shows an example of a semantic map with each is needed. This task can be achieved by using a LiDAR.
class label represented with a different color. One of the Unfortunately, in highly dynamic scenarios exteroceptive
main challenges that remain is the estimation of more sensors are no longer able to provide their localization
and are limited by data processing rate. Consequently, this
information can be retrieved by integrating an additional
IMU to the system [7, 96, 123].
Indeed, cameras are inexpensive and energy-efficient, but
they suffer from environmental variations due to several
sensitivities such as low lightning conditions dramatic
changes in light intensity or spurious detection. These
sensors are also affected by other factors such as the lack
of overlap between successive frames, texture-less images,
motion blur, etc. Additionally, cameras remain limited in
range. Moreover, they can only provide 2D representation
of the world which induces information loss and requires
additional computing power to map the world in 3D or to
track and classify objects [1].
In the case of radars, range can be as short as a few
meters or as much as a few hundred meters [49]. Although
they are less effective at very short distances. Their major
Fig. 4 Point cloud map generated by LeGO-LOAM [108] drawback on the other hand is their poor resolution and
2 Page 8 of 35 J Intell Robot Syst (2022) 105: 2
Fig. 5 Example of
metric-topological map [13]
their spurious detection which makes them less accurate SLAM applications where tracking and classifying dynamic
than cameras and LiDARs. Besides, they can not distinguish objects is crucial [104]. Finally, radars are still not accurate
different targets or colors which are essential in modern enough to provide a reliable 3D representation of the
environment, therefore, their coupling with visual solutions
is more challenging [49].
Alternatively, LiDARs are commonly used in autonomous
vehicles because of their wide field of view and long range
plus their ability to provide direct access to 3D information.
Moreover, in comparison with cameras, LiDARs have the
advantage of measuring distances with the associated noise
independently of the distance and the lightning conditions.
In addition to all these reasons, only camera and LiDAR
sensors are able to produce all the map types presented
above. Therefore, implementing a solution that can use
a high volume map locally (e.g., semantic map) and a
low volume map (e.g., topological map) globally can be
easier using these both modalities. Authors coupled other
modalities such as camera-inertial SLAM in [96, 107,
123], LiDAR-inertial SLAM in [7, 109], LiDAR SLAM
aided by wheeled odometers in [18] or a fusion of GNSS
and SLAM systems in [3, 41]. However, we believe that
using only exteroceptive sensors (camera and LiDAR) is
sufficient to provide an accurate localization and a reliable
map representation while avoiding drift or masked areas
introduced by proprioceptive sensors.
Finally, each sensor has its strengths and weaknesses as
summarized in Table 1. Therefore, using multiple modalities
to perceive the environment is a necessity to increase
the reliability and accuracy (in terms of translational and
rotational localization errors) of the system and to decrease
its faultiness [42, 135, 139]. Beyond all the advantages
mentioned above, cameras have the lead because they have
been widely studied in recent years, they can provide large
amounts of information about the environment and their
Fig. 6 Semantic map generated using LiDAR scans. Each color low power consumption ability makes them more suited for
represents a class label [20] embedded systems. Furthermore, LiDARs have other strong
J Intell Robot Syst (2022) 105: 2
Sensor Advantages Disadvantages Output size Power consump- Method References Map Scale / Dynamic Translational
(Mbps) tion (Watts) Density environment error
Radar 1. Applicable 1. Impractical 0.1 to 5 7 - 50 Feature [31, 49, 104] Medium / Sparse Low ∼ 1m
for all-weather for static objects
2. Long range 2. False alarms
Direct [8, 20, 29] Large / Dense Medium Large: ∼5m
Small: ∼4cm
LiDAR 1. Long range 1. Limited by 2 to 20 8 - 60 Feature [48, 54, 108, 134] Large - Small / Low Large: ∼5m
2. Brightness bad weather Dense Small: ∼4cm
independent 2. High cost
DL [32, 66, 78, 122, Large / Dense Medium - High ∼3m
126]
Feature [17, 22, 25, 61, Large - Small / Medium Large: ∼10m
72, 84, 86, 92, Sparse Small: ∼15cm
93]
1. Easy object 1. Limited by Direct [21, 34–36, 91, 125] Small - Medium Low Medium: ∼3m
identification bad weather / Dense Small: ∼10cm
Camera 2. Abundant 2. Luminosity 60 to 400 0.1 - 13 RGBD [47, 51, 65, 88, Small / Dense Low ∼12cm
information dependent 96, 102]
3. Low cost 3. Bad depth Event [58, 123, 137] Small / Sparse Low ∼15cm
estimation
DL [5, 59, 80, 117, 129] Small / Dense Low ∼10cm
Page 9 of 35 2
2 Page 10 of 35 J Intell Robot Syst (2022) 105: 2
points like high accuracy, high refresh rate and the ability is either a filter estimation in case of filtering methods or a
to detect high level of details [48, 108, 134]. For all these pose/map refinement in pose graph optimization methods.
reasons, the combination of these two sensors is the main
technological choice for many modern SLAM systems. 3.1 Front-End
approaches the goal is to select areas in the constructed 3D point cloud map that is previously built and outputs
image from the acquired data that potentially disclose real- centimeter-level pose estimation.
world qualities. Historically, features were defined in order
to minimize the computational complexity generated by 3.1.2 Data Association
handling each pixel. Many feature extractors and descriptors
have been developed, each with its own limitations and per- After the identification of points that can be matched
formances in terms of processing speed, rotation, scale and across multiple frames in feature-based approach or simply
illumination invariance . Therefore, selecting a suitable fea- acquiring and matching the raw dense information in
ture detector depends on the targeted environment (aerial, direct approach, the data association module establishes
terrestrial or aquatic) and the vehicle’s computational power measurement correspondences across different frames. That
which ultimately creates a trade-off between computational is by associating each measurement to a specific landmark
speed and feature quality. More generally, features can (3D point) in the environment. In other words, the set of
be classified in three main categories; low-level features measurements is linked to a subset of unknown variables.
like geometric primitives (points, corners, edges, lines), These correlations are then used as inputs to other modules.
middle-level features such as surface planes, or high-level The most important condition for this module is the
features such as objects with semantic labels. In camera- repeatability and the uniqueness of the key-points whether
based SLAM, there are several detectors and descriptors that they are detectable features or simple pixels measurements.
can elicit and represent salient locations in an image, also These conditions ensure successful tracking and increase
known as keypoints. These features should be recognizable certainty.
in the scene and unaffected by changes in viewpoint or illu-
mination, in addition to being resilient to blur and noise. For Direct Methods They are categorized into either dense or
instance, such features can be ORB (Oriented FAST and semi-dense methods. While dense methods utilize each
Rotated BRIEF) features as in ORB-SLAM2 [86] or simply and every pixel in a frame, semi-dense methods use only
blobs and corners like SOFT-SLAM [22]. pixels with a brightness gradient exceeding a predetermined
threshold. The fundamental concept of all direct methods
LiDAR Data Processing In case of 3D LiDAR, processing is brightness consistency constraint which is based on the
point cloud requires the use of one of two methods: infer- assumption that the corresponding pixel of a 3D point from
ring geometric information from 3D discrete and unsorted a scene observed at an image acquired at time ti will
points or direct 3D point cloud registration. In the first case, have the same intensity when observed in a future image
geometric primitives can be categorized into shape primi- acquired at time tj . The matching problem is then solved by
tives (lines, planes, cubes, etc.) or structure primitives (skele- iteratively minimizing the squared pixel-intensity difference
tons, 2D outlines, 3D edges, etc.). The geometric primitives between the current image and the transformed preceding
(also called features) represent recognizable areas that are image. This non-linear optimization is a computationally
consistent in time and space. The extraction of these prim- heavy task that depends both on the number of the
itives can be performed using multiple methods such as parameters in the considered transformation and the number
segmentation, clustering, energy-based optimization, etc. of pixels in each image.
Depending on the parameters tuning, these methods can be
sensitive to noise and density or time consuming. For exam- Feature-Based Methods Once the feature is selected, estab-
ple, in [134], the authors of LOAM used a term to evaluate lishing correspondences for low-level features can be
the smoothness of a local surface, allowing them to clas- achieved by matching features in two images (2D-2D), by
sify the points located on sharp edges and planar surface matching a 3D point in the world frame with its projec-
patches. In case of direct 3D point cloud registration, the tion onto the image frame (3D-2D), or by matching 3D
main goal is to record these point clouds consecutively in point in the world frame with the reconstructed map (3D-
order to align them in the same coordinates frame. This 3D). In the first case (2D-2D matching), the association
method can either combine incoming scans with portions is carried out by searching within a window surrounding
of a pre-built point cloud map yielding to the localization the feature’s location in the second image. As a result of
of the vehicle or combine successive LiDAR scans in order epipolar geometry, this search is reduced to one dimension
to calculate the odometry of the vehicle. The strategy used (along a line), as long as the camera pose is successfully
in IMLS-SLAM [29], relies on removing dynamic objects estimated (the transformation between the two images is
and down-sampling the scan before optimizing the trans- well known). Given the type of the chosen feature descrip-
formation using a scan-to-model strategy. Alternatively, in tor, various quantities exist to evaluate their similitude, such
[78] the authors have designed a deep learning framework as L1/L2 norms or Hamming distance. However, such mea-
that takes as input the online LiDAR point cloud and a sures can reduce the real-time performance of the system
2 Page 12 of 35 J Intell Robot Syst (2022) 105: 2
because they are time-consuming. To tackle this issue, spe- a critical requirement from a safety standpoint. Traditional
cial approaches are introduced to accelerate the search such feature extraction algorithms that classical camera-based
as KD-tree [83] or Bag-of-Words [38]. In the second case SLAM systems rely on have difficulty dealing with texture-
(3D-2D matching), the association is established by deduc- less regions and other complex scenes, which limits the
ing the current camera pose from the previous poses then development of camera-based SLAM. The studies of fea-
projecting the 3D landmark onto the corresponding image ture points extraction adopting deep learning show that this
frame. The 2D projection is then matched to the 2D features method has more advantages than traditional methods in
using the before-mentioned 2D-2D matching technique. dealing with complex scenes. Recently proposed neural net-
This type of matching is implemented when the camera pose works have demonstrated great success also in estimating
has to be estimated from the structure of the environment. 6-DoF poses [90] or pose graph optimization [68]. When
In the third case (3D-3D matching), the main purpose is compared to hand-designed methods for the same tasks,
to calculate and rectify the cumulative error along the path these methods are strong on robustness, since they will
traveled. This technique where 3D landmarks of both ends always make predictions that are similar to real scenarios
of the loop are matched in order to estimate the transforma- present in their training data. But designed methods still
tion parameters between the two ends allows loop closure often have advantages in flexibility in a range of unforeseen
in the Back-End block, which leads ultimately to a drift- scenarios, and in final accuracy due to the use of precise
free trajectory. Figure 8 shows a comparison of trajectories iterative optimization.
with drift and without loop closure and Fig. 9 shows an
illustration of matching techniques for low-level features. 3.1.3 Map/trajectory Initialization
Associating middle-level features like planes or any image
regions of homogeneous texture is accomplished by con- Before starting to estimate accurately the pose and the map,
straining the distance between the feature in the current a definition of the coordinate system and the unknown
frame and the previous ones in the constructed map to be environment are required. It is then the final task of the
less than a particular threshold. Otherwise, the feature is Front-End Block to provide an initial guess for the estimated
added as a new landmark to the map. The same technique variables to be optimized in the next modules. Usually, this
is used to associate high-level features as described in the is accomplished by computing the spatial transformation
work of Bowman et al. in [15]. between successive frames to calculate an initial set of
map points. In ORB-SLAM [85] this step is executed by
Deep Learning Methods The capabilities that Artificial computing in parallel two geometric models: a homography
Intelligence can provide to autonomous vehicles are well assuming a planar scene and a fundamental matrix assuming
recognized and as a result it is the subject of topical research a non-planar scene. The selected criteria to chose the
in recent years. It has been widely used due to its ability appropriate model is the symmetric transfer error [4]. When
to continuously improve some classical algorithms perfor- the localization process loses track or when insufficient
mances. Indeed, robust, real-time detection of obstacles is features are detected in subsequent frames the initialization
process is resumed.
Other SLAM algorithms such as LSD-SLAM [35], DSO
[34] or DPPTAM [21] use a random initialization process
taking into account only one frame where pixels with high
intensity gradients are given random depth values with
an associated large variance. Consequently, many frames
need to be processed before converging to a solution with
minimum depth variance.
3.2 Back-End
3.2.1 Filtering Methods The first branch of the filter-based methods concerns
derivatives of the Kalman Filter (KF) such as Extended
Filter-based methods derive from Bayesian filtering and Kalman Filter (EKF) [96], Adaptive Extended Kalman Fil-
work as two-step iterative processes that aims to compute ter (AEKF) [121] or Information Filter (IF) [73]. This
the probability distribution given by Eq. 1 for all times k. method decreases the uncertainties and provides an estima-
tion at each stage of the process. However, it is extremely
P (xk , m|Z0:k , U0:k , x0 ) (1)
susceptible to erroneous data association. Moreover, updat-
Given that the Bayes Theorem is used to determine the ing all landmarks and associated covariance matrix on a
joint posterior based on control uk and observation zk ; regular basis necessitates a significant amount of computa-
A recursive method that starts with an estimate for the tional power.
distribution P (xk−1 , m|Z0:k−1 , U0:k−1 ) at time k − 1 can The second main branch in filtering SLAM algorithms is
solve the Eq. 1. The state transition is considered under the based on Particle Filters (PF). The theory behind is based
assumption that it is a Markov process where the next state on the use of a sample of particles to represent the state
xk is only affected by the previous state xk−1 and the exerted given its probability density. Then, as with every filter, a
control uk , and is unaffected by the observations or the map. prediction of each particle’s displacement is fulfilled and
This prediction can also be expressed in terms of the state an update is conducted based on the observation. Particles
xk and map m at a time k based on all observations Z0:k and are weighted according to their probability in relation to
all control inputs U0:k . the measurements during the update phase. The most likely
Specifically, the recursive procedure for calculating the particles are retained, while the others are discarded and
joint posterior (quantified by Eq. 1) is given by Eqs. 2 and new ones are created [44]. Particle filter has become widely
3. Where the recursion is determined by combining the used and a large range of approaches based on it have
motion model P (xk |xk−1 , uk ) and the observation model excellent estimate precision and resilience.
P (zk |xk , m). Overall, filter-based SLAM are very dependent on their
Prediction (Time-update) ability to successfully handling uncertainties and the prob-
abilistic filter consistency. However, their convergence
P (xk , m|Z0:k−1 , U0:k , x0 )
becomes slower in higher dimensional environments. This
= P (xk |xk−1 , Uk ) approach has deprecated in the last years favoring the
emergence of graph-based and deep learning methods.
×P (xk−1 , m|Z0:k−1 , U0:k−1 , x0 ) dxk−1 (2)
3.2.2 Pose Graph Optimization Methods
Correction (Measurement Update)
P (xk , m|Z0:k , U0:k , x0 ) The majority of optimization-oriented SLAM techniques
P (zk |xk , m) × P (xk , m|Z0:k−1 , U0:k−1 , x0 ) include two subsystems. The first one uses sensor data by
= (3)
P (zk |Z0:k−1 , U0:k ) finding matches between newly added observations and the
2 Page 14 of 35 J Intell Robot Syst (2022) 105: 2
map in order to identify the problem’s constraints. The Bundle Adjustment is a technique that jointly opti-
second subsystem calculates or refines the vehicles’s pose mizes the 3D structure and the pose. The majority of early
(and previous poses) as well as the map given the constraints research concentrated on 3D reconstruction but it has since
to obtain a consistent whole. then been applied to SLAM. The fundamental concept
Suppose we want to estimate an unknown variable χ that is to optimize an objective function commonly using the
depends on the vehicle’s trajectory and the position of the Levenberg-Marquardt algorithm [93]. The latter reduces the
landmarks in the environment. Given a set of measurements reprojection error (gap between observations in the image
Z = {zk : k = 1, ..., m} such that each measurement may be and reprojected previous features) resulting in the optimal
written as a function of χ, i.e., zk = hk (χk )+εk , where χk ⊆ vehicle and landmark positions. However, because it exam-
χ is a subset of the variables, hk (.) is a known function (the ines all of the variables at once to optimize over, Bundle
measurement or the observation model) and εk represents a Adjustment technique is computationally intensive.
random noise. In Maximum A Posteriori (MAP) estimation, Bayesian SLAM’s graphical form is particularly well
we estimate χ by computing the assignment of variables suited to a resolution using optimization approaches. A
χ ∗ that attains the maximum of the posterior p(χ|Z); matrix defining the associations between landmarks and
corresponding to the belief over χ, given the measurements: vehicle positions may be simply generated and used in an
optimization framework based on this graphical representa-
χ ∗ = argmax p(χ|Z) = argmax p(Z|χ)p(χ) (4) tion.
χ χ
The graphical representation of Bayesian SLAM is par-
The equality follows from the Bayes theorem. In Eq. 4, ticularly well adapted to a resolution via optimization
p(Z|χ) is the likelihood of the measurements Z given methods. Based on this graphical representation, a matrix
the assignment χ and p(χ) is a prior probability over describing the relationships between landmarks and vehi-
χ . Assuming that the measurements are independent (the cle poses can easily be built and used in an optimization
corresponding noises are uncorrelated), Eq. 4 factorizes framework.
into: Generally, pose graph based methods are highly coupled
m with the performance of the front-end. Additionally, they are
∗
χ = argmax p(χ) p(zk |χ) more memory-efficient and conventionally more accurate
χ
k=1 than filtering methods [116]. Therefore, they are the
m
state-of-the-art in current back-end SLAM systems. We
= argmax p(χ) p(zk |χk ) (5) encourage the reader to find a tutorial on graph based SLAM
χ
k=1
in [43].
The problem described by Eq. 5 can be interpreted in
terms of inference over a factor graph where the variables
will correspond to the nodes. The terms p(zk |χk ) and the 4 Mono-modal SLAM Algorithms
prior p(χ) are called factors, and they encode probabilistic
constraints over a subset of nodes. A factor graph is a After establishing the importance of using Camera and
graphical model that encodes the dependence between the LiDAR in modern SLAM applications such as autonomous
k-th factor with its measurement zk and the corresponding vehicles and the theoretical background of SLAM, this
variables χk . section presents a state-of-the-art of the mono-modal SLAM
Like filter-based methods, we can divide optimization algorithms. That are frameworks that use either camera or
methods into two main branches: Bundle Adjustment and LiDAR. In case of camera-based SLAM, the classification
graph-SLAM. These two branches stem from a MAP is carried out according to the methodological approach
formulation. However, graph-SLAM includes a broad used and the sensor family. For LiDAR-based SLAM the
variety of sensor models and does not constrain the factors classification is conducted following to the global strategy.
in Eq. 5 to model projective geometry as in Bundle At the of each mono-modal-based SLAM category a
Adjustment. Moreover, graph-SLAM needs to be solved comparison and a discussion of open problems is presented.
incrementally by taking into account new measurements
when they are made available at each step as the vehicle 4.1 Camera-based SLAM
moves.
Fast linear solvers have been designed to update the 4.1.1 Research Status
estimate of χ as new observations are acquired (e.g.,
GTSAM [28], g2o [64], Ceres [2], iSAM [56], SLAM++ Historically, camera SLAM was the first developed SLAM
[94]). They can solve problems with tens of thousands of algorithm and it has since been a very dynamic subject
variables in a matter of a few secondes. in terms of research and contributions. Moreover, existing
J Intell Robot Syst (2022) 105: 2 Page 15 of 35 2
camera sensors provide large amounts of possibilities both of the available information. In 2011 Newcombe et al. pro-
in quantity and quality of relevant data to be processed. posed DTAM [88] which is one of the first direct methods,
Indeed, they are usually less expensive than other sen- it computes dense depth map by minimizing a global spa-
sors, all while being easier to use and more compact. tially regularized energy function. Engel et al. proposed
Table 2 presents a summary of the most representative LSD-SLAM [35] using regions of sufficiently large inten-
camera-based SLAM that can be applied in the autonomous sity gradient and Lie-algebra for depth reconstruction and
ground vehicle context, showing the main techniques used pose tracking. This work has been expanded in 2017 to
in recent years. Previous works, such as [37, 119] pre- DSO [34], where a sliding window optimization algorithm
sented detailed surveys on camera SLAM and summarized is used to exclude old camera poses and points that leave
its technical categories as feature-based methods, direct the field of view of the camera. To ensure high accu-
methods and RGB-D camera-based approaches. How- racy and robustness of motion estimation, different from
ever, recent technical achievements suggest adding event- LSD-SLAM, this model includes a complete photomet-
camera [99] and deep learning based approaches to the clas- ric calibration that takes into account non-linear response
sification [129]. functions, lens vignetting as well as exposure duration. In
SVO [36] a hybrid approach is proposed, combining fea-
Feature-Based Methods Usually seen as the classical tures with a direct method to track some keypoints (such
technique, the key idea behind Feature-based methods is as corners), then, using the information around the key-
to extract and track point features in the surroundings of points, estimate the camera’s velocity and location. This
the vehicle, while using geometric reprojection constraints method is fast because it doesn’t require to extract features
during estimation. This method can be executed using one in every frame, but adds special constraints and optimiza-
of the two existing methods: using a filter or using Bundle tions. Furthermore, like DSO it doesn’t include loop closure
Adjustment (BA) technique. In the first approach the past detection. Unlike classical direct-methods based on pho-
posed are marginalized and the obtained knowledge is tometric error minimization, NID-SLAM [91] used the
represented with a probability distribution, this technique Normalised Information Distance (NID). This metric is
was first used by MonoSLAM [24, 25] using an Extended more resistant to changes in the scene’s appearance caused
Kalman Filter. This technique’s accuracy proved to be by lighting, weather, and structural variations. Its tracking
limited when used in large-scale environments due to the approach incorporates depth uncertainties into the estimated
limited number of features in the constructed map. In order pose solution. Finally, long-term accuracy is achieved by
to address this limitation PTAM [60, 61] adopted the BA using NID for depth map updates.
approach where only a sub-set of past frames are selected
to process. This subset is called keyframes and contain a RGB-D Camera Methods KinectFusion [87] was the first
high number of features. For a more complete comparison method to use RGB-D camera to construct dense 3D
of the two techniques, one can refer to [116]. PTAM was depth maps for tracking and mapping. The algorithm used
also the first algorithm to propose the idea of a parallel Truncated Signed Distance Function (TSDF) to determine
threads for tracking and mapping. ORB-SLAM [85] built the surface of the scene that can be viewed in the current
on the success of this framework in small environments perspective by performing a global fusion of all depth
to propose a more stable algorithm using three threads for maps in the volume and ray projection technique. To cut
tracking, mapping and loop closing. Using ORB features down on the hefty computational cost, [102] proposed to
and Bag-of-Words [38], this technique is able of efficiently unify coplanar points to decrease the amount of the data
preventing cumulative errors, it has been extended to stereo processed using a GPU. In the same year Henry et al.
and RGB-D cameras in ORB-SLAM2 [86]. ORB-SLAM3 introduced RGB-D mapping [47] for indoor environments.
[17] has recently increased the algorithm’s robustness and The algorithm used SIFT features from color images
accuracy by using inertial information. In ENFT [131] the and the corresponding depth information. Matching 3D
authors proposed a SIFT-based novel method suggesting a feature points was performed using RANSAC method that
non-consecutive feature tracking to match tracks over one acts as an initialization for Iterative Closest Point (ICP),
or multiple sequences. Recent works like TagSLAM [92] which is a local optimizer to achieve a more accurate
and UcoSLAM [84] consider adding fiducial markers as pose. Binding visual and depth information together,
artificial landmarks placed in the environment to achieve this method is able to provide view-based loop closure
long-term robust tracking and relocalization. detection. More recent research focused on combining
RGB-D camera with other sensors. Laidlow et al. proposed
Direct Methods This method doesn’t need to extract fea- in [65] the first tightly-coupled dense RGB-D-inertial
tures, yet solves the pose problem by using raw pixel SLAM system. The algorithm minimizes a combined
intensities which results in including a higher percentage photometric, geometric and inertial function in the tracking
2 Page 16 of 35
Table 2 Summary of the most representative Camera-based SLAM systems, in chronological order. RTE is the Relative Translational Error as used in [39]
References Year Method Hardware architecture Output rate RTE Loop Back-end Map representation
closure optimization
D3VO [129] 2020 Deep learning-based N/A N/A 0.7% ✗ – Metric map (Point cloud
map)
SOFT-SLAM [22] 2018 Feature-based PC with quad-core pro- @20Hz 0.81% ✓ Pose-graph opti- Octomap
cessor @2.4GHz mization
ORB-SLAM2 [86] 2017 Feature-based PC with quad-core pro- Odometry @25Hz-30Hz 0.72% ✓ Bundle Adjustment Metric map (Landmark-
cessor @2.4GHz and and Mapping @5Hz based map)
8GB RAM
Stereo DSO [125] 2017 Direct N/A N/A 0.84% ✗ – Metric map (Landmark-
based map)
NID-SLAM [91] 2017 Direct PC with quad-core pro- Odometry @10Hz and 0.7% ✓ Pose-graph opti- Metric map (landmark-
cessor and AMD Radeon Mapping @1Hz mization based map)
R9 295X2 @10GHz and
8GB RAM GPU
S-PTAM [93] 2015 Feature-based PC with quad-core pro- @18Hz 1.19% ✓ Pose-graph opti- Metric map (Landmark-
cessor @2.20GHz and mization based map)
8GB RAM
J Intell Robot Syst (2022) 105: 2
J Intell Robot Syst (2022) 105: 2 Page 17 of 35 2
step to concurrently estimate the camera pose and velocity. a 3D model of the environment including both identified
In the mapping step, it constructs a fully dense 3D objects and non-identified object structures. Their system
reconstruction of the environment. Globally, the RGB- uses a Convolutional Network to detect and classify objects.
D-inertial system outperforms the RGB-D-only version Thus, every object detected is assigned to a segment of 3D
especially in aggressive motions and low photometric and points by an unsupervised 3D segmentation algorithm. In
geometric variations scenarios. Alternatively, to decrease contrast, Fusion++ [80], presented an online object-level
the global drift caused by inertial measurements due to SLAM system capable of using arbitrarily reconstructed
noise and biases, KO-Fusion [51] suggested using a tightly- objects to build a precise and consistent 3D graph map.
coupled dense RGB-D SLAM system with kinematic and They are then employed for tracking, loop closure detection
odometry measurements from a wheeled robot. and relocalization after being progressively corrected via
depth fusion. However, the challenge remains to find a
Event-Camera Methods SLAM is one of the typical good balance between ensuring sufficient scene coverage
scenarios where event cameras offer many advantages over and filtering detections. Apart from semantic SLAM, efforts
other classical sensors. Their differential and asynchronous have been made in [5] to tackle the problem of loop closure
principle of operation suppresses temporal redundancy and detection by proposing NetVLAD. Inspired by Vector of
emphasizes their ability to capture high-speed motions at Locally Aggregated Descriptors (VLAD) used in image
a higher dynamic range (HDR). Therefore, event-cameras representation, the authors designed a new Convolutional
show great promise for challenging scenarios such as HDR Neural Network architecture that is trained for place
feature tracking and/or rapid movements. However, being recognition. As this technology advances, modern research
a novel bio-inspired sensor, the main challenge remains uses deep learning to govern the whole process of SLAM.
the development of new algorithms capable of processing The work of Liu et al. in [71] improved the performance
unfamiliar stream of intensity changes, that are the events, of landmark-based methods for place recognition using
which will unlock this camera’s full potential. Recent CNNs. This approach suggested using feature maps from
works, like [58], have addressed this challenge, inspired by the low layer of a CNN to extract landmarks and
successful SLAM systems that used parallel threads. This compact landmark descriptors from high layers to perform
method uses three filters operating simultaneously to jointly quick matching. Finally, in the refinement step, Markov
estimate the motion of the event-camera, the intensity random field (MRF), combined with Bayes filters, are
gradients and the inverse depth of a keyframe. In order exploited to perform robust matchings. Recently presented
to design a more robust state estimation and propose an DeepSLAM [67] suggests using an unsupervised learning
ultimate SLAM, [123] introduced the first hybrid pipeline framework that implements several essential components,
in which standard frames, inertial measurements as well as including Mapping-Net, Tracking-Net, Loop-Net and a
events are all combined. When compared to using simply graph optimization unit. Similarly, SimVODIS [59] is built
events and an IMU, this fusion showed a 130% increase in on top of Mask-RCNN [46] which allows the design of
accuracy, and a 85% increase when using only conventional networks that take advantage of rich features to estimate
frames and an IMU. The quest for improving accuracy and the pose, predict the depth map and detect objects. The
efficiency continues with the exploration of stereo event- system uses a single thread to complete all these tasks and
based SLAM [137]. In this method, mapping is achieved by manages to outperform or match the performance of the
optimizing an objective function intended to assess spatial other state-of-the-art methods.
and temporal consistency in stereo event streams, while
tracking is performed by taking into account events and a 4.1.2 Open Problems
local map to compute the pose of the stereo camera with
respect to the map. Although camera-based SLAM has been very well studied
in recent decades and has achieved a lot of success, there
Deep Learning-Based Methods Nowadays, deep learning are still many problems that remain unsolved due to hard-
has become the mainstream technique used in computer ware and/or software limitations. On one hand, even if
vision and image recognition. By contrast to traditional approaches that use deep learning are successful, they are
methods, it can achieve higher accuracy and can be easily still limited by untrained cases and their learning behav-
added to SLAM framework as external and/or inherent ior remains unpredictable. On the other hand, feature-
modules to recognize loop closure, to estimate inter- based classical approaches are very texture and light-
frame motion or even to create semantic maps of the dependent, while direct approaches require more photomet-
environment. Building on ORB-SLAM2, Sunderhauf et al. ric calibration and more computational power. Moreover,
[117] proposed a semantic mapping system that reconstructs RGB-D methods are very range-limited and event-camera
2 Page 18 of 35 J Intell Robot Syst (2022) 105: 2
techniques suffer from artifacts caused by event noise and Hardware To a certain extent, SLAM problem often is
low spatial resolution. presented as a trade-off between responsiveness in real-
time and accuracy. Direct methods and some feature-based
Robustness Achieving a more accurate camera pose esti- methods often take advantage of GPUs and FPGAs to
mation is conditioned by the consideration of the shutter speed-up and design low size and weight embedded SLAM
type. While, most camera-based SLAM frameworks assume systems. However, efforts still need to be made to reduce
a global shutter, most cameras use rolling shutter due to the computational cost, capitalizing on the recent advances
its lower cost. Consequently, camera pose estimation is in these processors technologies. An approach that can
obtained by an interpolation approach which limits the integrate at design level strengths and weaknesses of the
global accuracy. Moreover, these solutions are still prone targeted platform will hopefully be an effective strategy
to errors due to their multiple sensitivities like illumina- to make use of domain specific processor [1]. On another
tion changing conditions, high dynamic environments, fast note, developing more intelligent sensors that can bring
motions, strong rotations and low-texture environments. the processing as close as possible to the sensing devices
Even if each camera-based SLAM technique can be more may also be an alternative solution to achieve hardware
resilient to some conditions it remains fragile under other efficiency and concentrate efforts on the development of
circumstances [16]. The question remains, how can we build more reliable algorithms.
the ideal SLAM system capable to adapt to various environ-
ments or at least decrease the amount of input parameters 4.2 LiDAR SLAM
that must be tuned for the SLAM to operate in a given
scenario? 4.2.1 Research Status
Flexibility The majority of proposed frameworks are very Compared to camera SLAM, LiDAR SLAM is second
sensor and platform-dependent, causing the system to fail in the amount of information in need of processing (See
in case of changing the defined architecture. Such situations Table 1). It has become more and more common in
require using a modular and extensive set of sensors for many applications such as autonomous driving cars and
fusion or opting for a learning-based system that can unmanned aerial vehicles, mainly due to the quality of
extract higher semantic information independently of the data they provide and the constant price drop across recent
sensor. In this hypothesis, the challenge becomes how can years. LiDAR SLAM can provide high-frequency range
we represent a unified map for the vehicle. One possible measurements with relatively small errors, it can track
solution is to implement a cross-modal symbolism of the distances, capture full 360° 3D scans, extract key points
state and the environment, based on the employed sensing belonging to edges and planes and generate point clouds
methods, that can efficiently manage data matching and of great size. Multiple approaches have been suggested
storage [82]. Moreover, designing a system than can adapt to perform motion estimation and mapping using LiDAR
the representation of the environment depending on the task, data. Table 3 presents a summary of the most representative
the complexity or the richness of the environment can solve LiDAR SLAM that can be applied in the autonomous
problems such as memory management [50]. The creation ground vehicle context, showing the main techniques used
of an optimal representation automatically will certainly in recent years. With algorithms main focus being efficiency
facilitate long-term localization and mapping. in their processing pipeline, LiDAR SLAM can be classified
according to the granularity of the point cloud data they use
Scalability Modern SLAM systems still struggle with large for estimation or the technical approach. In particular, the
scale areas where the autonomous vehicle should operate main categories are:
for an extended duration and through a long and versatile
trajectory. In fact, the memory needed to store previously Registration-Based Methods By analogy with camera
discovered landmarks will grow exponentially. Even if we SLAM, these methods can be seen as “direct” methods,
suppose GNSS localization to be always reliable, either because they make use of all the the points in the LiDAR
outdoor or indoor, the transition from one paradigm to the data. These techniques take advantage of recent develop-
other could still cause drift or additional noise. For long- ments in accurate point clouds registration algorithms [120],
term autonomy, more techniques need to be investigated in order to yield the transformation between two successive
in order to deal with time-varying maps. This problem scans by aligning them in the same coordinates frame. This
is accentuated in severely resource-constrained platforms. approach, also known as, scan matching is usually achieved
More attempts should be taken to suggest robust strategies by the Iterative Closest Point (ICP) algorithm [11]. ICP
that determine whether an information should be discarded, divides the registration in two sub-problems: the establish-
updated or permanently remembered [16]. ment of correspondences and the optimization of an error
Table 3 Summary of the most representative LiDAR-based SLAM systems, in chronological order. RTE is the Relative Translational Error as used in [39]
References Year Method Hardware architecture Output rate RTE Loop Back-end Map representation
closure optimization
SuMa++ [20] 2019 Registration-based Workstation with 8-cores @10Hz 0.55% ✓ Pose-graph opti- Semantic and metric
J Intell Robot Syst (2022) 105: 2
metric between all the points of both point clouds. The two was extended in SuMa++ [20] taking advantage of several
procedures proceed iteratively until the convergence crite- semantic classes. The point-wise labels are then obtained
ria is satisfied. Derived algorithms have been proposed with by spherical projections of the point cloud. This additional
the aim to enhance ICP robustness, by using point-to-plane information enables the algorithm to filter outliers caused
ICP [75] for instance or point-to-line ICP [19]. An exten- by dynamic objects and add more semantic constraints in
sive approach was proposed in Generalized ICP [105] with the data processing step, which results in a more accurate
a probabilistic framework merging ICP and point-to-plane and robust pose estimate.
ICP. With the purpose of reducing inherent drift caused by A different method based on scan-matching approach
consecutive registrations, methods like [81] suggested to is Implicit Moving Least Squares SLAM (IMLS-SLAM)
combine a pose graph process with a loop closure mech- [29]. Its first step is to eliminate dynamic objects before
anism using ICP algorithm for the localization. The pose the scan is matched to the model map. This algorithm,
graph is built by taking into account reference frames called simplifies this task by discarding small objects (assuming
keyframes. Even if ICP can be reasonably reliable, the they could be dynamic objects) and the ground points.
point-wise matching is very sensitive to outliers and par- Then, a down-sampling strategy is applied to select only
tially missing structures as it is only effective when the the points with more observability. Finally, a scan-to-
consecutive scans have sufficient overlap. Moreover, initial model matching strategy is performed to optimize the
misalignment decreases the global performance remarkably. transformation using IMLS surface representation. Without
In an attempt to tackle these issues, Normal Distribution loop closure detection, the proposed framework performs at
Transform (NDT) [12] algorithm was proposed for 2D scans the same level of state-of-the-art algorithms.
before its generalization to the 3D space in [79]. NDT’s
novelty is that it subdivides the space into cells before Feature-Based Methods Initially inspired by features used
establishing grid-wise correspondences, unlike ICP where in 2D context, these methods use suitable 3D features to
it is explicit and point-wise. First, each cell is assigned a estimate motion between scans. Therefore, using only a
normal distribution that models the probability of measuring subset of LiDAR data that corresponds to regions of interest.
a point, then Newton’s algorithm is used to find the spatial Descriptors are then used to establish correspondences
transformation between the consecutive scans, resulting in between the detected features of consecutive point clouds.
a more robust algorithm. Other optimization algorithms Many 3D features have been proposed in literature, for
can be used to solve this transformation problem such instance Guo et al. [45] proposed Rotational Projection
as Levenberg-Marquardt Method that combines Gradient Statistics (RoPS) feature descriptors, Fast point feature
Descent and Gauss-Newton Method. It should be noted that histograms (FPFH) were suggested in [101] and Normal
NDT is sensitive to the cell size. That is, if the size is Aligned Radial Feature (NARF) were presented in [115].
too large, the precision decreases. Otherwise, if the size LOAM [134] is a feature-based robust LiDAR odometry
is too small, larger memory is required to store data. In and mapping algorithm that can achieve highly accurate
practice, cells size is fixed by means of k-means clustering results in real-time. The algorithm runs two threads in
[23] or an Octree structure [33]. Finally, even if NDT needs parallel; one for odometry at a high frequency but low
overlapping percentage between the two scans like ICP, fidelity to estimate velocity of the LiDAR and the other for
the additional required step of interpolation between cells mapping at a relatively lower frequency. The first thread
increases the computational cost. starts by extracting points that are on sharp edges and
One of the numerous SLAM algorithms that relies on a planar surfaces provided they are not parallel to laser beams
scan-matching framework is Surfel-based Mapping (SuMa) or in occluded regions. These features are then matched
[8]. In this method, a point cloud is converted into a vertex witch patches of points from the subsequent scan. Finally,
map and normal map while simultaneously creating a 3D the motion estimation is solved using Levenberg-Marquardt
representation of the scan using surfel (Surface elements) method. The second thread is run in the background and is
map. The next step corresponds to using the vertex map used for the registration of the point cloud. Combination of
and the normal map to estimate the odometry by using the two threads allows this method to achieve reasonably
projective data association (denoted frame-to-model ICP in satisfying results this method. However, it still suffers from
this context). The current pose is employed to update the motion estimation drift caused by the absence of loop
surfel map by adding unseen areas and refining covered closure detection. In order to tackle this issue, LeGO-
ones. Afterwards, the current surfel map is used to detect LOAM [108] proposed an extension of this method that can
whether a place has been revisited. Parallel to this, a pose improve speed and reach more accurate results by adding
graph composed of loop closure information and relative loop closure and global optimization. Optionally, the system
poses odometry is optimized. The surfel map is finally can take as input IMU data. The major contribution of
updated using the resulting optimized poses. This strategy this framework is making use of the ground segmentation
J Intell Robot Syst (2022) 105: 2 Page 21 of 35 2
to remove unreliable features, then using a two-step modules of an existing classical SLAM framework that
Levenberg-Marquardt optimization first for planar features could take advantage from the achievements of a deep
then for edge features to speed-up the process. In recent learning networks.
work SC-LeGO-LOAM1 that combines using LeGO- For instance, PointNetVLAD [122] solves large scale
LOAM [108] and Scan Context [57] as spatial descriptor place recognition through the extraction of local features
that uses structural information for place recognition and before clustering them into VLAD [6] global descriptor.
global localization. The presented pipeline is light-weight SegMap [32] proposed using the 3D point clouds to
and achieves reasonable time performance by using KD-tree extract 3D segments to solve the localization and mapping
to search for the Scan Context while running loop detection problem by using a map representation solution. These 3D
at 10-15Hz. segment descriptors are extracted using a Convolutional
Google Cartographer [48] is another LiDAR SLAM Neural Network with an architecture that was developed
that provides long term loop closure. Although the original by varying several parameters before settling for an
algorithm was created for 2D LiDAR, it has been ultimate configuration where two fully connected layers
generalized to 3D. In this method, few consecutive scans are employed with three 3D convolutional layers placed
are used to create a submap that is shaped as a probability around max pool layers. Localization is then achieved by
grid. All the points in the grid and all the points that are identifying correspondences using k-Nearest Neighbors (k-
nearest to each one of them are used to define the matching NN). Finally, the same features are used for reconstructing
pixel. Their probabilities are then adjusted, depending on the environment and for extracting semantic information.
the scan’s hits and misses whenever the submap is updated This approach, addresses the computationally intensive
with the insertion of a new scan. A Ceres-based [2] scan task of processing 3D points clouds by using segments
matcher is used to optimize the scan pose relative to the as compressed data. Although this method achieves good
current submap before effectively inserting the scan into the accuracy, it still has limited applications in real-time tasks.
submap. In parallel, using Sparse Pose Adjustment [63] all Inspired by RGB semantic segmentation methods, PointSeg
the poses of all scans and submaps are optimized. Loop [126] tackles this issue in order to balance efficiency
closure optimization, like scan matching, is also solved and real-time performance. Firstly, LiDAR points cloud
using Ceres. The search is accelerated using a branch-and- is transformed by projection into a spherical image then
bound approach. These techniques have enabled the authors features are extracted from the dense generated data.
through their algorithm to deliver real-time fully optimized Secondly, a point-wise semantic mask is predicted using an
maps that cover an area of tens-of-thousands of square improved version of Squeeze-Net [53].
meters. The common obstacle that the majority of the previously
Alternately, CPFG-SLAM [54] focused on off-road mentioned strategies face is the presence of dynamic objects
environment and suggested using surface and line features (such as cars, bicycles, pedestrians, etc.) in the scene
to describe the characteristics of point cloud in a probability which limits the global performance of the system. One
grid. This algorithm uses ICP and NDT as reference and approach adopted by many researchers to overcome this
computes the nearest neighbor grid instead of nearest challenge is to detect and delete these dynamic objects.
neighbor point to reduce feature extraction and search cost. LO-Net [66] presented a novel encoder-decoder network
Lastly, Expectation-maximization (EM) is used to provide a architecture that estimates odometry using consecutive
pose estimation iteratively, while the optimization problem scans while concurrently learning to mask dynamic regions
is solved by the Levenberg-Marquardt algorithm. Even if and to estimate the surface normal. Moreover, this method
this algorithm is more accurate and reliable in challenging better formalizes the odometry learning process while
environments compared to others in the state-of-the-art, it is providing higher-order interaction between successive scans
still not adapted to urban environments and its performance by incorporating a spatio-temporal geometry consistency
is drastically affected by dynamic objects. constraint to the network. Furthermore, it designs a scan-
to-map module, which improves the estimate’s accuracy
Deep Learning-Based Methods New algorithms have by using the semantic and geometric information learned.
emerged that use deep learning techniques using LiDAR However, this method is still limited by the dataset used
data showing promising results in feature detection, seman- for training which restricts its application scenarios, plus it
tic segmentation or localization. These methods either try could be more practical to use direct 3D point clouds instead
to solve the SLAM problem in an end-to-end fashion by of the the encoded data.
using the raw point clouds as inputs and directly estimat- A novel learning-based LiDAR localization system capa-
ing the vehicle’s motion, or by trying to substitute certain ble of achieving centimeter-level accuracy was demon-
strated in L3 -Net [78]. The input of this framework is
1 Available from https://github.com/irapkaist/SC-LeGO-LOAM formed by a set of LiDAR points with their corresponding
2 Page 22 of 35 J Intell Robot Syst (2022) 105: 2
global coordinates and a predicted pose that is often gener- operating for an extended period of time and over long
ated by an IMU, which constitutes a map built offline, plus trajectories will require additional memory for the system to
an online LiDAR scan. Then, local feature descriptors are remember previously visited areas.
extracted from a set of keypoints using a mini-version of
PointNet [97]. Afterwards, a cost volume is constructed and Dynamic Scenarios In dynamic real-time scenarios, LiDAR
regularized with 3D Convolutional Neural Networks (3D transmits data in units of packets that form a complete field-
CNN). Later, a Recurrent Neural Network (RNN) is intro- of-view. while constantly moving. Therefore, at each frame
duced to achieve temporal smoothness of the displacements the point cloud is not at the same coordinate system and
predictions. This strategy has been extended by the same is not equal to a point cloud of the same frame in case
authors in DeepICP [76] and DeepVCP [77] where they of a static scanning. This generates motion distortion and
were inspired by Iterative Closest Point (ICP) and Virtual yields to erroneous results. Although, some algorithms try to
Corresponding Points (VCP) and tested to different Neu- correct the point cloud distortion, motion compensation con
ral Networks architectures to estimate transformation and not be guaranteed [108]. Furthermore, accurately removing
find the target point cloud’s matching points. The main dynamic objects from a scene is still an open problem.
originality is adding a weighting layer that uses a learn- In fact, filtering moving objects from the scene can be
ing process to assign each extracted feature a corresponding challenging, especially when the autonomous vehicle is
weight. Moreover, to obtain higher accuracy in the learning- running at an important speed. Moreover, filtering-out all
based registration task, a loss function that incorporates both moving objects can limit the estimation of the pose or
the local similarity and the global geometric constraints is velocity and will surely be at an additional computational
included. cost [20]. Other approaches try to solve this problem by
maintaining the map of some location over a period of
4.2.2 Open Problems time by adding missed features or removing transient map
components [10].
Even if LiDAR SLAM is proven to provide quite satisfying
results, the employed technology still faces some obstacles. Adversarial Sensor Attack The work of Shin et al. [111]
Indeed, the sensor position and the properties of the presented a spoofing by relaying attack, which can make
surrounding objects can easily affect the precision of the illusions look closer than the position of the spoofing
the acquired measurements. Moreover, information is lost device as well as induce spurious responses in the LiDAR
if objects are of inconsistent shape or when occlusions output. The experiments conducted showed that the attacker
occur due to sudden scene changing. Furthermore, more can deceive the targeted sensor by simulating a fictional
computationally efficient registration algorithms need to be circumstance via a created attacking signal. When exposed
developed in order to deal with the vast amount of the to these simulated impulses the sensor perception is unable
acquired point clouds. While registration based methods to differentiate between a fake and a real circumstance.
are considered more computationally greedy, feature based Thus, by exploiting this semantic gap between the real
approaches can generate accurate results in a reasonable environment and the perceived sensor’s measurements
time. However, depending on the environment, multiple introduced saturation attacks can entirely disable a LiDAR’s
parameters need to be tuned. Deep Learning based methods ability to detect a certain direction. Therefore, such threats
start to gain notoriety in LiDAR SLAM context but multiple need to be further investigated in order to propose more
considerations should be taken into account such as time resilient systems.
of inference and the representability of the training dataset,
plus the numerous architectures that should be tested before
validation which sometimes seems to be more of an ad-hoc 5 Multi-Modal SLAM Algorithms:
solutions. Camera-LiDAR Coupling Strategies
Scale and Texture of the Environment Usually the angular Previous sections detailed advantages and disadvantages of
resolution of LiDAR in the vertical direction is lower using camera-based or LiDAR-based SLAM. On one hand
than in the horizontal direction. This have drastic impacts and depending on the chosen sensor, to an extent, camera
on the overall performance of LiDAR SLAM when SLAM can still provide satisfying results. However, in case
the autonomous vehicle is surrounded by tall repetitive of monocular cameras, it is still limited by the scale drift or
structures or low-textured environments [18]. Additionally, the poor depth estimation. RGB-D and stereo-cameras face
it is affected by large-scale environments such as large ranging constraints in case of outdoor scenarios. Plus, event-
fields. Consequently, the point clouds become very sparse cameras have spatial resolution cap that restrains their use.
or the distance exceeds the maximum range [54]. Moreover, On the other hand, LiDAR SLAM is proven to be efficient
J Intell Robot Syst (2022) 105: 2 Page 23 of 35 2
facing these challenges, even if it usually need greater and Global Correlation (LRGC) where correspondences are
computing power. Alternatively, using both techniques in made between point cloud and line segments. The Sparse
a multi-modal framework demonstrated great potential to Pose Adjustment (SPA) based on Levenberg-Marquardt
solving many issues in the field. Specifically, using multiple algorithm is used for pose graph-based optimization to cor-
sensing modalities is a necessity when the aim of SLAM rect the global map. The proposed method is efficient in
is to ensure maximum data integrity in case of system loop detection but using ORB features makes it vulnerable
failure. Moreover, the two mainly used sensing modalities in to featureless environments.
literature are complementary in the sense of each surpasses Lopez et al. presented in MS-SLAM [74] a framework
the limitation of the other (Table 4). that adapts to the sensory configuration of aerial vehicles,
by combining several cutting-edge SLAM methods based
5.1 Loosely-Coupled Strategies on vision, laser and/or inertial measurements. The software
implementation of the proposed SLAM system is based
Loosely-Coupled approaches treat camera-based SLAM on three blocks corresponding to available open source
and LiDAR SLAM as two separate modules running at packages. For camera SLAM, either ORB-SLAM or LSD-
different rates and exchanging information in a statistically SLAM was used to estimate a 6-DoF pose. For LiDAR
optimal way. Therefore, the system is designed to fuse SLAM, HectorSLAM [62] was used to generate a 2D
multiple sensor estimates by updating them at the correct map. Scale correction is accomplished by adding other
time interval. Which ultimately, smooths the motion and the information from IMU and Extended Kalman Filter (EKF)
map based on the provided measurements and uncertainties. is employed to fuse all available data, resulting in an
It is relatively easy to implement but usually suffers from improved pose estimation and a generated 2.5D local map.
accuracy problems. Unfortunately, a global consistent map can not be generated
One of the first attempts to take advantage of both using this system as depth maps and point clouds are
camera and LiDAR data was carried out in [89] that used discarded once a new keyframe is obtained. Moreover, it can
visual appearance signatures for loop detection in LiDAR- not provide loop closure detection.
based SLAM. While laser measurements are used to build VISO2-LOAM [128] is another loosely coupled Vi-
a 3D map of the environment progressively, loop closure sual-LiDAR Odometry that combines the 6-DoF motion
detections are carried out using a similarity metric computed estimation output provided by LibVISO2 [40] and LOAM.
over consecutives images instead of a single image. Further, Stereo camera was used by the VISO2 module to calculate
a technique based on eigenvalues, that considers only the the transformation between two frames to correct laser point
atypical or interesting matches, is used on the similarity clouds. This motion estimate, combined with the last pose
matrix to decrease the false positives ratio. calculated by LOAM is used to initialize the current pose
In 2014, Zhang et al. presented DEMO [132, 133] that of LiDAR. Then, to match the point cloud collected in the
used depth information from LiDAR to recover camera map, shape characteristics from the corrected point cloud
motion. Harris corners features are extracted and tracked are extracted. The matching distance is reduced using the
from RGB images, if the corresponding depth information Levenberg-Marquardt technique to produce the ideal state
is available from laser data it is directly associated to it, of the current frame relative to the starting time. This
otherwise it is triangulated using the previously approxi- strategy faces several challenges due to drift accumulated
mated motion. The core of the method, is a Bundle Adjust- through distance and in environments lacking colors and
ment that refines the motion estimation concurrently by textures.
optimizing sequences of images. Although this method pro- LIMO [42] used a frame-to-frame motion estimation
vides accurate results, it is still very dependent on tracking from viso2 features then leveraged LiDAR data to estimate
features that are not always available in homogeneous envi- the depth corresponding to those features. First LiDAR
ronments. Moreover, it does not take full advantage of points are projected onto the image and a subset of points
LiDAR data and uses only a fraction of the measurements to around a given feature and within a fixed size rectangle
assign their depth to 2D camera features. This strategy, does are chosen in order to extract local planes. To estimate
not only exclude most of dense LiDAR points from use, but depth associated to features lying on corners and edges,
also increases residuals errors accumulated by the flawed the algorithm uses histograms of depths with fixed bin
depth estimation by triangulation. Moreover, it does not width. Frame-to-frame odometry estimation is carried out
provide loop closure detection capabilities. This issue was using the Perspective-n-Point-Problem [118]. Moreover,
later addressed by [69] in VL-SLAM using visual infor- landmarks are classified into near, middle and far points and
mation provided by ORB features and the Bag-of-Words are evenly distributed and employed to increase efficiency
model to detect whether the vehicle has already visited the and robustness. Furthermore, semantic information is used
place. Scan matching is performed using Local Registration to remove dynamic objects and determine the weight
2 Page 24 of 35
References Year Method Fusion level Hardware Architecture Output rate Map Scale / Density Dynamic Map representation Back-end RTE
optimization
Tightly- R²LIVE [70] 2021 Features Odometry Onboard computer with @10Hz Medium / Dense Low Point cloud – 0.21%
Coupled estima- quad-core processor
tion @1.80GHz and 8GB
RAM
PC with quad-core pro- @25Hz
cessor @3.6GHz and
32GB RAM
LIC-Fusion 2.0 2020 Features Data association PC with quad-core pro- C @20Hz Small/Semi-Dense Low Landmark-based – 0.17%
[138, 139] cessor @4.0GHz L @10Hz
TVLO [106] 2019 Features Odometry esti- N/A N/A Large / Sparse Low Grid and Pose-graph opti- 0.74%
mation and Landmark- mization
Mapping based
LC-LVF [55] 2019 Features Odometry esti- PC with quad-core pro- @7Hz Small / Semi-Dense Low Grid and Pose-graph opti- 0.25%
mation cessor @2.4GHz and Landmark- mization
8GB RAM. based
VIL-SLAM [110] 2019 Features Mapping PC with quad-core pro- N/A Medium / Dense Low Grid Pose-graph opti- 0.5%
cessor @3.6GHz mization
VLO-IMU [7] 2016 Features Odometry esti- N/A N/A Large / Sparse Low Landmark-based Bundle- 0.8%
mation Adjustment
J Intell Robot Syst (2022) 105: 2
Table 4 (continued)
References Year Method Fusion level Hardware Architecture Output rate Map Scale / Density Dynamic Map representation Back-end RTE
J Intell Robot Syst (2022) 105: 2
optimization
Loosely- LIV-LAM [98] 2020 Features Back-end PC with quad-core pro- N/A Small / Dense Low Point cloud Pose-graph opti- 1%
Coupled optimization cessor @2.4GHz and mization
8GB RAM
DVL-SLAM 2020 Direct Data association PC with quad-core pro- C @20Hz Medium / Sparse Low Point cloud Pose-graph opti- 0.98%
[112, 113] cessor @3.8GHz and L @10Hz mization
8GB RAM
Pilatus Driver- 2020 Deep Data association PC with quad-core pro- @10Hz Medium / Sparse Low Landmark-based Pose-graph opti- 0.15%
less [3] learning cessor @2.8GHz and mization
16GB RAM
LIMO [42] 2018 Features Data association PC with quad-core pro- @5Hz Large / Sparse High Landmark-based – 1.22%
cessor @3.5GHz
V-LOAM [132, 2017 Features Data association PC with quad-core pro- C @60Hz Large - Small / Dense High Point cloud – 1.16%
133, 135] cessor @2.5GHz and L @1Hz
6GB RAM.
MS-SLAM [74] 2017 Features Odometry esti- Distributed computing @25Hz Small / Semi-Dense Low Landmark-based Pose-graph opti- 1.25%
mation with onboard dual-core mization
controller @1GHz and
a PC with quad-core
processor @2.4GHz
VISO2-LOAM 2017 Features Odometry esti- PC with quad-core pro- @10Hz Large / Dense High Point cloud - 1.37%
[128] mation cessor @2.4GHz and
8GB RAM
VL-SLAM [69] 2015 Features Data association PC with dual-core pro- @5Hz Small / Dense Low Grid Pose-graph opti- 1.5%
cessor @3.3GHz and mization
4GB RAM
of landmarks. Finally, Bundle Adjustment (BA) is used on a Convolutional Neural Network, Tiny YOLOv3 [100],
on the selected keyframes to estimate the ego motion. that is trained to detect the cones that limit the track.
Even if this strategy shows good performance, it doesn’t The distance to the boundary is then estimated and the
solve loop closure problem and using only a fraction of detected cone is projected into 3D space. The LiDAR
LiDAR measurements make accumulative errors increase pipeline uses another Neural Network to predict the most
significantly the longer the traveled path. Similarly, DVL- likely color based on the LiDAR intensity pattern. Both
SLAM [112, 113] used the depth from LiDAR in a pipelines can run autonomously which increases the system
direct camera-based SLAM method, where photometric robustness. Point clouds that correspond to the cones are
errors were minimized in an iterative way. This direct then projected onto both image planes to classify the far
method is more advantageous in case of low resolution situated cones based on their color. Cone positions and
camera and sparse LiDAR measurements (reduced number uncertainties are filtered using a Kalman Filter, then a local
of laser beams). Tracking is performed using a sliding- map containing all cones within the new set observations
windows optimization that processes image frame and depth is generated. Finally, a global map is produced in real-
measurements. When performing pose-graph optimization, time using a GraphSLAM algorithm [43] with cones as
the pose of the oldest frame is marginalized. A place the landmarks. Ceres-solver [2] is used to estimate the
recognition-based module is added to perform loop closure. vehicle and cone poses together using non-linear least
Finally, pose-graph optimization includes loop constraints squares. Experiments show that this pipeline makes it
and use g 2 o [64]. possible to reach a maximum driving speed of 12 m/s in
LIV-LAM[98] is recently presented framework for unknown environments. The corresponding map obtained
SLAM that uses LiDAR-based odometry combined with has a RMSE of 0.29m. However, this solution is still not
monocular camera-based object detection and association adaptable to real-life use due to its cost but mostly due to
in a pose-graph optimization fashion with loop closure the complexity of scenarios where more agents and factors
detection. An unsupervised feature learning is performed need to be taken into account.
on the fly to at three levels: low-level features, temporally-
salient objects, and for discovered class correspondences, 5.2 Tightly-Coupled Strategies
all without prior knowledge of the context. Every n frames
a new keyframe is selected from which L2 normalized Tightly-Coupled approaches combine the two SLAMs into
SIFT and L2 normalized color are extracted and fed into one optimal framework. Most tightly-coupled optimization
stream clustering to progressively learn a feature code- methods use graphs to represent landmarks obtained from
book (similar to a Bag-of-Features paradigm). Each tracked both modalities but filtering methods can also be applied.
is associated with a Tracker Confidence Curve that can The common strategy being the use of a multi-view-state
indicate whether a new object is discovered and should be formulation that includes both sensors measurements in the
integrated and is used for object correspondence inferring. state vector. Practical implementation of this approach is
Following the same methodology of LOAM and LeGO- usually more challenging, as it implicates the fusion of raw
LOAM, the authors use planar and edge features computed data or multi-modal associated features. Nevertheless, it is
from LiDAR points to perform odometry. Next, these generally more accurate.
features are aligned and combined with the learned features In 2015, inspired by LOAM [134] and DEMO [132], the
via camera-based odometry, then added to the point-cloud same authors extended these techniques to propose a tightly-
map to find position transformation. Finally, the pose- coupled framework in V-LOAM [135]. This method takes
graph optimization process is carried out using Levenberg- camera-based odometry output as motion prior followed
Marquardt algorithm. This method can still be improved to by LiDAR odometry, the two modes are tightly combined
deal with outdoor scenarios and featureless areas. to represent the surroundings metrically and spatially and
One of the most recent research is presented in Pilatus to estimate motion with accuracy. The proposed method
Driverless [3] that introduces the perception, mapping and can handle the temporary absence of texture, as seen in
planning pipeline implemented on an autonomous race car. total blackout or whiteout images. Moreover, it can cope
The perception setup is designed to detect track boundaries with aggressive translation and rotation movements. Real-
over a long range; consequently, two monocular cameras time performance is achieved by performing camera-based
with global shutter in addition to two 32 channel rotating odometry at a high frequency (60Hz) and LiDAR odome-
LiDARs are used. The longitudinal, lateral, and rotational try at a low frequency (1Hz). This sequential/parallel design
velocities of the car are estimated by fusing raw data from allows the system to maintain a good pose estimation with
all the input sensors, including an optical speed sensor, aggressive motions by bypassing the faulty module. An
dual antenna GNSS, an IMU and motor resolvers with ICP-based scan to scan refinement step matches point
an Unscented Kalman Filter. The Vision pipeline is based clouds between successive scans to refine motion estimates
J Intell Robot Syst (2022) 105: 2 Page 27 of 35 2
and eliminate point cloud distortion induced by camera- In a related approach, LC-LVF [55] proposed using
based odometry drift. Then, to generate maps continuously, a new cost-function that considers both scan and image
distortion-free point clouds are matched and registered. data as constraints for pose graph optimization. First, scan-
Even if the authors intentionally chose not to implement to-scan or scan-to-map methods are used to estimate the
loop closure detection to push the boundaries of precise odom- current pose while ORB features are also used for motion
etry estimate, it achieves a 0.75% of relative position drift. estimation and to generate Bag-of-Words. Depending on
Inspired by this strategy, VLO-IMU [7] proposed inte- the scene conditions (featureless, colorful, large scale, etc.)
grating reduced IMU data with Camera-LiDAR odometry a weighting is performed to balance the two estimated
output in a Kalman Filter to provide a reliable naviga- quantities. Pose graph optimization is further carried out
tion solution. By contrast, this method uses stereo camera using g 2 o. Finally Bag-of-Words based approach is applied
odometry with windows processing, bucketing and Bun- to detect revisited places. Another contribution of this work
dle Adjustment to reduce growth rate accumulated errors. is the generated 2.5D map that combines and align both
Moreover, the locally optimized output of camera-based laser planes and image features. This map is used also for
odometry is used as initial transformation for LiDAR odom- fast relocalization but can also be generalized to dynamic
etry. The authors used then Point Cloud Library (PCL) and objects detection in the future.
ICP algorithm for scan matching after eliminating points Shao et al. proposed VIL-SLAM [110] which is based
that are either far (more than 20m), on the ground or exceed on a tightly-coupled stereo visual inertial LiDAR odometry.
180° Field-of-View. The relevant IMU data that were used The Kanade Lucas Tomasi (KLT) feature tracker is used
are the 2D horizontal acceleration (X and Y-axis) and one in the camera-based front-end to track all feature points in
1-D vertical gyroscope (Z-axis). Finally, the roll, pitch, previous stereo matches. If the number of detected stereo
azimuth, velocity and position are refined using Kalman matches falls below a certain level, Shi-Tomashi Corner
Filter to merge the results of different modules. detector is applied for features extraction. After computing
Alternatively to using camera-based odometry as motion ORB descriptors on final features, a brute-force stereo
initial guess for LiDAR odometry or using a Kalman Filter matching is performed to produce more stereo matches.
to merge data, another approach was proposed in TVLO This output is then combined with IMU measurements to
[106] to combine both measurements. In this framework perform a tightly-coupled fixed-lag smoothing over a pose
two maps in different modalities are built and both of graph. A LiDAR mapping module is used to align all
them are used to solve the residuals for odometry. Such a scanned points to the time of end of scan before carrying
tightly-coupled method estimates motion more accurately out scan-to-map registration. In the loop closure module,
by adding more geometric constraints. The implementation point cloud ICP alignment refines the initial loop constraint
uses LOAM pipeline to generate a voxel map and an estimations that follow visual place recognition. To get a
ORB-SLAM2 pipeline to obtain a visual map. The two globally corrected trajectory and a LiDAR pose correction
pipelines are run separately by extracting features individu- in real-time, a global pose graph constraining all LiDAR
ally from each modality and matching them in their respec- poses is incrementally optimized. These corrections are
tive domain. In the Laser domain, the residuals are geometry sent back to LiDAR mapping module for map update and
metric functions of point-to-plane and point-to-line. In the relocalization. Another novelty in this approach is this
2D visual domain, the monocular and stereo projections enhanced loop closure that adds ICP refinement to visual
of the 3D map points form the residuals. At the tracking Bag-of-Words.
step, the pose is estimated by minimizing residuals from LIC-Fusion introduced by Zuo et al. [138] tightly
both modalities using the Levenberg-Marquardt nonlin- combines IMU measurements, extracted LiDAR edge
ear optimization method. optimization method like the features, as well as sparse FAST visual features, using
Levenberg-Marquardt. Afterwards, the pose tracking results the Multi-State Constraint Kalman Filter (MSCKF) fusion
are used to update the LiDAR voxel map, and a local Bun- framework. This work has been recently improved by the
dle Adjustment is used to update the visual map based same authors in LIC-Fusion 2.0 [139] through making use
on map points. To make the LiDAR voxel map more uni- of a sliding window based plane-feature tracking approach
form throughout the surrounding environment, inspired by to efficiently process 3D LiDAR point clouds. An outlier
Google Cartographer, this method uses Ceres for loop rejection criterion is presented for the proposed plane
detection and closure before updating the final map using tracking, enabling robust matching by accounting for the
pose graph optimization. Even if experimental results show transformation uncertainty between LiDAR frames. The
that this strategy outperforms state-of-the-art algorithms system can accurately represent the uncertainties of LiDAR
in accuracy, efforts still need to be made to adapt it for data, which avoids the discrepancy caused when matching
real-time applications. LiDAR scans with ICP.
2 Page 28 of 35 J Intell Robot Syst (2022) 105: 2
R²LIVE [70] is the most recent tightly-coupled LiDAR- coupling strategies while emphasizing each method’s
Inertial-Visual state estimator and mapping. The proposed benefits and drawbacks.
framework uses a high-rate filter-based odometry and
a low-rate factor graph optimization. This multi-sensor 6.1.1 Map Representation
system fuses the measurements of camera, LiDAR and
inertial sensors within a Kalman filter to achieve real- As discussed in Section 2.4, mapping is an essential part of
time performance. The factor graph optimization refines a SLAM and using a dynamic high precision map is critical
local map of keyframe poses and visual landmark positions. for autonomous vehicles. However, using either camera
This approach was tested on several indoor-outdoor datasets or LiDAR, we can represent the environment through
using desktop PC (with Intel i7-9700K CPU and 32GB multiple map types. There are three main map categories:
RAM) and onboard computer (with Intel i7-8550u CPU and metric, semantic and hybrid. Each with its own advantages
8GB). and disadvantages, choosing the right map for the right
scenario can affect the global performance [9]. For large-
scale environments combining a local detailed map and a
6 Discussion global summarized map can solve some real-time-related
problems. However, combining different types of maps
6.1 Open Challenges in order to establish precise maps of outdoor, large and
dynamic environments is still a challenge for the upcoming
Our work shows that a clear tendency to integrate multiple endeavors [9]. On another note, for long term mapping and
sensors for a more accurate and robust SLAM framework autonomy, some open questions remain: at what rate should
has emerged in the last years. However, the more different the map be updated? Indeed, because of the quantity and
the sensors are the more challenges are faced to process quality of data necessary, the expense of maintaining data-
raw data and to respect exact time alignment. Moreover, collection systems, and the processing time, map creation
spacial transformations can become more complex, which remains a complicated operation [10]. Additionally, there is
can only increase the algorithmic complexity and limit the no standarization for the 3D representationf of the map and
optimization possibilities. the detectable patterns that should be used in map building.
In our opinion, merging both modalities and taking full Other than periodically updating the map, using a cross-
advantage of all the available information by unifying data modal map representation can be a possible solution to
representation will allow more robust and accurate SLAM compress heavyweight large-scale maps [82].
systems to emerge. However, this possible solution will have
to be more computationally efficient in limited memory 6.1.2 Adaptability and Robustness
hardware. In order to overcome these challenges, we believe
that taking hardware specifications into account while No SLAM system can now manage all scenarios. In
designing a tightly-coupled multi-modal SLAM algorithm order to perform successfully for a particular circumstance,
is a viable research orientation and will significantly most systems require substantial parameter adjustment.
improve the overall performance of future SLAM systems. The data association module in front-end block suffers
Table 5 summarizes the compares the previously presented the most from this. These parameters can also include
Table 5 Summary and comparison between camera-LiDAR coupling strategies
Tightly-Coupled
[7, 55, 70, 106, 110, 139] – Combine the two modalities in – More robust and accurate in – More challenging integration
one optimal framework challenging scenarios of both modalities
– Fusion at data representation or – Takes full advantage of all – Reduced reliability and
optimization level available data increased development cost
– Multi-level fusion possibilities – Greedy in memory need and
computational power
Loosely-Coupled [3, 42, – Each modality runs separately – Easier data association and – Accuracy problems due to
69, 74, 98, 113, 128, – Exchanging relevant information representation with efficient matching issues
135] in an optimal fashion computation – Minimal data usage and data
– Fusion at front end-level – Can work in dynamic scenes association dependency
and large scale environments – Necessitates robust estimation
and optimization to overcome
outliers
J Intell Robot Syst (2022) 105: 2 Page 29 of 35 2
thresholds that control feature tracking, criteria to insert 6.1.4 Risk Management
new keyframe, criteria to close the loop. In order to design
an adaptable SLAM system to variant scenarios, additional A perfect SLAM system must be both fail-safe and fail-
modules for automatic tuning of parameters need to be aware. It is not a matter of relocalization or loop closure in
incorporated [16]. Additionnaly, in order to enhance an this case. The SLAM system must be capable of respond-
autonomous vehicle perception, appearance-based instead ing to risk or failure. Its ability should be extended, but not
of feature-based methods are preferred, because this will limited, to detecting data corruption and outliers, respect-
significantly improve long term place recognition and close ing operational timing, resolving modalities heterogeneity
the loop even between different day and night periods or and data redundancy. The design of SLAM system should
different seasons. Moreover, mapping with one modality take into account a risk management module that will
(e.g., LiDAR) and localizing in the same map with a assess the risk associated with each sensor due to circum-
different sensor (e.g., camera) can be interesting, especially stantial hazards. More frameworks such as [136] should
if additional techniques such as visual vocabularies [89] or be introduced to enable the system to detect faults in estima-
Bag-of-Words [38] are used. These techniques however are tion, tracking or data associations. At the same time, an ideal
not immune to failing (e.g., in case of severe illumination SLAM solution should be able to run on a variety of plat-
variations) and can be upgraded by using sequence or forms, regardless of their computing limitations. A difficult
trajectory matching. task is balancing accuracy, resilience, and limited resources.
Adding this challenge to risk assessment, enabling a SLAM
6.1.3 Online and Life-Long Learning system to detect its computational limitations and remove
redundant or unnecessary data is an open problem [107].
Life-long and online learning is a significant problem that
any future long-term SLAM system will be required to 6.2 Future Orientations
solve. SLAM systems are often used in an open environment
with uninterrupted observation, where new objects and It has become evident that fusion is the future, but the
situations can be met at any time. On the other hand, deep questions remain what should we fuse? Data, based on
networks are often trained on a limited number of scenarios multiple used sensors? or components, from off-the-shelf
with a smaller set number of object classes compared to such as recently emerged Neural Networks?
the ones encountered in the real world [97]. A significant
challenge is to push the limits of deep networks capabilities 6.2.1 Deep Learning and Classical Hybrid Models
in a one-shot or zero-shot scenario (i.e., a new class may
have one or even zero training examples) to enable life-long The number of researchers focusing on the application
learning for a SLAM system that is constantly moving and of deep learning to SLAM is continuously increasing.
observing. Either an end-to-end model [59, 78, 129] or a hybrid
Wofk et al. [127] has successfully estimated depth from model [20, 32, 42] are adopted in literature. Fundamentally,
a single color image with low latency destined for an the classic physical and geometrical approaches that
embedded system. Additionally, prior information about govern SLAM will outperform data-driven approaches in
the environment can provide significant boost to SLAM’s untrained scenarios. Moreover, it is difficult to integrate
accuracy and reliability. This prior information can be heterogeneous data in an end-to-end network structure. By
geometrical or based on objects and their relationships [99, using a partial concurrent network structure to actualize the
117]. Correctly extracting and using this information is a fusion of various sensors by accepting multiple sensor data
very important challenge since fusing it with geometrical as input and simultaneous processing results as input in the
information can bootstrap the SLAM system. In fact, intermediate layer of the network, deep learning capabilities
characterizing the uncertainty of estimates derived from may be leveraged while keeping a theoretically established
deep architectures and designing an end-to-end Deep model. The network can learn the data properties of various
Neural Network inspired by [20, 76, 77] many solve sensors and conducts fusion in the intermediate feature
many problems in feature modeling, data associations, etc. layer through a large number of data training sessions.
However, in an open world, the autonomous vehicle will At the same time, the network’s depth must account for
encounter an ever changing environment. This will require the quantity of data, the channel number as well as the
the SLAM system of the future to learn, adapt and extent its sampling rate and the frequency of acquisition of various
knowledge while operating. sensor data. Another option is to combine data using the
2 Page 30 of 35 J Intell Robot Syst (2022) 105: 2
concept of supervised learning. One sensor can supervise block with combined scan matching and feature matching
the data of another sensor to enhance it or correct it. abilities.
For example, Generative Adversarial Networks (GANs) Figure 10 shows a suggested pipeline for a SLAM
may be used to track the LiDAR point cloud in order to framework that tightly combines both Camera and LiDAR
generate more accurate data and remove distortions [103]. data in the front-end block.
Moreover, deep learning based techniques are generally
computationally greedy [32, 66, 78]. Therefore, for ground 6.2.3 Tightly-Coupled Back-End
autonomous vehicle applications such models should scale
down its computational cost while maintaining acceptable An other way of taking advantage of the multi-modality
performance. In order to achieve this while balancing the in SLAM framework would be to implement a back-end
speed and the accuracy of these models, a possibility is to block capable of minimizing both residuals while arbitrating
reduce the amount of inputs (either pixels, points or voxels) between both modalities. We propose to investigate the
and the depth of the network [53]. possibility of sharing constraints obtained from Camera or
LiDAR. For instance, rotational errors could be minimized
6.2.2 Tightly-Coupled Front-End by visually detecting a turn and adding this this information
as a semantic constraint to the global back-end. Another
A tight-by-design strategy that takes advantage of both example would be measuring distance to an object by
sensor modalities does not exit. We propose using one LiDAR and forcing the range estimation to reduce the drift.
of the fusion strategies to include both sensors and both Such SLAM system will be able to build on the work
corresponding algorithms into the final SLAM framework. of Shao et al. in [110] to perform a global pose graph
Such possible fusion solutions can be: optimization that can incorporate constraints and relative
Data in, data out [110]: Feeding raw sensor data into transformation information from both LiDAR and camera
the fusion network, while the output is processed (typically data in order to reduce drift.
enhanced) raw data taking advantage of both modalities and
depending on each one to eliminate the disadvantages. 6.2.4 Data Fusion at Sensor Level
Data in, feature out [106]: Raw data is combined to create
a set of output features independently of the entered data Handling large maps in order to achieve long-term tracking
type. and relocalization requires large memory resources and
Feature in, feature out [30]: Where the input and the computational power. Therefore, in order to achieve real-
output are both feature vectors. That can also be seen as time performance on an embedded SLAM system for
translation of features to a cross-modal feature. autonomous vehicles processed data must be kept to the
Feature in, decision out [70, 139]: Tightly coupling the strict minimum of relevant information. This approach will
motion estimation module that will be capable of processing inevitably lead to designing new hardware-based solutions
either direct data or features from both modalities without that can ensure a better management of parallel processing,
distinction in order to estimate the vehicle’s motion. An memory usage and energy efficiency. In fact, developing an
associated challenge would be to increase the corresponding efficient autonomous ground vehicle will require designing
SLAM methods whose memory and computing complexity Laser data then compare the presented algorithms in each
requirements do not exceed a certain threshold [16]. modality. Later, we highlight the coupled strategies using
Using embedded platforms (such as FPGA boards) will both Camera and LiDAR data to perform SLAM and we
solve another critical multi-sensors fusion problem that classify them. We draw special attention to the strengths
is data synchronization and pre-processing [124]. By and weaknesses of each one of them. We finally discuss the
contrast with software-based synchronization techniques future challenges and open problems that still need to be
that demand additional computing power, hardware-based addressed and we present some future research directions
solutions warrants a unified timing system and a precise for the multi-modal SLAM.
time-stamping of the triggered sensors. Encouraging results have been obtained in recent
research due to the algorithmic maturity that Camera-
6.2.5 Hardware/software Co-Design LiDAR SLAM algorithms have reached. However, design-
ing an end-to-end system embedding these algorithms for
Table 4 summarizes the discussed coupled Camera-LiDAR autonomous vehicles is far from being an achieved objective
SLAM systems and the hardware architecture that each given the complexity of such algorithms and the hardware
one of them used. Unfortunately, none of them takes architectures constraints. Hence, a hardware/software co-
into account hardware constraints associated to embedding design approach remains essential and unexplored. Building
such algorithms. In fact, an efficient embedded SLAM on this, in our future work, we will study the design of
must provide high processing capabilities and low energy a sensor-fusion-based, tightly-coupled implementation of
consumption. The work of Abouzahir et al. in [1] SLAM in the future. By combining camera frames with
proved that embedded platforms using multi-core and LiDAR point clouds while taking into account algorithmic
heterogeneous architectures may be used to attain a real- and hardware constraints, we expect to build a robust, low-
time mono-modal SLAM. Following the same reasoning, cost and low-drift end-to-end real-time embedded SLAM
future research must take into account a hardware/software system.
co-design methodology when designing a multi-modal
Author Contributions All authors contributed to the study conception
SLAM system. Consequently, defining the frequency and methodology. Abdelhafid El Ouardi and Sergio Rodriguez had
requirements and the latency restrictions of the whole the idea for the article and supervised the process. Sergio Rodriguez
processing chain must be in line with the targeted platform was in charge of providing useful insights to the algorithmic aspects
that carries out the computation [127]. In this sense, CPUs of the discussed SLAM methods. Abdelhafid El Ouardi was in charge
of providing useful insights to the hardware aspects of the discussed
and GPUs provide suitable implementation of complex
SLAM systems. Mohammed Chghaf was in charge of synthetizing
algorithms than FPGAs. However, FPGAs, generally, lead recent mono-modal and multi-modal SLAM strategies and prepared
to better real-time processing thanks to their data-flow the first draft of the manuscript. He was in charge of investigation,
processing specifications. They also lead to better power literature search and data analysis. Sergio Rodriguez and Abdelhafid
El Ouardi commented on previous versions of the manuscript and
consumption performances. Hence, to achieve an efficient
critically revised the work. All authors read and approved the final
trade-off when addressing the implementation of real-time manuscript.
multi-modal embedded SLAM, the design should rely on
heterogeneous hardware architectures combining a CPU Funding This work was supported by the French Ministry of Higher
Education, Research and Innovation.
and FPGAs. This will reduce the algorithmic complexity by
processing small data volumes on CPU at a low frequency, Code Availability Not applicable
all while larger volumes are efficiently and massively
parallelized and processed on a multi-FPGA architecture. Declarations
Ethics approval and consent to participate Not applicable
In this work we presented Simultaneous Localization and Conflicts of interest/Competing interests The authors declare that
they have no known conflicts or competing interests that could have
Mapping (SLAM) based on Camera and LiDAR solutions.
appeared to influence the work reported in this paper.
We first, presented the multiple possible sensors that can be
used in the SLAM framework and answered the question to
why Camera and LiDAR are the best choice for SLAM.
References
A brief presentation was formulated to introduce the
theory behind SLAM problem, its building blocks and
1. Abouzahir, M., Elouardi, A., Latif, R., Bouaziz, S., algorithms,
the techniques uses to solve it. We then review the most A.T.: Embedding slam has it come of age? Robot. Auton. Syst.
famous SLAM algorithms that use either Camera data or 100, 14–26 (2018)
2 Page 32 of 35 J Intell Robot Syst (2022) 105: 2
2. Agarwal, S., Mierle, K., et al.: Ceres solver. http://ceres-solver. In: 2019 IEEE/RSJ International Conference on Intelligent
org Robots and Systems (IROS), pp. 4530–4537. IEEE (2019)
3. Andresen, L., Brandemuehl, A., Hönger, A., Kuan, B., Vödisch, 21. Concha, A., Civera, J.: DPPTAM: Dense piecewise planar
N., Blum, H., Reijgwart, V., Bernreiter, L., Schaupp, L., Chung, tracking and mapping from a monocular sequence. In: 2015
J.J., et al: Accurate mapping and planning for autonomous IEEE/RSJ International Conference on Intelligent Robots and
racing. In: Proceedings of the IEEE/RSJ International Confer- Systems (IROS), pp. 5686–5693. IEEE (2015)
ence on Intelligent Robots and Systems (IROS), pp. 4743–4749 22. Cvišić, I., Ćesić, J., Marković, I., Petrović, I.: SOFT-SLAM:
(2020) Computationally efficient stereo visual simultaneous localization
4. Andrew, A.M.: Multiple view geometry in computer vision and mapping for autonomous unmanned aerial vehicles. J. Field
Kybernetes (2001) Robot. 35(4), 578–595 (2018)
5. Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Josef, 23. Das, A., Waslander, S.L.: Scan registration with multi-scale
S.: NetVLAD: CNN architecture for weakly supervised k-means normal distributions transform. In: 2012 IEEE/RSJ
place recognition. In: Proceedings of the IEEE Confer- International Conference On Intelligent Robots and Systems,
ence on Computer Vision and Pattern Recognition, pp. 5297 pp. 2705–2710. IEEE (2012)
–5307 (2016) 24. Davison, A.J.: Real-time simultaneous localisation and mapping
6. Arandjelovic, R., Zisserman, A.: All about VLAD. In: Proceed- with a single camera. In: IEEE International Conference
ings of the IEEE conference on Computer Vision and Pattern on Computer Vision, vol. 3, pp. 1403–1403. IEEE Computer
Recognition, pp. 1578–1585 (2013) Society (2003)
7. Sarvrood, Y.B., Hosseinyalamdary, S., Gao, Y.: Visual-liDAR 25. Davison, A.J., Reid, I.D., Molton, N.D., Olivier, S.:
odometry aided by reduced IMU. ISPRS Int. J. Geo-Inform. 5(1), MonoSLAM: Real-time single camera SLAM. IEEE Trans.
3 (2016) Pattern Anal. Mach. Intell. 29(6), 1052–1067 (2007)
8. Behley, J., Stachniss, C.: Efficient surfel-based slam using 3d 26. De Croce, M., Pire, T., Bergero, F.: DS-PTAM: Distributed stereo
laser range data in urban environments. In: Robotics: Science and parallel tracking and mapping SLAM system. J. Intell. Robot.
Systems 2018 (2018) Syst. 95(2), 365–377 (2019)
9. Bernuy, F., Ruiz-del Solar, J.: Topological semantic mapping and 27. Debeunne, C., Vivet, D.: A review of visual-lidar fusion based
localization in urban road scenarios. J. Intell. Robot. Syst. 92(1), simultaneous localization and mapping. Sensors 20(7), 2068
19–32 (2018) (2020)
10. Berrio, J.S., Worrall, S., Shan, M., Nebot, E.: Long- 28. Dellaert, F.: Factor graphs and GTSAM: A handson introduction.
term map maintenance pipeline for autonomous vehicles. Technical report, Georgia Institute of Technology (2012)
arXiv:2008.12449 (2020) 29. Deschaud, J.-E.: IMLS-SLAM: Scan-to-model matching based
11. Besl, P.J., McKay, N.D.: Method for registration of 3-d shapes. on 3d data. In: 2018 IEEE International Conference on Robotics
In: Sensor Fusion IV: Control Paradigms and Data Structures, and Automation (ICRA), pp. 2480–2485. IEEE (2018)
vol. 1611, pp. 586–606. International Society for Optics and 30. Ding, X., Wang, Y., Xiong, R., Li, D., Li, T., Yin, H., Zhao,
Photonics (1992) L.: Persistent stereo visual localization on cross-modal invariant
12. Biber, P., Straßer, W.: The normal distributions transform A new map. IEEE Trans. Intell. Transp. Syst. 21(11), 4646–4658 (2019)
approach to laser scan matching. In: Proceedings 2003 IEEE/RSJ 31. Dissanayake, G.M.W.M., Newman, P., Clark, S., Durrant-Whyte,
International Conference on Intelligent Robots and Systems H.F., Csorba, M.: A solution to the simultaneous localization
(IROS 2003)(Cat. No. 03CH37453), vol. 3, pp. 2743–2748. IEEE and map building (SLAM) problem. IEEE Trans. Robot. Autom.
(2003) 17(3), 229–241 (2001)
13. Blanco, J.-L., Fernández-Madrigal, J.-A., Gonzalez, J.: Toward a 32. Dubé, R., Cramariuc, A., Dugas, D., Nieto, J., Siegwart, R.,
unified Bayesian approach to hybrid metric–topological SLAM. Cadena, C.: SegMap: 3d segment mapping using data-driven
IEEE Trans. Robot. 24(2), 259–270 (2008) descriptors. arXiv:1804.09557 (2018)
14. Borenstein, J., Everett, H.R., Feng, L., Wehe, D.: Mobile robot 33. Einhorn, E., Gross, H.-M.: Generic NDT mapping in dynamic
positioning: Sensors and techniques. J. Robot. Syst. 14(4), 231– environments and its application for lifelong SLAM. Robot.
249 (1997) Auton. Syst. 69, 28–39 (2015)
15. Bowman, S.L., Atanasov, N., Daniilidis, K., Pappas, G.J.: 34. Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE
Probabilistic data association for semantic SLAM. In: 2017 IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 611–625 (2017)
International Conference on Robotics and Automation (ICRA), 35. Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: Large-scale
pp. 1722–1729. IEEE (2017) direct monocular SLAM. In: European Conference on Computer
16. Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza, Vision, pp. 834–849. Springer (2014)
D., Neira, J., Reid, I., Leonard, J.J.: Past, present, and future 36. Forster, C., Zhang, Z., Gassner, M., Werlberger, M., Davide,
of simultaneous localization and mapping: Toward the robust- S.: SVO: Semidirect Visual odometry for monocular and
perception age. IEEE Trans. Robot. 32(6), 1309–1332 (2016) multicamera systems. IEEE Trans. Robot. 33(2), 249–265 (2016)
17. Campos, C., Elvira, R., Gómez Rodrı́guez, J.J., Montiel, J., 37. Fuentes-Pacheco, J., Ruiz-Ascencio, J., Rendón-Mancha, J.M.:
Tardós, J.D.: ORB-SLAM3: An accurate open-source library for Visual simultaneous localization and mapping: a survey. Artif.
visual, visual-inertial and multi-map SLAM. arXiv:2007.11898 Intell. Rev. 43(1), 55–81 (2015)
(2020) 38. Gálvez-López, D., Tardos, J.D.: Bags of binary words for fast
18. Cao, F., Zhuang, Y., Zhang, H., Wang, W.: Robust place place recognition in image sequences. IEEE Trans. Robot. 28(5),
recognition and loop closing in laser-based SLAM for ugvs in 1188–1197 (2012)
urban environments. IEEE Sensors J. 18(10), 4242–4252 (2018) 39. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous
19. Censi, A.: An ICP variant using a point-to-line metric. In: 2008 driving? the KITTI vision benchmark suite. In: Conference on
IEEE International Conference on Robotics and Automation, Computer Vision and Pattern Recognition (CVPR) (2012)
pp. 19–25. IEEE (2008) 40. Geiger, A., Ziegler, J., Stiller, C.: StereoScan: Dense 3d
20. Chen, X., Milioto, A., Palazzolo, E., Giguere, P., Behley, J., reconstruction in real-time. In: Intelligent Vehicles Symposium
Stachniss, C.: SuMa++: Efficient LiDAR-based semantic slam. (IV) (2011)
J Intell Robot Syst (2022) 105: 2 Page 33 of 35 2
41. Gong, Z., Ying, R., Wen, F., Qian, J., Liu, P.: Tightly 59. Kim, U.-H., Kim, S., Jong-Hwan, K.: SimVODIS: Simultaneous
coupled integration of GNSS and vision SLAM using 10-DoF visual odometry, object detection, and instance segmentation.
optimization on manifold. IEEE Sensors J. 19(24), 12105–12117 IEEE Transactions on Pattern Analysis and Machine Intelligence
(2019) (2020)
42. Graeter, J., Wilczynski, A., Lauer, M.: LIMO: Lidar-monocular 60. Klein, G., Murray, D.: Parallel tracking and mapping for small
visual odometry. In: 2018 IEEE/RSJ International Conference on AR workspaces. In: 2007 6th IEEE and ACM International
Intelligent Robots and Systems (IROS), pp. 7872–7879. IEEE Symposium on Mixed and Augmented Reality, pp. 225–234.
(2018) IEEE (2007)
43. Grisetti, G., Kümmerle, R., Stachniss, C., Burgard, W.: A tutorial 61. Klein, G., Murray, D.: Parallel tracking and mapping on a camera
on graph-based SLAM. IEEE Intell. Transp. Syst. Mag. 2(4), phone. In: 2009 8th IEEE International Symposium on Mixed
31–43 (2010) and Augmented Reality, pp. 83–86. IEEE (2009)
44. Grisetti, G., Stachniss, C., Burgard, W.: Improved techniques for 62. Kohlbrecher, S., Meyer, J., von Stryk, O., Klingauf, U.: A
grid mapping with rao-blackwellized particle filters. IEEE Trans. flexible and scalable SLAM system with full 3d motion
Robot. 23(1), 34–46 (2007) estimation. In: Proc. IEEE International Symposium on Safety,
45. Guo, Y., Sohel, F., Bennamoun, M., Lu, M., Wan, J.: Rotational Security and Rescue Robotics (SSRR). IEEE (2011)
projection statistics for 3d local surface description and object 63. Konolige, K., Grisetti, G., Kümmerle, R., Burgard, W.,
recognition. Int. J. Comput. Vis. 105(1), 63–86 (2013) Limketkai, B., Vincent, R.: Efficient sparse pose adjustment for
46. He, K., Gkioxari, G., DollṔar, P., Girshick, R.: Mask r-CNN. In: 2d mapping. In: 2010 IEEE/RSJ International Conference on
Proceedings of the IEEE International Conference on Computer Intelligent Robots and Systems, pp. 22–29. IEEE (2010)
Vision, pp. 2961–2969 (2017) 64. Kümmerle, R., Grisetti, G., Strasdat, H., Konolige, K., Burgard,
47. Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D.: RGB-D W.: g2o: A general framework for graph optimization. In: 2011
mapping: Using depth cameras for dense 3d modeling of indoor IEEE International Conference on Robotics and Automation,
environments. In: Experimental Robotics, pp. 477–491. Springer pp. 3607–3613. IEEE (2011)
(2014) 65. Laidlow, T., Bloesch, M., Li, W., Leutenegger, S.: Dense RGB-
48. Hess, W., Kohler, D., Rapp, H., Andor, D.: Real-time loop D-Inertial SLAM with map deformations. In: 2017 IEEE/RSJ
closure in 2d liDAR SLAM. In: 2016 IEEE International International Conference On Intelligent Robots and Systems
Conference on Robotics and Automation (ICRA), pp. 1271– (IROS), pp. 6741–6748. IEEE (2017)
1278. IEEE (2016) 66. Li, Q., Chen, S., Wang, C., Li, X., Wen, C., Cheng, M., Li,
49. Hong, Z., Petillot, Y., Sen, W.: RadarSLAM: Radar based large- J.: LO-Net: Deep real-time liDAR odometry. In: Proceedings
scale SLAM, in all weathers. arXiv:2005.02198 (2020) of the IEEE/CVF Conference on Computer Vision and Pattern
50. Hornung, A., Wurm, K.M., Bennewitz, M., Stachniss, C., Bur- Recognition, pp. 8473–8482 (2019)
gard, W.: OctoMap: An efficient probabilistic 3d mapping frame- 67. Li, R., Wang, S., Gu, D.: DeepSLAM: A robust monocular
work based on octrees. Autonom Rob 34(3), 189–206 (2013) SLAM system with unsupervised deep learning. IEEE Trans. Ind.
51. Houseago, C., Bloesch, M., Leutenegger, S.: KO-Fusion: dense Electron. 68(4), 3577–3587 (2020)
visual SLAM with tightly-coupled kinematic and odometric 68. Li, Y., Ushiku, Y., Tatsuya, H.: Pose graph optimization for
tracking. In: 2019 International Conference on Robotics and unsupervised monocular visual odometry. In: 2019 International
Automation (ICRA), pp. 4054–4060. IEEE (2019) Conference on Robotics and Automation (ICRA), pp. 5439–
52. Hyun, E., Jin, Y.-S., Lee, J.-H.: Moving and stationary target 5445. IEEE (2019)
detection scheme using coherent integration and subtraction 69. Liang, X., Chen, H., Li, Y., Liu, Y.: Visual laser-SLAM in
for automotive fmcw radar systems. In: 2017 IEEE Radar large-scale indoor environments. In: 2016 IEEE International
Conference (RadarConf), pp. 0476–0481. IEEE (2017) Conference on Robotics and Biomimetics (ROBIO), pp. 19–24.
53. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, IEEE (2016)
W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x 70. Lin, J., Zheng, C., Xu, W., Fu, Z.: R2LIVE: A robust, real-
fewer parameters and < 0.5 mb model size. arXiv:1602.07360 time, LiDAR-inertial-visual tightly-coupled state estimator and
(2016) mapping, arXiv:2102.12400 (2021)
54. Ji, K., Chen, H., Di, H., Gong, J., Xiong, G., Qi, J., Yi, T.: CPFG- 71. Liu, Q., Duan, F.: Fast and consistent matching for landmark-
SLAM: a robust simultaneous localization and mapping based based place recognition. Journal of Intelligent & Robotic
on LIDAR in off-road environment. In: 2018 IEEE Intelligent Systems, 1–14 (2020)
Vehicles Symposium (IV), pp. 650–655. IEEE (2018) 72. Liu, Y., Yang, D., Li, J., Gu, Y., Pi, J., Zhang, X.: Stereo
55. Jiang, G., Yin, L., Jin, S., Tian, C., Ma, X., Ou, Y.: A visual-inertial SLAM with points and lines. IEEE Access 6,
simultaneous localization and mapping (SLAM) framework for 69381–69392 (2018)
2.5 d map building based on low-cost liDAR and vision fusion. 73. Liu, Y., Thrun, S.: Results for outdoor-SLAM using sparse
Appl. Sci. 9(10), 2105 (2019) extended information filters. In: 2003 IEEE International
56. Kaess, M., Ranganathan, A., Frank, D.: ISAM: Incremental Conference on Robotics and Automation (Cat. No. 03CH37422),
smoothing and mapping. IEEE Trans. Robot. 24(6), 1365–1378 vol. 1, pp. 1227–1233. IEEE (2003)
(2008) 74. López, E., Garcı́a, S., Barea, R., Bergasa, L.M., Molinos,
57. Kim, G., Kim, A.: Scan context: Egocentric spatial descriptor for E.J., Arroyo, R., Romera, E., Pardo, S.: A multi-sensorial
place recognition within 3D point cloud map. In: Proceedings of simultaneous localization and mapping (SLAM) system for low-
the IEEE/RSJ International Conference on Intelligent Robots and cost micro aerial vehicles in GPS-denied environments. Sensors
Systems Madrid (2018) 17(4), 802 (2017)
58. Kim, H., Leutenegger, S., Davison, A.J.: Real-time 3d recon- 75. Low, K.-L.: Linear least-squares optimization for point-to-
struction and 6-dof tracking with an event camera. In: European plane ICP surface registration. Chapel Hill, University of North
Conference on Computer Vision, pp. 349–364. Springer (2016) Carolina 4(10), 1–3 (2004)
2 Page 34 of 35 J Intell Robot Syst (2022) 105: 2
76. Lu, W., Wan, G., Zhou, Y., Fu, X., Yuan, P., Song, S.: 94. Polok, L., Ila, V., Solony, M., Smrz, P., Zemcik, P.: Incremental
DeepICP: An end-to-end deep neural network for 3d point cloud block cholesky factorization for nonlinear least squares in
registration. arXiv:1905.04153 (2019) robotics. In: Robotics: Science and Systems, pp. 328–336 (2013)
77. Lu, W., Wan, G., Zhou, Y., Fu, X., Yuan, P., Shiyu, S.: DeepVCP: 95. Prophet, R., Li, G., Sturm, C., Vossiek, M.: Semantic segmen-
An end-to-end deep neural network for point cloud registration. tation on automotive radar maps. In: 2019 IEEE Intelligent
In: Proceedings of the IEEE/CVF International Conference on Vehicles Symposium (IV), pp. 756–763. IEEE (2019)
Computer Vision, pp. 12–21 (2019) 96. Qayyum, U., Ahsan, Q., Mahmood, Z.: IMU aided RGB-D
78. Lu, W., Zhou, Y., Wan, G., Hou, S., Shiyu, S.: L3-Net: Towards SLAM. In: 2017 14th International Bhurban Conference on
Learning based liDAR localization for autonomous driving. In: Applied Sciences and Technology (IBCAST), pp. 337–341.
Proceedings of the IEEE/CVF Conference on Computer Vision IEEE (2017)
and Pattern Recognition, pp. 6389–6398 (2019) 97. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: Deep
79. Magnusson, M., Lilienthal, A., Duckett, T.: Scan registration learning on point sets for 3d classification and segmentation. In:
for autonomous mining vehicles using 3d-NDT. J Field Robot Proceedings of the IEEE Conference on Computer Vision and
24(10), 803–827 (2007) Pattern Recognition, pp. 652–660 (2017)
80. McCormac, J., Clark, R., Bloesch, M., Davison, A., Leutenegger, 98. Radmanesh, R., Wang, Z., Chipade, V.S., Tsechpenakis, G.,
S.: Fusion++: Volumetric object-level SLAM. In: 2018 Inter- Panagou, D.: LIV-LAM: LiDAR and visual localization and
national Conference on 3D Vision (3DV), pp. 32–41. IEEE mapping. In: 2020 American Control Conference (ACC),
(2018) pp. 659–664. IEEE (2020)
81. Mendes, E., Koch, P., Lacroix, S.: pose-graph SLAM. In: 2016 99. Rebecq, H., Horstschäfer, T., Gallego, G., Davide, S.: EVO: A
Icp-based IEEE International Symposium On Safety, Security, geometric approach to event-based 6-DOF parallel tracking and
and Rescue Robotics (SSRR), pp. 195–200. IEEE (2016) mapping in real-time. IEEE Robot. Autom. Lett. 2(2), 593–600
82. Mithun, N.C., Sikka, K., Chiu, H.-P., Samarasekera, S., (2016)
Rakesh, K.: RGB2LIDAR: Towards solving large-scale cross- 100. Redmon, J., Ali, F.: YOLOV3: An incremental improvement.
modal visual localization. In: Proceedings of the 28th ACM arXiv:1804.02767 (2018)
International Conference on Multimedia, pp. 934–954 (2020) 101. Rusu, R.B., Blodow, N., Beetz, M.: Fast point feature histograms
83. Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with (FPFH) for 3d registration. In: 2009 Fast Point Feature
automatic algorithm configuration. VISAPP (1) 2(331-340), 2 IEEE International Conference on Robotics and Automation,
(2009) pp. 3212–3217. IEEE (2009)
84. Munoz-Salinas, R., Rafael, M.-C.: UcoSLAM: Simultaneous 102. Salas-Moreno, R.F., Glocken, B., Kelly, P.HJ., Davison, A.J.:
localization and mapping by fusion of keypoints and squared Dense planar SLAM. In: 2014 Dense IEEE International
planar markers. Pattern Recogn. 101, 107193 (2020) Symposium on Mixed and Augmented Reality (ISMAR),
85. Mur-Artal, R., Martinez Montiel, J.M., Tardos, J.D.: ORB- pp. 157–164. IEEE (2014)
SLAM: A versatile and accurate monocular SLAM system. IEEE 103. Sallab, A.E., Sobh, I., Zahran, M., Essam, N.: Lidar sensor
Trans. Robot. 31(5), 1147–1163 (2015) modeling and data augmentation with gans for autonomous
86. Mur-Artal, R., Tardós, J.D.: ORB-SLAM2:An open-source driving, arXiv:1905.07290 (2019)
SLAM system for monocular, stereo, and rgb-d cameras. IEEE 104. Schuster, F., Keller, C.G., Rapp, M., Haueis, M., Curio, C.:
Trans. Robot. 33(5), 1255–1262 (2017) SLAM using graph optimization. In: 2016 Landmark based
87. Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., IEEE 19th International Conference on Intelligent Transportation
Kim, D., Davison, A.J., Kohi, P., Shotton, J., Hodges, S., Systems (ITSC), pp. 2559–2564. IEEE (2016)
Fitzgibbon, A.: KinectFusion: Real-time dense surface mapping 105. Segal, A., Haehnel, D., Thrun, S.: Generalized-ICP. In: Robotics:
and tracking. In: 2011 10th IEEE International Symposium on Science and Systems, vol. 2, p. 435, Seattle (2009)
Mixed and Augmented Reality, pp. 127–136 (2011) 106. Seo, Y., Chou, C.-c.: A tight coupling of vision-liDAR
88. Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: DTAM: Dense measurements for an effective odometry. IEEE (2019)
tracking and mapping in real-time. In: 2011 International 107. Servières, M., Renaudin, V., Dupuis, A., Antigny, N.: Visual
Conference on Computer Vision, pp. 2320–2327. IEEE (2011) and visual-inertial SLAM: State of the art, classification, and
89. Newman, P., Cole, D., Ho, K.: Outdoor SLAM using visual experimental benchmarking. Journal of Sensors, 2021 (2021)
appearance and laser ranging. In: 2006 Proceedings 2006 IEEE 108. Shan, T., Englot, B.: LeGO-LOAM: Lightweight and ground-
International Conference on Robotics and Automation ICRA optimized LiDAR odometry and mapping on variable terrain. In:
2006., pp. 1180–1187. IEEE (2006) 2018 IEEE/RSJ International Conference on Intelligent Robots
90. Bowden, R., Kaygusuz, N., Mendez, O.: MDN-VO: Estimating and Systems (IROS), pp. 4758–4765. IEEE (2018)
visual odometry with confidence. In: IEEE/RSJ International 109. Shan, T., Englot, B., Meyers, D., Wang, W., Ratti, C., Rus,
Conference on Intelligent Robots and Systems (IROS) (2021) D.: LIO-SAM: Tightly-coupled LiDAR inertial odometry via
91. Pascoe, G., Maddern, W., Tanner, M., Piniés, P., Paul, N.: NID- smoothing and mapping. In: 2020 IEEE/RSJ International
SLAM: Robust Monocular SLAM using normalised information Conference on Intelligent Robots and Systems (IROS), pp. 5135–
distance. In: Proceedings of the IEEE Conference on Computer 5142. IEEE (2020)
Vision and Pattern Recognition, pp. 1435–1444 (2017) 110. Shao, W., Vijayarangan, S., Li, C., Kantor, G.: Stereo
92. Pfrommer, B., Kostas, D.: TagSLAM: Robust SLAM, with visual inertial LiDAR, simultaneous localization and mapping,
fiducial markers. arXiv:1910.00679 (2019) arXiv:1902.10741 (2019)
93. Pire, T., Fischer, T., Civera, J., De Cristóforis, P., Berlles, J.J.: 111. Shin, H., Kim, D., Kwon, Y., Kim, Y.: Illusion and dazzle:
Stereo parallel tracking and mapping for robot localization. Adversarial optical channel exploits against lidars for auto-
In: 2015 Stereo parallel IEEE/RSJ International Conference on motive applications. In: International Conference on Cryp-
Intelligent Robots and Systems (IROS), pp. 1373–1378. IEEE tographic Hardware and Embedded Systems, pp. 445–467.
(2015) Springer (2017)
J Intell Robot Syst (2022) 105: 2 Page 35 of 35 2
112. Shin, Y.-S., Park, Y.S., Kim, A.: Direct visual SLAM using 130. Yu, W., Amigoni, F.: Standard for robot map data representation
sparse depth for camera- LiDAR system. In: 2018 Direct IEEE for navigation. In: 2014 IEEE/RSJ International Conference on
International Conference on Robotics and Automation (ICRA), Intelligent Robots and Systems (IROS), pp. 3–4 (2014)
pp. 5144–5151. IEEE (2018) 131. Zhang, G., Liu, H., Dong, Z., Jia, J., Wong, T.-T., Bao, H.:
113. Shin, Y.-S., Park, Y.S., Kim, A.: DVL-SLAM: Sparse depth Efficient non-consecutive feature tracking for robust structure-
enhanced direct visual-liDAR SLAM. Auton. Robot. 44(2), 115– from-motion. IEEE Trans. Image Process. 25(12), 5957–5970
130 (2020) (2016)
114. Song, H., Shin, H.-C.: Classification and spectral mapping of 132. Ji, Z., Kaess, M., Singh, S.: Real-time depth enhanced monocular
stationary and moving objects in road environments using fmcw odometry. In: 2014 IEEE/RSJ International Conference On
radar. IEEE Access 8, 22955–22963 (2020) Intelligent Robots and Systems, p. 2014. IEEE (2014)
115. Steder, B., Rusu, R.B., Konolige, K., Burgard, W.: NARF: 3d 133. Ji, Z., Kaess, M., Singh, S.: A real-time method for depth
range image features for object recognition. In: Workshop on enhanced visual odometry. Auton. Robot. 41(1), 31–43 (2017)
Defining and Solving Realistic Perception Problems in Personal 134. Ji, Z., Sanjiv, S.: LOAM: Lidar Odometry and mapping in
Robotics at the IEEE/RSJ. Int. Conf. on Intelligent Robots and real-time. In: Robotics: Science and Systems, vol. 2 (2014)
Systems (IROS), vol. 44 (2010) 135. Zhang, J., Singh, S.: Visual-LiDAR odometry and mapping:
116. Strasdat, H., Montiel, J., Davison, A.J.: Visual SLAM: Why Low-drift, robust, and fast. In: 2015 IEEE International
filter? Image Vis. Comput. 30(2), 65–77 (2012) Conference on Robotics and Automation (ICRA), pp. 2174–
117. Sünderhauf, N., Pham, T.T., Latif, Y., Milford, M., Reid, I.: 2181. IEEE (2015)
Meaningful maps with object-oriented semantic mapping. In: 136. Zheng, X., Huang, B., Ni, D., Xu, Q.: A novel intelligent vehicle
2017 IEEE/RSJ International Conference On Intelligent Robots risk assessment method combined with multi-sensor fusion in
and Systems (IROS), pp. 5079–5085. IEEE (2017) dense traffic environment. Journal of Intelligent and Connected
118. Szeliski, R.: Computer Vision: Algorithms and Applications. Vehicles (2018)
Springer Science & Business Media (2010) 137. Zhou, Y., Gallego, G., Shen, S.: Event-based stereo visual
119. Taketomi, T., Uchiyama, H., Ikeda, S.: Visual SLAM algorithms: odometry, arXiv:2007.15548 (2020)
a survey from 2010 to 2016. IPSJ Trans. Comput. Vis. Applic. 138. Zuo, X., Geneva, P., Lee, W., Liu, Y., Huang, G.: LIC-Fusion:
9(1), 1–11 (2017) LiDAR-Inertial-Camera odometry, arXiv:1909.04102 (2019)
120. Tam, G.KL., Cheng, Z.-Q., Lai, Y.-K., Langbein, F.C., Liu, Y., 139. Zuo, X., Yang, Y., Geneva, P., Lv, J., Liu, Y., Huang, G., Polle-
Marshall, D., Martin, R.R., Sun, X.-F., Rosin, P.L.: Registration feys, M.: LIC-Fusion, 2.0: LiDAR-inertial-camera odometry
of 3d point clouds and meshes: a survey from rigid to nonrigid. with sliding-window plane-feature tracking, arXiv:2008.07196
IEEE Trans. Visual. Comput. Graph. 19(7), 1199–1217 (2012) (2020)
121. Tian, Y., Suwoyo, H., Wang, W., Mbemba, D., Li, L.: An AEKF-
SLAM algorithm with recursive noise statistic based on MLE and Publisher’s Note Springer Nature remains neutral with regard to
EM. J Intelli Robot Syst 97(2), 339–355 (2020) jurisdictional claims in published maps and institutional affiliations.
122. Uy, M.A., Lee, H.G.: pointnetVLAD: Deep point cloud based
retrieval for large-scale place recognition. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition,
Mohammed Chghaf received the M.S. degree in electrical engineer-
pp. 4470–4479 (2018)
ing in 2015 (specializing in automation and industrial computing)
123. Vidal, A.R., Rebecq, H., Horstschaefer, T., Scaramuzza, D.:
from Mohammadia School of Engineers, Rabat, Morocco. He received
Ultimate SLAM? Combining events, images, and imu for robust
the M.S. degree in embedded systems and data processing in 2019
visual slam in hdr and high-speed scenarios. IEEE Robot. Autom.
from Paris-Saclay University, Orsay, France. He is currently pursu-
Lett. 3(2), 994–1001 (2018)
ing the Ph.D. degree with SATIE laboratory, Paris-Saclay University.
124. Wan, Z., Yu, B., Li, T.Y., Tang, J., Zhu, Y., Yu, W.,
His research activities are focused on multi-modal embedded SLAM
Raychowdhury, A., Liu, S.: A survey of fpga-based robotic
systems for autonomous vehicles.
computing. IEEE Circ. Syst. Mag. 21(2), 48–74 (2021)
125. Wang, R., Schworer, M., Daniel, C.: Stereo DSO: Large-
scale direct sparse visual odometry with stereo cameras. In:
Proceedings of the IEEE International Conference on Computer Sergio Rodriguez received his M.S. and Ph.D. degrees from
Vision, pp. 3903–3911 (2017) University of Technology of Compiègne, France, in 2007 and 2009,
126. Wang, Y., Shi, T., Yun, P., Tai, L., Ming, L.: PointSeg: Real- respectively. Since 2011, he is an Associate Professor of the Paris-
time semantic segmentation based on 3d LiDAR, point cloud. Sud University. His research activities are focused on dynamic
arXiv:1807.06288 (2018) scene analysis through multimodal perception intended to enhance
127. Wofk, D., Ma, F., Yang, T.-J., Karaman, S., Sze, V.: FastDepth: Intelligent Transportation Systems applications.
Fast monocular depth estimation on embedded systems. In: 2019
International Conference on Robotics and Automation (ICRA),
pp. 6101–6108. IEEE (2019) Abdelhafid El Ouardi received the M.S. degrees from Pierre and
128. Yan, M., Wang, J., Li, J., Zhang, C.: Loose coupling visual- Marie Curie University in 2001 and the Ph.D. degree in Electronics
LiDAR odometry by combining VISO2 and LOAM. In: 2017 from Paris-Sud University in 2005. He worked at Henri Poincaré
36th Chinese Control Conference (CCC), pp. 6841–6846. IEEE University, Nancy, as a researcher in 2005–2006. He is currently an
(2017) Associate Prof at Paris-Sud University, France. In Embedded Systems
129. Yang, N., von Stumberg, L., Wang, R., Daniel, C.: D3VO: Deep team of SATIE lab, his research interests include Hardware-Software
depth, deep pose and deep uncertainty for monocular visual co-design, evaluation and instrumentation of embedded systems,
odometry. In: Proceedings of the IEEE/CVF Conference on design of smart architectures for image and signal processing, SLAM
Computer Vision and Pattern Recognition, pp. 1281–1292 (2020) applications.