Scalable Autonomous Drone Flight in the Forest with Visual-Inertial
SLAM and Dense Submaps Built without LiDAR
Sebastián Barbas Laina1,3 , Simon Boche1 , Sotiris Papatheodorou1,2,3 ,
Dimos Tzoumanikas2 , Simon Schaefer1 , Hanzhi Chen1 and Stefan Leutenegger1,2,3
Abstract— Forestry constitutes a key element for a sustain-
able future, while it is supremely challenging to introduce digital
processes to improve efficiency. The main limitation is the
arXiv:2403.09596v1 [cs.RO] 14 Mar 2024
difficulty of obtaining accurate maps at high temporal and
spatial resolution as a basis for informed forestry decision-
making, due to the vast area forests extend over and the
sheer number of trees. To address this challenge, we present
an autonomous Micro Aerial Vehicle (MAV) system which
purely relies on cost-effective and light-weight passive visual
and inertial sensors to perform under-canopy autonomous
navigation. We leverage visual-inertial simultaneous localization
and mapping (VI-SLAM) for accurate MAV state estimates and
couple it with a volumetric occupancy submapping system to
achieve a scalable mapping framework which can be directly Fig. 1: Mesh of a real-world environment generated using
used for path planning. As opposed to a monolithic map, our proposed approach. The executed mission consists of a
submaps inherently deal with inevitable drift and corrections
from VI-SLAM, since they move with pose estimates as they are
back and forth trajectory with a small loop at the end. The
updated. To ensure the safety of the MAV during navigation, red line shows the drone trajectory flying out, and the white
we also propose a novel reference trajectory anchoring scheme one is the returning trajectory. The red sphere represents the
that moves and deforms the reference trajectory the MAV is start of the mission and the white one the end of it. Right:
tracking upon state updates from the VI-SLAM system in a the drone used in these real-world experiments.
consistent way, even upon large changes in state estimates due
to loop-closures. We thoroughly validate our system in both real
and simulated forest environments with high tree densities in
excess of 400 trees per hectare and at speeds up to 3 m/s – while For precise data acquisition, aerial robots prove to be an
not encountering a single collision or system failure. To the best ideal solution as they are not affected by the heterogeneous
of our knowledge this is the first system which achieves this ground conditions present in a forest; and furthermore they
level of performance in such unstructured environment using
low-cost passive visual sensors and fully on-board computation are able to acquire data from all perspectives due to their
including VI-SLAM. available degrees of freedom. Nonetheless, forests present
vast challenges: they must to a large extent be considered
I. I NTRODUCTION GPS denied environments when performing under-canopy
With 31% of Earth covered by forests or woodlands, their missions; and these environments are cluttered and unstruc-
well-being is essential for all the inhabitants on Earth due tured, demanding precise mapping and perhaps even consid-
to their essential task of reducing the carbon dioxide in eration of moving branches for safe autonomous navigation.
the environment to produce oxygen. To move towards more To fly mapping missions autonomously, aerial vehicles
sustainable, well-informed forest management, accurate and require robust capabilities of state estimation, dense mapping
large-scale under-canopy tree-level data is required for more and a flight controller which can track the commanded
precise and data oriented decision-making. This challenging paths – in order to safely navigate the environment without
objective can only be achieved by leveraging the recent ad- collisions. At the same time, during mission, odometry drifts
vances in robotics, environment perception and data analysis. are unavoidable, but these need to be corrected continuously
This work was supported by the Technical University of Munich, MIRMI,
and especially upon loop closure to keep a consistent map
the TUM Innovation Network CoConstruct, Leica Geosystems AG, and the representation for effective navigation therein and to com-
EU Horizon project DigiForest. municate it to end-users.
1
Smart Robotics Lab, School of Computation, Informa- State-of-the-art autonomous drone flying in forests typ-
tion and Technology, Technical University of Munich. E-
mail addresses: {sebastian.barbas, simon.boche, ically relies on LiDAR, e.g. [1], [2] for precise and far-
sotiris.papatheodorou, simon.k.schaefer, reaching range measurements. In this work, we aim to
hanzhi.chen, stefan.leutenegger}@tum.de overcome such constraints employing only passive vision
2
Smart Robotics Lab, Department of Computing, Imperial
College London. E-mail addresses: {s.papatheodorou18,
and an Inertial Measurement Unit (IMU), in a quest towards
dimosthenis.tzoumanikas14, s.leutenegger}@ic.ac.uk potentially cheaper, lighter, and thus more scalable drone sys-
3
Munich Institute of Robotics and Machine Intelligence (MIRMI). tems which are flexibly deployable in cluttered environments
such as forests; able to pass through small gaps; and ideally collision avoidance mechanism onboard, leading to poten-
without sacrificing flight velocity and endurance. tially risky missions.
In short, the main contributions of this paper are: Lin et al. present a seminal work in the area of vision-
• We present an autonomous drone system which per- based autonomous navigation in [4]. The system leverages
forms under-canopy autonomous navigation with only visual-inertial sensors to compute MAV odometry. The key
(passive) visual and inertial sensors. It performs visual- distinction with our approach is their usage of monolithic 3D
inertial SLAM (VI-SLAM), deep-learning-based depth maps based on OctoMap[5], which is suboptimal for outdoor
estimation from its stereo camera driving volumetric autonomous navigation for several reasons. Primarily the
occupancy mapping in submaps, path planning and need to allocate a larger octree as the mission area increases,
trajectory tracking – with all computations on-board. leading to slower data allocation of the octree. Secondly,
• We introduce a novel approach to trajectory defor- the requirement to predefine the area of exploration poses
mation at odometry rate, to continue performing tra- a constraint. Moreover, as the odometry drift accumulates
jectory tracking upon state estimation updates, most during the navigation, repeated geometric structures will
pronounced when loop-closures occur. This allows to appear within the map. The lack of a corrective mechanism
seamlessly track planned paths without needing to stop for this issue can lead to an artificially more cluttered map
upon odometry shifts. hindering its use for path planning.
• We demonstrate our approach in simulated and real- A system closely aligned with our work is [6], where an
world forest environments of up to 467 trees per hectare MAV based on visual-inertial sensors is capable of doing
and achieve flight velocities of up to 3 m/s without autonomous navigation while generating dense maps for path
any collisions at all. To the best of our knowledge, we planning. While [6] relies on monolithic maps based on [7]
believe our system to be the first of its kind, i.e. not and [8], we present a submapping approach based on [9],
using LiDAR and flying in such dense forests while which can account for free, occupied and unknown space.
performing SLAM and mapping in the loop, allowing The main differences between the systems are: [6] performs
for missions which require precise mapping. visual-inertial odometry (VIO) [10] while our proposed so-
• We perform a mapping completeness and accuracy lution performs VI-SLAM and handles loop-closures, which
analysis to demonstrate the benefits of online maps is a challenge for autonomous systems; in [6] mapping is
generated via SLAM poses instead of visual inertial based on ESDFs, whilst our proposed system is based on
odometry. We demonstrate that with more loop-closures, occupancy values, allowing for faster depth integration to
the environment reconstruction quality is improved. submaps; [6] performs both global and local planning, while
our system relies on global planning; [6] relies on depth data
II. R ELATED W ORK from RGB-D sensors and our system leverages deep learning
for stereo depth estimation; finally, the proposed system
In this section we will focus on providing a compre- performs trajectory deformation to counteract odometry drift
hensive overview of complete frameworks for autonomous and loop-closure updates.
navigation, regardless of the specific sensors employed. Our Another work which also relies on visual sensors for au-
goal is to create a vision-based system for autonomous tonomous navigation in forestry environments was presented
navigation of MAVs that can operate in large, unstructured by Zhou et al. [11]. This work extends [12] by enabling
and cluttered environments, while building a map of the swarms of MAVs to navigate in cluttered environments.
environment, for downstream tasks, and flying as fast as Based on an ESDF mapping system, smooth trajectories
possible. Additionally, in large-scale autonomous navigation, are obtained to navigate around the present obstacles. The
a consistent communication network cannot be assumed, so differences with our approach are: the use of VIO, based
a core capability of our system is that the algorithm must on [13], instead of SLAM; the system is designed to work
run on the MAV’s onboard computer. with swarms of robots while our system is designed for only
Several systems which leverage vision sensors for au- one; in [11], real world experiments were performed with
tonomous navigation have been proposed in the literature. a maximum navigation velocity of 1.5 m/s, lower than our
One of the earliest versions of vision based navigation maximum velocity of 3 m/s, a critical parameter for large
was presented in [3]. Based on a monocular camera and scale missions.
IMU sensors, the system was capable of performing state- A work which tackles the issue of high-speed navigation
estimation used for trajectory tracking. Simultaneously, data for MAVs in cluttered and unstructured environments is [14].
was streamed via WiFi to an off-board laptop which per- They propose an end-to-end learning approach based on priv-
formed a 3D dense reconstruction of the environment via ileged learning which allows the MAV to perform high-speed
Multi-View stereo techniques. This method highlighted the autonomous flights in environments outside the training set.
potential of autonomous navigation based on vision systems Although their approach obtains higher velocities, it does not
but with certain limitations. The system relies on WiFi for the build a map of the environment, a component we consider
dense 3D reconstruction. In addition, there was no feedback crucial for downstream tasks and for collaborative robotics,
between the perceived 3D reconstructed environment and the as it serves for a common environment representation. The
system, trajectories were given by an operator without any map of the environment also allows our approach to compute
collision-free reference trajectories whereas theirs assumes the proposed method is that it does not rely on LiDAR
that a trajectory is provided, either by a higher-level planner sensors but only on passive visual sensors. This has two
or by the user. Also, this approach does not allow to main advantages: it reduces the mass of the system, which
perform missions which require a knowledge of mapping reduces battery consumption, and it also reduces the cost of
completeness or mapped area. the system.
There are more works which rely on visual sensors for
III. P ROPOSED APPROACH
autonomous navigation [15], [16]. These rely on VIO, in-
stead of SLAM, which is not able to correct for the state In this section we will present the different modules which
estimation drift which occurs over extended missions. As build our system. A schematic overview is presented in Fig. 2
we will demonstrate, drift is the main source of error in A. MAV platform
map reconstruction, making drift correction techniques, such
as loop-closure optimization, necessary for these types of We use a custom-built quadrotor as depicted in Fig. 1
applications. Accurate submaps and their poses are critical consisting of a HolybroS500 V2 ARF-Kit airframe1 , a PX4
for safe navigation, as the accurate representations of free, flight controller2 , an Intel RealSense D4553 (of which we
occupied or unknown space are critical for safe path plan- only use the IR stereo images and the IMU), and an NVIDIA
ning. Downstream tasks which require accurate online per- Jetson Orin NX (16GB)4 for processing (CPU and GPU) that
ception, such as exploration, where the goal is to completely is carried out entirely on-board. The total mass including
explore a volume, also benefit from loop-closures and drift battery amounts to 1.3 kg, allowing flight times around 10
compensation. minutes. We also replicated the setup in simulation using
Another prominent direction in autonomous navigation are Ignition Gazebo5 .
systems which rely on LiDAR sensors. These systems have B. Notation and Definitions
demonstrated an improved performance if compared with The used VI-SLAM [21] tracks a moving body with
vision based systems due to the unpaired depth perception a mounted IMU and several cameras relative to a static
capabilities of these sensors. Some of the most important World coordinate frame F
work regarding autonomous navigation with LiDAR sensors →W . The IMU coordinate frame
−
is denoted as F →S
− and the camera frames as F →Ci , whereby
−
was presented in [2], who were the winners of the DARPA we use a stereo setup and therefore i ∈ 1, 2. Left-hand
Subterranean Challenge. This team included MAVs to per- indices denote coordinate representation. Homogeneous po-
form subterranean exploration in conjunction with legged sition vectors (denoted in italic font) can be transformed with
robots. To perform state estimation, they relied in multi- TAB , meaning A rP = TAB B rP .
sensor fusion of visual and thermal imagery, LiDAR depth
data, inertial cues, and kinematic pose estimates based on C. State Estimation
[17]. This sensor redundancy provided robustness to sensor Our state estimation module is based on the existing VI-
degradation or malfunctioning. SLAM system OKVIS2. We will provide a brief recap of the
Another example of these systems was presented by Liu et key components of this work. For more details of OKVIS2
al. [1]. This system loosely couples semantic LiDAR odom- please refer to [21]. OKVIS2 requires IMU measurements
etry and visual inertial states to estimate the odometry of the and stereo images as inputs to perform the state estimation
MAV. They base their work on [18], [19] and they improve and the sparse map of the environment which is based on
SLOAM [20] by incorporating visual-inertial odometry and the visual landmarks extracted from the images. The state
a lightweight semantic segmentation network designed for representation in this work is:
LiDAR data. They also showcase their system in under- h iT
canopy forest navigation and, to the best of our knowledge, xl = W rTS , qW
T T T T
S ,W v , bg , ba , (1)
it is the only system which leverages SLAM instead of plain
odometry for state estimation. Being able to handle loop- i.e. the position W rS (of the IMU frame origin) expressed in
closures, which produce odometry discontinuities, is a major World coordinates, its orientation quaternion qW S , velocity
T
issue for autonomous navigation with a SLAM framework. W v, and gyroscope and accelerometer biases bg and ba ,
In their approach, they perform path planning in a visual respectively. A new state xl is estimated when a pair of
odometry reference frame. On each update, the drift between stereo images is received. In the estimator, two non-linear
the SLAM reference frame and the visual odometry reference least squares optimizations are present: one executed in a
frame is computed and the planned trajectory is transformed real-time manner, in which the optimization runs over the
with this estimated drift. In our approach we anchor the last T frames, M keyframes, considering IMU measurements
trajectory on prior poses, as we know they were in free space. in-between, as well as reprojection errors and pose-graph
We then deform the trajectory based on the updates to these 1
pose anchors. Thanks to the anchoring technique, we do not https://holybro.com/products/s500-v2-kit.
2
https://px4.io.
need to update in parallel a VIO and a SLAM system, but 3
https://www.intelrealsense.com/depth-camera-d455.
can work with a single state-estimator, reducing the overall 4
https://www.nvidia.com/en-us/
computational cost of the system . autonomous-machines/embedded-systems/jetson-orin.
5
Compared with these systems, the main difference of https://gazebosim.org.
Fig. 2: Overall approach overview: incoming stereo frames and IMU measurements are consumed by the OKVIS2 VI-
SLAM system. In parallel, the stereo images are processed by a CNN to output a depth map, which is then integrated into
occupancy submaps. These in turn are used by a planner, generating a reference trajectory to be followed by the MPC.
Upon state updates, especially pronounced after loop-closures, the trajectory is kept anchored to state estimates, i.e. moved
and deformed in-line with the estimates and submaps.
errors formulated from marginalised co-observation; and a perceived environment as a basis of planning and executing
second optimization problem, using the exact same factor actions. As the sparse pointclouds produced by OKVIS2
graph, is triggered on loop-closure in the background, which are not sufficient to this end, we adopt a dense mapping
optimises older states around the loop, too. This means, when framework, based on supereight2 [9], an octree-based vol-
processing a new frame, not only its state xl is estimated, umetric mapping approach which can represent occupied,
but possibly many prior states xl−k are also updated. free and unobserved space. At the same time, it leverages
multi-resolution, enabling faster integration of depth or range
D. Model Predictive Controller measurements into each map, as well as faster collision
The (linear) model predictive control (MPC) is based on checks during path planning.
our prior work presented in [22], although modifications To account for inaccuracies and drift corrections of the
were done so it could accept the trajectory updates processed underlying SLAM system, it is quintessential to refrain from
by the trajectory anchoring as described below in Sec. III- the concept of a monolithic volumetric map. Otherwise
F. Previously, the system would receive the odometry states maps inconsistent with trajectory estimates are constructed,
asynchronously from the trajectory states to be tracked. For which are artificially cluttered with duplicated structure after
a correct trajectory tracking under the presence of major corrections – in essence entirely useless for any form of safe
odometry changes, as it happens with loop closures, the navigation therein.
odometry update has to be processed synchronously with the
We therefore adopt a submapping approach, in which
deformed trajectory to ensure smooth and safe navigation.
submaps are associated with keyframes from OKVIS2 and
The MPC obtains odometry updates at a minimum rate
rigidly move with them. This is based on the assumption
of 40Hz to compute the attitude control sent to the MAV’s
that locally, pose drift is negligible, allowing to fuse several
internal PX4 attitude controller via MAVROS6 . To achieve a
depth images into a local submap.
higher rate than the framerate, IMU integration is performed
from the most recent state estimate to the most recent IMU As classical stereo-matching algorithms could fail in chal-
measurement received. lenging outdoor scenarios and only yield sparse output, we
developed a deep stereo network able to infer dense depth
E. Submap Interface based on [23], and fine-tuned it with large-scale synthetic
To perform tasks such as path planning and navigation, forestry data from [24]. To generate the map, given a set
a mapping system needs to store the information of the of depth images Dt obtained via depth stereo estimation
at a time step t, we aim to generate an occupancy-based
6
http://wiki.ros.org/mavros. volumetric map. OKVIS2 computes a state xt from the same
3
2 map
m ap Sub moved
Sub (anchored)
reference
Submap 1 trajectory
Submap 1
original
reference
trajectory
re-estimated original
Keyrfa states states
unkno
Fig. 3: Submap generation example every two OKVIS2
keyframes: note the differentiation between unknown, free, Fig. 4: Reference trajectory anchoring: when estimated states
and occupied space per submap. Submaps are aligned with change, e.g. upon loop closure, the planned reference trajec-
the respective keyframes and move with them when OKVIS2 tory moves in-line with the states and submaps associated
re-estimates them – which is pronounced upon loop-closure. with some of the keyframe states. The previous state trajec-
tory, keyframe, and submap poses are drawn in gray, whereas
the updated state trajectory is black. The previous trajectory
stereo images, thus Dt is inherently associated with the pose (light pink) is moved and deformed (pink) in line with the
estimates in xt , thus we can directly use the pose of camera submaps (illustrative example with exaggerated change).
1, in which Dt is expressed, i.e. TWC1,t = TW St TSC1 , where
TSC1 denotes the extrinsics calibration of camera 1.
moving the reference in-line with updates from the state
The submap generation policy is based on the keyframe
estimator. “Jumps” in state estimates would otherwise lead
selection done by OKVIS2. Every n keyframes, a new
to undesired erratic drone motion and ultimately to a crash,
submap, with pose TW Sk , is generated, where the upcoming
as the drone will aim to navigate to the previous trajec-
depth images will be integrated as they are processed, with
tory state which represents a different environment location.
the k-th keyframe of the trajectory. An example of the
Furthermore, we need to ensure that the planned reference
submap generation policy can be found in Fig. 3. As both
trajectory remains both in free space after the submap poses
TW Sk and TWC1,t are known, the relative transform TSkC1,t
are updated, as well as that it remains continuous, i.e. the
can be computed, i.e. the pose of the depth frame at time
change must not break apart the reference trajectory.
step t relative to the submap coordinate frame corresponding
Our approach as visualized in Fig. 4 anchors each refer-
to the pose at time step k, F →Sk , allowing depth frame
− ence state to a set S of states xs , s ∈ {1, . . . , S} estimated
integration. On keyframe state updates, the poses of submaps
by OKVIS2. The reference trajectory states will be updated
move accordingly.
as the anchor states xs update their corresponding poses.
F. Path Planning and Reference Trajectory Anchoring A reference trajectory contains references (superscript
“ref”) xref ref ref ref
j = (W rW S j , qW S j , W v j ), with j ∈ {1, . . . , J} (and
Our path planning is based on OMPL [25], specifically corresponding timestamps), which are anchored to the esti-
using the Informed RRT* [26] algorithm. To extend the mated OKVIS2 states in S via the algorithm of k-nearest
path planner to the submap structure leveraged in our work, neighbours, using the distance metric:
we assume that if a path segment (comprised of a cylinder
ref
and a half-sphere at the end of the segment, both of the d j,s = ||W rW S j − W rSs ||. (2)
same radius) is free in any of the last S submaps, it is a
We associate every pair j, s with the relative transformation
valid segment to navigate through. As our occupancy maps
TSs S j and a weight w j,s :
distinguish between free, occupied and unknown space, we
will consider a segment free if the log-odds occupancy value 1/d j,s
is below a threshold α < 0 which only occurs in free-space. w j,s = S
. (3)
∑ j=1 1/d j,s
Once a safe path is returned, it is converted into a
trajectory, by accounting for the maximum velocity vmax and Upon a state update, trajectory state positions are updated as
acceleration amax that we consider to decelerate and accel- follows:
S
erate on sharp turns. This reference trajectory will undergo ref’
=
the process of what we call trajectory anchoring, effectively
W rW S j ∑ w j,s TW Ss Ss rS j , (4)
s=1
where TW Ss is the latest anchor pose as estimated by
SLAM mission under lawnmower path commands
ref’ Reference positions
OKVIS2. The orientation qW S j is updated to the weighted 100 SLAM
average of qSs S j , s ∈ {1 . . . S} using the weights already Ground-truth
computed as above by following the method from [27] . We 80
also apply the respective orientation changes to the velocities.
IV. E VALUATION 60
y[m]
To validate our method, we present two sets of experi-
ments. The first one is done in simulation where quantitative 40
results can be obtained for the estimator and the map
quality. The second set of results is obtained from real-world 20
experiments, where qualitative results are presented.
The parameters in both experiments were kept constant. 0
The voxel resolution of the volumetric mapping was set to 10
80 60 40 20 0
cm, the planning time set to 0.5 s and the target maximum x[m]
velocity was set to 3 m/s. The OKVIS2 hyperparameters
were kept constant for both the simulator and the real-world SLAM mission under modified lawnmower path commands
experiments for a fair comparison, matching the computa- Reference positions
100 SLAM
tional resources of the Jetson Orin NX. In both cases, loop-
Ground-truth
closures were present, necessitating trajectory anchoring.
80
A. Simulation Experiments
Our simulator is based on the Ignition Gazebo simulation 60
y[m]
environment. In this simulation, we have created an artificial
forest which contains 620 trees in a squared shaped surface 40
area of 16’384 m2 , i.e. 378 trees per hectare. According to
[28], Swiss forests come with an average tree density of
690 trees on the same surface area, validating the realism of 20
our simulation. To add more realism to the simulation, we
use PX4’s simulation in the loop capabilities to interact with 0
the drone, equivalently to the real world setup – including 80 60 40 20 0
MAVROS for communication between the MPC and the x[m]
PX4, instead of assuming perfect tracking. Fig. 5: Overhead views of two simulation experiments, one
In order to demonstrate the capability of our system to using a lawnmower pattern on the top and one using the
perform long missions in presence of loop closures, and the proposed modified lawnmower pattern which triggers more
superior quality of maps generated by a system leveraging loop-closures on the bottom. The images show the MAV ref-
SLAM rather than just VIO, we evaluate on two mission erence positions, the live trajectory estimated by the OKVIS2
types: the first using a classic lawnmower pattern reference SLAM system and the corresponding ground-truth trajectory.
path and the second a slight modification of the former Tree positions are shown as dots. The reference positions
which triggers more loop-closures by revisiting previously can’t always be accurately tracked due to the presence of
mapped areas. Both patterns result in large loop closures trees.
so as to validate the performance and safety of the system
and particularly the trajectory anchoring module. A total of
12 missions were executed, without a single MAV collision.
Trajectories following the lawnmower path had an average the trees, requiring a lower velocity. The target velocity was
length of 1869.78 m while those following the modified achieved on the longer and straighter segments.
version had an average length of 2363.64 m. An example To demonstrate the superior mapping quality obtained by
of each mission type and the resulting trajectories is shown leveraging SLAM for state estimation, we evaluate the accu-
in Fig. 5 while an example of trajectory anchoring upon loop- racy and completeness of the mesh reconstructions obtained
closure is presented in Fig. 6. via SLAM and VIO. We also evaluate the reconstructions
Fig. 7 shows the velocity profile of a mission using the using the ground-truth poses and depth maps corresponding
modified lawnmower pattern obtained from the ground-truth to the SLAM and VIO missions as a best-case-scenario.
odometry provided by the simulator. The average velocity The final reconstruction is the combination of the submap
is approximately 1.3 m/s even though the target velocity meshes. Accuracy is computed as the root-mean-square error
is 3m/s. This is due to the large amount of aggressive and completeness is computed as the percentage of the
maneuvers the MAV had to perform to navigate between ground-truth mesh vertices for which there is a reconstructed
Acc. (m)↓ Compl. (%)↑
SLAM 0.280 69.62
VIO 0.240 48.71
Lawnmower
GT SLAM 0.103 71.86
GT VIO 0.104 69.85
SLAM 0.190 71.34
VIO Modified 0.230 28.08
GT SLAM Lawnmower 0.105 72.14
GT VIO 0.105 74.32
TABLE I: Mesh reconstruction accuracy and completeness
Fig. 6: Trajectory anchoring when performing a mission in for both mission types using SLAM and VIO as well as
the simulator. Left: MAV pose (red arrow), estimated past using ground-truth poses and depth images (GT SLAM/VIO
trajectory (red line), current MPC reference (violet sphere) rows).
and reference trajectory (green line) before a loop-closure
was detected. Right: situation after the loop-closure with
deformed estimates and anchored trajectory. The blue circle because prior submaps could be used for path planning. Near
and the yellow line represent an estimate of the MPC ref- home, an extra detour was done to be able to change again
erence and reference trajectory if trajectory anchoring were the orientation and have the same start pose to finally trigger
not operating – highlighting possible catastrophic failures if a long loop-closure. OKVIS2 processed stereo images at
not handled. MAV speed distribution over a mission 15Hz and depth images were generated at 5Hz on average.
For these experiments we have roughly determined the
position drift between the start and end positions to be less
than 1 meter before loop-closure – since the drone was
commanded back to the start. According to the OKVIS2
odometry, the total distance travelled during the mission was
of 226.71m. Thus the estimator position drift amounts to
< 0.5% of distance travelled, which is substantially more
accurate than the numbers reported in [1] (> 1.0%).
0 1 2 3 4
The maximum velocity achieved was a peak of 4 m/s ac-
cording to OKVIS2 odometry, although the average velocity
Fig. 7: MAV velocity distribution in m/s for a simulated was at 1.2 m/s with the top and bottom quartiles being at 1.8
mission with a reference velocity of 3 m/s and a trajectory m/s and 0.6 m/s, respectively. No collisions occurred during
length of 2363.58 m. the mission, and no interventions had to be made, as all
planned segments were assessed to be safe, and in fact flown
safely by the controller.
mesh vertex within 50 cm. The approximate density of the forest in the area of the
The quantitative evaluation results are presented in Table I. real-world experiments amounted to 467 trees per hectare,
The ground-truth mesh reconstructions do not achieve 100% which is even denser than the simulated forest or a compa-
completeness because there are regions of the environment rable forest area in Switzerland.
that are unobservable from the followed trajectories, e.g. the
higher parts of trees. It can be seen that the drift correction V. C ONCLUSION
due to loop-closures improves both the reconstructed mesh
quality and the trajectory accuracy in both mission types. The In this paper, we presented an MAV system which relies
modified lawnmower pattern results in further improvement solely on passive visual sensors and an IMU to perform
in mesh accuracy and completeness in the SLAM case due to autonomous under-canopy navigation in forests. Onboard
the higher amount of loop closures triggered homogeneously visual-inertial SLAM is in charge of MAV state estimation.
throughout the mission. We propose a volumetric occupancy submapping system
to achieve a scalable mapping representation of the envi-
B. Real World Experiments ronment and reduce the reconstruction errors which arise
For the real-world experiments, two people were required: in monolithic maps under drifting and brusquely changing
a safety-pilot in case of unexpected malfunctioning and an state estimates. We also introduce an approach of trajectory
operator. The latter would manually set the 3D pose where anchoring to ensure safe navigation upon abrupt changes in
the drone should fly.The overall mission consisted of doing odometry due to loop closures. To demonstrate the method,
a path forward, then turn around and return in the opposite we validated the system both in simulation and in a real-
direction to prevent short loop-closures as shown in Fig. 1. world environment – demonstrating safe flight at up to 3
During the return phase, the planned paths were longer, m/s without a single collision or unsafe planning instance.
R EFERENCES [18] K. Mohta, K. Sun, S. Liu, M. Watterson, B. Pfrommer, J. Svacha,
Y. Mulgaonkar, C. J. Taylor, and V. Kumar, “Experiments in fast, au-
[1] X. Liu, G. V. Nardari, F. C. Ojeda, Y. Tao, A. Zhou, T. Donnelly, tonomous, GPS-denied quadrotor flight,” in 2018 IEEE International
C. Qu, S. W. Chen, R. A. Romero, C. J. Taylor et al., “Large-scale Conference on Robotics and Automation (ICRA). IEEE, 2018, pp.
autonomous flight with real-time semantic SLAM under dense forest 7832–7839.
canopy,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. [19] K. Mohta, M. Watterson, Y. Mulgaonkar, S. Liu, C. Qu, A. Makineni,
5512–5519, 2022. K. Saulnier, K. Sun, A. Zhu, J. Delmerico et al., “Fast, autonomous
[2] M. Tranzatto, T. Miki, M. Dharmadhikari, L. Bernreiter, M. Kulkarni, flight in GPS-denied and cluttered environments,” Journal of Field
F. Mascarich, O. Andersson, S. Khattak, M. Hutter, R. Siegwart Robotics, vol. 35, no. 1, pp. 101–120, 2018.
et al., “CERBERUS in the DARPA Subterranean Challenge,” Science [20] S. W. Chen, G. V. Nardari, E. S. Lee, C. Qu, X. Liu, R. A. F. Romero,
Robotics, vol. 7, no. 66, p. eabp9742, 2022. and V. Kumar, “Sloam: Semantic lidar odometry and mapping for
[3] M. Faessler, F. Fontana, C. Forster, E. Mueggler, M. Pizzoli, and forest inventory,” IEEE Robotics and Automation Letters, vol. 5, no. 2,
D. Scaramuzza, “Autonomous, vision-based flight and live dense 3D pp. 612–619, 2020.
mapping with a quadrotor micro aerial vehicle,” Journal of Field [21] S. Leutenegger, “OKVIS2: Realtime scalable visual-inertial SLAM
Robotics, vol. 33, no. 4, pp. 431–450, 2016. with loop closure,” arXiv preprint arXiv:2202.09199, 2022.
[4] Y. Lin, F. Gao, T. Qin, W. Gao, T. Liu, W. Wu, Z. Yang, and [22] D. Tzoumanikas, W. Li, M. Grimm, K. Zhang, M. Kovac, and
S. Shen, “Autonomous aerial navigation using monocular visual- S. Leutenegger, “Fully autonomous micro air vehicle flight and land-
inertial fusion,” Journal of Field Robotics, vol. 35, no. 1, pp. 23–51, ing on a moving target using visual–inertial estimation and model-
2018. predictive control,” Journal of Field Robotics, vol. 36, no. 1, pp. 49–
[5] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Bur- 77, 2019.
gard, “OctoMap: An efficient probabilistic 3D mapping framework [23] G. Xu, J. Cheng, P. Guo, and X. Yang, “Attention concatenation vol-
based on octrees,” Autonomous robots, vol. 34, pp. 189–206, 2013. ume for accurate and efficient stereo matching,” in Proceedings of the
[6] H. Oleynikova, C. Lanegger, Z. Taylor, M. Pantic, A. Millane, R. Sieg- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
wart, and J. Nieto, “An open-source system for vision-based micro- 2022, pp. 12 981–12 990.
aerial vehicle mapping, planning, and flight in cluttered environments,” [24] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor,
Journal of Field Robotics, vol. 37, no. 4, pp. 642–666, 2020. and S. Scherer, “TartanAir: A dataset to push the limits of visual
[7] H. Oleynikova, Z. Taylor, M. Fehr, R. Siegwart, and J. Nieto, SLAM,” in 2020 IEEE/RSJ International Conference on Intelligent
“Voxblox: Incremental 3D Euclidean signed distance fields for on- Robots and Systems (IROS), 2020.
board MAV planning,” in 2017 IEEE/RSJ International Conference [25] I. A. Sucan, M. Moll, and L. E. Kavraki, “The open motion planning
on Intelligent Robots and Systems (IROS), 2017, pp. 1366–1373. library,” IEEE Robotics & Automation Magazine, vol. 19, no. 4, pp.
[8] A. Millane, Z. Taylor, H. Oleynikova, J. Nieto, R. Siegwart, and 72–82, 2012.
C. Cadena, “C-blox: A scalable and consistent TSDF-based dense [26] J. D. Gammell, S. S. Srinivasa, and T. D. Barfoot, “Informed RRT*:
mapping approach,” in 2018 IEEE/RSJ International Conference on Optimal sampling-based path planning focused via direct sampling
Intelligent Robots and Systems (IROS), 2018. of an admissible ellipsoidal heuristic,” in IEEE/RSJ International
[9] N. Funk, J. Tarrio, S. Papatheodorou, M. Popović, P. F. Alcantarilla, Conference on Intelligent Robots and Systems, Chicago, IL, USA,
and S. Leutenegger, “Multi-resolution 3D mapping with explicit September 2014, pp. 2997–3004.
free space representation for fast and accurate mobile robot motion [27] F. L. Markley, Y. Cheng, J. L. Crassidis, and Y. Oshman, “Averaging
planning,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. quaternions,” Journal of Guidance, Control, and Dynamics, vol. 30,
3553–3560, 2021. no. 4, pp. 1193–1197, 2007.
[10] T. Schneider, M. Dymczyk, M. Fehr, K. Egger, S. Lynen, I. Gilitschen- [28] Swiss National Forest Inventory, https://lfi.ch/publikationen/publ/LFI
ski, and R. Siegwart, “maplab: An open framework for research in Flyer-en.pdf, 2021.
visual-inertial mapping and localization,” IEEE Robotics and Automa-
tion Letters, vol. 3, no. 3, pp. 1418–1425, 2018.
[11] X. Zhou, J. Zhu, H. Zhou, C. Xu, and F. Gao, “Ego-swarm: A fully
autonomous and decentralized quadrotor swarm system in cluttered
environments,” in 2021 IEEE international conference on robotics and
automation (ICRA). IEEE, 2021, pp. 4101–4107.
[12] X. Zhou, Z. Wang, H. Ye, C. Xu, and F. Gao, “Ego-planner: An ESDF-
free gradient-based local planner for quadrotors,” IEEE Robotics and
Automation Letters, vol. 6, no. 2, pp. 478–485, 2020.
[13] H. Xu, L. Wang, Y. Zhang, K. Qiu, and S. Shen, “Decentralized
visual-inertial-uwb fusion for relative state estimation of aerial swarm,”
in 2020 IEEE international conference on robotics and automation
(ICRA). IEEE, 2020, pp. 8776–8782.
[14] A. Loquercio, E. Kaufmann, R. Ranftl, M. Müller, V. Koltun, and
D. Scaramuzza, “Learning high-speed flight in the wild,” Science
Robotics, vol. 6, no. 59, p. eabg5810, 2021.
[15] T. Cieslewski, E. Kaufmann, and D. Scaramuzza, “Rapid exploration
with multi-rotors: A frontier selection method for high speed flight,”
in 2017 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), 2017, pp. 2135–2142.
[16] Z. Zhang and D. Scaramuzza, “Perception-aware receding horizon
navigation for MAVs,” in 2018 IEEE International Conference on
Robotics and Automation (ICRA), 2018, pp. 2534–2541.
[17] S. Khattak, H. Nguyen, F. Mascarich, T. Dang, and K. Alexis,
“Complementary multi–modal sensor fusion for resilient robot pose
estimation in subterranean environments,” in 2020 International Con-
ference on Unmanned Aircraft Systems (ICUAS). IEEE, 2020, pp.
1024–1029.