0% found this document useful (0 votes)
12 views

3D Obstacle Avoidance For UAV Based On RL and RealSense

Построение маршрутов на основе обучения с подкреплением
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

3D Obstacle Avoidance For UAV Based On RL and RealSense

Построение маршрутов на основе обучения с подкреплением
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

The Journal of Engineering

The 3rd Asian Conference on Artificial Intelligence Technology (ACAIT


2019)

Three-dimensional obstacle avoidance for eISSN 2051-3305


Received on 11th October 2019

UAV based on reinforcement learning and


Accepted on 19th November 2019
E-First on 22nd May 2020
doi: 10.1049/joe.2019.1167
RealSense www.ietdl.org

Deqiang Han1, Qishan Yang1 , Rui Wang1


1Faculty of Information Technology, Beijing University of Technology, Beijing 100124, People's Republic of China
E-mail: [email protected]

Abstract: With the increasingly widespread application of unmanned aerial vehicle (UAV), safety issues such as effectiveness
of obstacle avoidance have been paid more attentions. The classical obstacle avoidance algorithms are mostly suitable for
mobile robots, but these algorithms are not ideal for UAV using in three-dimensional space. Most of the three-dimensional
obstacle avoidance algorithms which are more effective using RGB image data as input. Thus, a large amount of image data is
involved in complex computing process. This study proposes an effective obstacle avoidance algorithm for UAV with less input
data and fewer sensors based on RealSense and reinforcement learning. It combines the feature map of the depth image of
RealSense as the input data of reinforcement learning and the current direction of flight of UAV to calculate the direction and
angle of avoiding. The proposed algorithm that implements real-time obstacle avoidance for UAV has been verified by
simulation and tested in three-dimensional space scenario.

1 Introduction to use than dual vision camera. By taking the depth data of
RealSense as the input to calculate actions of avoiding, it can
With the development of science and technology, technologies of improve the effective rate of the obstacle avoidance and safety of
unmanned aerial vehicles (UAV) are evolving as well. Nowadays, UAV.
UAV is not only used in military, but also in commercial, consumer Machine learning is a new paradigm of problem solving [3].
electronics and even toys. At the same time, safety issues are also Reinforcement learning is an important machine learning method
increasingly concerned. Obstacle avoidance of UAV is a key point and applied many fields of intelligent control. The Agent does not
in safety issue. Whatever UAV is in the state of automation or need any prior knowledge of the environment and learn the rules
controlled by the user, it will cause damage to UAV or even gradually with each action and feedback. Traditional reinforcement
injuries to people if UAV cannot avoid obstacles in time. learning can be divided into Policy-based and Value-based. The
Therefore, an effective obstacle avoidance function is very purpose of the Policy-based is to find the optimal strategy. It can
important for UAV. get the probability of multiple possible actions to be taken next by
Obstacle avoidance algorithms for UAV can be divided into two analysing the current environment, and then performs the next
classifications. One is traditional obstacle avoidance methods and action according to the probability. The purpose of the Value-based
the other is based on machine learning to avoid obstacles. As is to find the optimal sum of rewards. It can get the value of each
traditional obstacle avoidance methods,artificial potential field and action, and then chooses the action according to the highest value.
vector field histogram have been used extensively in mobile robots Actor–Critic combines the advantages of the above two methods.
[1]. This class of algorithms use obstacles ranging data as input and The Actor means to perform the next action based on the
calculate the way of avoiding obstacles by specific models. Their probability distribution of the policy. The Critic means giving the
advantages are less input data and low computational complexity. value of the action. In other words, the Actor selects the action by
However, once some unexpected situations for the model happen, the probability and the Critic gives the Q-value by the action of the
it is likely that obstacle avoidance methods will become invalid. Actor. Then, the Actor updates the probability of selecting action
Another class of algorithms is based on reinforcement learning. based on the Q-value given by the Critic. Therefore, it is equivalent
The algorithms combine convolutional neural networks with to speeding up the learning process on the basis of the original
reinforcement learning to get features of input data which are policy gradients. The structure of Actor–Critic is shown in Fig. 1.
usually images of obstacles. Then, the features are matched to the Actor–Critic also has some disadvantages, one of which is
original actions of avoiding [2]. Finally, a better way to avoid inherent slow rate of converging. In order to solve this problem,
obstacles is given. However, the large number of images as input deep deterministic policy gradient (DDPG) is adopted in this paper.
data leads to higher computational complexity. However, model-free reinforcement learning algorithm requires the
Considering the advantages of both, this paper proposes an Agent to update knowledge by trial and error. Thus, training the
obstacle avoidance algorithm for UAV based on RealSense and neural network in the simulation environment can avoid to damage
reinforcement learning. The algorithm uses the feature map of raw to UAV before training completed. In addition, because of using
depth data of RealSense as input data. Compared to RGB images, the feature map of raw depth data of RealSense as input data, the
using the raw depth data which contains the necessary ranging difference between simulation environment and real-world testing
information to avoiding obstacles can reduce the amount of input has little effect on the algorithm, which is better than using RGB
and computational complexity. Furthermore, the algorithm images as input. Finally, the algorithm proposed in this paper is
implements obstacle avoidance function only using RealSense. It is verified that it can implement an effective obstacle avoidance
different to some common solutions with various types of sensors. function for UAV by real-world testing.
Intel RealSense technologies, formerly known as Intel
Perceptual Computing, are the perfect choice for the computer
vision and depth solution. For obtaining 3D depth information,
technologies commonly used in two-dimensional (2D)
environments such as ultrasonic and infrared ranging are no longer
the best choice. RealSense device is smaller than Kinect and easier
J. Eng., 2020, Vol. 2020 Iss. 13, pp. 540-544 540
This is an open access article published by the IET under the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/3.0/)
20513305, 2020, 13, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/joe.2019.1167 by Cochrane Russian Federation, Wiley Online Library on [03/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Fig. 1 Structure of Actor–Critic

2.2 Design of training


According to the region corresponding to the direction of flight in
the feature map as shown in Fig. 2, the state of the UAV can be
confirmed. The black area in Fig. 2 is the flight area of the UAV.
By comparing the size of the area and the UAV, it can be
determined whether the UAV is able to pass safely. The white area
is the direction area of the UAV. The sum of the black and white
area is counted as the checking area.
The flight can be divided into two states: Safe and Dangerous,
Fig. 2 Feature map partition which depends on the analysis result of the checking area in the
feature map [4]. In the state of Safe, the UAV flies in the direction
of target. In the state of Dangerous, the UAV follows action which
is the output of the algorithm to avoid obstacles. To avoid some
problems of Partially Observable Markov Decision Process, the
Agent keeps training in two states.
In order to improve the training effect, DDPG adopts the
mechanism of experience replay. Some pre-processed data are
collected and stored in Replay Buffer before training. The state,
action, reward of each step are stored in Replay Buffer, while the
Fig. 3 Raw data processing data in Mini-Batch are randomly taken out during training. The
size of Replay Buffer is 500,000, and the size of Mini-Batch is 300.
2 Proposed method Training started after storing 30,000 sets of data.
2.1 Problem statement
2.3 Data pre-processing
Actor–Critic has been proved that it can achieve very good results
There are many parameters that can be collected during the UAV
in the game named The Open Racing Car Simulator. We can make
flight including Euler angles, relative position etc. If these
appropriate modifications based on this reinforcement learning
parameters are used as input for reinforcement learning, the
structure. After the depth image as input data is processed by
accuracy of obstacle avoidance can be further improved. However,
neural network, we can get the moving speed of the UAV in
some of parameters are difficult to collect in real time in practical
forward, up and down, left and right directions, respectively, for
application, and the general flight control system does not open all
avoiding obstacles in 3D space. Since controlled variable of the
these parameters for developers to use. Therefore, only feature map
UAV is continuous action, which may lead to trouble in
of depth data is used as input data for state. The data processing
convergence of Actor–Critic, we adopt a reinforcement learning
process is shown in Fig. 3.
method called DDPG. DDPG is based on Actor–Critic and
The depth data resolution of RealSense is 640*480. A set of
combines the advantages of deep-Q-network to improve stability
20*20 feature vector is taken out and used in convolution
and convergence. The description of the problem can be
processing with the stride of 20*20. Then, the output is 32*24,
summarised as a function formula
representing the average valid distance of each 20*20 pixels. Since
detectable range of RealSense is between 0.7 and 4 m [5], the data
action = f state (1)
overstepping the range are set to 0.
The state in (1) is the feature map of depth data. Since the input
variable is only state, the action which is actions of obstacle 2.4 Design of network structure
avoidance for the UAV depends solely on the value of state. That is There are two groups of neural networks which are Actor network
to say, there is a mapping relationship between state and action. and Critic network. The parameter TAU in target network is
With proper policy,the Agent gradually maps state to action and, 0.0003. Actor is a policy-based network which calculates action
eventually, achieves effective obstacle avoidance. Every time based on the input state. It updates depending on the output Q-
action is executed, it should be ensured that the value of state value of the Critic network. The structure of the Actor network is
which represents the condition of current obstacles is up to date. shown in Fig. 4.
From this, the real time of the algorithm can be guaranteed during In Actor network, the input is feature map of depth data and the
the UAV flying. output is the speed of forward and avoiding action both
horizontally and vertically. The input layer is connected to a fully
connected (FC) layer. After Swish function activating and handling
with Dropout, these data are used as input of three other FC layers,

J. Eng., 2020, Vol. 2020 Iss. 13, pp. 540-544 541


This is an open access article published by the IET under the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/3.0/)
20513305, 2020, 13, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/joe.2019.1167 by Cochrane Russian Federation, Wiley Online Library on [03/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Fig. 4 Structure of Actor

Fig. 5 Structure of Critic

respectively [6]. One set of the data is connected to an FC layer Table 1 Correspondence between reward and status
and mapped to (0,1) by Sigmoid activation function. It can produce Status of the UAV Reward
the speed of forward. The other two sets of the data are connected S-D −3
to an FC layer and handled by Swish activation function, and then
D-S +3
connected to another FC layer and mapped to (−1, 1) by Tanh
activation function. It can produce the speed of horizontal and D-D, L↓ −2
vertical for obstacle avoidance, respectively. The learning rate in D-D, L↑ +2
Actor network is 0.00003.
Critic is a value-based network which evaluates Actor network
based on state and action. The network is updated by reward and Table 2 Correspondence between reward and flight states
output of target network for each action. The structure of the Critic Flight State Reward
network is shown in Fig. 5. collision −5
In Critic network, the input is feature map of depth data (state) reaching the destination +5
and action of Actor network, while the output is Q-value for
vforward > 0.9 m/s +1
evaluating Actor. The state is connected to an FC layer. After
Swish function activating and handling with Dropout, these data
and action are connected to another FC layer together. Similarly,
this set of data is connected to an FC layer and handled by Swish Collision, reaching the destination and flying forward are three
activation function with Dropout. Before Q-value is given, there is flight states to consider. When a collision occurs, the reward is
the last FC layer that needs to connect. The Q-value is used to reduced by 5. If the UAV arrives at the destination, the reward is
evaluate each action and train the Actor network. The learning rate added 5. The reward is also added 1 if current forward speed is
in Critic network is 0.0003. >0.9 m/s. Whether a collision occurs or the UAV reaches the
destination, the end of the round of training is marked. The reward
2.5 Design of reward corresponding to flight states is shown in Table 2.
It should be noted that the collision is divided into two cases.
When evaluating the effect of Actor, the method of calculation for One is the collision between the UAV and obstacles, the other is
reward needs to be considered. The design of reward has a great that obstacles in front is less than the minimum detectable range of
influence on the results. If the design is unreasonable, the neural RealSense. The significance of the latter is to avoid the problems,
network will easily fall into the result of local optimum. The such as the state is still in Safe when the UAV approaches collision,
reward calculation in this paper consists of two parts: the change of caused by the detectable range of RealSense.
feature map and the flight state of the UAV. After each action is executed, the reward corresponding to
As mentioned earlier, the flight state of the UAV is divided into action is obtained by calculating the sum of both above.
two states, and the reward corresponding to the change of state is
shown in Table 1. 2.6 Additional design of action
In Table 1, S means the state of Safe, D means the state of
Dangerous, and L means minimum non-zero distance from In order to detect enough continuous behaviour in the simulation
obstacles in Dangerous. The initial value of reward is 0. The environment, Ornstein–Uhlenbeck (OU) process is used to
reward is reduced by 3 when the state changes from Safe to generate noise, which increases the randomness of action.
Dangerous. After avoiding obstacles, the state changes from Combining the noise with action as the output of Actor can
Dangerous to Safe, and the reward is added 3. Moreover, there are improve adaptability to unknown environments during training. It
two situations that need to be considered when Dangerous remains is better to remove OU process in practical application, however.
unchanged. The L increasing indicates that the UAV is away from
obstacles, the reward is added 2. Conversely, the L will decrease if 3 Experiments
the UAV is approaching obstacles, and the reward is reduced by 2.
As it should be, the reward will not change if the state of Safe 3.1 Simulation experiments
remains unchanged. The simulation environment in this paper is Gazebo and robot
operating system (ROS). Gazebo can help users to rapidly test
542 J. Eng., 2020, Vol. 2020 Iss. 13, pp. 540-544
This is an open access article published by the IET under the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/3.0/)
20513305, 2020, 13, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/joe.2019.1167 by Cochrane Russian Federation, Wiley Online Library on [03/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Fig. 6 The UAV modeling in design
(a) The model of the UAV, (b) The structure of the UAV

Fig. 7 Map of simulation environment

algorithms, perform regression testing, and train AI system using


realistic scenarios. In other words, it offers the ability to accurately
and efficiently simulate populations of robots in complex indoor
and outdoor environments [7]. The ROS is a flexible framework
for writing robot software. It is a collection of tools, libraries, and
conventions that aim to simplify the task of creating complex and
robust robot behavior across a wide variety of robotic platforms
[8]. In ROS, we can code control functions for the UAV and run
the algorithm.
In the simulation environment, the model and structure of the
UAV is shown in Fig. 6. Fig. 6a is the model of the UAV with the
same size as the actual UAV. Fig. 6b is the assembly location of
RealSense device on the UAV.
The relationship between the speed of the UAV and the output
of the Actor is shown in (2). The maximum speed change is 1 m/s

vx = action ⋅ x m/s
vy = action ⋅ y m/s (2)
vz = action ⋅ z m/s

Fig. 7 is a simulation environment for testing. The map contains


several groups of obstacles in a three-dimensional environment.
For each obstacle, some necessary data are collected from all Fig. 8 Results of training
directions and locations before training. (a) Distance, (b) Rewards, (c) Q-values
The training results are shown in Fig. 8. The first 4000 episodes
represent data collection process as mentioned above. The results The real-world testing environment is an underground car park
of this part are not used to discuss the training effect. Fig. 8a is the as shown in Fig. 10a. A wall-stud in park is used as the obstacle to
result of completed flying distance. The total length of the testing the effect of obstacle avoidance algorithm in practical
simulation environment is 40 m. After about thousands of training, application. Fig. 10b is the fixed depth image of the obstacle.
the UAV can pass through most of simulation testing areas. Fig. 8b Fig. 10c is the feature map of the depth image after the
is the result of rewards. The relationship between distance and convolution. It can be seen that a part of the shape of the obstacle is
rewards is linear to some extent. Fig. 8c is the result of Q-values. retained. At this point, the action for avoiding obstacle as the
Q-values generally shows an upward trend but continues to output by the algorithm is shown as in Fig. 10d. As shown in (2),
fluctuate. the UAV will move to the lower left to avoid obstacles.
The process of obstacle avoidance is shown in Fig. 9. This The policies adopted by the UAV for obstacle avoidance are the
figure, which can be divided into six parts, represents a complete result of training in simulation environment. When the UAV
process of the UAV successfully avoiding obstacles. detects obstacles, it first reduces the forward speed. Then it avoids
the obstacle. Finally, it increases the speed and continues to fly
3.2 Real-world experiments forward after successfully avoiding obstacles. The process of the
UAV avoiding obstacles in park is shown in Fig. 11 which also can
Hardware control system includes Intel Joule developer kit, Intel be divided into six parts.
RealSense R200 and Pixhawk. The Intel Joule platform is a system
on module and is available in multiple configurations with support
for Intel RealSense technologies [9]. Pixhawk is an open-hardware 4 Conclusion
flight controller and supports multiple flight stacks [10]. RealSense This paper proposed an obstacle avoidance algorithm for UAV
R200 as the input device and Pixhawk as the output device are based on reinforcement learning. The feature map of the depth
connected to and controlled by Joule as the control unit. image from RealSense was used as the input for reinforcement
learning. After training and processing by neural network, the

J. Eng., 2020, Vol. 2020 Iss. 13, pp. 540-544 543


This is an open access article published by the IET under the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/3.0/)
20513305, 2020, 13, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/joe.2019.1167 by Cochrane Russian Federation, Wiley Online Library on [03/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Fig. 9 Process of obstacle avoidance in simulation experiments

Fig. 11 Process of obstacle avoidance in real-world experiments

Fig. 10 Results of Real-world experiments 5 References


(a) The real-world testing environment, (b) The fixed depth image, (c) The feature [1] Vanneste, S., Bellekens, B., Weyn, M.: ‘3DVFH + : real-time three-
map, (d) The result of actions dimensional obstacle avoidance using an octomap’. The 1st Int. Workshop on
Model-Driven Robot Software Engineering (MORSE 2014), York, Great
Britain, 2014, pp. 91–102
action of avoiding obstacles for the UAV was outputted. The UAV [2] Fereshteh, S., Sergey, L.: ‘CAD2RL: real single-image flight without a single
can successfully avoid obstacles by using this algorithm, which real image’. Robotics: Science and Systems XIII, Massachusetts, USA, 2017,
was verified in the simulation environment and tested in indoor pp. 1–5
scenario. [3] De La Rosa, M., Chen, Y.: ‘A machine learning platform for multirotor
There are many details to be further improved. The feature activity training and recognition’. Proc. of IEEE 14th Int. Symp. on
Autonomous Decentralized Systems, Utrecht, Netherlands, April 2019, pp.
information extracted from the depth images is not complete. More 15–22
analysis and processing of the depth images can be done to mine [4] Chen, X., Kamel, A.E.: ‘A reinforcement learning method of obstacle
more valuable information. More information will also contribute avoidance for industrial mobile vehicles in unknown environments using
to improve the efficiency of training. Besides, both of simulation neural network’. The 21st Int. Conf. on Industrial Engineering and
Engineering Management 2014, Paris, France, 2014, pp. 671–675
and real-world experiments designed in this paper are less complex [5] Introducing the Intel® RealSense™ R200 Camera, https://
than many application scenarios. Furthermore, if the latest software.intel.com/en-us/articles/realsense-r200-camera, accessed 15 May
RealSense device is used, it will allow more useful information to 2015
be provided. Thus, there are some ways to improve the robustness [6] Ramachandran, P., Zoph, B., Quoc, V.L.: ‘Swish: a self-gated activation
function’, 2017, arXiv:1710.05941
of the algorithm in the future. [7] Why Gazebo, http://gazebosim.org/, accessed 10 December 2018
[8] About ROS, http://www.ros.org/about-ros/, accessed 15 December 2018
[9] Online Guide for the Intel® Joule™ Module, https://software.intel.com/en-us/
node/721455, accessed 14 July 2017
[10] What is Pixhawk, http://pixhawk.org/, accessed 4 August 2018

544 J. Eng., 2020, Vol. 2020 Iss. 13, pp. 540-544


This is an open access article published by the IET under the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/3.0/)

You might also like