3D Obstacle Avoidance For UAV Based On RL and RealSense
3D Obstacle Avoidance For UAV Based On RL and RealSense
Abstract: With the increasingly widespread application of unmanned aerial vehicle (UAV), safety issues such as effectiveness
of obstacle avoidance have been paid more attentions. The classical obstacle avoidance algorithms are mostly suitable for
mobile robots, but these algorithms are not ideal for UAV using in three-dimensional space. Most of the three-dimensional
obstacle avoidance algorithms which are more effective using RGB image data as input. Thus, a large amount of image data is
involved in complex computing process. This study proposes an effective obstacle avoidance algorithm for UAV with less input
data and fewer sensors based on RealSense and reinforcement learning. It combines the feature map of the depth image of
RealSense as the input data of reinforcement learning and the current direction of flight of UAV to calculate the direction and
angle of avoiding. The proposed algorithm that implements real-time obstacle avoidance for UAV has been verified by
simulation and tested in three-dimensional space scenario.
1 Introduction to use than dual vision camera. By taking the depth data of
RealSense as the input to calculate actions of avoiding, it can
With the development of science and technology, technologies of improve the effective rate of the obstacle avoidance and safety of
unmanned aerial vehicles (UAV) are evolving as well. Nowadays, UAV.
UAV is not only used in military, but also in commercial, consumer Machine learning is a new paradigm of problem solving [3].
electronics and even toys. At the same time, safety issues are also Reinforcement learning is an important machine learning method
increasingly concerned. Obstacle avoidance of UAV is a key point and applied many fields of intelligent control. The Agent does not
in safety issue. Whatever UAV is in the state of automation or need any prior knowledge of the environment and learn the rules
controlled by the user, it will cause damage to UAV or even gradually with each action and feedback. Traditional reinforcement
injuries to people if UAV cannot avoid obstacles in time. learning can be divided into Policy-based and Value-based. The
Therefore, an effective obstacle avoidance function is very purpose of the Policy-based is to find the optimal strategy. It can
important for UAV. get the probability of multiple possible actions to be taken next by
Obstacle avoidance algorithms for UAV can be divided into two analysing the current environment, and then performs the next
classifications. One is traditional obstacle avoidance methods and action according to the probability. The purpose of the Value-based
the other is based on machine learning to avoid obstacles. As is to find the optimal sum of rewards. It can get the value of each
traditional obstacle avoidance methods,artificial potential field and action, and then chooses the action according to the highest value.
vector field histogram have been used extensively in mobile robots Actor–Critic combines the advantages of the above two methods.
[1]. This class of algorithms use obstacles ranging data as input and The Actor means to perform the next action based on the
calculate the way of avoiding obstacles by specific models. Their probability distribution of the policy. The Critic means giving the
advantages are less input data and low computational complexity. value of the action. In other words, the Actor selects the action by
However, once some unexpected situations for the model happen, the probability and the Critic gives the Q-value by the action of the
it is likely that obstacle avoidance methods will become invalid. Actor. Then, the Actor updates the probability of selecting action
Another class of algorithms is based on reinforcement learning. based on the Q-value given by the Critic. Therefore, it is equivalent
The algorithms combine convolutional neural networks with to speeding up the learning process on the basis of the original
reinforcement learning to get features of input data which are policy gradients. The structure of Actor–Critic is shown in Fig. 1.
usually images of obstacles. Then, the features are matched to the Actor–Critic also has some disadvantages, one of which is
original actions of avoiding [2]. Finally, a better way to avoid inherent slow rate of converging. In order to solve this problem,
obstacles is given. However, the large number of images as input deep deterministic policy gradient (DDPG) is adopted in this paper.
data leads to higher computational complexity. However, model-free reinforcement learning algorithm requires the
Considering the advantages of both, this paper proposes an Agent to update knowledge by trial and error. Thus, training the
obstacle avoidance algorithm for UAV based on RealSense and neural network in the simulation environment can avoid to damage
reinforcement learning. The algorithm uses the feature map of raw to UAV before training completed. In addition, because of using
depth data of RealSense as input data. Compared to RGB images, the feature map of raw depth data of RealSense as input data, the
using the raw depth data which contains the necessary ranging difference between simulation environment and real-world testing
information to avoiding obstacles can reduce the amount of input has little effect on the algorithm, which is better than using RGB
and computational complexity. Furthermore, the algorithm images as input. Finally, the algorithm proposed in this paper is
implements obstacle avoidance function only using RealSense. It is verified that it can implement an effective obstacle avoidance
different to some common solutions with various types of sensors. function for UAV by real-world testing.
Intel RealSense technologies, formerly known as Intel
Perceptual Computing, are the perfect choice for the computer
vision and depth solution. For obtaining 3D depth information,
technologies commonly used in two-dimensional (2D)
environments such as ultrasonic and infrared ranging are no longer
the best choice. RealSense device is smaller than Kinect and easier
J. Eng., 2020, Vol. 2020 Iss. 13, pp. 540-544 540
This is an open access article published by the IET under the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/3.0/)
20513305, 2020, 13, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/joe.2019.1167 by Cochrane Russian Federation, Wiley Online Library on [03/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Fig. 1 Structure of Actor–Critic
respectively [6]. One set of the data is connected to an FC layer Table 1 Correspondence between reward and status
and mapped to (0,1) by Sigmoid activation function. It can produce Status of the UAV Reward
the speed of forward. The other two sets of the data are connected S-D −3
to an FC layer and handled by Swish activation function, and then
D-S +3
connected to another FC layer and mapped to (−1, 1) by Tanh
activation function. It can produce the speed of horizontal and D-D, L↓ −2
vertical for obstacle avoidance, respectively. The learning rate in D-D, L↑ +2
Actor network is 0.00003.
Critic is a value-based network which evaluates Actor network
based on state and action. The network is updated by reward and Table 2 Correspondence between reward and flight states
output of target network for each action. The structure of the Critic Flight State Reward
network is shown in Fig. 5. collision −5
In Critic network, the input is feature map of depth data (state) reaching the destination +5
and action of Actor network, while the output is Q-value for
vforward > 0.9 m/s +1
evaluating Actor. The state is connected to an FC layer. After
Swish function activating and handling with Dropout, these data
and action are connected to another FC layer together. Similarly,
this set of data is connected to an FC layer and handled by Swish Collision, reaching the destination and flying forward are three
activation function with Dropout. Before Q-value is given, there is flight states to consider. When a collision occurs, the reward is
the last FC layer that needs to connect. The Q-value is used to reduced by 5. If the UAV arrives at the destination, the reward is
evaluate each action and train the Actor network. The learning rate added 5. The reward is also added 1 if current forward speed is
in Critic network is 0.0003. >0.9 m/s. Whether a collision occurs or the UAV reaches the
destination, the end of the round of training is marked. The reward
2.5 Design of reward corresponding to flight states is shown in Table 2.
It should be noted that the collision is divided into two cases.
When evaluating the effect of Actor, the method of calculation for One is the collision between the UAV and obstacles, the other is
reward needs to be considered. The design of reward has a great that obstacles in front is less than the minimum detectable range of
influence on the results. If the design is unreasonable, the neural RealSense. The significance of the latter is to avoid the problems,
network will easily fall into the result of local optimum. The such as the state is still in Safe when the UAV approaches collision,
reward calculation in this paper consists of two parts: the change of caused by the detectable range of RealSense.
feature map and the flight state of the UAV. After each action is executed, the reward corresponding to
As mentioned earlier, the flight state of the UAV is divided into action is obtained by calculating the sum of both above.
two states, and the reward corresponding to the change of state is
shown in Table 1. 2.6 Additional design of action
In Table 1, S means the state of Safe, D means the state of
Dangerous, and L means minimum non-zero distance from In order to detect enough continuous behaviour in the simulation
obstacles in Dangerous. The initial value of reward is 0. The environment, Ornstein–Uhlenbeck (OU) process is used to
reward is reduced by 3 when the state changes from Safe to generate noise, which increases the randomness of action.
Dangerous. After avoiding obstacles, the state changes from Combining the noise with action as the output of Actor can
Dangerous to Safe, and the reward is added 3. Moreover, there are improve adaptability to unknown environments during training. It
two situations that need to be considered when Dangerous remains is better to remove OU process in practical application, however.
unchanged. The L increasing indicates that the UAV is away from
obstacles, the reward is added 2. Conversely, the L will decrease if 3 Experiments
the UAV is approaching obstacles, and the reward is reduced by 2.
As it should be, the reward will not change if the state of Safe 3.1 Simulation experiments
remains unchanged. The simulation environment in this paper is Gazebo and robot
operating system (ROS). Gazebo can help users to rapidly test
542 J. Eng., 2020, Vol. 2020 Iss. 13, pp. 540-544
This is an open access article published by the IET under the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/3.0/)
20513305, 2020, 13, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/joe.2019.1167 by Cochrane Russian Federation, Wiley Online Library on [03/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Fig. 6 The UAV modeling in design
(a) The model of the UAV, (b) The structure of the UAV
vx = action ⋅ x m/s
vy = action ⋅ y m/s (2)
vz = action ⋅ z m/s