0% found this document useful (0 votes)

3 views

11 - Optimal Path Planning Approach Based On Q-Learning Algorithm For Mobile Robots

Uploaded by

Farah Kamil

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

11 - Optimal Path Planning Approach Based On Q-Learning Algorithm For Mobile Robots

Uploaded by

Farah Kamil

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Journal Pre-proof

Optimal path planning approach based on Q-learning algorithm for

mobile robots

Abderraouf Maoudj, Abdelfetah Hentout

PII: S1568-4946(20)30734-1
DOI: https://doi.org/10.1016/j.asoc.2020.106796
Reference: ASOC 106796

To appear in: Applied Soft Computing Journal

Received date : 10 March 2020

Revised date : 8 September 2020
Accepted date : 9 October 2020

Please cite this article as: A. Maoudj and A. Hentout, Optimal path planning approach based on
Q-learning algorithm for mobile robots, Applied Soft Computing Journal (2020), doi:
https://doi.org/10.1016/j.asoc.2020.106796.

This is a PDF file of an article that has undergone enhancements after acceptance, such as the
addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive
version of record. This version will undergo additional copyediting, typesetting and review before it
is published in its final form, but we are providing this version to give early visibility of the article.
Please note that, during the production process, errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.

© 2020 Elsevier B.V. All rights reserved.

Journal Pre-proof

Optimal path planning approach based on Q-learning

algorithm for mobile robots
Abderraouf Maoudj, Abdelfetah Hentout

of
Centre de Développement des Technologies Avancées (CDTA)
Division Productique et Robotique (DPR)
BP 17, Baba Hassen, Algiers 16303, Algeria.
[email protected], [email protected]

Abstract: In fact, optimizing path within short computation time still remains a major challenge for mobile robotics

pro
applications. In path planning and obstacles avoidance, Q-Learning (QL) algorithm has been widely used as a
computational method of learning thought environment interaction. However, less emphasis is placed on path
optimization using QL because of its slow and weak convergence towards optimal solutions. Therefore, this paper
proposes an Efficient Q-Learning (EQL) algorithm to overcome these limitations and ensure an optimal collision-free
path in less possible time. In the QL algorithm, successful learning is closely dependent on the design of an effective
reward function and an efficient selection strategy for an optimal action that ensures exploration and exploitation. In this
regard, a new reward function is proposed to initialize the Q-table and provide the robot with prior knowledge of the
environment, followed by a new efficient selection strategy proposal to accelerate the learning process through search
space reduction while ensuring a rapid convergence towards an optimized solution. The main idea is to intensify research
at each learning stage, around the straight-line segment linking the current position of the robot to 𝑇𝑎𝑟𝑔𝑒𝑡 (optimal path
in terms of length). During the learning process, the proposed strategy favors promising actions that not only lead to an
re-
optimized path but also accelerate the convergence of the learning process. The proposed EQL algorithm is first validated
using benchmarks from the literature, followed by a comparison with other existing QL-based algorithms. The achieved
results showed that the proposed EQL gained good learning proficiency; besides, the training performance is significantly
improved compared to the state-of-the-art. Concluded, EQL improves the quality of the paths in terms of length,
computation time and robot safety, furthermore outperforms other optimization algorithms.

Keywords: Path optimization, Efficient Q-Learning, Efficient selection strategy, Convergence speed, Training
performances.
lP
1. Introduction
Nowadays, a growing trend in intelligent mobile robot applications highlights the importance of planning and
improving paths in many areas of robotics, especially in automated industrial environments where the paths optimization
plays a major role in products scheduling [1]. Therefore, research on path optimization for mobile robots has been an
active area gaining considerable attention [2] [3] [4] [5]. Despite extensive research works carried in this field, generating
optimal paths in complex environments still remains a challenge to the community [6].
Path optimization aims to generate an optimal collision-free path that links the starting position (𝑆𝑜𝑢𝑟𝑐𝑒) to the
target position (𝑇𝑎𝑟𝑔𝑒𝑡) while respecting certain performance metrics. Authors in [7] stated that global optimal motion
rna

planning is an NP-hard problem. During the last decade, several works have been carried out in the literature and many
algorithms have been proposed for the problem of path optimization, as for instance, Genetic Algorithm (GA) [5] [8],
Particle Swarm Optimization (PSO) [9], membrane Evolutionary Artificial Potential Field (memEAPF) [10], Rapidly-
exploring Random Trees (RRT) [6] and Ant Colony Optimization (ACO) algorithm [11] [12]. Although these algorithms
are powerful, each has pros and cons, such as high computation complexity, long search time, and low affinity accuracy
[13]. In addition, these optimization algorithms present some implementation issues in considering the dynamic
environment due to difficulty in real-time setting of their parameters. Thus, solving these difficult issues, especially
avoiding dynamic obstacles, requires an intelligent learning process using Machine Learning (ML) techniques.
To support autonomous mobile robots for learning in unknown environments, ML algorithms are used as preferred
and alternative methods that go beyond conventional methods. In this situation, ML appears via Reinforcement Learning
Jou

(RL) as a solution that allows agents to adjust/alter their actions in a specific environment to obtain maximum results [14].
RL was originally used in the disciplines of game theory, information theory, control theory and operational research [15],
and over time it was adopted in the navigation of autonomous mobile robots [15] [16]. Recently, many RL algorithms are
emerging while focusing on path planning and obstacles avoidance. However, the emphasis is less on improving the path
and even less on how to improve the path while reducing the huge computation time to attain convergence. In this context,
the Q-Learning (QL) method [17] is widely used to implement RL and has recently been implemented worldwide in
various fields of robotics. QL is the most widely used RL algorithm due to its simplicity and its proof of convergence [18]
[19]. The use of QL algorithms in path optimization for intelligent mobile robots makes it possible to adapt their behavior
in workspaces based on previous observations. However, one of the major limitations when using QL to optimize the path
of mobile robots is the difficulty to visit all State-Action pairs due to enormous time and computation burden and size of
the states necessary to represent the workspace [20].
The optimality and convergence speed have always been a significant subject of interest in the QL context.
Consequently, numerous studies have been carried out in the literature to deal with such problems. Motivated by the fact
Journal Pre-proof

that slowness of convergence can be moderated by appropriate initialization of Q-values, Nakashima and Ishibuchi [21]
demonstrated that QL can be accelerated by adequately setting initial Q-values using fuzzy rules. Low et al. [15]
introduced the partially-guided QL concept in which the flower pollination algorithm is used to improve the QL
initialization. Ganapathy and colleagues [20] proposed an approach to reduce both time and space complexity of the
classical QL algorithm. Additionally, Jiang and Xin [22] proposed an improved QL algorithm for path planning in a free-

of
space environment by generating fuzzified state variables for continuous space division and avoiding the dimensionality
curse. To accelerate the learning process, the authors proposed an integrated learning strategy based on space allocation.
Wang et al. [23] benefit from the State-Action-Reward-State-Action (SARSA) algorithm in short-term convergence and
associated with the QL algorithm, the results showed an improvement in learning time. Das and colleagues [24] developed
an alternative method to QL to deal with speed convergence. To reduce both space and time complexity by economically
updating Q-table, Goswami et al. [25] proposed an extension of classical QL algorithm. The authors created a single field

pro
for action of each state; Q-table stored Q-values at the best action in each state, which requires short time to retrieve Q-
values and thus saves significant time complexity.
Although many QL algorithms have been conducted to solve path optimization problems for mobile robots, optimal
paths still remain difficult to obtain; as well, the convergence speed of such algorithms continues to be a challenge,
especially when dealing with complex environments. Thus, this work represents an additional effort addressing these
issues, where the main contributions can be outlined as follows. First, introducing a new definition of states and actions
spaces, that reduces the size of Q-table and leads to the optimal path through rapid learning. Second, motivated by the
fact that QL algorithm exploits cumulative rewards received in the future and to give the robot a priori knowledge of its
environment, a reward assignment algorithm is proposed for initializing Q-table. The algorithm is also used in the
selection policy to evaluate a set of actions in each learning step to determine the best action. Third, an effective new
selection strategy has been proposed to address the slow convergence of QL algorithm, while actively targeting search
re-
toward promising states that lead to the optimum path without resorting to explore the entire environment. Finally,
comparison support is provided to validate the proposed algorithms for path optimization.
The rest of the paper is organized as follows. Section 2 presents environment approximation and mathematical
modeling of the problem. Section 3 provides some properties of the classical QL and the proposed Efficient Q-Learning
(EQL) algorithms. Section 4 shows simulation results and comparisons between EQL and other works of the literature.
Finally, section 5 gives conclusions and future works.

2. Mathematical model of mobile robot path optimization

Almost all of the algorithms in the literature proposed for paths optimization, in particular those based on QL, are
lP
limited to grid environments and to a fixed number of states, actions and directions of movement. Other than this work
contributes an Efficient Q-Learning (EQL) algorithm which can be applied to any real indoor environment. To this end,
the efficiency of algorithms of the literature such as the YOLO algorithm [26] [27] are extremely motivating in real-time
objects identification. Thus, the preprocessing steps concerning the efficient modeling of 3D obstacles as in real
environment are presented in the following subsection.

2.1. Obstacles modeling of the indoor environment

Among the methods of paths optimization in complex cluttered environments gathering obstacles approximation
rna

using geometric shapes, the rectangles are considered to be the ideal geometric shape allowing to easily approximate any
obstacle geometry as indicated in Fig. 1(a). This geometry makes it possible to cover the obstacle while planning in 2D
and avoiding 3D obstacles. Fig. 1(b) illustrates the real indoor environment with 3D obstacles of different shapes (chairs,
desks, etc.) taken by the Microsoft Kinect sensors placed on the roof of a room. Figs. 1(c), (d) and (e) describe in sequence
the different steps for wrapping these obstacles by rectangles. More details on these steps can be found in a previous work
[5].
Jou

(a) Modeling of the 3D obstacles of the indoor environment

Journal Pre-proof

of
pro
(b) Real encumbered environment in 3D [5] (c) Binary map of the environment

re-
(d) Approximation of the obstacles by rectangles (e) 2D binary safe map of the environment
Fig. 1. Obstacles approximation for path planning.

2.2. Mathematical modeling of the optimization problem

lP
The mobile robot should move from its starting position 𝑆𝑜𝑢𝑟𝑐𝑒 𝑥 , 𝑦 to its final position 𝑇𝑎𝑟𝑔𝑒𝑡 𝑥 , 𝑦 . The
aim is being to find an optimized path connecting 𝑆𝑜𝑢𝑟𝑐𝑒 to 𝑇𝑎𝑟𝑔𝑒𝑡. The generated path consists of a set of 𝑛 nodes
(𝑁 , 𝑖 1 … 𝑛) and 𝑛 − 1 segments each defined by two consecutive nodes (𝑃𝑎𝑡ℎ_𝑆𝑒𝑔 𝑁 , 𝑁 ). Let consider a
mobile robot evolving in a cluttered environment with 𝑚 static obstacles and all additional information about obstacles
are known. As shown in Fig. 1, each obstacle is approximated by a rectangle formed by four points
( 𝑃 𝑥 , 𝑦 … 𝑃 𝑥 , 𝑦 ) and four segments (𝑂𝑏𝑠_𝑆𝑒𝑔 𝑡 ∈ 1 … 4 , 𝑗 ∈ 1 … 𝑚 ), where 𝑃 is the lower left corner
of the rectangle. Similarly, the robot is approximated by a rectangle formed by four points; their coordinates change
according to the current position of the robot (𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑃𝑜𝑠). The mathematical equation of each segment defined by two
rna

points 𝑃 , 𝑃 of a rectangle enveloping an obstacle is given as follows:

𝑦−𝑦 𝑥−𝑥
𝑆𝑒𝑔 𝑃 , 𝑃 (1)
𝑀𝐼𝑁 𝑥 , 𝑥 𝑥 𝑀𝐴𝑋 𝑥 , 𝑥

The following notations describe indices and parameters used in the mathematical model:
• 𝑛: number of nodes of the generated path,
• 𝑚: number of obstacles in the environment,
• 𝑖 𝑖 ∈ 1 … 𝑛 : index of the segment defined by 𝑖 and 𝑖 1 nodes in the path,
• 𝑗 𝑗 ∈ 1 … 𝑚 : index of the 𝑗 obstacle in the navigation environment,
Jou

• 𝑘, 𝑙 𝑘, 𝑙 ∈ 1 … 4 : indices of the point that defines an obstacle,

• 𝑟 𝑟 ∈ 1 … 4 : index of the segment 𝑟 from the rectangle approximating the mobile robot,
• 𝑡 𝑡 ∈ 1 … 4 : index of the segment 𝑡 that defines the rectangle of an obstacle,
• 𝑁 : 𝑖 node of the path,
• 𝑂𝑏𝑠 𝑗 ∈ 1 … 𝑚 : 𝑗 obstacle,
• 𝑃𝑎𝑡ℎ_𝑆𝑒𝑔 𝑁 , 𝑁 𝑖 ∈ 1 … 𝑛 − 1 : 𝑖 segment of the path defined by two nodes (𝑁 , 𝑁 ,,
• 𝑂𝑏𝑠_𝑆𝑒𝑔 𝑃 , 𝑃 𝑘, 𝑙 ∈ 1 … 4 , 𝑗 ∈ 1 … 𝑚 : 𝑙 segment of 𝑗 obstacle defined by two points (𝑃 , 𝑃 ),
• 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑃𝑜𝑠: current location of the robot,
• 𝑅𝑜𝑏_𝑆𝑒𝑔 𝑃 , 𝑃 𝑟, 𝑘, 𝑙 ∈ 1 … 4 : 𝑟 segment of the rectangle approximating the robot; {𝑃 , 𝑃 } is calculated
according to 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑃𝑜𝑠.
• 𝑆𝑎𝑓𝑒𝑡𝑦𝐷𝑖𝑠𝑡: it depends on the mobile robot geometry to ensure that the robot does not collide with obstacles.
Journal Pre-proof

The decision variables of the mathematical model are as follows:

𝟏 if ∃ 𝑃 , 𝑃 | 𝑃 , 𝑃 ∈ 𝑃𝑎𝑡ℎ_𝑆𝑒𝑔 ∧ 𝑃 , 𝑃 ∈ 𝑂𝑏𝑠_𝑆𝑒𝑔
• 𝐵, ∀𝑖 ∈ 1 … 𝑛 , ∀𝑡 ∈ 1 … 4 , ∀𝑗 ∈ 1 … 𝑚
𝟎 Otherwise

of
𝟏 if ∃ 𝑃 , 𝑃 | 𝑃 , 𝑃 ∈ 𝑅𝑜𝑏_𝑠𝑒𝑔 ∧ 𝑃 , 𝑃 ∈ 𝑂𝑏𝑠_𝑆𝑒𝑔
• 𝐴 , ∀ 𝑡, 𝑟 ∈ 1 … 4 , ∀ 𝑗 ∈ 1 … 𝑚
𝟎 Otherwise

𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝑥 −𝑥 𝑦 −𝑦 ,∀ 𝑖 ∈ 1…𝑛 − 1 (2)

pro
Subject to:

𝑥 𝑥 ∨ 𝑦 𝑦 ,∀ 𝑖 ∈ 1…𝑛 − 1 (3)
∑ 𝐵, 0, ∀ 𝑖 ∈ 1 … 𝑛 , ∀ 𝑡 ∈ 1 … 4 , 𝑗 ∈ 1 … 𝑚 (4)
𝐴 , 0, ∀ 𝑟, 𝑡 ∈ 1 … 4 , ∀ 𝑗 ∈ 1 … 𝑚 (5)
𝐵 , ∈ 0, 1 , 𝐴 , ∈ 0, 1 , ∀ 𝑖 ∈ 1 … 𝑛 , 𝑗 ∈ 1 … 𝑚 (6)

The objective function, given by (2), consists of minimizing the length of the path. The constraint given by (3)
re-
requires that all nodes are different. Constraints in (4) ensure that the path segments do not overlap environment obstacles.
The constraint (5) prevents the robot from crossing obstacles when moving through the path nodes. Finally, the constraints
(6) require all 𝐴 and 𝐵 variables to be binary.

3. Proposed Efficient Q-Learning (EQL) algorithm

3.1. Classical Q-Learning algorithm and its limitations

The QL algorithm implements the concept of reward and penalty in exploring its environment [15]. As shown in Fig.
2, the mobile robot chooses an action based on the appropriate policy, and then performs that action; in return, the
lP
navigation environment delivers a state (𝑠) and a reward (𝑟) to this robot. When optimizing the path, a state consists of
the current location of the robot in its workspace; whereas, an action is the movement that the robot takes to move from
one state to another.
rna
Jou

Fig. 2. A rough framework of Q-learning.

To achieve this objective, the robot adjusts its action strategies through the reward 𝑟 of environmental feedback,
which assesses the quality of its actions. For an action 𝑎 at state 𝑠 and at time step 𝑡, the Q-value is updated using the
Bellman’s equation as in (7):

𝑄 𝑠 ,𝑎 𝑄 𝑠 ,𝑎 𝛼 𝑟 𝑠 ,𝑎 𝛾 ∗ 𝑚𝑎𝑥 𝑄 𝑠 ,𝑎 − 𝑄 𝑠 ,𝑎 (7)
∈

where 𝛼 is the learning rate, 𝛾 is the discount factor, 𝑟 𝑠 , 𝑎 denotes the immediate reward/penalty acquired by the agent
after executing an action 𝑎 at state 𝑠 . 𝑚𝑎𝑥 𝑄 𝑠 , 𝑎 signifies the maximum Q-value among all possible actions in the
∈
new state 𝑠 .
Journal Pre-proof

Algorithm 1 summarizes the different steps of the classical QL algorithm.

Algorithm 1: Classical Q-Learning algorithm

begin
Initialization:

of
𝑸 𝒔𝒕 , 𝒂𝒕 ← 𝟎 , (𝒏 states and 𝒎 actions)
𝒎 𝒏
for (each episode):
(1) set 𝒔𝒕 ← a random state from the states set 𝑺;
while 𝒔𝒕 𝑮𝒐𝒂𝒍 𝒔𝒕𝒂𝒕𝒆
(2) Choose 𝒂𝒕 in 𝒔𝒕 by using an adequate policy (𝜺-greedy, etc.);
(3) Perform action 𝒂𝒕 , and receive reward/penalty and 𝒔𝒕 𝟏 ;

pro
(4) Update 𝑸 𝒔𝒕 , 𝒂𝒕 using equation (7);
𝒔𝒕 ← 𝒔𝒕 𝟏 ;
end-while
end-for
end

In an environment with 𝑚 states and 𝑛 actions, the dimension of Q-table is 𝑚 𝑛. Choosing an action with the
highest Q-value amongst 𝑛 possible actions includes 𝑛 − 1 comparisons. Thus, the number of comparisons necessary to
update Q-table with 𝑛 states is 𝑚 𝑛 − 1 . As the states size (representing the workspace) increases (i.e., 𝑛 increases),
the time required by the QL algorithm to complete the path optimization increases as the search space increases, which
re-
prevents the algorithm from providing good solutions. Therefore, one of the main problems when implementing QL
algorithm on path optimization is the difficulty of visiting all State-Action pairs during the training process due to the
huge computation time required [11] and, therefore, it negatively affects convergence to the optimal path.

3.2. The proposed Efficient Q-Learning (EQL) algorithm

To implement the QL algorithm to optimize paths, simulation environment where the robots learn, reward function,
the motion states and control actions are of major concerns. Thus, the first step for applying QL is to define the states and
actions spaces. It is important to note here that large size states slow down the learning process. On the other hand, the
reduction of the states space negatively influences the optimization process and prevents the algorithm to generate the
lP
optimal policy.
Environment discretization may produce a poor approximation of optimal behavior. Thus, in the proposed EQL, the
environment will be carefully discretized to define the states and actions spaces while selecting the promising states
leading to the optimal path. The straight-line segment is obviously the shortest path between two arbitrary points.
Accordingly, in a free navigation environment (without obstacles), the shortest path between 𝑆𝑜𝑢𝑟𝑐𝑒 and 𝑇𝑎𝑟𝑔𝑒𝑡 is the
straight line connecting these two locations. However, in crowded environments, the presence of obstacles prevents from
connecting them (𝑆𝑜𝑢𝑟𝑐𝑒 and 𝑇𝑎𝑟𝑔𝑒𝑡) by a straight line. In this case, avoiding these obstacles through the points located
on the rectangles corners (wrapping the obstacles) having a small orthogonal distance to the straight-line segment, as
rna

shown in Fig. 3(c), leads to find the shortest path. In this regard, this work proposes a new effective definition of states
and actions. To allow robot safely avoid this obstacle, four points called 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡𝑠 are defined for each obstacle
in the workspace as seen in Figs. 3(a) and (b). Each 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 is the node in the corner of the rectangle surrounding
this obstacle. The 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 coordinates are measured as a function of four points forming the rectangle and
respecting the safety distance 𝑆𝑎𝑓𝑒𝑡𝑦𝐷𝑖𝑠𝑡, making it possible to define the coordinates 𝑋 , 𝑌 of the
𝑖 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 and ensure that the robot does not collide with the 𝑗 obstacle.
Jou

(a) (b) (c)

Fig. 3. Limitation of the states and search spaces. (a) Real environment with obstacles wrapped by rectangles, (b) The
red points define the 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡𝑠 to avoid obstacles, (c) Shortest path constructed throughout 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡𝑠;
the red/green points represent the searching space of the learning process in the first step.
Journal Pre-proof

In this work, states are defined as 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡𝑠 locations in the environment, including 𝑇𝑎𝑟𝑔𝑒𝑡 (𝑠 ). The actions
space is discretized into 4 × 𝑚 + 1 movements; this set consists of all moves to 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡𝑠 assigned to each
obstacle including 𝑀𝑜𝑣𝑒𝑇𝑜𝑇𝑎𝑟𝑔𝑒𝑡 action.
Considering an environment with 𝑚 obstacles (𝑂𝑏𝑠 , 𝑗 ∈ 1 … 𝑚 ), the actions space 𝐴 and states space 𝑆 are as

of
follows:

𝐴 = {𝑀𝑜𝑣𝑒𝑇𝑜𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 , 𝑀𝑜𝑣𝑒𝑇𝑜𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 … 𝑀𝑜𝑣𝑒𝑇𝑜𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 , 𝑀𝑜𝑣𝑒𝑇𝑜𝑇𝑎𝑟𝑔𝑒𝑡}, ∀ 𝑙 ∈

{1 … 4}, 𝑗 ∈ {1 … 𝑚}
𝑆 = 𝑠 (𝑥 , 𝑦 ) … 𝑠 (𝑥 , 𝑦 ), 𝑠 (𝑥 , 𝑦 ) … 𝑠 𝑥 , 𝑦 , 𝑠 (𝑥 , 𝑦 ) , ∀ 𝑙 ∈ {1 … 4}, ∀ 𝑗 ∈ {1 … 𝑚}

pro
In this manner, the search space is reduced with preserving states that lead to the optimal path. Additionally, this
discretization makes it possible to assess and monitor reward/penalty without the need for a robot simulation environment.

3.2.1. Selection strategy and reward assignment function

The reward function in QL evaluates (good/bad) the action performed in a given situation. It is difficult to develop a
reward strategy; it is obtained by observing several iterations for a specific problem. Thus, this work provides a reward
function for the initialization of Q-table which provides the robot with prior knowledge as a support basis for better
learning. Algorithm 2 describes the proposed reward assignment algorithm for a given action.

Algorithm 2: Reward assignment for an action 𝑴𝒐𝒗𝒆𝑻𝒐𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕𝒍𝒋

begin

else
(1) 𝑹𝒆𝒘𝒂𝒓𝒅 = − ∥ 𝒔𝒔 − 𝒔𝑻 ∥
re-
if CollisionCheck(𝒔𝒌𝒓 (𝒙𝒌𝒓 , 𝒚𝒌𝒓 ), 𝒔𝒍𝒋 (𝒙𝒍𝒋 , 𝒚𝒍𝒋 )) //𝒔𝒌𝒓 (𝒙𝒌𝒓 , 𝒚𝒌𝒓 ) represents the current state

if (𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕𝒌𝒓 is the best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕 according to Algorithm 3)

if CollisionCheck(𝒔𝒍𝒋 (𝒙𝒍𝒋 , 𝒚𝒍𝒋 ), 𝒔𝑻 (𝒙𝑻 , 𝒚𝑻 ))
(2) 𝑹𝒆𝒘𝒂𝒓𝒅 = ∥ 𝒔𝒌𝒓 − 𝒔𝑻 ∥ − ∥ 𝒔𝒍𝒋 − 𝒔𝑻 ∥ + ∥ 𝒔𝒌𝒓 − 𝒔𝒍𝒋 ∥
else
𝒓 =∥ 𝒔𝒔 − 𝒔𝑻 ∥ − ∥ 𝒔𝒍𝒋 − 𝒔𝑻 ∥
lP
(3) 𝑹𝒆𝒘𝒂𝒓𝒅 =∥ 𝑺𝒌𝒓 − 𝑺𝑻 ∥ −∥ 𝑺𝒍𝒋 − 𝑺𝑻 ∥ +∥ 𝑺𝒌𝒓 − 𝑺𝒍𝒋 ∥ +𝒓
else
(4) 𝑹𝒆𝒘𝒂𝒓𝒅 = ∥ 𝒔𝒌𝒓 − 𝒔𝑻 ∥ − ∥ 𝒔𝒍𝒋 − 𝒔𝑻 ∥
end

where:
• ∥ 𝑠 − 𝑠 ∥ denotes the Euclidean distance calculated for the two locations of the robot defined in the two states
rna

𝑠 and 𝑠 ∀ 𝑘, 𝑙 ∈ {1 … 4}, ∀ 𝑗, 𝑟 ∈ {1 … 𝑚}.

• 𝑅𝑒𝑤𝑎𝑟𝑑 (1) is negative because the chosen action causes a collision with the obstacle 𝑂𝑏𝑠 .
• 𝑅𝑒𝑤𝑎𝑟𝑑 (2) pushes the robot to move toward 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 from which no collision occurs with obstacles.
• 𝑅𝑒𝑤𝑎𝑟𝑑 (3) encourages the robot to choose the action that enables it to reach a new state from which it is
possible to select the next best action. This means that this is a promising action and may lead the robot to reach
𝑇𝑎𝑟𝑔𝑒𝑡.
• 𝑅𝑒𝑤𝑎𝑟𝑑 (4) represents the difference between two distances:
− the first distance is calculated between 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑃𝑜𝑠 of the robot and 𝑇𝑎𝑟𝑔𝑒𝑡 (∥ 𝑠 − 𝑠 ∥);
− the second distance is calculated between 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑃𝑜𝑠 of the robot and 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 𝑠 𝑥 , 𝑦
(∥ 𝑠 − 𝑠 ∥). This reward favorites the 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 having the minimal distance toward 𝑇𝑎𝑟𝑔𝑒𝑡.
Jou

As well, the reward function is used to judge the quality of an action in the selection strategy. The reward assignment
process always tries to favor 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡𝑠 around this part of the line segment, which helps reinforcing the selection
of the best action.
In QL algorithm, successful learning is highly dependent on the design of the selection strategy for the optimal action
in each learning step. In addition, the choice of an action must ensure a balance between exploration and exploitation of
already archived learning. In fact, pure exploitation of the algorithm quickly leads to locally optimized policies; whereas,
excessive exploration reduces the performance of the learning process even if it can avoid the locally optimal policies
[15]. In this regard, the proposed EQL uses a new strategy to ensure a good balance between exploration and exploitation.
For this purpose, the reward assignment algorithm is used to effectively initialize Q-table. This gives the robot prior
experience of the workspace and expands exploration. Therefore, EQL handles exploration using reward assignment
algorithm during the initialization of Q-table and then in the selection strategy. On the other hand, a new selection strategy
is proposed on the basis of rewarding assignment while exploiting effective states around the line segment connecting the
Journal Pre-proof

current state of the robot (𝑠 ) to the goal state (𝑠 ), followed by exploring the specific states imposed by the reward
assignment algorithm. The purpose is being to learn the robot, in each learning step, to move toward the best
𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 around the line segment that connects 𝑠 with 𝑠 , leading to the optimal path.
Algorithm 3 describes the selection strategy for choosing best action (i.e., move to the best 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡) in each
learning step. Assume that 𝑠 (𝑥 , 𝑦 ), 𝑠 (𝑥 , 𝑦 ) and 𝑠 (𝑥 , 𝑦 ) denote the current state, next state and 𝑇𝑎𝑟𝑔𝑒𝑡

of
state, respectively. This algorithm contains three main steps. In the first step, calculate the linear segment 𝑆𝑒𝑔(𝑠 , 𝑠 )
using mathematical equation (1). Then, define the set of obstacles 𝑂𝑏𝑠[] = 𝑆𝑒𝑔(𝑠 , 𝑠 ) ∩ 𝑂𝑏𝑠_𝑆𝑒𝑔 , ∀ 𝑡 ∈ {1 … 4}, ∀ 𝑗 ∈
{1 … 𝑚}. Besides, in the case 𝑆𝑒𝑔(𝑠 , 𝑠 ) ∩ 𝑂𝑏𝑠_𝑆𝑒𝑔 = ∅, ∀ 𝑡 ∈ {1 … 4}, ∀ 𝑗 ∈ {1 … 𝑚}, the algorithm immediately
returns the best action 𝑀𝑜𝑣𝑒𝑇𝑜𝑇𝑎𝑟𝑔𝑒𝑡; otherwise, proceed with other steps. In the second step, two vectors are identified
for each obstacle obtained in 𝑂𝑏𝑠[] set: 𝐴𝑙𝑙𝑜𝑤𝑎𝑏𝑙𝑒_𝑀𝑜𝑣𝑒𝑠[] to store the authorized actions associated with
𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡𝑠 specified for all obstacles in this set; and 𝐴𝑠𝑠𝑒𝑠𝑠𝑚𝑒𝑛𝑡[] to store the calculated values to assess the

pro
quality of each action in 𝐴𝑙𝑙𝑜𝑤𝑎𝑏𝑙𝑒_𝑀𝑜𝑣𝑒𝑠[]. It should be noted that at each state 𝑠 , the 𝐴𝑙𝑙𝑜𝑤𝑎𝑏𝑙𝑒_𝑀𝑜𝑣𝑒𝑠[] vector
components represent the search space in this learning step, which is minimized. Next, determine the best 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡
that can lead to choosing the best action by selecting the closest obstacle 𝑁𝑒𝑎𝑟𝑒𝑠𝑡_𝑂𝑏𝑠 to the current state 𝑠 from 𝑂𝑏𝑠[].
Then, the points of intersection with 𝑁𝑒𝑎𝑟𝑒𝑠𝑡_𝑂𝑏𝑠 are calculated as {𝑃 , 𝑃 } = 𝑆𝑒𝑔(𝑠 , 𝑠 ) ∩ 𝑂𝑏𝑠_𝑆𝑒𝑔 , ∀𝑡 ∈
{1 … 4}, ∀𝑖 ∈ {0 … 𝑙𝑒𝑛𝑔𝑡ℎ(𝑂𝑏𝑠[])}. The best 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 is selected according to the locations of the intersection
points 𝑃 and 𝑃 on the segments 𝑁𝑒𝑎𝑟𝑒𝑠𝑡_𝑂𝑏𝑠, as described in Algorithm 3 and Fig. 4. In the last step, an assessment
value is assigned to each action in the vector 𝐴𝑙𝑙𝑜𝑤𝑎𝑏𝑙𝑒_𝑀𝑜𝑣𝑒𝑠[] and stored in the vector 𝐴𝑠𝑠𝑒𝑠𝑠𝑚𝑒𝑛𝑡[] using Algorithm
2. This value represents the importance of taking specific action in the current state 𝑠 .

re-
lP
rna

(a) Case of intersection with 𝑆𝑒𝑔 and 𝑆𝑒𝑔

Jou

(b) Case of intersection with 𝑆𝑒𝑔 and 𝑆𝑒𝑔

Fig. 4. Selection of best 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 to avoid obstacles.
Journal Pre-proof

Algorithm 3: Selection strategy

Inputs:
-Current state 𝒔𝒕
Output:

of
-Best action
begin
Initialization:
-Calculate mathematical equation of the line segment 𝑺𝒆𝒈(𝒔𝒕 , 𝒔𝑻 ) using equation (1)
-Initialize the vector 𝑨𝒍𝒍𝒐𝒘𝒂𝒃𝒍𝒆_𝑴𝒐𝒗𝒆𝒔[]
-Initialize the vector 𝑨𝒔𝒔𝒆𝒔𝒔𝒎𝒆𝒏𝒕[ ]

pro
𝑶𝒃𝒔[] = 𝑺𝒆𝒈(𝒔𝒕 , 𝒔𝑻 ) ∩ 𝑶𝒃𝒔_𝑺𝒆𝒈𝒕𝒋 , ∀ 𝒕 ∈ {𝟏. . 𝟒}, ∀ 𝒋 ∈ {𝟏 … 𝒎}
if (𝑶𝒃𝒔[] = ∅) // this means that there is no obstacle collide with line segment linking current state 𝒔𝒕 with 𝒔𝑻
return Best action = 𝑴𝒐𝒗𝒆𝑻𝒐𝑻𝒂𝒓𝒈𝒆𝒕
else
Define the Subset of actions 𝑨 related to 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕𝒔 associated to the set obstacles in 𝑶𝒃𝒔[]
𝑨𝒍𝒍𝒐𝒘𝒂𝒃𝒍𝒆_𝑴𝒐𝒗𝒆𝒔[ ] = 𝑨
𝑵𝒆𝒂𝒓𝒆𝒔𝒕_𝑶𝒃𝒔 = 𝑨𝒓𝒈(𝑴𝒊𝒏𝑫𝒊𝒔𝒕𝒂𝒏𝒄𝒆(𝒐𝒃𝒔𝒊 , 𝒔𝒕 )), ∀ 𝒊 ∈ {𝟏 … 𝒍𝒆𝒏𝒈𝒉𝒕(𝑶𝒃𝒔[ ])}
{𝑷𝟏 , 𝑷𝟐 } = 𝑺𝒆𝒈(𝒔𝒕 , 𝒔𝑻 ) ∩ 𝑵𝒆𝒂𝒓𝒆𝒔𝒕_𝑶𝒃𝒔
if {𝑷𝟏 , 𝑷𝟐 } ∈ {𝑺𝒆𝒈𝟏 , 𝑺𝒆𝒈𝟐 }:
best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕 = 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕𝟎𝟐
else if {𝑷𝟏 , 𝑷𝟐 } ∈ {𝑺𝒆𝒈𝟐 , 𝑺𝒆𝒈𝟑 }:
re-
best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕 = 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕𝟎𝟑
else if {𝑷𝟏 , 𝑷𝟐 } ∈ {𝑺𝒆𝒈𝟑 , 𝑺𝒆𝒈𝟒 }:
best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕 = 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕𝟎𝟒
else if {𝑷𝟏 , 𝑷𝟐 } ∈ {𝑺𝒆𝒈𝟏 , 𝑺𝒆𝒈𝟒 }:
best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕 = 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕𝟎𝟏
else if {𝑷𝟏 , 𝑷𝟐 } ∈ {𝑺𝒆𝒈𝟏 , 𝑺𝒆𝒈𝟑 }:
Select the best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕 according to {𝑷𝟏 , 𝑷𝟐 } coordinates as described in Fig. 4(a)
else if {𝑷𝟏 , 𝑷𝟐 } ∈ {𝑺𝒆𝒈𝟐 , 𝑺𝒆𝒈𝟒 }:
lP
Select the best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕 according to {𝑷𝟏 , 𝑷𝟐 } coordinates as described in Fig. 4(b)
𝑨𝒔𝒔𝒆𝒔𝒔𝒎𝒆𝒏𝒕[ ] = Evaluate the rewards of each action in 𝑨𝒍𝒍𝒐𝒘𝒂𝒃𝒍𝒆_𝑴𝒐𝒗𝒆𝒔[] using Algorithm 2
𝒊𝒏𝒅𝒆𝒙 = 𝑨𝒓𝒈(𝒎𝒂𝒙 𝑨𝒔𝒔𝒆𝒔𝒔𝒎𝒆𝒏𝒕[])
Return 𝑩𝒆𝒔𝒕 𝒂𝒄𝒕𝒊𝒐𝒏 = 𝑨𝒍𝒍𝒐𝒘𝒂𝒃𝒍𝒆𝑴𝒐𝒗𝒆𝒔[𝒊𝒏𝒅𝒆𝒙]
end

3.2.2. Efficient Q-Learning algorithm

Algorithm 4 shows the pseudo-code of the proposed EQL for path optimization in less computation time that includes
rna

the proposed selection strategy. EQL constantly learns the policy by maintaining a numeric value for each Action-State
pair in the Q-table. This numerical value is updated as the robot explores its environment. Note that configuring Q-tables
using Algorithm 2 (without considering the best 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡) provides the robot prior information about its
navigation workspace. The robot starts from the starting position 𝑆𝑜𝑢𝑟𝑐𝑒 (learning takes place only from 𝑠 to 𝑠 ) and
chooses the next state using Algorithm 3 (which refers to Algorithm 2). Since all the following possible states will have
different Q-values, choosing the next state using Algorithm 3 is not done randomly as in classic QL. Indeed, after moving
to the next state, Q-value for the Action-State pair will be updated using equation (7). The learning process continues until
reaching 𝑇𝑎𝑟𝑔𝑒𝑡.
After the optimization process examined and the resulting optimal policy stored in Q-table, the path will be planned
subsequently. During this process, the robot starts from its initial state 𝑠 ; then, the next best state 𝑠 is determined,
where Q-value is greater than all Q-values for other adjacent states of 𝑠 .
Jou

Algorithm 4: EQL Algorithm for path optimization

begin
1. Learning process:
Inputs:
-Navigation environment:
• Pictures of the navigation environment
• MAP: Obstacles coordinates
-𝜸, 𝜶
-𝒔𝒔 and 𝒔𝑻
Output:
-Optimal Q-Table
Journal Pre-proof

Initialization:
-Define the states space 𝑺 and actions space 𝑨:
-𝑨 = 𝑀𝑜𝑣𝑒𝑇𝑜𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 … 𝑀𝑜𝑣𝑒𝑇𝑜𝑇𝑎𝑟𝑔𝑒𝑡 , ∀ 𝑙 ∈ {1 … 4}, ∀𝑗 ∈ {1 … 𝑚}.
-𝑺 = {𝑠 𝑥 , 𝑦 … 𝑠 (𝑥 , 𝑦 ) } ∀ 𝑙 ∈ {1 … 4}, ∀ 𝑗 ∈ {1 … 𝑚}
-Initialize Q-Table[][] using Algorithm 2 without considering the best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕

of
for each episode
set 𝒔𝒕 = 𝒔𝒔
repeat
Choose an action 𝑎 in 𝑠 by using the policy developed in Algorithm 3
Perform action 𝑎 , then receive 𝑟(𝑠 , 𝑎 ) as described in Algorithm 2 and 𝑠
Find largest Q-Value for all 𝑠 (𝑚𝑎𝑥 𝑄(𝑠 , 𝑎))

pro
∈
Update 𝑄 − 𝑇𝑎𝑏𝑙𝑒[][] using equation (7)
𝑠 =𝑠
until 𝒔𝒕 = 𝒔𝑻
end-for
2. Planning process:
Input:
-Optimal Q-Table
Output:
-Optimal sequence of states
-MAP
Set 𝑠 = 𝑠
repeat
Get best 𝑠
Go to 𝑠
re-
where Q-value of 𝑠 >Q-values of other adjacent states of 𝑠 (𝑠 = 𝐴𝑟𝑔 (max 𝑄(𝑠
∈
, 𝑎)))

𝑠 =𝑠
until 𝒔𝒕 = 𝒔𝑻
end
lP
4. Computation results and performances analysis
This section presents simulation results and evaluates the effectiveness of the proposed EQL approach (implemented
using C#.Net). All simulations are performed on a PC with Intel Core i3 3.3GHz, 8GB RAM and 64-bit Windows operating
system. First, EQL feasibility is tested via five designed complex maps (MAP1 to MAP5) with different arrangements of
static obstacles. Second, the effectiveness and performances of EQL are compared with existing algorithms through
benchmarks in the literature. To this end, two types of comparisons are conducted:
• The first type of comparison is done with other QL-based algorithms to provide better support for the proposed
EQL. This comparison clearly shows the improvement of the proposed EQL algorithm, demonstrates its
rna

efficiency and show how impressive the performance when compared to the latest related QL-based algorithms.
• The second type is performed with meta-heuristic optimization algorithms known from the literature. This
comparison will further demonstrate the efficiency of the proposed EQL to provide better results even in complex
maps.

Path length and calculation time metrics are used in both comparisons. It is important to note that for all the tests
considered, 𝑆𝑎𝑓𝑒𝑡𝑦𝐷𝑖𝑠𝑡 belongs to [5(𝑢𝑛𝑖𝑡), 15(𝑢𝑛𝑖𝑡)] range. This shows the effect of this parameter on the robot safety
when it is moving. Moreover, computation time of EQL results includes all the required calculations: calculation of
lines/segments in each step, and reward values, etc., except the calculations of squares around obstacles, which is
negligible.
Jou

4.1. Used datasets

We proposed five complex environments from MAP1 to MAP5; in each environment, the arrangements may create
many narrow paths between obstacles. Note that MAP2 studies EQL sensitivity with respect to obstacles rotation and
shows strength even in non-grid environments. The mobile robot must learn to navigate collision-free through these
narrow passages while optimizing its path.
The proposed EQL was compared with related QL-based algorithms using the grid environments presented in [15].
The same initial conditions 𝑆𝑜𝑢𝑟𝑐𝑒(2, 16) and 𝑇𝑎𝑟𝑔𝑒𝑡(19, 11) are used on all maps. It is important to note that obstacles
in these grid environments come in many shapes. Consequently, to test the proposed EQL algorithm, all these obstacles
are enveloped by rectangles as depicted in Fig. 6(a). Note that the approach presented in [15] is evaluated on a PC AMD
A10 Quad Core with 2.5 GHz, with Windows10. In this comparison, the proposed EQL approach is evaluated and
compared using flower pollination (IQ-FPA) [15], classical QL [15] and Improved Decentralized QL (IDQ) [28] through
five grid environments with predefined obstacles in different shapes, sizes and layouts. Finally, two other powerful
Journal Pre-proof

optimization algorithms, GA and memEAPF are used to demonstrate the robust performance of EQL using different maps
represented by M1 to M12 [10] and Env01 to Env04 [8] as shown in Figs. 7 and 8, respectively.
Tables 3, 4, 5 and 6 (Appendix) provide details on the verification environments within 𝑆𝑜𝑢𝑟𝑐𝑒 and 𝑇𝑎𝑟𝑔𝑒𝑡, adopted
by [15], [10] and [8], respectively. Obstacles in M1 to M12 environment are of circle form; they are enveloped in
rectangles as shown in Fig. 7. Unfortunately, environments discussed in [8] are represented in pictures without any

of
information about obstacles. To exploit them, a program has been developed to extract the coordinates and characteristics
of each obstacle as shown in Table 6. Both M1 and M2 are compared with memEAPF approach [10]; Env01 to Env04 are
compared to the GA approach [8].

4.2. Results and discussions

Fig. 5 presents preliminary results of feasibility tests of the proposed EQL algorithm in complex environments

pro
cluttered with obstacles in different arrangements (MAP1 to MAP5). The generated paths are obviously highly optimized
and do not collide with obstacles. Moreover, the computation time for all cases are 0.20s, 0.20s, 028s, 0.37s, and 0.20s,
respectively. Thus, with the proposed EQL, the robot can learn faster and obtain optimal solutions in very short time.
Additionally, by rotating an obstacle on a map (e.g., the blue obstacle in MAP2), the final path is modified without losing
the optimality (compared to MAP1).

re-
lP
(a) MAP1 (b) MAP2
rna

(c) MAP3 (d) MAP4

Jou

(e) MAP5
Fig. 5. Obtained paths by the proposed EQL for MAP1 to MAP5 (𝑆 and 𝑇 represent 𝑆𝑜𝑢𝑟𝑐𝑒 and 𝑇𝑎𝑟𝑔𝑒𝑡, respectively).
Journal Pre-proof

4.2.1. Comparison with QL-based algorithms

Fig. 6 shows the navigation maps with 8, 9, 10, 11, 12 and 15 obstacles, respectively, associated with the generated
paths. Table 1 summarizes the comparative performances of the proposed EQL with IQ-FPA, classical QL, and IDQ
algorithms. The authors in [15] modified IDQ to be used for path planning while adopting cost function, actions and
states. Be noted the improvement percentage in EQL is calculated according to the best solutions among those returned

of
by the IQ-FPA, classical QL and IDQ algorithms. This table also illustrates the results obtained over 30 independent runs
(non-deterministic algorithm) using IQ-FPA algorithm and tested on a PC with AMD 10, Quad-core and 2.5 GHz. Note
that the computation time of IQ-FPA algorithm is the average value of 30 independent runs.

pro
(a) Environment with 08 obstacles
re- (b) Environment with 09 obstacles
lP
rna

(c) Environment with 10 obstacles (d) Environment with 11 obstacles

Jou

(e) Environment with 12 obstacles (f) Environment with 15 obstacles

Fig. 6. Comparison of obtained paths with QL-based algorithms (𝑆 and 𝑇 represent 𝑆𝑜𝑢𝑟𝑐𝑒 and 𝑇𝑎𝑟𝑔𝑒𝑡, respectively).

By observing the calculation time shown in Table 1, a longer processing time appears when the number of obstacles
increases using IQ-FPA and IDQ; whereas, classical QL behaves in opposite way. Besides, the common characteristics
of these three algorithms with the highest computation time in all tested environments. It is satisfactory to note that the
proposed EQL for all maps converges in a very short time. Despite the high hardware performances used in [15], EQL
Journal Pre-proof

computation time is significantly better for all the maps considered with a strong improvement ranging between 88.85%
and 95.17%. This is mainly due to the reduction in the search space at every stage of the learning process while exploiting
effective states around the line segment only, that link the current state of the robot to the target state. For instance,
navigation map with 8 obstacles, only two obstacles collide with the straight-line segment between 𝑠 and 𝑠 . This may
be applied to other maps where a few obstacles are taken into account in the learning process. Thus, with the proposed

of
EQL, the robot can learn to move towards its imposed 𝑇𝑎𝑟𝑔𝑒𝑡 in the optimal path in a very short computation time.

Table 1: Comparison of the performances of the proposed EQL approach with QL, IDQ and IQ-FPA algorithms.
Path length (Unit) Computation time (s)
Environment Improvement Improvement
QL IDQ IQ-FPA EQL percentage (%)
QL IDQ IQ-FPA EQL
percentage (%)

pro
T1 (8 obstacles) 28.93 26.27 29.33 19.85 24.43 4.06 9.26 3.52 0.17 95.17
T2 (9 obstacles) 30.67 26.37 31.27 19.69 26.25 4.04 10.05 3.62 0.18 95.02
T3 (10 obstacles) 30.00 26.00 29.60 20.20 22.30 3.89 10.28 3.88 0.23 94.07
T4 (11 obstacles) 27.67 26.27 28.73 20.20 33.10 3.64 27.10 3.98 0.23 94.22
T5 (12 obstacles) 27.67 26.53 29.00 20.20 23.85 3.71 23.39 4.00 0.34 91.5
T6 (15 obstacles) 30.07 26.00 26.80 18.93 27.19 3.23 2.87 3.66 0.32 88.85

More satisfactory results in terms of path length can be observed in Table 1, where the proposed EQL reduces the
path distance with significant improvement (decreases by at least 22.30%). In fact, the paths created by EQL are shorter
for all maps; this is due to the efficiency of the proposed selection strategy, where the action conducting to the next best
state leading to the optimal path is determined. Moreover, we can see that IQ-FPA, classical QL and IDQ algorithms have
zigzag paths, which is not the case for EQL. re-
With regard to maps on Figs. 6(b), (c), (d) and (e), same paths are obtained by the proposed EQL, even adding an
obstacle away from the straight-line segment does not affect the optimization process. Unlike other algorithms, although
there are additional obstacles far from previously obtained paths, the new paths and computation time are negatively
affected. The 15-obstacle navigation map as in Fig. 6(f) clearly shows the effectiveness of EQL where the proposed
selection strategy is able to identify states that lead to the optimal path in only two learning steps. On the other hand,
despite that wrapping obstacles by rectangles reduces the narrow passages between obstacles; the proposed EQL is able
to find optimal paths for all environments. This can be justified by the good performance of the learning process due to
the high efficiency of the selection strategy proposed to select the best action. To summarize, the results presented in this
lP
comparison examination proved that the proposed EQL approach is inevitable, its effectiveness impressive as the state-
of-the-art of QL-based algorithms, and proposed EQL approach is more efficient than the other algorithms in the literature.

4.2.2. Comparison with meta-heuristic algorithms

Table 2 shows comparison results in environments M1 to M12. Columns 1 denote the name of the test environment;
columns 2, 3, and 4 represent the paths length generated by memEAPF, EQL, and improvement percentage compared to
memEAPF, respectively. Columns 5, 6 and 7 represent the computation time of memEAPF, EQL, and the percentage
improvement EQL, respectively. The value of the computation time of memEAPF consists of the best value obtained in
30 independent runs.
rna

Table 2: Performances of EQL on the environments M1 to M12 compared to memEAPF [10] with 𝑚 = 16 (𝑚 is the
best propositional gains). Bold numbers refer to best solutions.
Path length (Unit) Computation time (s)
Environment
memEAPF EQL Improvement percentage (%) memEAPF EQL Improvement percentage(%)
M1 5.46 5.85 -7.14 3.50 0.046 98.68
M2 8.55 8.17 4.44 3.38 0.046 98.63
M3 8.93 8.78 1.67 3.27 0.016 99.51
M4 9.31 6.24 32.97 3.13 0.031 99.00
M5 6.37 6.50 -2.04 3.16 0.156 95.06
Jou

M6 11.17 11.01 1.43 3.38 0.125 96.30

M7 7.72 8.30 -7.51 3.06 0.22 92.81
M8 8.28 9.11 -10.02 3.00 0.108 96.4
M9 6.95 6.98 -0.43 3.20 0.015 99.53
M10 4.68 4.54 2.99 2.82 0.34 87.94
M11 8.44 8.17 3.19 3.39 0.094 97.22
M12 9.24 9.07 1.83 3.19 0.046 98.55

Fig. 7 shows paths generated by EQL approach that are performed only once in environments M1 to M12. It is
important to note that the non-deterministic nature of the memEAPF algorithm requires that many runs be performed in
the same environment to obtain meaningful results. Consequently, the results presented for this last algorithm consist of
the best values obtained in 30 independent runs.
Journal Pre-proof

of
pro
re-
lP
Fig. 7. Obtained paths by the proposed EQL on the environments M1 to M12.

Fig. 8(a) shows results generated by EQL approach which is executed only once for each workspace. Whereas, Fig.
8(b) illustrates comparison with GA [8]. Since it is a non-deterministic algorithm, GA results represent the best value of
100 executions.
rna

Env01 Env02
Jou

Env03 Env04
(a) Obtained paths by the proposed EQL approach
Journal Pre-proof

of
pro
Env01 Env02

Env03
re- Env04
(b) Obtained paths for GA given in [8]
Fig. 8. Simulation comparison of EQL with GA for environment maps Env01 to Env04.

Figs. 9(a) and (b) show comparisons of computation time obtained using the proposed EQL algorithm with those
proposed in [10] and [8], respectively. As shown in Table 2, the proposed EQL algorithm greatly improves the distance
traveled in case of 07 tests compared to memEAPF, with a 6.93% improvement achieved. Whereas, the best improvement
for 05 tests performed by memEAPF were 5.43%. In addition, more interesting results can be extracted from the table
lP
such as: (i) the significant improvement in computation time by approximately 96% using EQL compared to memEAPF;
(ii) the ability of EQL approach to generate satisfactory paths even in complex workspaces as confirmed and illustrated
in Fig. 7; (iii) as well, effectiveness of the proposed method for finding high-quality paths in terms of length and safety;
and finally, (iv) in terms of computation time, EQL performs better than this powerful algorithm (i.e., memEAPF).
More interesting results are observed from Fig. 8 such as: (i) EQL approach has considerably improved the quality
of the solutions in terms of path length, computation time and robot safety for environments Env02 to Env04; (ii) high
quality of all paths generated by both GA and EQL approaches; (iii) it is clear that GA gave the best performance for
Env01. However, for the other environments (Env02, Env03 and Env04), a good quality path provided for both
rna

approaches. Despite the fact that GA algorithm [8] is effective for global path planning, unlike the path safety is clearly
poor. In fact, obstacle vertices are nodes of the planned paths, which may cause a collision with the robot (i.e., less safety).
Generally, safety metrics is obtained in meta-heuristic algorithms by adding a penalty to the length of the path, which
always makes it difficult to guarantee robot safety; additionally, the optimization speed will be affected. However, the
proposed EQL simply checks the robot safety by estimating its value according to the robot dimensions.

Time (s) Computation Time

3.5
3.38 3.38 3.39
3.5 3.27 3.2 3.19
3.13 3.16 3.06 3
3 2.82
Jou

2.5

1.5

1
0.34
0.5 0.046 0.156 0.125 0.22
0.046 0.016 0.031 0.108 0.015 0.094 0.046
0
M01 M02 M03 M04 M05 M06 M07 M08 M09 M10 M11 M12
EQL memEAPF

(a) Computation time on M01 to M12

Journal Pre-proof

Time Computation Time

(s)5
4.07
3.47 3.62 3.43
4

of
2

1 0.078 0.141 0.062 0.124

0
Env01 Env02 Env03 Env04
EQL GA [6]

pro
(b) Computation time on Env01 to Env04
Fig. 9. Comparison of computation time between the proposed EQL, memEAPF and GA.

According to Figs. 9(a) and (b), the execution time of the proposed EQL approach is considerably better than the
other meta-heuristic algorithms for all the maps considered. In fact, meta-heuristic algorithms spend more time to generate
high-quality solutions. In addition, when the number of obstacles increases the execution time variability of the proposed
EQL approach increases too. On the contrary, EQL approach always generates good solutions in a very short time.
To summarize from the obtained simulation results, (i) the proposed EQL is able to generate optimal/near-optimal
(short) paths quickly and safely; (ii) EQL approach is deterministic and requires a few milliseconds to compute a good
solution in terms of length and safety (controlled using 𝑆𝑎𝑓𝑒𝑡𝑦𝐷𝑖𝑠𝑡 parameter); (iii) meta-heuristic-based algorithms are
non-deterministic and lack a good trade-off between convergence speed and path length. (iv) the proposed EQL
re-
performance has been significantly improved compared to the performance of the latest related work; finally, (v) the
proposed EQL enhanced paths quality in terms of length, calculation time and robot safety.

5. Conclusion and future works

This work presented the Efficient Q-Learning (EQL) algorithm proposed as a solution to generate optimum paths for
mobile robots in self-learning with less computation time. The algorithm includes a new definition of states space and
actions space, a selection strategy proposed to (i) help the robot to select the best action at each stage, (ii) accelerate the
learning process, (iii) find optimal/near-optimal solutions, and (iv) exploit a new reward function in Q-tables initialization
which provides the robot with prior information about its environment and expands exploration. Simulation results of the
lP
proposed EQL algorithm compared to the benchmarks and related-QL algorithms in the literature demonstrated its
efficiency and superiority. Concluded, EQL quickly returns good solutions in terms of path length, computation time and
safety compared to the related algorithms in the literature. Future work will focus on implementing EQL algorithm in
dynamic environments by improving the efficiency of the robot learning process to avoid dynamic obstacles.

Acknowledgements
The authors would like to thank the reviewers for their insightful recommendations and comments on the paper, as
they led to considerably improve its quality. The authors would also like to express their deepest appreciation to Pr.
rna

Aouache Mustapha for his careful readings and corrections.

References
[1] Maoudj, A., Bouzouia, B., Hentout, A., Kouider, A., & Toumi, R. (2016, October). Distributed multi-agent approach
based on priority rules and genetic algorithm for tasks scheduling in multi-robot cells. In IECON 2016-42nd Annual
Conference of the IEEE Industrial Electronics Society (pp. 692-697). IEEE.
[2] Ayawli, B. B. K., Mei, X., Shen, M., Appiah, A. Y., & Kyeremeh, F. (2019). Optimized RRT-A* Path Planning
Method for Mobile Robots in Partially Known Environment. Information Technology and Control, 48(2), 179-194.
[3] Hentout, A., Maoudj, A., & Bouzouia, B. (2016, November). A survey of development frameworks for robotics. In
2016 8th International Conference on Modelling, Identification and Control (ICMIC) (pp. 67-72). IEEE.
[4] Ajeil, F. H., Ibraheem, I. K., Sahib, M. A., & Humaidi, A. J. (2020). Multi-objective path planning of an autonomous
Jou

mobile robot using hybrid PSO-MFB optimization algorithm. Applied Soft Computing, 89, 106076.
[5] Bakdi, A., Hentout, A., Boutamai, H., Maoudj, A., Hachour, O., & Bouzouia, B. (2017). Optimal path planning and
execution for mobile robots using genetic algorithm and adaptive fuzzy-logic control. Robotics and Autonomous
Systems, 89, 95-109.
[6] Hentout, A., Maoudj, A., Guir, D., Saighi, S., Aures Harkat, M., Hammouche, M. Z., & Bakdi, A. (2019). Collision-
free path planning for indoor mobile robots based on Rapidly-exploring Random Trees and Piecewise Cubic Hermite
Interpolating Polynomial. International Journal of Imaging and Robotics, 19(3), 74-94.
[7] Daniel, K., Nash, A., Koenig, S., & Felner, A. (2010). Theta*: Any-angle path planning on grids. Journal of Artificial
Intelligence Research, 39, 533-579.
[8] Karami, A. H., & Hasanzadeh, M. (2015). An adaptive genetic algorithm for robot motion planning in 2D complex
environments. Computers & Electrical Engineering, 43, 317-329.
Journal Pre-proof

[9] Qin, Y. Q., Sun, D. B., Li, N., & Cen, Y. G. (2004, August). Path planning for mobile robot using the particle swarm
optimization with mutation operator. In Proceedings of 2004 international conference on machine learning and
cybernetics (IEEE Cat. No. 04EX826) (Vol. 4, pp. 2473-2478). IEEE.
[10] Orozco-Rosas, U., Montiel, O., & Sepúlveda, R. (2019). Mobile robot path planning using membrane evolutionary
artificial potential field. Applied Soft Computing, 77, 236-251.

of
[11] Dorigo, M., & Stützle, T. (2019). Ant colony optimization: overview and recent advances. In Handbook of
metaheuristics (pp. 311-351). Springer, Cham.
[12] Brand, M., Masuda, M., Wehner, N., & Yu, X. H. (2010, June). Ant colony optimization algorithm for robot path
planning. In 2010 International Conference On Computer Design and Applications (Vol. 3, pp. V3-436). IEEE.
[13] Cheng, Y. H., Chao, P. J., & Kuo, C. N. (2019). Mobile Robot Path Planning using a Teaching-Learning-Interactive
Learning-Based Optimization. IAENG International Journal of Computer Science, 46(2).

pro
[14] Liu, J., Wang, Q., He, C., Jaffrès-Runser, K., Xu, Y., Li, Z., & Xu, Y. (2019). QMR: Q-learning based Multi-
objective optimization Routing protocol for Flying Ad Hoc Networks. Computer Communications.
[15] Low, E. S., Ong, P., & Cheah, K. C. (2019). Solving the optimal path planning of a mobile robot using improved Q-
learning. Robotics and Autonomous Systems, 115, 143-161.
[16] Luviano, D., & Yu, W. (2017). Continuous-time path planning for multi-agents with fuzzy reinforcement learning.
Journal of Intelligent & Fuzzy Systems, 33(1), 491-501.
[17] Qu, C., Gai, W., Zhong, M., & Zhang, J. (2020). A novel reinforcement learning based grey wolf optimizer algorithm
for unmanned aerial vehicles (UAVs) path planning. Applied Soft Computing, 89, 106099.
[18] Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8(3-4), 279-292.
[19] Jaradat, M. A. K., Al-Rousan, M., & Quadan, L. (2011). Reinforcement based mobile robot navigation in dynamic
environment. Robotics and Computer-Integrated Manufacturing, 27(1), 135-149.
re-
[20] Ganapathy, V., Yun, S. C., & Joe, H. K. (2009, July). Neural Q-learning controller for mobile robot. In 2009
IEEE/ASME International Conference on Advanced Intelligent Mechatronics (pp. 863-868).
[21] Oh, C. H., Nakashima, T., & Ishibuchi, H. (1998, May). Initialization of Q-values by fuzzy rules for accelerating Q-
learning. In 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on
Computational Intelligence (Cat. No. 98CH36227) (Vol. 3, pp. 2051-2056). IEEE.
[22] Jiang, J., & Xin, J. (2019). Path planning of a mobile robot in a free-space environment using Q-learning. Progress
in Artificial Intelligence, 8(1), 133-142.
[23] Wang, Y. H., Li, T. H. S., & Lin, C. J. (2013). Backward Q-learning: The combination of Sarsa algorithm and Q-
learning. Engineering Applications of Artificial Intelligence, 26(9), 2184-2193.
lP
[24] Das, P. K., Mandhata, S. C., Behera, H. S., & Patro, S. N. (2012). An improved Q-learning algorithm for path-
planning of a mobile robot. International Journal of Computer Applications, 51(9).
[25] Goswami, I., Das, P. K., Konar, A., & Janarthanan, R. (2010, December). Extended Q-learning algorithm for path-
planning of a mobile robot. In Asia-Pacific Conference on Simulated Evolution and Learning (pp. 379-383).
Springer, Berlin, Heidelberg.
[26] Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779-788.
[27] Redmon, J., & Farhadi, A. (2017). YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on
rna

computer vision and pattern recognition, pp. 7263-7271.

[28] Simsek, M., Czylwik, A., Galindo-Serrano, A., & Giupponi, L. (2011, June). Improved decentralized Q-learning
algorithm for interference reduction in LTE-femtocells. In 2011 Wireless Advanced (pp. 138-143). IEEE.

Appendix

Table 3: Proposed test environment MAP1 to MAP5. 𝑥 , 𝑦 are the coordinates of the lower-left corner of
𝑗 rectangle, Width and Height are the width and height of the 𝑗 rectangle, respectively.
Environment 𝑺(𝒙𝑺 , 𝒚𝑺 ) 𝑻(𝒙𝑻 , 𝒚𝑻 ) Obstacles 𝑶𝒃𝒔𝒋 (𝒙𝒋 , 𝒚𝒋 , 𝑾𝒊𝒅𝒕𝒉, 𝑯𝒆𝒊𝒈𝒉𝒕)
(0, 240, 60, 60), (0, 380, 80, 60), (60, 500, 80, 80), (120, 220, 140, 80), (120, 400, 80, 40),
MAP1 (10, 50) (500, 400) (160, 320, 20, 20), (180, 80, 80, 60), (300, 40, 60, 180), (360, 260, 60, 180), (420, 40,
Jou

180, 80), (480, 200, 120, 120), (380, 480, 120, 100)
(380, 95, 100, 40), (300, 70, 50, 100), (150, 100, 100, 50), (120, 50, 40, 100), (405, 120,
50, 100), (150, 180, 50, 100), (230, 200, 100, 50), (380, 250, 100, 50), (435, 320, 40, 100)
MAP2 (170, 50) (400, 430)
(rotated by 45°), (360, 380, 100, 40), (290, 370, 45, 90), (100, 370, 140, 50), (100, 300,
40, 100), (260, 280, 50, 50)
(300, 70, 50, 100), (150, 100, 100, 50), (120, 50, 40, 100), (405, 120, 50, 100), (150, 180,
MAP3 (180, 430) (435, 50) 50, 100), (230, 200, 100, 50), (380, 250, 100, 50), (435, 320, 40, 100), (360, 380, 100,
40), (290, 390, 45, 90), (100, 370, 140, 50), (100, 300, 40, 100), (260, 280, 50, 50)
(380, 95, 100, 40), (300, 70, 50, 100), (150, 100, 100, 50), (120, 50, 40, 100), (405, 120,
50, 100), (150, 180, 50, 100), (230, 200, 100, 50), (380, 250, 100, 50), (435, 320, 40,
MAP4 (170, 50) (400, 430)
100), (360, 380, 100, 40), (290, 370, 45, 90), (100, 370, 140, 50), (100, 300, 40, 100),
(260, 280, 50, 50)
Journal Pre-proof

(0, 0, 150, 50), (400, 0, 300, 100), (150, 100, 150, 250), (450, 200, 250, 160), (0, 400,
MAP5 (50, 100) (550, 620) 250, 100), (300, 450, 250, 100), (105, 600, 100, 200), (300, 650, 250, 100) (650, 500, 100,
250), (350, 600, 100, 100), (350, 400, 100, 100), (0, 200, 100, 100)

Table 4: Test environments T1 to T6 [15]. Note that Obstacle7 is formed by three rectangles: (10, 8, 2, 2), (9, 4, 2, 2)

of
and (10, 6, 1, 2).
Environment 𝑺(𝒙𝑺 , 𝒚𝑺 ) 𝑻(𝒙𝑻 , 𝒚𝑻 ) Obstacles 𝑶𝒃𝒔𝒋 (𝒙𝒋 , 𝒚𝒋 , 𝑾𝒊𝒅𝒕𝒉, 𝑯𝒆𝒊𝒈𝒉𝒕)
T1 (2, 9, 2, 3), (6, 12, 4, 6), (3, 5, 2, 2), (3, 2, 4, 1), (10, 8, 2, 2), (9, 4, 2, 2), (10, 6, 1, 2), (15,
(2, 16) (19, 11)
(8 obstacles) 10, 3, 1), (13, 2, 5, 3), (16, 14, 4, 4)
T2 (3, 14, 2, 4), (2, 9, 2, 3), (6, 12, 4, 6), (3, 5, 2, 2), (3, 2, 4, 1), (10, 8, 2, 2), (9, 4, 2, 2), (10, 6,
(2, 16) (19, 11)
(9 obstacles) 1, 2), (15, 10, 3, 1), (13, 2, 5, 3), (16, 14, 4, 4)
T3 (3, 14, 2, 4), (2, 9, 2, 3), (6, 12, 4, 6), (3, 5, 2, 2), (3, 2, 4, 1), (10, 8, 2, 2), (9, 4, 2, 2), (10, 6,

pro
(2, 16) (19, 11)
(10 obstacles) 1, 2), (15, 10, 3, 1), (13, 2, 5, 3), (16, 14, 4, 4), (6, 8, 2, 2)
T4 (3, 14, 2, 4), (2, 9, 2, 3), (6, 12, 4, 6), (6, 8, 2, 2), (3, 5, 2, 2), (3, 2, 4, 1), (10, 8, 2, 2), (9, 4,
(2, 16) (19, 11)
(11 obstacles) 2, 2), (10, 6, 1, 2), (15, 10, 3, 1), (13, 2, 5, 3), (16, 14, 4, 4) , (18, 6, 2, 2)
T5 (3, 14, 2, 4), (2, 9, 2, 3), (6, 12, 4, 6), (6, 8, 2, 2), (3, 5, 2, 2), (3, 2, 4, 1), (10, 8, 2, 2), (9, 4,
(2, 16) (19, 11)
(12 obstacles) 2, 2), (10, 6, 1, 2), (15, 10, 3, 1), (13, 2, 5, 3), (16, 14, 4, 4), (18, 6, 2, 2), (12, 14, 2, 6)
(6, 18, 1, 1), (5, 15, 1, 1), (6, 12, 1, 1), (6, 9, 1, 1), (6, 6, 1, 1), (10, 16, 1, 1), (10, 13, 1, 1),
T6
(2, 16) (19, 11) (10, 10, 1, 1), (10, 7, 1, 1), (10, 4, 1, 1), (14, 14, 1, 1), (14, 11, 1, 1), (14, 8, 1, 1), (14, 7, 1,
(15 obstacles)
1), (14, 2, 1, 1)

Table 5: Test environments M1 to M12.

Environment 𝑺(𝒙𝑺 , 𝒚𝑺 ) 𝑻(𝒙𝑻 , 𝒚𝑻 ) Obstacles 𝑶𝒃𝒔𝒋 (𝒙𝒋 , 𝒚𝒋 , 𝒓𝒐𝒃𝒔𝒕(𝒋) )
M1
M2
M3
M4
M5
(6.5, 8.0)
(5.0, 9.0)
(5.0, 9.0)
(5.0, 8.0)
(6.0, 3.0)
(5.0, 1.0)
(5.0, 1.0)
(5.0, 2.0)
re-
(6.0, 5.0, 0.5), (4.0, 5.0, 0.5), (3.2, 5.0, 0.5), (2.4, 5.0, 0.5), (6.8, 5.0, 0.5)
(4.0, 6.5, 0.5), (2.5, 6.5, 0.5), (5.0, 3.5, 0.5), (6.5, 3.5, 0.5), (8.0, 3.5, 0.5)
(4.0, 5.1, 0.5), (5.0, 5.1, 0.5), (6.0, 5.1, 0.5), (4.0, 6.1, 0.5), (4.0, 7.1, 0.5)
(5.0, 4.5, 0.5), (3.5, 4.5, 0.5), (3.5, 6.0, 0.5), (6.5, 4.5, 0.5), (6.5, 6.0, 0.5)
(4.0, 3.8, 0.5), (5.0, 3.8, 0.5), (6.0, 3.8, 0.5), (6.0, 2.8, 0.5), (6.0, 1.8, 0.5), (6.0, 6.3, 0.5),
(2.0, 3.8) (8.0, 6.3)
(5.0, 6.3, 0.5), (4.0, 6.3, 0.5), (4.0, 3.7, 0.5), (4.0, 8.3, 0.5)
M6 (2.0, 7.5, 0.5), (3.0, 7.5, 0.5), (4.0, 7.5, 0.5), (4.0, 5.0, 0.5), (5.0, 5.0, 0.5), (6.0, 5.0, 0.5),
(5.0, 9.0) (5.0, 1.0)
(6.0, 2.0, 0.5), (7.0, 2.5, 0.5), (8.0, 2.5, 0.5)
M7 (2.0, 7.5, 0.5), (3.0, 7.5, 0.5), (4.0, 7.5, 0.5), (4.0, 5.0, 0.5), (5.0, 5.0, 0.5), (6.0, 5.0, 0.5),
(5.5, 9.0) (4.5, 3.0) (6.0, 2.0, 0.5), (7.0, 2.5, 0.5), (8.0, 2.5, 0.5), (2.0, 5.5, 0.5, 0.5), (2.0, 6.5, 0.5), (8.0, 3.5,
lP
0.5), (8.0, 4.5, 0.5)
M8 (3.8, 1.8, 0.5), (2.5, 8.3, 0.5), (3.5, 8.3, 0.5), (3.5, 7.3, 0.5), (3.5, 6.3, 0.5), (6.5, 3.8, 0.5),
(2.0, 7.0) (8.0, 3.0)
(6.5, 2.8, 0.5), (6.5, 1.8, 0.5), (7.5, 1.8, 0.5), (3.8, 2.8, 0.5), (1.5, 3.8, 0.5), (8.5, 1.8, 0.5)
M9 (5.0, 8.0) (6.0, 2.0) (4.3, 5.0, 1.0), (5.8, 5.0, 1.0)
M10 (7.4, 2.5, 0.3), (8.0, 2.5, 0.3), (8.6, 2.5, 0.3), (8.6, 3.1, 0.3), (8.6, 7.0, 0.3), (1.5, 6.3, 0.3),
(3.0, 6.5) (6.8, 3.8) (1.5, 6.9, 0.3), (1.5, 7.5, 0.3), (2.1, 7.5, 0.3), (2.7, 7.5, 0.3), (3.5, 5.0, 0.3), (4.1, 5.0, 0.3),
(4.7, 5.0, 0.3), (4.1, 4.4, 0.3), (6.4, 5.0, 0.3), (7.0, 5.0, 0.3)
M11 (4.0, 6.5, 0.5), (2.5, 6.5, 0.5), (5.0, 3.5, 0.5), (6.5, 3.5, 0.5), (8.0, 3.5, 0.5), (3.3, 6.5, 0.5),
(5.0, 9.0) (5.0, 1.0)
(5.8, 3.5, 0.5), (7.3, 3.5, 0.5)
rna

M12 (1.5, 7.5) (8.5, 3.0) (5.0, 5.0, 0.3), (3.5, 3.5, 0.5), (3.5, 6.5, 0.5), (6.5, 3.5, 0.5), (6.5, 6.5, 0.5)

Table 6: Test environments Env01 to Env04 [8].

Environment 𝑺(𝒙𝑺 , 𝒚𝑺 ) 𝑻(𝒙𝑻 , 𝒚𝑻 ) Obstacles 𝑶𝒃𝒔𝒋 (𝒙𝒋 , 𝒚𝒋 , 𝑾𝒊𝒅𝒕𝒉, 𝑯𝒆𝒊𝒈𝒉𝒕)
(1, 113, 46, 12), (1, 193, 126, 28), (35, 21, 44, 40), (99, 117, 60, 24), (131, 53, 126, 28),
Env01 (1, 256) (256, 1)
(163, 197, 44, 40), (227, 117, 12, 56)
(256, (1, 97, 78, 28), (1, 241, 48, 16), (33, 1, 30, 60), (51, 149, 44, 72), (99, 21, 76, 24), (99, 85,
Env02 (1.0, 1.0)
256) 76, 24), (115, 149, 12, 40), (131, 229, 94, 28), (147, 149, 76, 40), (211, 21, 28, 88)
(1, 113, 46, 12), (1, 193, 126, 28), (35, 21, 44, 40), (131, 53, 126, 28), (163, 197, 44, 40),
Env03 (1.0, 256) (256, 1)
(227, 117, 12, 56)
(256, (1, 97, 78, 28), (1, 241, 48, 16), (33, 1, 30, 60), (51, 149, 44, 72), (99, 21, 76, 24), (99, 85,
Jou

Env04 (1.0, 1.0)

256) 76, 24), (131, 229, 94, 28), (147, 149, 76, 40), (211, 21, 28, 88)
Journal Pre-proof

Highlight:

 In this work, the mobile robot path optimization problem is handled and modeled.
 To this end, an Efficient Q-Learning (EQL) algorithm is proposed while assuring a collision-
free optimal path in minimum computation time.
 In the proposed EQL:
o New definition of states space and actions space.

of
o New reward function is proposed to initialize Q-table and gives the robot a priori
knowledge about its environment.
o The learning process is sped up by exploiting a new proposed efficient selection
strategy that reduces the searching space and assures a fast convergence to an

pro
optimized solution.
 Computation results on benchmarks from state-of-the-art and comparisons with well-known
planners of the literature prove that:
o EQL has a good learning efficiency.
o Training performances have been significantly improved compared with those of the
state-of-the-art.
o EQL outperforms the others optimization algorithms while improving the quality of
re-
the paths in terms of length, computation time and robot safety.
lP
rna
Jou
Journal Pre-proof

Credit author statement

 Maoudj Abderraouf: Conceptualization, Methodology, Software, Validation, Writing -

Original Draft, Writing - Review & Editing.
 Hentout Abdelfetah: Validation, Writing - Original Draft, Writing - Review & Editing,

of
Supervision.

pro
re-
lP
rna
Jou
Journal Pre-proof

Declaration of interests

☒ The authors declare that they have no known competing financial interests or personal relationships

of
that could have appeared to influence the work reported in this paper.

☐The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:

pro
re-
lP
rna
Jou

Drug Papper
No ratings yet
Drug Papper
43 pages
Mobile_Robot_Path_Planning_Using_a_QAPF_Learning_Algorithm_for_Known_and_Unknown_Environments
No ratings yet
Mobile_Robot_Path_Planning_Using_a_QAPF_Learning_Algorithm_for_Known_and_Unknown_Environments
16 pages
The Direction Analysis On Trajectory of Fast Neural Network Learning Robot
No ratings yet
The Direction Analysis On Trajectory of Fast Neural Network Learning Robot
11 pages
DB Report 21011F0018 P
No ratings yet
DB Report 21011F0018 P
17 pages
Reinforcement Learning Based Approach For Mobile Robot Navigation
No ratings yet
Reinforcement Learning Based Approach For Mobile Robot Navigation
4 pages
Enhancement of Cell Decomposition Path-Planning Algorithm For Autonomous Mobile Robot Based On An Intelligent Hybrid Optimization Method
No ratings yet
Enhancement of Cell Decomposition Path-Planning Algorithm For Autonomous Mobile Robot Based On An Intelligent Hybrid Optimization Method
16 pages
Mobile Robot Path Planning in Dynamic Environments Through Globally Guided Reinforcement Learning
No ratings yet
Mobile Robot Path Planning in Dynamic Environments Through Globally Guided Reinforcement Learning
8 pages
AIML
No ratings yet
AIML
4 pages
A Hybrid CNN-LSTM Architecture For Path Planning of Mobile Robots in Unknow Environments
No ratings yet
A Hybrid CNN-LSTM Architecture For Path Planning of Mobile Robots in Unknow Environments
5 pages
FP 3
No ratings yet
FP 3
7 pages
Application of Deep Reinforcement Learning in Mobile Robot Path Planning
No ratings yet
Application of Deep Reinforcement Learning in Mobile Robot Path Planning
5 pages
Sensors 23 02036
No ratings yet
Sensors 23 02036
24 pages
Improving Deep Reinforcement L
No ratings yet
Improving Deep Reinforcement L
9 pages
1 - An Effective Dynamic Path Planning Approach For Mobile
No ratings yet
1 - An Effective Dynamic Path Planning Approach For Mobile
29 pages
Research_progress_of_path_planning_algorithm_for_m
No ratings yet
Research_progress_of_path_planning_algorithm_for_m
8 pages
Sensors 23 03521
No ratings yet
Sensors 23 03521
15 pages
Using PSO 2
No ratings yet
Using PSO 2
6 pages
3 - PQ-RRTAn Improved Path Planning Algorithm For Mobile Robots
No ratings yet
3 - PQ-RRTAn Improved Path Planning Algorithm For Mobile Robots
19 pages
Summary
No ratings yet
Summary
4 pages
1805 00224 PDF
No ratings yet
1805 00224 PDF
27 pages
10.1109 Lars-Sbr.2015.41 Apbp
No ratings yet
10.1109 Lars-Sbr.2015.41 Apbp
6 pages
Apprenticeship Learning For Motion Planning With Application To Parking Lot Navigation
No ratings yet
Apprenticeship Learning For Motion Planning With Application To Parking Lot Navigation
8 pages
Peerj Cs 556
No ratings yet
Peerj Cs 556
25 pages
Learning Human-Aware Path Planning With Fully Convolutional Networks
No ratings yet
Learning Human-Aware Path Planning With Fully Convolutional Networks
6 pages
A Boundary Node Method For Path Planning of Mo - 2020 - Robotics and Autonomous
No ratings yet
A Boundary Node Method For Path Planning of Mo - 2020 - Robotics and Autonomous
21 pages
Robotics 13 00012
No ratings yet
Robotics 13 00012
29 pages
Path Planning and Obstacle Avoidance Approaches For Mobile Robot
No ratings yet
Path Planning and Obstacle Avoidance Approaches For Mobile Robot
10 pages
Mathematics 10 03990 v2
No ratings yet
Mathematics 10 03990 v2
28 pages
Enhancing Q-Learning Speed Using Selective Signal Injection
No ratings yet
Enhancing Q-Learning Speed Using Selective Signal Injection
4 pages
Smooth Path Planning of 6-DOF Robot Based On Reinforcement Learning
No ratings yet
Smooth Path Planning of 6-DOF Robot Based On Reinforcement Learning
5 pages
Research Paper 1
No ratings yet
Research Paper 1
7 pages
Simulation of The Navigation of A Mobile Robot by The Q-Learning Using Artificial Neuron Networks
No ratings yet
Simulation of The Navigation of A Mobile Robot by The Q-Learning Using Artificial Neuron Networks
12 pages
15 - Multi-Obstacle Path Planning and Optimization For Mobile Robot
No ratings yet
15 - Multi-Obstacle Path Planning and Optimization For Mobile Robot
16 pages
Path planning for unmanned surface vehicle based on improved Q-Learning algorithm
No ratings yet
Path planning for unmanned surface vehicle based on improved Q-Learning algorithm
15 pages
Efficient Autonomous Navigation For Mobile Robots Using Machine Learning
No ratings yet
Efficient Autonomous Navigation For Mobile Robots Using Machine Learning
11 pages
18-A Boundary Node Method For Path Planning of Mobile Robots
No ratings yet
18-A Boundary Node Method For Path Planning of Mobile Robots
58 pages
Multi-Objective Path Planning of An Autonomous Mob
No ratings yet
Multi-Objective Path Planning of An Autonomous Mob
13 pages
A Review of Robot Path Planning Algorithms
No ratings yet
A Review of Robot Path Planning Algorithms
7 pages
Algorithms 15 00477 v3
No ratings yet
Algorithms 15 00477 v3
16 pages
Sensors
No ratings yet
Sensors
20 pages
JETIR2107295
No ratings yet
JETIR2107295
4 pages
zhang2018
No ratings yet
zhang2018
11 pages
A Neural Network-Based Navigation Approach
No ratings yet
A Neural Network-Based Navigation Approach
17 pages
download (1)
No ratings yet
download (1)
7 pages
2011 0006.advanced Evolutionary
No ratings yet
2011 0006.advanced Evolutionary
76 pages
A Data-Driven Path Planner for Small Autonomous Robots Using Deep Regression Models.
No ratings yet
A Data-Driven Path Planner for Small Autonomous Robots Using Deep Regression Models.
9 pages
Reinforcement_Learning_Algorithms_in_Global_Path_Planning_for_Mobile_Robot
No ratings yet
Reinforcement_Learning_Algorithms_in_Global_Path_Planning_for_Mobile_Robot
5 pages
PAPER PEMC2020 Mobile Robot Trajectory Planning
No ratings yet
PAPER PEMC2020 Mobile Robot Trajectory Planning
8 pages
Q Learning
No ratings yet
Q Learning
9 pages
A Distributed Multi-Robot Path Planning Algorithm For Searching Multiple Hidden Targets
No ratings yet
A Distributed Multi-Robot Path Planning Algorithm For Searching Multiple Hidden Targets
4 pages
project_summary
No ratings yet
project_summary
10 pages
PSO Robot Path Planning
No ratings yet
PSO Robot Path Planning
12 pages
final project
No ratings yet
final project
4 pages
Research On Intelligent Fuzzy Control Algorithm For Moving Path of Handling Robot
No ratings yet
Research On Intelligent Fuzzy Control Algorithm For Moving Path of Handling Robot
5 pages
毕设参考文献
No ratings yet
毕设参考文献
12 pages
Extended Q-Learning: Reinforcement Learning Using Self-Organized State Space
No ratings yet
Extended Q-Learning: Reinforcement Learning Using Self-Organized State Space
10 pages
unit-5
No ratings yet
unit-5
65 pages
Journal of Robotics - 2022 - Rafai - A Review on Path Planning and Obstacle Avoidance Algorithms for Autonomous Mob
No ratings yet
Journal of Robotics - 2022 - Rafai - A Review on Path Planning and Obstacle Avoidance Algorithms for Autonomous Mob
14 pages
Application of Neural Q-Learning Controllers On The Khepera II Via Webots Software
No ratings yet
Application of Neural Q-Learning Controllers On The Khepera II Via Webots Software
8 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
jmdv-11-1-1
No ratings yet
jmdv-11-1-1
10 pages
Fuzzylogic-basedcontrolforintelligentvehiclesAsurvey2
No ratings yet
Fuzzylogic-basedcontrolforintelligentvehiclesAsurvey2
18 pages
1848-article-10.1556-1848.2024.00711
No ratings yet
1848-article-10.1556-1848.2024.00711
9 pages
19 - Global Path Planning and Multi-Objective Path Control For Unmanned Surface Vehicle Based On Modified Particle Swarm Optimization (PSO) Algorithm
No ratings yet
19 - Global Path Planning and Multi-Objective Path Control For Unmanned Surface Vehicle Based On Modified Particle Swarm Optimization (PSO) Algorithm
16 pages
6 - Classical and Heuristic Approaches For Mobile Robot Path
No ratings yet
6 - Classical and Heuristic Approaches For Mobile Robot Path
35 pages
B Grammar-2
No ratings yet
B Grammar-2
45 pages
Economie Test 2-2021 Semnat
No ratings yet
Economie Test 2-2021 Semnat
4 pages
Bibhu Mohapatra - Wikipedia
No ratings yet
Bibhu Mohapatra - Wikipedia
11 pages
Lesson Plan Preparing Financial Statements
No ratings yet
Lesson Plan Preparing Financial Statements
4 pages
Density of Rocks Lab Worksheet
100% (1)
Density of Rocks Lab Worksheet
6 pages
Lecture 5 Year 2-1
No ratings yet
Lecture 5 Year 2-1
28 pages
CH200 Operation Manual
100% (1)
CH200 Operation Manual
258 pages
CRZ Notification: A Case Study
No ratings yet
CRZ Notification: A Case Study
29 pages
Welded Steel Tanks For Oil Storage - RPS 967 (Tank) PDF
100% (1)
Welded Steel Tanks For Oil Storage - RPS 967 (Tank) PDF
10 pages
IES QP (1)
No ratings yet
IES QP (1)
5 pages
Five Ideas For How Professors Can Deal With GPT3
No ratings yet
Five Ideas For How Professors Can Deal With GPT3
7 pages
G12 CHAPTER 5 First Half
No ratings yet
G12 CHAPTER 5 First Half
6 pages
Legal Rights and Duties
No ratings yet
Legal Rights and Duties
7 pages
Chapter 8: More Number Theory
No ratings yet
Chapter 8: More Number Theory
5 pages
12 Molstoichws
No ratings yet
12 Molstoichws
2 pages
Literature Review On Mpesa
100% (2)
Literature Review On Mpesa
6 pages
Ta de Cuong On Tap Giua Hki Khoi 11 2024 2025 - 24920247
100% (1)
Ta de Cuong On Tap Giua Hki Khoi 11 2024 2025 - 24920247
7 pages
Project Proposal - Water Filtration Facility
No ratings yet
Project Proposal - Water Filtration Facility
3 pages
Assignment 1 Word 2007: Fakultas Kedokteran Gigi Universitas Sumatera Utara
No ratings yet
Assignment 1 Word 2007: Fakultas Kedokteran Gigi Universitas Sumatera Utara
11 pages
Unbalance & Balancing
No ratings yet
Unbalance & Balancing
18 pages
Tinh Giản Kì Thi Thpt Quốc Gia Năm 2020 - 38
No ratings yet
Tinh Giản Kì Thi Thpt Quốc Gia Năm 2020 - 38
4 pages
2 Correlation and Linear Regression PDF
No ratings yet
2 Correlation and Linear Regression PDF
26 pages
Turnstile Solution - Sample Only
No ratings yet
Turnstile Solution - Sample Only
18 pages
UiiSii OWS Earphone Catalog 202410 Hot
No ratings yet
UiiSii OWS Earphone Catalog 202410 Hot
27 pages
Introduction To Statistics in The Psychological Sciences
No ratings yet
Introduction To Statistics in The Psychological Sciences
264 pages
80L Hall Fiat Punto A3280 A3281 A3283
No ratings yet
80L Hall Fiat Punto A3280 A3281 A3283
12 pages
LMR 122D
No ratings yet
LMR 122D
4 pages
Python - How To Parse The Body From A Raw Email, Given That Raw Email Does Not Have A - Body - Tag or Anything - Stack Overflow PDF
No ratings yet
Python - How To Parse The Body From A Raw Email, Given That Raw Email Does Not Have A - Body - Tag or Anything - Stack Overflow PDF
7 pages
42 Sales Page
No ratings yet
42 Sales Page
18 pages
Marmara University, Faculty of Engineering Electrical and Electronics Engineering Fall 2020, EE3082 - Communication Engineering
No ratings yet
Marmara University, Faculty of Engineering Electrical and Electronics Engineering Fall 2020, EE3082 - Communication Engineering
25 pages

11 - Optimal Path Planning Approach Based On Q-Learning Algorithm For Mobile Robots

Uploaded by

11 - Optimal Path Planning Approach Based On Q-Learning Algorithm For Mobile Robots

Uploaded by

Journal Pre-proof

Optimal path planning approach based on Q-learning algorithm for

Abderraouf Maoudj, Abdelfetah Hentout

To appear in: Applied Soft Computing Journal

Received date : 10 March 2020

© 2020 Elsevier B.V. All rights reserved.

Optimal path planning approach based on Q-learning

2. Mathematical model of mobile robot path optimization

2.1. Obstacles modeling of the indoor environment

(a) Modeling of the 3D obstacles of the indoor environment

2.2. Mathematical modeling of the optimization problem

points 𝑃 , 𝑃 of a rectangle enveloping an obstacle is given as follows:

• 𝑘, 𝑙 𝑘, 𝑙 ∈ 1 … 4 : indices of the point that defines an obstacle,

The decision variables of the mathematical model are as follows:

𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝑥 −𝑥 𝑦 −𝑦 ,∀ 𝑖 ∈ 1…𝑛 − 1 (2)

3. Proposed Efficient Q-Learning (EQL) algorithm

3.1. Classical Q-Learning algorithm and its limitations

Fig. 2. A rough framework of Q-learning.

Algorithm 1 summarizes the different steps of the classical QL algorithm.

Algorithm 1: Classical Q-Learning algorithm

3.2. The proposed Efficient Q-Learning (EQL) algorithm

(a) (b) (c)

3.2.1. Selection strategy and reward assignment function

Algorithm 2: Reward assignment for an action 𝑴𝒐𝒗𝒆𝑻𝒐𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕𝒍𝒋

if (𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕𝒌𝒓 is the best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕 according to Algorithm 3)

𝑠 and 𝑠 ∀ 𝑘, 𝑙 ∈ {1 … 4}, ∀ 𝑗, 𝑟 ∈ {1 … 𝑚}.

(a) Case of intersection with 𝑆𝑒𝑔 and 𝑆𝑒𝑔

(b) Case of intersection with 𝑆𝑒𝑔 and 𝑆𝑒𝑔

Algorithm 3: Selection strategy

3.2.2. Efficient Q-Learning algorithm

Algorithm 4: EQL Algorithm for path optimization

4.1. Used datasets

4.2. Results and discussions

(c) MAP3 (d) MAP4

4.2.1. Comparison with QL-based algorithms

(c) Environment with 10 obstacles (d) Environment with 11 obstacles

(e) Environment with 12 obstacles (f) Environment with 15 obstacles

4.2.2. Comparison with meta-heuristic algorithms

M6 11.17 11.01 1.43 3.38 0.125 96.30

Time (s) Computation Time

(a) Computation time on M01 to M12

Time Computation Time

1 0.078 0.141 0.062 0.124

5. Conclusion and future works

Aouache Mustapha for his careful readings and corrections.

computer vision and pattern recognition, pp. 7263-7271.

Table 5: Test environments M1 to M12.

Table 6: Test environments Env01 to Env04 [8].

Env04 (1.0, 1.0)

Credit author statement

 Maoudj Abderraouf: Conceptualization, Methodology, Software, Validation, Writing -

You might also like