11 - Optimal Path Planning Approach Based On Q-Learning Algorithm For Mobile Robots
11 - Optimal Path Planning Approach Based On Q-Learning Algorithm For Mobile Robots
PII: S1568-4946(20)30734-1
DOI: https://doi.org/10.1016/j.asoc.2020.106796
Reference: ASOC 106796
Please cite this article as: A. Maoudj and A. Hentout, Optimal path planning approach based on
Q-learning algorithm for mobile robots, Applied Soft Computing Journal (2020), doi:
https://doi.org/10.1016/j.asoc.2020.106796.
This is a PDF file of an article that has undergone enhancements after acceptance, such as the
addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive
version of record. This version will undergo additional copyediting, typesetting and review before it
is published in its final form, but we are providing this version to give early visibility of the article.
Please note that, during the production process, errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.
of
Centre de Développement des Technologies Avancées (CDTA)
Division Productique et Robotique (DPR)
BP 17, Baba Hassen, Algiers 16303, Algeria.
[email protected], [email protected]
Abstract: In fact, optimizing path within short computation time still remains a major challenge for mobile robotics
pro
applications. In path planning and obstacles avoidance, Q-Learning (QL) algorithm has been widely used as a
computational method of learning thought environment interaction. However, less emphasis is placed on path
optimization using QL because of its slow and weak convergence towards optimal solutions. Therefore, this paper
proposes an Efficient Q-Learning (EQL) algorithm to overcome these limitations and ensure an optimal collision-free
path in less possible time. In the QL algorithm, successful learning is closely dependent on the design of an effective
reward function and an efficient selection strategy for an optimal action that ensures exploration and exploitation. In this
regard, a new reward function is proposed to initialize the Q-table and provide the robot with prior knowledge of the
environment, followed by a new efficient selection strategy proposal to accelerate the learning process through search
space reduction while ensuring a rapid convergence towards an optimized solution. The main idea is to intensify research
at each learning stage, around the straight-line segment linking the current position of the robot to 𝑇𝑎𝑟𝑔𝑒𝑡 (optimal path
in terms of length). During the learning process, the proposed strategy favors promising actions that not only lead to an
re-
optimized path but also accelerate the convergence of the learning process. The proposed EQL algorithm is first validated
using benchmarks from the literature, followed by a comparison with other existing QL-based algorithms. The achieved
results showed that the proposed EQL gained good learning proficiency; besides, the training performance is significantly
improved compared to the state-of-the-art. Concluded, EQL improves the quality of the paths in terms of length,
computation time and robot safety, furthermore outperforms other optimization algorithms.
Keywords: Path optimization, Efficient Q-Learning, Efficient selection strategy, Convergence speed, Training
performances.
lP
1. Introduction
Nowadays, a growing trend in intelligent mobile robot applications highlights the importance of planning and
improving paths in many areas of robotics, especially in automated industrial environments where the paths optimization
plays a major role in products scheduling [1]. Therefore, research on path optimization for mobile robots has been an
active area gaining considerable attention [2] [3] [4] [5]. Despite extensive research works carried in this field, generating
optimal paths in complex environments still remains a challenge to the community [6].
Path optimization aims to generate an optimal collision-free path that links the starting position (𝑆𝑜𝑢𝑟𝑐𝑒) to the
target position (𝑇𝑎𝑟𝑔𝑒𝑡) while respecting certain performance metrics. Authors in [7] stated that global optimal motion
rna
planning is an NP-hard problem. During the last decade, several works have been carried out in the literature and many
algorithms have been proposed for the problem of path optimization, as for instance, Genetic Algorithm (GA) [5] [8],
Particle Swarm Optimization (PSO) [9], membrane Evolutionary Artificial Potential Field (memEAPF) [10], Rapidly-
exploring Random Trees (RRT) [6] and Ant Colony Optimization (ACO) algorithm [11] [12]. Although these algorithms
are powerful, each has pros and cons, such as high computation complexity, long search time, and low affinity accuracy
[13]. In addition, these optimization algorithms present some implementation issues in considering the dynamic
environment due to difficulty in real-time setting of their parameters. Thus, solving these difficult issues, especially
avoiding dynamic obstacles, requires an intelligent learning process using Machine Learning (ML) techniques.
To support autonomous mobile robots for learning in unknown environments, ML algorithms are used as preferred
and alternative methods that go beyond conventional methods. In this situation, ML appears via Reinforcement Learning
Jou
(RL) as a solution that allows agents to adjust/alter their actions in a specific environment to obtain maximum results [14].
RL was originally used in the disciplines of game theory, information theory, control theory and operational research [15],
and over time it was adopted in the navigation of autonomous mobile robots [15] [16]. Recently, many RL algorithms are
emerging while focusing on path planning and obstacles avoidance. However, the emphasis is less on improving the path
and even less on how to improve the path while reducing the huge computation time to attain convergence. In this context,
the Q-Learning (QL) method [17] is widely used to implement RL and has recently been implemented worldwide in
various fields of robotics. QL is the most widely used RL algorithm due to its simplicity and its proof of convergence [18]
[19]. The use of QL algorithms in path optimization for intelligent mobile robots makes it possible to adapt their behavior
in workspaces based on previous observations. However, one of the major limitations when using QL to optimize the path
of mobile robots is the difficulty to visit all State-Action pairs due to enormous time and computation burden and size of
the states necessary to represent the workspace [20].
The optimality and convergence speed have always been a significant subject of interest in the QL context.
Consequently, numerous studies have been carried out in the literature to deal with such problems. Motivated by the fact
Journal Pre-proof
that slowness of convergence can be moderated by appropriate initialization of Q-values, Nakashima and Ishibuchi [21]
demonstrated that QL can be accelerated by adequately setting initial Q-values using fuzzy rules. Low et al. [15]
introduced the partially-guided QL concept in which the flower pollination algorithm is used to improve the QL
initialization. Ganapathy and colleagues [20] proposed an approach to reduce both time and space complexity of the
classical QL algorithm. Additionally, Jiang and Xin [22] proposed an improved QL algorithm for path planning in a free-
of
space environment by generating fuzzified state variables for continuous space division and avoiding the dimensionality
curse. To accelerate the learning process, the authors proposed an integrated learning strategy based on space allocation.
Wang et al. [23] benefit from the State-Action-Reward-State-Action (SARSA) algorithm in short-term convergence and
associated with the QL algorithm, the results showed an improvement in learning time. Das and colleagues [24] developed
an alternative method to QL to deal with speed convergence. To reduce both space and time complexity by economically
updating Q-table, Goswami et al. [25] proposed an extension of classical QL algorithm. The authors created a single field
pro
for action of each state; Q-table stored Q-values at the best action in each state, which requires short time to retrieve Q-
values and thus saves significant time complexity.
Although many QL algorithms have been conducted to solve path optimization problems for mobile robots, optimal
paths still remain difficult to obtain; as well, the convergence speed of such algorithms continues to be a challenge,
especially when dealing with complex environments. Thus, this work represents an additional effort addressing these
issues, where the main contributions can be outlined as follows. First, introducing a new definition of states and actions
spaces, that reduces the size of Q-table and leads to the optimal path through rapid learning. Second, motivated by the
fact that QL algorithm exploits cumulative rewards received in the future and to give the robot a priori knowledge of its
environment, a reward assignment algorithm is proposed for initializing Q-table. The algorithm is also used in the
selection policy to evaluate a set of actions in each learning step to determine the best action. Third, an effective new
selection strategy has been proposed to address the slow convergence of QL algorithm, while actively targeting search
re-
toward promising states that lead to the optimum path without resorting to explore the entire environment. Finally,
comparison support is provided to validate the proposed algorithms for path optimization.
The rest of the paper is organized as follows. Section 2 presents environment approximation and mathematical
modeling of the problem. Section 3 provides some properties of the classical QL and the proposed Efficient Q-Learning
(EQL) algorithms. Section 4 shows simulation results and comparisons between EQL and other works of the literature.
Finally, section 5 gives conclusions and future works.
using geometric shapes, the rectangles are considered to be the ideal geometric shape allowing to easily approximate any
obstacle geometry as indicated in Fig. 1(a). This geometry makes it possible to cover the obstacle while planning in 2D
and avoiding 3D obstacles. Fig. 1(b) illustrates the real indoor environment with 3D obstacles of different shapes (chairs,
desks, etc.) taken by the Microsoft Kinect sensors placed on the roof of a room. Figs. 1(c), (d) and (e) describe in sequence
the different steps for wrapping these obstacles by rectangles. More details on these steps can be found in a previous work
[5].
Jou
of
pro
(b) Real encumbered environment in 3D [5] (c) Binary map of the environment
re-
(d) Approximation of the obstacles by rectangles (e) 2D binary safe map of the environment
Fig. 1. Obstacles approximation for path planning.
𝑦−𝑦 𝑥−𝑥
𝑆𝑒𝑔 𝑃 , 𝑃 (1)
𝑀𝐼𝑁 𝑥 , 𝑥 𝑥 𝑀𝐴𝑋 𝑥 , 𝑥
The following notations describe indices and parameters used in the mathematical model:
• 𝑛: number of nodes of the generated path,
• 𝑚: number of obstacles in the environment,
• 𝑖 𝑖 ∈ 1 … 𝑛 : index of the segment defined by 𝑖 and 𝑖 1 nodes in the path,
• 𝑗 𝑗 ∈ 1 … 𝑚 : index of the 𝑗 obstacle in the navigation environment,
Jou
𝟏 if ∃ 𝑃 , 𝑃 | 𝑃 , 𝑃 ∈ 𝑃𝑎𝑡ℎ_𝑆𝑒𝑔 ∧ 𝑃 , 𝑃 ∈ 𝑂𝑏𝑠_𝑆𝑒𝑔
• 𝐵, ∀𝑖 ∈ 1 … 𝑛 , ∀𝑡 ∈ 1 … 4 , ∀𝑗 ∈ 1 … 𝑚
𝟎 Otherwise
of
𝟏 if ∃ 𝑃 , 𝑃 | 𝑃 , 𝑃 ∈ 𝑅𝑜𝑏_𝑠𝑒𝑔 ∧ 𝑃 , 𝑃 ∈ 𝑂𝑏𝑠_𝑆𝑒𝑔
• 𝐴 , ∀ 𝑡, 𝑟 ∈ 1 … 4 , ∀ 𝑗 ∈ 1 … 𝑚
𝟎 Otherwise
pro
Subject to:
𝑥 𝑥 ∨ 𝑦 𝑦 ,∀ 𝑖 ∈ 1…𝑛 − 1 (3)
∑ 𝐵, 0, ∀ 𝑖 ∈ 1 … 𝑛 , ∀ 𝑡 ∈ 1 … 4 , 𝑗 ∈ 1 … 𝑚 (4)
𝐴 , 0, ∀ 𝑟, 𝑡 ∈ 1 … 4 , ∀ 𝑗 ∈ 1 … 𝑚 (5)
𝐵 , ∈ 0, 1 , 𝐴 , ∈ 0, 1 , ∀ 𝑖 ∈ 1 … 𝑛 , 𝑗 ∈ 1 … 𝑚 (6)
The objective function, given by (2), consists of minimizing the length of the path. The constraint given by (3)
re-
requires that all nodes are different. Constraints in (4) ensure that the path segments do not overlap environment obstacles.
The constraint (5) prevents the robot from crossing obstacles when moving through the path nodes. Finally, the constraints
(6) require all 𝐴 and 𝐵 variables to be binary.
To achieve this objective, the robot adjusts its action strategies through the reward 𝑟 of environmental feedback,
which assesses the quality of its actions. For an action 𝑎 at state 𝑠 and at time step 𝑡, the Q-value is updated using the
Bellman’s equation as in (7):
𝑄 𝑠 ,𝑎 𝑄 𝑠 ,𝑎 𝛼 𝑟 𝑠 ,𝑎 𝛾 ∗ 𝑚𝑎𝑥 𝑄 𝑠 ,𝑎 − 𝑄 𝑠 ,𝑎 (7)
∈
where 𝛼 is the learning rate, 𝛾 is the discount factor, 𝑟 𝑠 , 𝑎 denotes the immediate reward/penalty acquired by the agent
after executing an action 𝑎 at state 𝑠 . 𝑚𝑎𝑥 𝑄 𝑠 , 𝑎 signifies the maximum Q-value among all possible actions in the
∈
new state 𝑠 .
Journal Pre-proof
of
𝑸 𝒔𝒕 , 𝒂𝒕 ← 𝟎 , (𝒏 states and 𝒎 actions)
𝒎 𝒏
for (each episode):
(1) set 𝒔𝒕 ← a random state from the states set 𝑺;
while 𝒔𝒕 𝑮𝒐𝒂𝒍 𝒔𝒕𝒂𝒕𝒆
(2) Choose 𝒂𝒕 in 𝒔𝒕 by using an adequate policy (𝜺-greedy, etc.);
(3) Perform action 𝒂𝒕 , and receive reward/penalty and 𝒔𝒕 𝟏 ;
pro
(4) Update 𝑸 𝒔𝒕 , 𝒂𝒕 using equation (7);
𝒔𝒕 ← 𝒔𝒕 𝟏 ;
end-while
end-for
end
In an environment with 𝑚 states and 𝑛 actions, the dimension of Q-table is 𝑚 𝑛. Choosing an action with the
highest Q-value amongst 𝑛 possible actions includes 𝑛 − 1 comparisons. Thus, the number of comparisons necessary to
update Q-table with 𝑛 states is 𝑚 𝑛 − 1 . As the states size (representing the workspace) increases (i.e., 𝑛 increases),
the time required by the QL algorithm to complete the path optimization increases as the search space increases, which
re-
prevents the algorithm from providing good solutions. Therefore, one of the main problems when implementing QL
algorithm on path optimization is the difficulty of visiting all State-Action pairs during the training process due to the
huge computation time required [11] and, therefore, it negatively affects convergence to the optimal path.
shown in Fig. 3(c), leads to find the shortest path. In this regard, this work proposes a new effective definition of states
and actions. To allow robot safely avoid this obstacle, four points called 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡𝑠 are defined for each obstacle
in the workspace as seen in Figs. 3(a) and (b). Each 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 is the node in the corner of the rectangle surrounding
this obstacle. The 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 coordinates are measured as a function of four points forming the rectangle and
respecting the safety distance 𝑆𝑎𝑓𝑒𝑡𝑦𝐷𝑖𝑠𝑡, making it possible to define the coordinates 𝑋 , 𝑌 of the
𝑖 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 and ensure that the robot does not collide with the 𝑗 obstacle.
Jou
In this work, states are defined as 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡𝑠 locations in the environment, including 𝑇𝑎𝑟𝑔𝑒𝑡 (𝑠 ). The actions
space is discretized into 4 × 𝑚 + 1 movements; this set consists of all moves to 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡𝑠 assigned to each
obstacle including 𝑀𝑜𝑣𝑒𝑇𝑜𝑇𝑎𝑟𝑔𝑒𝑡 action.
Considering an environment with 𝑚 obstacles (𝑂𝑏𝑠 , 𝑗 ∈ 1 … 𝑚 ), the actions space 𝐴 and states space 𝑆 are as
of
follows:
pro
In this manner, the search space is reduced with preserving states that lead to the optimal path. Additionally, this
discretization makes it possible to assess and monitor reward/penalty without the need for a robot simulation environment.
else
(1) 𝑹𝒆𝒘𝒂𝒓𝒅 = − ∥ 𝒔𝒔 − 𝒔𝑻 ∥
re-
if CollisionCheck(𝒔𝒌𝒓 (𝒙𝒌𝒓 , 𝒚𝒌𝒓 ), 𝒔𝒍𝒋 (𝒙𝒍𝒋 , 𝒚𝒍𝒋 )) //𝒔𝒌𝒓 (𝒙𝒌𝒓 , 𝒚𝒌𝒓 ) represents the current state
where:
• ∥ 𝑠 − 𝑠 ∥ denotes the Euclidean distance calculated for the two locations of the robot defined in the two states
rna
As well, the reward function is used to judge the quality of an action in the selection strategy. The reward assignment
process always tries to favor 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡𝑠 around this part of the line segment, which helps reinforcing the selection
of the best action.
In QL algorithm, successful learning is highly dependent on the design of the selection strategy for the optimal action
in each learning step. In addition, the choice of an action must ensure a balance between exploration and exploitation of
already archived learning. In fact, pure exploitation of the algorithm quickly leads to locally optimized policies; whereas,
excessive exploration reduces the performance of the learning process even if it can avoid the locally optimal policies
[15]. In this regard, the proposed EQL uses a new strategy to ensure a good balance between exploration and exploitation.
For this purpose, the reward assignment algorithm is used to effectively initialize Q-table. This gives the robot prior
experience of the workspace and expands exploration. Therefore, EQL handles exploration using reward assignment
algorithm during the initialization of Q-table and then in the selection strategy. On the other hand, a new selection strategy
is proposed on the basis of rewarding assignment while exploiting effective states around the line segment connecting the
Journal Pre-proof
current state of the robot (𝑠 ) to the goal state (𝑠 ), followed by exploring the specific states imposed by the reward
assignment algorithm. The purpose is being to learn the robot, in each learning step, to move toward the best
𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 around the line segment that connects 𝑠 with 𝑠 , leading to the optimal path.
Algorithm 3 describes the selection strategy for choosing best action (i.e., move to the best 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡) in each
learning step. Assume that 𝑠 (𝑥 , 𝑦 ), 𝑠 (𝑥 , 𝑦 ) and 𝑠 (𝑥 , 𝑦 ) denote the current state, next state and 𝑇𝑎𝑟𝑔𝑒𝑡
of
state, respectively. This algorithm contains three main steps. In the first step, calculate the linear segment 𝑆𝑒𝑔(𝑠 , 𝑠 )
using mathematical equation (1). Then, define the set of obstacles 𝑂𝑏𝑠[] = 𝑆𝑒𝑔(𝑠 , 𝑠 ) ∩ 𝑂𝑏𝑠_𝑆𝑒𝑔 , ∀ 𝑡 ∈ {1 … 4}, ∀ 𝑗 ∈
{1 … 𝑚}. Besides, in the case 𝑆𝑒𝑔(𝑠 , 𝑠 ) ∩ 𝑂𝑏𝑠_𝑆𝑒𝑔 = ∅, ∀ 𝑡 ∈ {1 … 4}, ∀ 𝑗 ∈ {1 … 𝑚}, the algorithm immediately
returns the best action 𝑀𝑜𝑣𝑒𝑇𝑜𝑇𝑎𝑟𝑔𝑒𝑡; otherwise, proceed with other steps. In the second step, two vectors are identified
for each obstacle obtained in 𝑂𝑏𝑠[] set: 𝐴𝑙𝑙𝑜𝑤𝑎𝑏𝑙𝑒_𝑀𝑜𝑣𝑒𝑠[] to store the authorized actions associated with
𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡𝑠 specified for all obstacles in this set; and 𝐴𝑠𝑠𝑒𝑠𝑠𝑚𝑒𝑛𝑡[] to store the calculated values to assess the
pro
quality of each action in 𝐴𝑙𝑙𝑜𝑤𝑎𝑏𝑙𝑒_𝑀𝑜𝑣𝑒𝑠[]. It should be noted that at each state 𝑠 , the 𝐴𝑙𝑙𝑜𝑤𝑎𝑏𝑙𝑒_𝑀𝑜𝑣𝑒𝑠[] vector
components represent the search space in this learning step, which is minimized. Next, determine the best 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡
that can lead to choosing the best action by selecting the closest obstacle 𝑁𝑒𝑎𝑟𝑒𝑠𝑡_𝑂𝑏𝑠 to the current state 𝑠 from 𝑂𝑏𝑠[].
Then, the points of intersection with 𝑁𝑒𝑎𝑟𝑒𝑠𝑡_𝑂𝑏𝑠 are calculated as {𝑃 , 𝑃 } = 𝑆𝑒𝑔(𝑠 , 𝑠 ) ∩ 𝑂𝑏𝑠_𝑆𝑒𝑔 , ∀𝑡 ∈
{1 … 4}, ∀𝑖 ∈ {0 … 𝑙𝑒𝑛𝑔𝑡ℎ(𝑂𝑏𝑠[])}. The best 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 is selected according to the locations of the intersection
points 𝑃 and 𝑃 on the segments 𝑁𝑒𝑎𝑟𝑒𝑠𝑡_𝑂𝑏𝑠, as described in Algorithm 3 and Fig. 4. In the last step, an assessment
value is assigned to each action in the vector 𝐴𝑙𝑙𝑜𝑤𝑎𝑏𝑙𝑒_𝑀𝑜𝑣𝑒𝑠[] and stored in the vector 𝐴𝑠𝑠𝑒𝑠𝑠𝑚𝑒𝑛𝑡[] using Algorithm
2. This value represents the importance of taking specific action in the current state 𝑠 .
re-
lP
rna
of
-Best action
begin
Initialization:
-Calculate mathematical equation of the line segment 𝑺𝒆𝒈(𝒔𝒕 , 𝒔𝑻 ) using equation (1)
-Initialize the vector 𝑨𝒍𝒍𝒐𝒘𝒂𝒃𝒍𝒆_𝑴𝒐𝒗𝒆𝒔[]
-Initialize the vector 𝑨𝒔𝒔𝒆𝒔𝒔𝒎𝒆𝒏𝒕[ ]
pro
𝑶𝒃𝒔[] = 𝑺𝒆𝒈(𝒔𝒕 , 𝒔𝑻 ) ∩ 𝑶𝒃𝒔_𝑺𝒆𝒈𝒕𝒋 , ∀ 𝒕 ∈ {𝟏. . 𝟒}, ∀ 𝒋 ∈ {𝟏 … 𝒎}
if (𝑶𝒃𝒔[] = ∅) // this means that there is no obstacle collide with line segment linking current state 𝒔𝒕 with 𝒔𝑻
return Best action = 𝑴𝒐𝒗𝒆𝑻𝒐𝑻𝒂𝒓𝒈𝒆𝒕
else
Define the Subset of actions 𝑨 related to 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕𝒔 associated to the set obstacles in 𝑶𝒃𝒔[]
𝑨𝒍𝒍𝒐𝒘𝒂𝒃𝒍𝒆_𝑴𝒐𝒗𝒆𝒔[ ] = 𝑨
𝑵𝒆𝒂𝒓𝒆𝒔𝒕_𝑶𝒃𝒔 = 𝑨𝒓𝒈(𝑴𝒊𝒏𝑫𝒊𝒔𝒕𝒂𝒏𝒄𝒆(𝒐𝒃𝒔𝒊 , 𝒔𝒕 )), ∀ 𝒊 ∈ {𝟏 … 𝒍𝒆𝒏𝒈𝒉𝒕(𝑶𝒃𝒔[ ])}
{𝑷𝟏 , 𝑷𝟐 } = 𝑺𝒆𝒈(𝒔𝒕 , 𝒔𝑻 ) ∩ 𝑵𝒆𝒂𝒓𝒆𝒔𝒕_𝑶𝒃𝒔
if {𝑷𝟏 , 𝑷𝟐 } ∈ {𝑺𝒆𝒈𝟏 , 𝑺𝒆𝒈𝟐 }:
best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕 = 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕𝟎𝟐
else if {𝑷𝟏 , 𝑷𝟐 } ∈ {𝑺𝒆𝒈𝟐 , 𝑺𝒆𝒈𝟑 }:
re-
best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕 = 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕𝟎𝟑
else if {𝑷𝟏 , 𝑷𝟐 } ∈ {𝑺𝒆𝒈𝟑 , 𝑺𝒆𝒈𝟒 }:
best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕 = 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕𝟎𝟒
else if {𝑷𝟏 , 𝑷𝟐 } ∈ {𝑺𝒆𝒈𝟏 , 𝑺𝒆𝒈𝟒 }:
best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕 = 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕𝟎𝟏
else if {𝑷𝟏 , 𝑷𝟐 } ∈ {𝑺𝒆𝒈𝟏 , 𝑺𝒆𝒈𝟑 }:
Select the best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕 according to {𝑷𝟏 , 𝑷𝟐 } coordinates as described in Fig. 4(a)
else if {𝑷𝟏 , 𝑷𝟐 } ∈ {𝑺𝒆𝒈𝟐 , 𝑺𝒆𝒈𝟒 }:
lP
Select the best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕 according to {𝑷𝟏 , 𝑷𝟐 } coordinates as described in Fig. 4(b)
𝑨𝒔𝒔𝒆𝒔𝒔𝒎𝒆𝒏𝒕[ ] = Evaluate the rewards of each action in 𝑨𝒍𝒍𝒐𝒘𝒂𝒃𝒍𝒆_𝑴𝒐𝒗𝒆𝒔[] using Algorithm 2
𝒊𝒏𝒅𝒆𝒙 = 𝑨𝒓𝒈(𝒎𝒂𝒙 𝑨𝒔𝒔𝒆𝒔𝒔𝒎𝒆𝒏𝒕[])
Return 𝑩𝒆𝒔𝒕 𝒂𝒄𝒕𝒊𝒐𝒏 = 𝑨𝒍𝒍𝒐𝒘𝒂𝒃𝒍𝒆𝑴𝒐𝒗𝒆𝒔[𝒊𝒏𝒅𝒆𝒙]
end
the proposed selection strategy. EQL constantly learns the policy by maintaining a numeric value for each Action-State
pair in the Q-table. This numerical value is updated as the robot explores its environment. Note that configuring Q-tables
using Algorithm 2 (without considering the best 𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡) provides the robot prior information about its
navigation workspace. The robot starts from the starting position 𝑆𝑜𝑢𝑟𝑐𝑒 (learning takes place only from 𝑠 to 𝑠 ) and
chooses the next state using Algorithm 3 (which refers to Algorithm 2). Since all the following possible states will have
different Q-values, choosing the next state using Algorithm 3 is not done randomly as in classic QL. Indeed, after moving
to the next state, Q-value for the Action-State pair will be updated using equation (7). The learning process continues until
reaching 𝑇𝑎𝑟𝑔𝑒𝑡.
After the optimization process examined and the resulting optimal policy stored in Q-table, the path will be planned
subsequently. During this process, the robot starts from its initial state 𝑠 ; then, the next best state 𝑠 is determined,
where Q-value is greater than all Q-values for other adjacent states of 𝑠 .
Jou
Initialization:
-Define the states space 𝑺 and actions space 𝑨:
-𝑨 = 𝑀𝑜𝑣𝑒𝑇𝑜𝐸𝑠𝑐𝑎𝑝𝑖𝑛𝑔𝑃𝑜𝑖𝑛𝑡 … 𝑀𝑜𝑣𝑒𝑇𝑜𝑇𝑎𝑟𝑔𝑒𝑡 , ∀ 𝑙 ∈ {1 … 4}, ∀𝑗 ∈ {1 … 𝑚}.
-𝑺 = {𝑠 𝑥 , 𝑦 … 𝑠 (𝑥 , 𝑦 ) } ∀ 𝑙 ∈ {1 … 4}, ∀ 𝑗 ∈ {1 … 𝑚}
-Initialize Q-Table[][] using Algorithm 2 without considering the best 𝑬𝒔𝒄𝒂𝒑𝒊𝒏𝒈𝑷𝒐𝒊𝒏𝒕
of
for each episode
set 𝒔𝒕 = 𝒔𝒔
repeat
Choose an action 𝑎 in 𝑠 by using the policy developed in Algorithm 3
Perform action 𝑎 , then receive 𝑟(𝑠 , 𝑎 ) as described in Algorithm 2 and 𝑠
Find largest Q-Value for all 𝑠 (𝑚𝑎𝑥 𝑄(𝑠 , 𝑎))
pro
∈
Update 𝑄 − 𝑇𝑎𝑏𝑙𝑒[][] using equation (7)
𝑠 =𝑠
until 𝒔𝒕 = 𝒔𝑻
end-for
2. Planning process:
Input:
-Optimal Q-Table
Output:
-Optimal sequence of states
-MAP
Set 𝑠 = 𝑠
repeat
Get best 𝑠
Go to 𝑠
re-
where Q-value of 𝑠 >Q-values of other adjacent states of 𝑠 (𝑠 = 𝐴𝑟𝑔 (max 𝑄(𝑠
∈
, 𝑎)))
𝑠 =𝑠
until 𝒔𝒕 = 𝒔𝑻
end
lP
4. Computation results and performances analysis
This section presents simulation results and evaluates the effectiveness of the proposed EQL approach (implemented
using C#.Net). All simulations are performed on a PC with Intel Core i3 3.3GHz, 8GB RAM and 64-bit Windows operating
system. First, EQL feasibility is tested via five designed complex maps (MAP1 to MAP5) with different arrangements of
static obstacles. Second, the effectiveness and performances of EQL are compared with existing algorithms through
benchmarks in the literature. To this end, two types of comparisons are conducted:
• The first type of comparison is done with other QL-based algorithms to provide better support for the proposed
EQL. This comparison clearly shows the improvement of the proposed EQL algorithm, demonstrates its
rna
efficiency and show how impressive the performance when compared to the latest related QL-based algorithms.
• The second type is performed with meta-heuristic optimization algorithms known from the literature. This
comparison will further demonstrate the efficiency of the proposed EQL to provide better results even in complex
maps.
Path length and calculation time metrics are used in both comparisons. It is important to note that for all the tests
considered, 𝑆𝑎𝑓𝑒𝑡𝑦𝐷𝑖𝑠𝑡 belongs to [5(𝑢𝑛𝑖𝑡), 15(𝑢𝑛𝑖𝑡)] range. This shows the effect of this parameter on the robot safety
when it is moving. Moreover, computation time of EQL results includes all the required calculations: calculation of
lines/segments in each step, and reward values, etc., except the calculations of squares around obstacles, which is
negligible.
Jou
optimization algorithms, GA and memEAPF are used to demonstrate the robust performance of EQL using different maps
represented by M1 to M12 [10] and Env01 to Env04 [8] as shown in Figs. 7 and 8, respectively.
Tables 3, 4, 5 and 6 (Appendix) provide details on the verification environments within 𝑆𝑜𝑢𝑟𝑐𝑒 and 𝑇𝑎𝑟𝑔𝑒𝑡, adopted
by [15], [10] and [8], respectively. Obstacles in M1 to M12 environment are of circle form; they are enveloped in
rectangles as shown in Fig. 7. Unfortunately, environments discussed in [8] are represented in pictures without any
of
information about obstacles. To exploit them, a program has been developed to extract the coordinates and characteristics
of each obstacle as shown in Table 6. Both M1 and M2 are compared with memEAPF approach [10]; Env01 to Env04 are
compared to the GA approach [8].
pro
cluttered with obstacles in different arrangements (MAP1 to MAP5). The generated paths are obviously highly optimized
and do not collide with obstacles. Moreover, the computation time for all cases are 0.20s, 0.20s, 028s, 0.37s, and 0.20s,
respectively. Thus, with the proposed EQL, the robot can learn faster and obtain optimal solutions in very short time.
Additionally, by rotating an obstacle on a map (e.g., the blue obstacle in MAP2), the final path is modified without losing
the optimality (compared to MAP1).
re-
lP
(a) MAP1 (b) MAP2
rna
(e) MAP5
Fig. 5. Obtained paths by the proposed EQL for MAP1 to MAP5 (𝑆 and 𝑇 represent 𝑆𝑜𝑢𝑟𝑐𝑒 and 𝑇𝑎𝑟𝑔𝑒𝑡, respectively).
Journal Pre-proof
of
by the IQ-FPA, classical QL and IDQ algorithms. This table also illustrates the results obtained over 30 independent runs
(non-deterministic algorithm) using IQ-FPA algorithm and tested on a PC with AMD 10, Quad-core and 2.5 GHz. Note
that the computation time of IQ-FPA algorithm is the average value of 30 independent runs.
pro
(a) Environment with 08 obstacles
re- (b) Environment with 09 obstacles
lP
rna
By observing the calculation time shown in Table 1, a longer processing time appears when the number of obstacles
increases using IQ-FPA and IDQ; whereas, classical QL behaves in opposite way. Besides, the common characteristics
of these three algorithms with the highest computation time in all tested environments. It is satisfactory to note that the
proposed EQL for all maps converges in a very short time. Despite the high hardware performances used in [15], EQL
Journal Pre-proof
computation time is significantly better for all the maps considered with a strong improvement ranging between 88.85%
and 95.17%. This is mainly due to the reduction in the search space at every stage of the learning process while exploiting
effective states around the line segment only, that link the current state of the robot to the target state. For instance,
navigation map with 8 obstacles, only two obstacles collide with the straight-line segment between 𝑠 and 𝑠 . This may
be applied to other maps where a few obstacles are taken into account in the learning process. Thus, with the proposed
of
EQL, the robot can learn to move towards its imposed 𝑇𝑎𝑟𝑔𝑒𝑡 in the optimal path in a very short computation time.
Table 1: Comparison of the performances of the proposed EQL approach with QL, IDQ and IQ-FPA algorithms.
Path length (Unit) Computation time (s)
Environment Improvement Improvement
QL IDQ IQ-FPA EQL percentage (%)
QL IDQ IQ-FPA EQL
percentage (%)
pro
T1 (8 obstacles) 28.93 26.27 29.33 19.85 24.43 4.06 9.26 3.52 0.17 95.17
T2 (9 obstacles) 30.67 26.37 31.27 19.69 26.25 4.04 10.05 3.62 0.18 95.02
T3 (10 obstacles) 30.00 26.00 29.60 20.20 22.30 3.89 10.28 3.88 0.23 94.07
T4 (11 obstacles) 27.67 26.27 28.73 20.20 33.10 3.64 27.10 3.98 0.23 94.22
T5 (12 obstacles) 27.67 26.53 29.00 20.20 23.85 3.71 23.39 4.00 0.34 91.5
T6 (15 obstacles) 30.07 26.00 26.80 18.93 27.19 3.23 2.87 3.66 0.32 88.85
More satisfactory results in terms of path length can be observed in Table 1, where the proposed EQL reduces the
path distance with significant improvement (decreases by at least 22.30%). In fact, the paths created by EQL are shorter
for all maps; this is due to the efficiency of the proposed selection strategy, where the action conducting to the next best
state leading to the optimal path is determined. Moreover, we can see that IQ-FPA, classical QL and IDQ algorithms have
zigzag paths, which is not the case for EQL. re-
With regard to maps on Figs. 6(b), (c), (d) and (e), same paths are obtained by the proposed EQL, even adding an
obstacle away from the straight-line segment does not affect the optimization process. Unlike other algorithms, although
there are additional obstacles far from previously obtained paths, the new paths and computation time are negatively
affected. The 15-obstacle navigation map as in Fig. 6(f) clearly shows the effectiveness of EQL where the proposed
selection strategy is able to identify states that lead to the optimal path in only two learning steps. On the other hand,
despite that wrapping obstacles by rectangles reduces the narrow passages between obstacles; the proposed EQL is able
to find optimal paths for all environments. This can be justified by the good performance of the learning process due to
the high efficiency of the selection strategy proposed to select the best action. To summarize, the results presented in this
lP
comparison examination proved that the proposed EQL approach is inevitable, its effectiveness impressive as the state-
of-the-art of QL-based algorithms, and proposed EQL approach is more efficient than the other algorithms in the literature.
Table 2: Performances of EQL on the environments M1 to M12 compared to memEAPF [10] with 𝑚 = 16 (𝑚 is the
best propositional gains). Bold numbers refer to best solutions.
Path length (Unit) Computation time (s)
Environment
memEAPF EQL Improvement percentage (%) memEAPF EQL Improvement percentage(%)
M1 5.46 5.85 -7.14 3.50 0.046 98.68
M2 8.55 8.17 4.44 3.38 0.046 98.63
M3 8.93 8.78 1.67 3.27 0.016 99.51
M4 9.31 6.24 32.97 3.13 0.031 99.00
M5 6.37 6.50 -2.04 3.16 0.156 95.06
Jou
Fig. 7 shows paths generated by EQL approach that are performed only once in environments M1 to M12. It is
important to note that the non-deterministic nature of the memEAPF algorithm requires that many runs be performed in
the same environment to obtain meaningful results. Consequently, the results presented for this last algorithm consist of
the best values obtained in 30 independent runs.
Journal Pre-proof
of
pro
re-
lP
Fig. 7. Obtained paths by the proposed EQL on the environments M1 to M12.
Fig. 8(a) shows results generated by EQL approach which is executed only once for each workspace. Whereas, Fig.
8(b) illustrates comparison with GA [8]. Since it is a non-deterministic algorithm, GA results represent the best value of
100 executions.
rna
Env01 Env02
Jou
Env03 Env04
(a) Obtained paths by the proposed EQL approach
Journal Pre-proof
of
pro
Env01 Env02
Env03
re- Env04
(b) Obtained paths for GA given in [8]
Fig. 8. Simulation comparison of EQL with GA for environment maps Env01 to Env04.
Figs. 9(a) and (b) show comparisons of computation time obtained using the proposed EQL algorithm with those
proposed in [10] and [8], respectively. As shown in Table 2, the proposed EQL algorithm greatly improves the distance
traveled in case of 07 tests compared to memEAPF, with a 6.93% improvement achieved. Whereas, the best improvement
for 05 tests performed by memEAPF were 5.43%. In addition, more interesting results can be extracted from the table
lP
such as: (i) the significant improvement in computation time by approximately 96% using EQL compared to memEAPF;
(ii) the ability of EQL approach to generate satisfactory paths even in complex workspaces as confirmed and illustrated
in Fig. 7; (iii) as well, effectiveness of the proposed method for finding high-quality paths in terms of length and safety;
and finally, (iv) in terms of computation time, EQL performs better than this powerful algorithm (i.e., memEAPF).
More interesting results are observed from Fig. 8 such as: (i) EQL approach has considerably improved the quality
of the solutions in terms of path length, computation time and robot safety for environments Env02 to Env04; (ii) high
quality of all paths generated by both GA and EQL approaches; (iii) it is clear that GA gave the best performance for
Env01. However, for the other environments (Env02, Env03 and Env04), a good quality path provided for both
rna
approaches. Despite the fact that GA algorithm [8] is effective for global path planning, unlike the path safety is clearly
poor. In fact, obstacle vertices are nodes of the planned paths, which may cause a collision with the robot (i.e., less safety).
Generally, safety metrics is obtained in meta-heuristic algorithms by adding a penalty to the length of the path, which
always makes it difficult to guarantee robot safety; additionally, the optimization speed will be affected. However, the
proposed EQL simply checks the robot safety by estimating its value according to the robot dimensions.
2.5
1.5
1
0.34
0.5 0.046 0.156 0.125 0.22
0.046 0.016 0.031 0.108 0.015 0.094 0.046
0
M01 M02 M03 M04 M05 M06 M07 M08 M09 M10 M11 M12
EQL memEAPF
of
2
pro
(b) Computation time on Env01 to Env04
Fig. 9. Comparison of computation time between the proposed EQL, memEAPF and GA.
According to Figs. 9(a) and (b), the execution time of the proposed EQL approach is considerably better than the
other meta-heuristic algorithms for all the maps considered. In fact, meta-heuristic algorithms spend more time to generate
high-quality solutions. In addition, when the number of obstacles increases the execution time variability of the proposed
EQL approach increases too. On the contrary, EQL approach always generates good solutions in a very short time.
To summarize from the obtained simulation results, (i) the proposed EQL is able to generate optimal/near-optimal
(short) paths quickly and safely; (ii) EQL approach is deterministic and requires a few milliseconds to compute a good
solution in terms of length and safety (controlled using 𝑆𝑎𝑓𝑒𝑡𝑦𝐷𝑖𝑠𝑡 parameter); (iii) meta-heuristic-based algorithms are
non-deterministic and lack a good trade-off between convergence speed and path length. (iv) the proposed EQL
re-
performance has been significantly improved compared to the performance of the latest related work; finally, (v) the
proposed EQL enhanced paths quality in terms of length, calculation time and robot safety.
Acknowledgements
The authors would like to thank the reviewers for their insightful recommendations and comments on the paper, as
they led to considerably improve its quality. The authors would also like to express their deepest appreciation to Pr.
rna
References
[1] Maoudj, A., Bouzouia, B., Hentout, A., Kouider, A., & Toumi, R. (2016, October). Distributed multi-agent approach
based on priority rules and genetic algorithm for tasks scheduling in multi-robot cells. In IECON 2016-42nd Annual
Conference of the IEEE Industrial Electronics Society (pp. 692-697). IEEE.
[2] Ayawli, B. B. K., Mei, X., Shen, M., Appiah, A. Y., & Kyeremeh, F. (2019). Optimized RRT-A* Path Planning
Method for Mobile Robots in Partially Known Environment. Information Technology and Control, 48(2), 179-194.
[3] Hentout, A., Maoudj, A., & Bouzouia, B. (2016, November). A survey of development frameworks for robotics. In
2016 8th International Conference on Modelling, Identification and Control (ICMIC) (pp. 67-72). IEEE.
[4] Ajeil, F. H., Ibraheem, I. K., Sahib, M. A., & Humaidi, A. J. (2020). Multi-objective path planning of an autonomous
Jou
mobile robot using hybrid PSO-MFB optimization algorithm. Applied Soft Computing, 89, 106076.
[5] Bakdi, A., Hentout, A., Boutamai, H., Maoudj, A., Hachour, O., & Bouzouia, B. (2017). Optimal path planning and
execution for mobile robots using genetic algorithm and adaptive fuzzy-logic control. Robotics and Autonomous
Systems, 89, 95-109.
[6] Hentout, A., Maoudj, A., Guir, D., Saighi, S., Aures Harkat, M., Hammouche, M. Z., & Bakdi, A. (2019). Collision-
free path planning for indoor mobile robots based on Rapidly-exploring Random Trees and Piecewise Cubic Hermite
Interpolating Polynomial. International Journal of Imaging and Robotics, 19(3), 74-94.
[7] Daniel, K., Nash, A., Koenig, S., & Felner, A. (2010). Theta*: Any-angle path planning on grids. Journal of Artificial
Intelligence Research, 39, 533-579.
[8] Karami, A. H., & Hasanzadeh, M. (2015). An adaptive genetic algorithm for robot motion planning in 2D complex
environments. Computers & Electrical Engineering, 43, 317-329.
Journal Pre-proof
[9] Qin, Y. Q., Sun, D. B., Li, N., & Cen, Y. G. (2004, August). Path planning for mobile robot using the particle swarm
optimization with mutation operator. In Proceedings of 2004 international conference on machine learning and
cybernetics (IEEE Cat. No. 04EX826) (Vol. 4, pp. 2473-2478). IEEE.
[10] Orozco-Rosas, U., Montiel, O., & Sepúlveda, R. (2019). Mobile robot path planning using membrane evolutionary
artificial potential field. Applied Soft Computing, 77, 236-251.
of
[11] Dorigo, M., & Stützle, T. (2019). Ant colony optimization: overview and recent advances. In Handbook of
metaheuristics (pp. 311-351). Springer, Cham.
[12] Brand, M., Masuda, M., Wehner, N., & Yu, X. H. (2010, June). Ant colony optimization algorithm for robot path
planning. In 2010 International Conference On Computer Design and Applications (Vol. 3, pp. V3-436). IEEE.
[13] Cheng, Y. H., Chao, P. J., & Kuo, C. N. (2019). Mobile Robot Path Planning using a Teaching-Learning-Interactive
Learning-Based Optimization. IAENG International Journal of Computer Science, 46(2).
pro
[14] Liu, J., Wang, Q., He, C., Jaffrès-Runser, K., Xu, Y., Li, Z., & Xu, Y. (2019). QMR: Q-learning based Multi-
objective optimization Routing protocol for Flying Ad Hoc Networks. Computer Communications.
[15] Low, E. S., Ong, P., & Cheah, K. C. (2019). Solving the optimal path planning of a mobile robot using improved Q-
learning. Robotics and Autonomous Systems, 115, 143-161.
[16] Luviano, D., & Yu, W. (2017). Continuous-time path planning for multi-agents with fuzzy reinforcement learning.
Journal of Intelligent & Fuzzy Systems, 33(1), 491-501.
[17] Qu, C., Gai, W., Zhong, M., & Zhang, J. (2020). A novel reinforcement learning based grey wolf optimizer algorithm
for unmanned aerial vehicles (UAVs) path planning. Applied Soft Computing, 89, 106099.
[18] Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8(3-4), 279-292.
[19] Jaradat, M. A. K., Al-Rousan, M., & Quadan, L. (2011). Reinforcement based mobile robot navigation in dynamic
environment. Robotics and Computer-Integrated Manufacturing, 27(1), 135-149.
re-
[20] Ganapathy, V., Yun, S. C., & Joe, H. K. (2009, July). Neural Q-learning controller for mobile robot. In 2009
IEEE/ASME International Conference on Advanced Intelligent Mechatronics (pp. 863-868).
[21] Oh, C. H., Nakashima, T., & Ishibuchi, H. (1998, May). Initialization of Q-values by fuzzy rules for accelerating Q-
learning. In 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on
Computational Intelligence (Cat. No. 98CH36227) (Vol. 3, pp. 2051-2056). IEEE.
[22] Jiang, J., & Xin, J. (2019). Path planning of a mobile robot in a free-space environment using Q-learning. Progress
in Artificial Intelligence, 8(1), 133-142.
[23] Wang, Y. H., Li, T. H. S., & Lin, C. J. (2013). Backward Q-learning: The combination of Sarsa algorithm and Q-
learning. Engineering Applications of Artificial Intelligence, 26(9), 2184-2193.
lP
[24] Das, P. K., Mandhata, S. C., Behera, H. S., & Patro, S. N. (2012). An improved Q-learning algorithm for path-
planning of a mobile robot. International Journal of Computer Applications, 51(9).
[25] Goswami, I., Das, P. K., Konar, A., & Janarthanan, R. (2010, December). Extended Q-learning algorithm for path-
planning of a mobile robot. In Asia-Pacific Conference on Simulated Evolution and Learning (pp. 379-383).
Springer, Berlin, Heidelberg.
[26] Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779-788.
[27] Redmon, J., & Farhadi, A. (2017). YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on
rna
Appendix
Table 3: Proposed test environment MAP1 to MAP5. 𝑥 , 𝑦 are the coordinates of the lower-left corner of
𝑗 rectangle, Width and Height are the width and height of the 𝑗 rectangle, respectively.
Environment 𝑺(𝒙𝑺 , 𝒚𝑺 ) 𝑻(𝒙𝑻 , 𝒚𝑻 ) Obstacles 𝑶𝒃𝒔𝒋 (𝒙𝒋 , 𝒚𝒋 , 𝑾𝒊𝒅𝒕𝒉, 𝑯𝒆𝒊𝒈𝒉𝒕)
(0, 240, 60, 60), (0, 380, 80, 60), (60, 500, 80, 80), (120, 220, 140, 80), (120, 400, 80, 40),
MAP1 (10, 50) (500, 400) (160, 320, 20, 20), (180, 80, 80, 60), (300, 40, 60, 180), (360, 260, 60, 180), (420, 40,
Jou
180, 80), (480, 200, 120, 120), (380, 480, 120, 100)
(380, 95, 100, 40), (300, 70, 50, 100), (150, 100, 100, 50), (120, 50, 40, 100), (405, 120,
50, 100), (150, 180, 50, 100), (230, 200, 100, 50), (380, 250, 100, 50), (435, 320, 40, 100)
MAP2 (170, 50) (400, 430)
(rotated by 45°), (360, 380, 100, 40), (290, 370, 45, 90), (100, 370, 140, 50), (100, 300,
40, 100), (260, 280, 50, 50)
(300, 70, 50, 100), (150, 100, 100, 50), (120, 50, 40, 100), (405, 120, 50, 100), (150, 180,
MAP3 (180, 430) (435, 50) 50, 100), (230, 200, 100, 50), (380, 250, 100, 50), (435, 320, 40, 100), (360, 380, 100,
40), (290, 390, 45, 90), (100, 370, 140, 50), (100, 300, 40, 100), (260, 280, 50, 50)
(380, 95, 100, 40), (300, 70, 50, 100), (150, 100, 100, 50), (120, 50, 40, 100), (405, 120,
50, 100), (150, 180, 50, 100), (230, 200, 100, 50), (380, 250, 100, 50), (435, 320, 40,
MAP4 (170, 50) (400, 430)
100), (360, 380, 100, 40), (290, 370, 45, 90), (100, 370, 140, 50), (100, 300, 40, 100),
(260, 280, 50, 50)
Journal Pre-proof
(0, 0, 150, 50), (400, 0, 300, 100), (150, 100, 150, 250), (450, 200, 250, 160), (0, 400,
MAP5 (50, 100) (550, 620) 250, 100), (300, 450, 250, 100), (105, 600, 100, 200), (300, 650, 250, 100) (650, 500, 100,
250), (350, 600, 100, 100), (350, 400, 100, 100), (0, 200, 100, 100)
Table 4: Test environments T1 to T6 [15]. Note that Obstacle7 is formed by three rectangles: (10, 8, 2, 2), (9, 4, 2, 2)
of
and (10, 6, 1, 2).
Environment 𝑺(𝒙𝑺 , 𝒚𝑺 ) 𝑻(𝒙𝑻 , 𝒚𝑻 ) Obstacles 𝑶𝒃𝒔𝒋 (𝒙𝒋 , 𝒚𝒋 , 𝑾𝒊𝒅𝒕𝒉, 𝑯𝒆𝒊𝒈𝒉𝒕)
T1 (2, 9, 2, 3), (6, 12, 4, 6), (3, 5, 2, 2), (3, 2, 4, 1), (10, 8, 2, 2), (9, 4, 2, 2), (10, 6, 1, 2), (15,
(2, 16) (19, 11)
(8 obstacles) 10, 3, 1), (13, 2, 5, 3), (16, 14, 4, 4)
T2 (3, 14, 2, 4), (2, 9, 2, 3), (6, 12, 4, 6), (3, 5, 2, 2), (3, 2, 4, 1), (10, 8, 2, 2), (9, 4, 2, 2), (10, 6,
(2, 16) (19, 11)
(9 obstacles) 1, 2), (15, 10, 3, 1), (13, 2, 5, 3), (16, 14, 4, 4)
T3 (3, 14, 2, 4), (2, 9, 2, 3), (6, 12, 4, 6), (3, 5, 2, 2), (3, 2, 4, 1), (10, 8, 2, 2), (9, 4, 2, 2), (10, 6,
pro
(2, 16) (19, 11)
(10 obstacles) 1, 2), (15, 10, 3, 1), (13, 2, 5, 3), (16, 14, 4, 4), (6, 8, 2, 2)
T4 (3, 14, 2, 4), (2, 9, 2, 3), (6, 12, 4, 6), (6, 8, 2, 2), (3, 5, 2, 2), (3, 2, 4, 1), (10, 8, 2, 2), (9, 4,
(2, 16) (19, 11)
(11 obstacles) 2, 2), (10, 6, 1, 2), (15, 10, 3, 1), (13, 2, 5, 3), (16, 14, 4, 4) , (18, 6, 2, 2)
T5 (3, 14, 2, 4), (2, 9, 2, 3), (6, 12, 4, 6), (6, 8, 2, 2), (3, 5, 2, 2), (3, 2, 4, 1), (10, 8, 2, 2), (9, 4,
(2, 16) (19, 11)
(12 obstacles) 2, 2), (10, 6, 1, 2), (15, 10, 3, 1), (13, 2, 5, 3), (16, 14, 4, 4), (18, 6, 2, 2), (12, 14, 2, 6)
(6, 18, 1, 1), (5, 15, 1, 1), (6, 12, 1, 1), (6, 9, 1, 1), (6, 6, 1, 1), (10, 16, 1, 1), (10, 13, 1, 1),
T6
(2, 16) (19, 11) (10, 10, 1, 1), (10, 7, 1, 1), (10, 4, 1, 1), (14, 14, 1, 1), (14, 11, 1, 1), (14, 8, 1, 1), (14, 7, 1,
(15 obstacles)
1), (14, 2, 1, 1)
M12 (1.5, 7.5) (8.5, 3.0) (5.0, 5.0, 0.3), (3.5, 3.5, 0.5), (3.5, 6.5, 0.5), (6.5, 3.5, 0.5), (6.5, 6.5, 0.5)
Highlight:
In this work, the mobile robot path optimization problem is handled and modeled.
To this end, an Efficient Q-Learning (EQL) algorithm is proposed while assuring a collision-
free optimal path in minimum computation time.
In the proposed EQL:
o New definition of states space and actions space.
of
o New reward function is proposed to initialize Q-table and gives the robot a priori
knowledge about its environment.
o The learning process is sped up by exploiting a new proposed efficient selection
strategy that reduces the searching space and assures a fast convergence to an
pro
optimized solution.
Computation results on benchmarks from state-of-the-art and comparisons with well-known
planners of the literature prove that:
o EQL has a good learning efficiency.
o Training performances have been significantly improved compared with those of the
state-of-the-art.
o EQL outperforms the others optimization algorithms while improving the quality of
re-
the paths in terms of length, computation time and robot safety.
lP
rna
Jou
Journal Pre-proof
of
Supervision.
pro
re-
lP
rna
Jou
Journal Pre-proof
Declaration of interests
☒ The authors declare that they have no known competing financial interests or personal relationships
of
that could have appeared to influence the work reported in this paper.
☐The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:
pro
re-
lP
rna
Jou