0% found this document useful (0 votes)
29 views12 pages

Automated Guided Vehicle Dispatching and Routing Integration Via Digital Twin With Deep Reinforcement Learning

This paper presents a digital twin-enhanced deep reinforcement learning framework for optimizing automated guided vehicle (AGV) dispatching and routing in manufacturing systems. The proposed method integrates a high-fidelity simulation environment to train AGV dispatching agents while considering spatial-temporal and kinematic constraints, ultimately improving performance in terms of tardiness and energy consumption. Experimental results demonstrate that this approach outperforms existing methods, highlighting the potential of digital twin and reinforcement learning in manufacturing optimization.

Uploaded by

Aleem Mughal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views12 pages

Automated Guided Vehicle Dispatching and Routing Integration Via Digital Twin With Deep Reinforcement Learning

This paper presents a digital twin-enhanced deep reinforcement learning framework for optimizing automated guided vehicle (AGV) dispatching and routing in manufacturing systems. The proposed method integrates a high-fidelity simulation environment to train AGV dispatching agents while considering spatial-temporal and kinematic constraints, ultimately improving performance in terms of tardiness and energy consumption. Experimental results demonstrate that this approach outperforms existing methods, highlighting the potential of digital twin and reinforcement learning in manufacturing optimization.

Uploaded by

Aleem Mughal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Journal of Manufacturing Systems 72 (2024) 492–503

Contents lists available at ScienceDirect

Journal of Manufacturing Systems


journal homepage: www.elsevier.com/locate/jmansys

Technical paper

Automated guided vehicle dispatching and routing integration via digital


twin with deep reinforcement learning
Lixiang Zhang a, Chen Yang b, Yan Yan a, Ze Cai a, Yaoguang Hu a, *
a
Lab of Industrial and Intelligent System Engineering, Beijing Institute of Technology, Beijing 100081, China
b
School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing 100081, China

A R T I C L E I N F O A B S T R A C T

Keywords: The manufacturing industry has witnessed a significant shift towards high flexibility and adaptability, driven by
Dispatching personalized demands. However, automated guided vehicle (AGV) dispatching optimization is still challenging
Routing when considering AGV routing with the spatial-temporal and kinematics constraints in intelligent production
Digital twin
logistics systems, limiting the evolving industry applications. Against this backdrop, this paper presents a digital
Reinforcement learning
twin (DT)-enhanced deep reinforcement learning-based optimization framework to integrate AGV dispatching
Automated guided vehicle
and routing at both horizontal and vertical levels. First, the proposed framework leverages a digital twin model
of the shop floor to provide a simulation environment that closely mimics the actual manufacturing process,
Dispatching - Double deep Q network enabling the AGV dispatching agent to be trained in a realistic setting, thus reducing the risk of finding unre­
Routing - Markov Decision alistic solutions under specific shop-floor settings and preventing time-consuming trial-and-error processes.
Static Path Planning - A*
Collision and Congestion - deterministic policies Then, the AGV dispatching with the routing problem is modeled as a Markov Decision Process to optimize
tardiness and energy consumption. An improved dueling double deep Q network algorithm with count-based
exploration is developed to learn a better-dispatching policy by interacting with the high-fidelity DT model
that integrates a static path planning agent using A* and a dynamic collision avoidance agent using a deep
deterministic policy gradient to prevent the congestion and deadlock. Experimental results show that our method
outperforms four state-of-the-art methods with shorter tardiness, lower energy consumption, and better stability.
The proposed method provides significant potential to utilize the digital twin and reinforcement learning in the
decision-making and optimization of manufacturing processes.

1. Introduction example, that the velocity of AGVs is fixed and constant [6]. This
approach dispatches solutions that usually fail to satisfy the
The current market is experiencing a significant rise in personalized spatial-temporal constraints and kinematics of AGVs, resulting in gaps
products, which has posed considerable challenges in terms of quality, between theoretical dispatching and practical application. As a result,
cost, and delivery for manufacturing enterprises. In response, local congestion or deadlocks are likely to occur in IPLS.
manufacturing practitioners are exploring flexible and intelligent stra­ On the other hand, although traditional methods, such as exact al­
tegies and devices to address these challenges [1–3]. One such strategy is gorithms, heuristic rules, and meta-heuristic algorithms, can derive
the application of automated guided vehicles (AGV) that can transport optimal or feasible solutions in certain specific environments, it becomes
materials in intelligent production logistics systems (IPLS) [4]. This challenging to maintain a balance between performance, responsive­
approach has been widely utilized to reduce logistics costs and improve ness, and adaptability. In such a scenario, researchers have explored
production flexibility. In this condition, researchers have extensively reinforcement learning (RL) and deep reinforcement learning (DRL) to
studied AGV dispatching and routing problems to improve delivery solve vehicle routing and service scheduling [7–9], which exhibit
performance in the past decade [5]. However, there remain two critical excellent generalization ability in complex environments. However,
limitations between theoretical research and industrial application. One DRL algorithms belong to self-supervised trial-and-error learning
limitation is that researchers have typically simplified the kinematic methods, which demand a data-rich and safe simulation environment to
constraints of AGVs in the dispatching and routing model, assuming, for support the training of DRL-based agents. Therefore, constructing a

* Corresponding author.
E-mail address: [email protected] (Y. Hu).

https://doi.org/10.1016/j.jmsy.2023.12.008
Received 4 March 2023; Received in revised form 16 November 2023; Accepted 21 December 2023
Available online 6 January 2024
0278-6125/© 2023 The Society of Manufacturing Engineers. Published by Elsevier Ltd. All rights reserved.
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503

high-fidelity simulation environment is crucial for AGV dispatching and decomposition method to solve dynamic AGV dispatching and CFR in
routing when using DRL-based methods. bidirectional AGV systems by converting dynamic requests into static
In recent years, the domains of the Internet of Things, digital twin optimal firing sequencing problems. Shiue et al. [21] proposed a
(DT), cyber-physical systems, and machine learning have emerged as Q-learning and off-line learning-based method to utilize multi-attribute
potential game-changers for the manufacturing industry. These tech­ scheduling rules to respond to changes in the dynamic shop floor, which
nologies offer new perspectives to overcome critical limitations and performs better than traditional heuristic dispatching rules. Hu et al.
foster the integration of manufacturing and logistics decisions in In­ [22] proposed a deep Q-network-based AGV real-time scheduling
dustry 4.0 [10,11]. The concept of DT, in particular, has revolutionized method with mixed rules to minimize the makespan and delay ratio.
the organization and management of production processes by effectively Lastly, Tang et al. [23] proposed a hierarchical soft actor-critic method
synchronizing the physical and virtual spaces, thereby enabling pro­ trained under different sub-goals to solve the order-picking problem,
duction simulation, optimization, and control [12,13]. Researchers have where the top level learned a policy, and the bottom level achieved the
begun to leverage this technology along with DRL to reduce training subgoals.
costs and improve adaptability [14]. DT provides a new perspective for In Table 1, we can find a summary of the previous studies in AGV
realizing data-driven decision-making and optimization in dynamic dispatching with routing. It is evident that numerous researchers have
environments, strengthening manufacturing systems’ capacities for delved into AGV dispatching and routing problems under simplified
self-learning, self-optimization, and self-adaptability. constraints, and some have even integrated dispatching and routing at
The challenges and opportunities posed by AGV dispatching with the vertical level. Nevertheless, due to the complexity of integrating
routing in current studies have motivated us to propose an adaptive DT- dispatching and routing, AGV kinematic constraints are often over­
enhanced DRL-based optimization method. Our main contributions in looked. It can lead to theoretical dispatching solutions that are
this regard can be summarized as follows: impractical in industrial settings.
A DT-enhanced DRL-based optimization framework is proposed to
involve high-fidelity simulation into data-driven algorithms for AGV 2.2. Digital twin and reinforcement learning
dispatching while integrating intelligent agents for conflict-free routing
in the DT model. Professor Grieves first introduced DT as a crucial technology for In­
The Markov Decision Process (MDP) of AGV dispatching is formu­ dustry 4.0, which involves physical space, virtual space, and the
lated under the DT environment while considering the spatial-temporal connection between them [24]. With its ability to accurately represent
and kinematics constraints of AGVs to improve the adaptability of dis­ digital space, DT has opened up new possibilities for production simu­
patching solutions. lation and optimization with the integration of multiple sub-systems at
An improved dueling double deep Q network (ID3QN) with count- the vertical and horizontal levels. [25]. Fang et al. [26] proposed a
based exploration is developed to maximize the long-term reward of DT-based method for updating parameters to facilitate rescheduling in
MDP, where a composite reward function is designed to fit tardiness and response to dynamic events. Negri et al. [27] developed a robust
energy consumption simultaneously. scheduling framework for flow shop scheduling problems that uses a DT
The remainder of this paper is organized as follows. Section 2 pro­ model with an equipment prognostics and health management module
vides a summary of related works in the field of manufacturing. Section to compute failure probability and optimize scheduling. In addition,
3 describes the DT-enhanced DRL-based optimization framework, Zhang et al. [28] presented a twins-learning framework that combines
multidimensional DT model, and optimization formulation. Section 4 DT’s simulation capacity with RL’s learning capacity to improve pro­
presents the details of the developed DRL-based agents. Section 5 ver­ duction logistics systems’ performance and adaptability. By synchro­
ifies performance and adaptability. Finally, the last section summarizes nizing physical and virtual spaces, DT can integrate more complex
the research and discusses future research. modules to support top-level scheduling optimization, providing a new
perspective for decision-making and optimization.
2. Related work As DT theories and technologies rapidly develop, more researchers
are exploring self-learning methods to enhance production performance
2.1. AGV dispatching with routing in complex environments. Kuhnle et al. [29] studied reinforcement
learning performance with different MDP designs to achieve a robust
Researchers have extensively explored the AGV dispatching problem design that satisfies the requirements of dynamic and complex produc­
in the past few decades. The first AGV dispatching studies date back to tion systems with limited knowledge. Based on the learning capacity of
the 1980 s [15]. Since then, a growing number of researchers have reinforcement learning, many researchers are now exploring the inte­
conducted related studies. Several studies have begun examining AGV gration of RL and DT. Zhang et al. [30] utilized real-time information
dispatching problems while considering conflict-free routing (CFR). from DT to derive near-optimal process plans based on reinforcement
Nishi et al. [16] developed a two-level mixed-integer formulation for learning. Xia et al. [31] simulated system behaviors and predicted pro­
AGV dispatching and routing separately to minimize the total weighted cess faults using DT while incorporating DRL into the industrial control
tardiness using Lagrangian relaxation. Demesure et al. [17] proposed a process to expand system-level DT usage. Park et al. [32] coordinated DT
two-step decentralized control method to perform motion planning and and RL to improve production control in a re-entrant job shop. However,
dispatching with CFR according to global information and local priority the RL policy network was from creation procedures without synchro­
to improve the pertinence and feasibility. However, due to the nizing with the DT model. Yan et al. [33] developed a double-layer
complexity of interaction, AGV routing constraints are often simplified Q-learning algorithm to solve dynamic scheduling with preventive
or ignored when solving AGV dispatching problems to meet practical maintenance, and the DT model shortened the gap between presched­
requirements. uling and actual scheduling. Lee et al. [34] tested the potential of a
Many algorithms have been proposed to meet specific requirements DT-driven DRL learning method in robotic construction environments,
for AGV dispatching or conflict-free routing problems in various where the DT model simulated dynamic conditions and interacted with
manufacturing scenarios. For instance, Zhang et al. [18] proposed an the DRL agent, demonstrating the adaptability of DRL in dynamic en­
improved gene expression programming algorithm to dynamically vironments. Additionally, the fusion of DRL and DT broke the limitation
schedule self-organized AGVs without considering routing. Li et al. [19] of the sim-to-real problem, promoting the applications of DRL for the
developed a particle swarm optimization-based mechanism for flocking motion of multi-AGV systems under unknown and stochastic
multi-robot and task allocation problems with dynamic demands in environments [35].
intelligent warehouse systems. Nishi et al. [20] proposed a Petri net In Table 2, we can find a summary of the previous studies in DT and

493
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503

Table 1
A summary of the previous studies in AGV dispatching with routing.
Integrated
Reference Problem Algorithm Advantages Disadvantages Kinematics
level

Dispatching and Lagrangian A small ratio of processing and transport


[16] Solving small-scale problems Vertical ×
routing relaxation time
Dispatching and
[17] Meta-heuristic Planning feasible trajectories Without considering dynamic obstacles Vertical ×
routing
[18] Dispatching Meta-heuristic Making real-time decisions Time-consuming evolution × ×
[19] Dispatching Meta-heuristic Facing multiple demands Without considering collision avoidance × ×
Dispatching and
[20] Petri nets Embedding deadlock avoidance Without considering dynamic events Vertical ×
routing
Making decisions based on real-time
[21] Dispatching RL Low dimensions of state and action space × ×
data
[22] Dispatching DRL Learning a mixed rule-based policy Without considering energy consumption × ×
[23] Dispatching DRL Adapting different scenarios Without considering collision avoidance × ×

Table 2
A summary of the previous studies in digital twin and reinforcement learning.
Reference Problem Algorithm Advantage Disadvantage Integrated level Kinematics

Meta-
[26] Job-shop scheduling Updating state parameters dynamically Time-consuming evolution Horizontal ×
heuristic
Meta-
[27] Flow-shop scheduling Diagnosing equipment failure in real-time Time-consuming evolution Horizontal ×
heuristic
Integrating dispatching and routing in Without considering AGV Horizontal/
[28] Dispatching and routing £ ×
real-time movement control vertical
Generating near-optimal solutions in a Low dimensions of state and action
[30] Process planning RL Horizontal ×
limited time space
Modeling high-fidelity virtual
[31] Manufacturing control DRL Single equipment Horizontal √
environments
Re-entrant job shop Without defining DT autonomous
[32] DRL Integrating DT and RL Horizontal ×
dispatching operation
Flexible job shop Without considering the delivery
[33] DRL Considering prevent maintenance Horizontal ×
scheduling time
[34] Task allocation DRL Adapting robotic construction Single scenario Horizontal √
[35] Flocking motion DRL Updating the trained model continuously Without modeling high-fidelity DT Horizontal √

RL. It shows that DT opens up new ways to explore data-driven opti­ theoretical solutions and practical applications while ensuring excellent
mization methods for integrating equipment scheduling and control at system performance.
the horizontal level. The rich data environment enhances the training of
DRL-based agents, and the virtual environment reduces the risk of poor 3. Methods
solutions during training processes. Nonetheless, there is still a lack of
integrating multiple sub-problems at the vertical level. 3.1. Problem description

The intelligent production logistics system (IPLS) comprises various


2.3. Literature summary
equipment such as automated guided vehicles (AGVs), robots, stackers,
manufacturing units, a warehouse, and a battery replacement station.
After analyzing the previous studies in Table 1, a simplified mathe­
AGVs are primarily responsible for transporting materials between
matical model could help integrate AGV dispatching and routing at the
warehouses and machines in a horizontal movement and sequentially
vertical level. However, it is noted that this model often ignores the
executing the material handling tasks on their list. They move from the
kinematic constraints of AGVs due to their complexity, which makes
current point to the loading point, pick up the materials, and then
integration at the horizontal level more challenging. On the other hand,
transport them to the unloading point. Robots play a vital role in
after analyzing the previous studies in Table 2, digital twin technology
handling materials between AGVs and buffers. On the other hand,
provides a new perspective to integrate equipment components at the
stackers are responsible for depositing and retrieving materials in the
horizontal level, thanks to its representation ability in the digital space.
warehouse. AGVs move to the battery replacement station to replace
Moreover, high-fidelity simulations ensure safety and feasibility during
their batteries in case of low battery.
training reinforcement learning algorithms, which are trial-and-error
The present study delves into AGV dispatching and battery replace­
methods.
ment management (D&BRM) while considering conflict-free routing
Therefore, this study proposes an adaptive DT-enhanced DRL-based
(CFR). The centralized AGV D&BRM decisions encompass the AGV
optimization method that integrates and synchronizes dispatching and
assignment and the battery replacement management. Regarding CFR,
routing to improve the adaptability of AGV dispatching solutions. DRL-
the A* algorithm devises a global conflict-free path based on task in­
based agents meet the requirements of performance and adaptability in
formation in the static environment. Building on the preceding research,
dynamic environments. Additionally, embedding the AGV movement
a deep deterministic policy gradient (DDPG)-based agent avoids dy­
control model promotes integration at the horizontal level, while the
namic obstacles along with the conflict-free path [36]. The notations are
interaction between AGV dispatching and routing facilitates integration
defined as follows.
at the vertical level. This approach eliminates the gap between

494
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503

The definition of notations [ ]


vltj ∈ 0, Vml , ∀j ∈ G, ∀t The linear velocity from 0 to max (4)

T Set of tasks, T = {Ti |T1 , T2 , ⋯…, Tn } [ π π]


G

Set of AGVs,G = {Gj ⃒G1 , G2 , ⋯…, Gq } vωtj ∈ − , , ∀j ∈ G, ∀t The angular velocity (5)
2 2
M Set of stations,M = {Mk |M1 , M2 , ⋯…, Mm }
n The number of tasks bstj ≥ Bc × Bl, ∀j ∈ G, ∀t The battery state > battery capacity x battery (6)
t The current time threshold
m The number of loading/unloading stations
q The number of AGVs in IPLS GTj ∩ OB = ∅, ∀j ∈ G path or trajectory GTj of AGV should not
intersect or overlap with an obstacle OB
(7)
Bc The battery capacity of AGVs completion time rct of a task at
Bl The threshold of battery state of charge (SoC) time t should be greater than or rcti ≥ rsuli + Lt, ∀i ∈ T start time rast of the AGV (8)
j, which must be greater than
Tr The time of replacing the insufficient battery equal to the unloading start time or equal to the combination of
rsul, plus a constant Lt, ensuring ( )
Lt The loading/unloading time of the part that the task can only be rasti ≥ Xij rlttj + ruttj + rottj , ∀i ∈ T, ∀j ∈ G Xj (a spatial variable or (9)
Ws The weight of AGV completed after the unloading distance) and the sum of the
Vml The maximum linear velocity of AGV operation and a certain time delay.

real unloading time rslt j and
rslti rotation time rot
ati The arrival time of Ti
dti The due date of Ti
real start unloading time rsltj rslti ≥ rasti + Xij vltj dt, ∀i ∈ T, ∀j ∈ G, ∀t real unloading start time(10)
should be greater than or equal rasti rsult is greater than or
spi The loading position of Ti to the AGV's arrival time rastj equal to the real
dpi The unloading position of Ti plus a term representing the ∫ rsuti unloading time
wi The part weight of Ti travel time, where the integral
rsuti ≥ rslti + Lt + Xij vltj dt, ∀i ∈ T, ∀j ∈ G, ∀t rsltj plus a delay Lt and (11)
computes the distance traveled
xk The x-axis of Mk with linear velocity Vjl between rslti the angular travel
distance, where the
yk The y-axis of Mk the start time and the task integral calculates the
vltj The linear velocity of AGV Gj at t completion time Tj distance based on the
vwtj The angular velocity of AGV Gj at t angular velocity vja
dkk’ The Manhattan distance between Mk and Mk’ 3.2. DT-enabled DRL-based optimization framework
rasti The AGV real start time of Ti
rslti The real start loading time of Ti This study proposes an adaptive optimization framework based on
rsuti The real start unloading time of Ti deep reinforcement learning (DRL) and digital twin (DT) enhancements.
rcti The real completion time of Ti
ecti The estimated completion time of Ti
As depicted in Fig. 1, the framework comprises physical, digital, and
erbcij The estimated remaining battery SoC of Gj after performing Ti service optimization spaces. The physical space is an intelligent teaching
Ei The energy consumption for completing Ti manufacturing system that collaborates with flexible equipment to
TDi The real tardiness of Ti manufacture various parts. In the digital space, virtual entities replicate
wmtj The loading part weight of Gj at t
the characteristics of geometry, movement control, behavior, and rule of
ewtj The unit energy consumption of Gj at t
ewa The average unit energy consumption of AGV during loaded traveling physical entities in DT-enhanced IPLS. The digital space provides a high-
patj The destination point of Gj at t fidelity simulation environment for intelligent agents to learn and
pxtj The AGV x-axis position of Gj at t optimize manufacturing systems by simulating decisions and collecting
pytj The AGV x-axis position of Gj at t historical experiences. The service optimization space employs a bi-level
βt The AGV angle at t
bstj The battery SoC of Gj at t
integration optimization method to solve the dispatching and routing
rlttj The remaining loaded traveling time of Gj at t problems. At the dispatching level, a DRL-based D&BRM agent assigns
ruttj The remaining unloaded traveling time of Gj at t tasks to AGVs based on real-time data from the digital space and max­
rottj The remaining other time of Gj at t imizes the long-term reward by optimizing the decision-making policy
rnttj The remaining time of Gj at t without battery replacement
according to historical experiences. At the routing level, the AGV
Xij If Aj is dispatched to transport Ti, Xij= 1; otherwise, Xij= 0
If Aj is dispatched to transport Ti with battery replacement, Yij= 1; otherwise, controller integrates a static path planning (SPP) agent and a dynamic
Yij
Yij= 0 collision avoidance (DCA) agent. The SPP agent plans a global conflict-
free path in static environments. In contrast, the DCA agent generates
control instructions to avoid stochastic obstacles in dynamic environ­
A group of material handling tasks is denoted as T. When any task Ti ments according to the guidance of the conflict-free path.
arrives at time ati, an AGV is dispatched from the group (G) with a The DT model integrates DRL-based intelligent agents and provides
simultaneous evaluation to determine if battery replacement is required. them with a high-fidelity simulation environment to verify the feasi­
Therefore, the optimization objective of AGV D&BRM is to minimize the bility and performance of solutions. The DT model improves the intel­
average tardiness and average energy consumption of material handling ligence and adaptability of physical entities and optimizes the
tasks, as defined in (1− 2). The spatial-temporal constraints include the dispatching performance by integrating the bi-level problem.
following: Constraint (3) states that only one AGV must be dispatched The proposed DT-enhanced DRL-based optimization method consists
for each task. Constraints (4− 5) represent the limitations of the kine­ of three key aspects: the DT model creation, the formulation of the
matics of AGVs. Constraint (6) dictates that the battery state of charge optimization problem, and the design of intelligent agents.
must not be lower than the threshold. Constraint (7) ensures that AGVs
avoid obstacles, where GTj is the trajectory of Gj and OB represents
obstacles. Constraints (8− 11) represent the time constraints in the dy­ 3.3. DT model creation
namic DT environment.

1∑ n 3.3.1. Geometry model


Minimize f1 = Di Minimize avg tardiness (1) The geometry model is crucial in mapping the physical entities’
n i=1
Maximum Likelihood ? shape, size, structure composition, and assembly relations into the dig­
ital space. In the case of AGVs, the geometry model provides specific
1∑ n
Minimize f2 = Ei Minimize avg energy consumption
(2) details, such as the size, shape, and position of wheels and related sen­
n i=1
sors, which significantly influence collision avoidance. Similarly, in the
case of robots, the geometry model provides details such as the size and
s.t.
friend-child relations of arms, which affect the grabbing and placing of
∑j=q
Xij = 1, ∀i ∈ T, ∃j ∈ G Aj is dispatched (3) parts and the interaction with machines and AGVs. Therefore, it is
j=1
essential to have an accurate geometry model to ensure proper func­
tioning and optimal performance of IPLS.

495
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503

Geometry model: The virtual


representation of the physical
entities (AGVs, robots,
machines).
Behavior model: Captures
interactions among equipment,
such as how AGVs move or
interact with robots.
Movement control model:
Models the movement
dynamics of AGVs, such as
their velocity and position.
Rule model: Defines the
operational constraints for
AGVs (e.g., spatial-temporal
constraints).

Fig. 1. The DT-enhanced DRL-based optimization framework.


current position change in pos and orientation
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
px(t+1)j px cos β tj × vl
⎥ ⎢
tj ⎥
⎢ y ⎥ ⎢ tjy
⎢p ⎥= ⎢ ptj ⎥+⎢ ⎥
sin β × vltj ⎥ × Δt (12)
⎣ (t+1)j ⎦ ⎣ ⎦ ⎢
⎣ ω tj ⎦
β(t+1)j βtj vtj

In addition, the proportional-integral-derivative (PID) control is


built in the DT model as defined in (13− 16), where vc is the calculation
velocity from the DCA agent, and vr is the actual velocity by using PID.
Kp, Ki, and Kd are the proportional, integral time, and derivative time
constant, where Kp = 0.2, Ki = 0.1, and Kd = 0.02.
{ }
vtj = vltj , vωtj velocity vector linear and angular (13)

etj = vctj − vtj error = desired vel (calculated from DCA) - current vel. (AGV)
(14)
∫ t Adjusts the output based on the current
detj
control output utj = Kpetj + Ki etj dt + Kd error + (15)
0 dt integrate(Adjusts based on the sum of
Fig. 2. The AGV movement control model.
past errors) + derivate (Adjusts based
on the rate of change of the error)
vr(t+1)j = vtj + utj updated velocity (16)
3.3.2. Movement control model
The AGV movement control model in (12) considers linear velocity, 3.3.3. Behavior model
angular velocity, position, and attitude, as shown in Fig. 2. The behavior model serves as a representation of material handling
processes at both the equipment and system levels. At the system level,
the behaviors indicate the interactions occurring among various
equipment, as illustrated in Fig. 3. To cite an instance, the robot re­
trieves the part from the AGV and subsequently positions it on the

496
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503

3.4.1. State representation


The real-time state includes the task’s and AGVs’ data in the dynamic
DT environment. At the decision point when ati that Ti arrives, the task’s
data consists of the x-axis and y-axis value of the loading and unloading
position, the due date, and the part weight in (22).
{ }
xspi , yspi , xdpi , ydpi , dti , wi (22)

The AGVs’ data includes the destination position, the remaining


unloaded traveling time, the remaining loaded traveling time, the remain­
ing other time, the battery SoC, and the remaining time without battery
replacement, defined in (23). Thus, the state space is 6 + 7 ×q dimensions.
{ }
xpatj , ypatj , rlttj , ruttj , rottj , rnttj , bstj , ∀i ∈ T, ∀j ∈ G (23)

3.4.2. Action representation


The action represents the AGV dispatched and battery replacement
strategy, defined as ai = {j, Yij}, j∈G. Thus, the action space is 2 ×q
Fig. 3. The behaviors model in IPLS. dimensions.

machining buffer. 3.4.3. State transition


At the equipment level, the behavior model comprises several key When a new task arrives, the state transition occurs at the decision
behaviors, including the movement and energy consumption behaviors point. The Eqs. (24–27) represent the state transition of various pa­
of AGVs and the behaviors of robots. The movement behaviors of AGVs rameters such as the position, the remaining loaded traveling time, the
encompass unloaded traveling, loaded traveling, collision avoidance, remaining unloaded traveling time, the remaining other time, and the
and waiting. During the movement processes, the DCA agent provides remaining time without battery replacement of AGVs.
AGV with real-time velocity to avoid dynamic obstacles. Moreover, the ( )
AGV energy consumption model, defined in (17− 19), is established to rlt((t+1)j ≥ )(
Xij rlttj( + rsuti − )) rslti − 1
(24)
+ 1 − Xij max 0, rlttj − 1 , ∀j ∈ G, ∀t
monitor the battery state and maintain continuous operation. Addi­
tionally, the behaviors of robots involve grabbing and placing parts to ( / )
realize the loading and unloading by interacting with AGVs and rasti + Yij × ddpi Ms Vml − 1
rut( (t+1)j ≥)(Xij rut( tj + rslti − ))
(25)
+ 1 − Xij max 0, ruttj − 1 , ∀j ∈ G, ∀t
machining buffers. The behavior model at the equipment level thus
enables a detailed mapping of the material handling processes taking ( )
rot( (t+1)j ≥)(Xij rot( tj + 2Lt + Y)) ij Tr − 1
place within a system. (26)
+ 1 − Xij max 0, rottj − 1 , ∀j ∈ G, ∀t
energy consumption per
( ( unit distance) )/ ((Static weight of AGV+ weight of material))
linear velocity of the AGV
ewtj = vltj Ws + wmtj × 0.007275 + 1.93645 Vml (17) ( )( )
rnt( (t+1)j ≥)(Xij 1 −( Yij rnttj + ))rsuti − rasti − Lt − 1
avg energy consumption = static weight (27)
( / of )agc + average weight of material + 1 − Xij max 0, rnttj − 1 , ∀j ∈ G, ∀t
∑n
ewa = Ws + wi n × 0.007275 + 1.93645 (18)
3.4.4. Reward function
i=1

total energy consumption = (xij = 1 if task) integral of energy consumption The paper aims to minimize the average tardiness in (1) and the

average energy consumption of material handling tasks in (2). The
∑ q t=rsuti
Ei = Xij ewtj dt (19)
j t=rasti tardiness of the material handling task is defined in (28), where rcti is
generated from the DT model. As a result, the composite reward function
3.3.4. Rule model is designed in (29), where ecti and erbcij are estimated using the math­
The rule model imposes restrictions on the operation of equipment ematical model in (30− 32).
and systems based on the spatial-temporal constraints of AGVs, as out­ TDi = max(0, rcti − dti ) task delay = real completion - deadline time (28)
lined in Section 3.1. Moreover, the DCA agent must comply with con­

straints (20− 21), which dictate that AGVs are required to move in ⎪ 20, If ecti ≤ dti and Yij = 0 and erbcij ≥ Bc × Bl est. comp. time < deadline time

tandem with the global path determined by the SPP agent. ⎪ y -> agv carrying load

⎨ 10, If ecti ≤ dti and Yij = 1 remaining energy >battery
→ → riE = 0, If ecti > dti and Yij = 0 and erbcij ≥ Bc × Bl threshold (29)

Ptj ⋅PGtj > 0 (20)
position vector . goal vector > 0 means pointing in same general direction




− 10, If ecti > dti and Yij = 1
− 20, Otherwise
distance pickup to start and start to end
→ → ( )
ΔPtj ⋅PGtj > 0 (21) dpa sp + dsp dp
ecti = ati + Xij rltij + rutij + rotij + j i l i i (30)
Vm
normal velocity
3.4. Dispatching optimization formulation dkk′ = |xk − xk′| + |yk − yk′| (31)
( )
The AGV D&BRM, considering the CFR problem in DT-enhanced erbcij = bstj − rltij + rutij × ewa − Xij Ei (32)
IPLS, is formulated as an MDP (S, A, P, γ, R, π). S is the state of the energy remaining before charging = initial battery stage + release and unload time + avg energy+ energy
current environment. A represents an integrated action, including the for each task
3.4.5. Policy
dispatching and battery replacement of AGVs. P is the state transition The action-value function Qπ (s, a) in (33) represents the value of
probability. γ represents the discount factor. R is the reward function. π using policy π to decision a action a at state s. The target is to find an
(s, a) is the probability of taking action a at state s when using policy π. optimal policy π* in (34) for dispatching AGVs and managing battery
replacement according to the real-time state in the DT-enhanced IPLS.

497
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503
expected value of a return given state and action is taken
Qπ (s, a) = Eπ [Gi |Si = s, Ai = a] (33) Positions
Linear and angular v
action taken
Qπ∗ (s, a) = max Eπ [R|Si = s, Ai = a] = maxqπ (s, a) max expected return (34) position history
π

4. Intelligent agent

4.1. The integration of AGV dispatching and routing

The study aims to investigate an optimal policy for AGV dispatching in


DT-enhanced environments amidst CFR problems. To this end, this study
proposes a DRL-based optimization method, as illustrated in Fig. 4, which
can provide real-time dispatching and routing decisions while fulfilling
Fig. 5. The actor-network of the DDPG-based DCA agent.
the requirements of performance, responsiveness, and adaptability.
The D&BRM agent utilizes the digital twin to gather real-time data of
AGVs and ongoing tasks at each decision point when task Ti arrives at start and destination points and weight of the materials.

time ati. It then provides the SPP agent with AGV dispatching and bat­
locations, task times (like release, unload, operation, etc.), and the
battery status.
tery replacement decisions. The SPP agent plans a conflict-free path
using the A* algorithm, avoiding static obstacles, and sends the path as a
global guideline to the DCA agent. The DDPG-based DCA agent collects
data by calling the digital twin, decides AGV velocity, and uses PID to
calculate the actual velocity for the AGV controller. The decisions are
sent back to the digital twin, which updates the environment’s state in
real-time and executes the decisions. After the task is completed, the
digital twin provides feedback to the agent to generate decision-making Fig. 6. The neural network of the ID3QN-based D&BRM agent.
experiences for learning a better policy. experiences [38]. Besides, a count-based exploration discovers more
states to strengthen adaptability [39]. The main Q and target Q network,
4.2. Dynamic collision avoidance agent as depicted in Fig. 6, learn the mapping between the state and the action.
Two identical networks are employed, where the activation functions of
The DCA agent employs real-time data from relevant sensors to mea­ hidden layers are ReLU.
sure dynamic obstacles during the movement processes, guided by the SPP
agent with the A* algorithm. Based on static conflict-free path planning 4.4. Dispatching agent training
and dynamic collision avoidance, AGVs can avoid all types of obstacles for
CFR while preventing the occurrence of congestion or deadlock in complex In optimizing the cumulative reward, the D&BRM agent interacts
IPLS. The DDPG-based actor-network, as presented in Fig. 5, offers AGV with the DT environment to minimize the loss value between the main
linear velocity and angular velocity based on the direction of the global and target Q networks. Algorithm 1 presents the pseudocode of the
conflict-free path, velocity, radar data, and position [36]. training process.
Regarding the training of the DCA agent, stochastic obstacles exist in
the DT environment for each episode, and four AGVs move in the IPLS as Algorithm 1. The pseudocode of training the D&BRM agent.
the dynamic obstacles to gather historical experiences and learn a better
policy for collision avoidance. After training, the DCA agent is deployed to
the AGV’s controller to support the training of the D&BRM agent. Addi­
tionally, the DCA agent can update the DCA policy as the system runs.

4.3. Dispatching agent design

The D&BRM agent provides efficient decisions under dynamic en­


vironments. This paper proposes an improved dueling double deep Q
network for the D&BRM agent.
First, the dueling network addresses some conditions where different
actions have the same influences on the dynamic environments [37].
Second, the experience replay enhances the utilization of historical

Fig. 4. The interaction of DRL-based agents for AGV D&BRM with CFR.

498
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503

Here, the ε-greedy balances the exploration and exploitation during


training referred from [8]. SimHash (35 − 36) calculates the internal
reward, which helps to explore more states.
/√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
rIi = β n(ϕ(s)) internal reward = scaling f / no of visits at given state (35)

ϕ(s) = sgn(Ag(s)) sign (hash function) --> output in 1 or 0 (36)


The loss value is calculated in (37), which is used to update the pa­
rameters of the main Q network with the gradient descent algorithm in
(38) [40]. Table 3 describes the relevant parameters of the D&BRM
agent. combined reward + discounting factor (next state and possible next state, parameter)
(( ) )2
L(θi ) = rEi + rIi + γ max Q(si+1 ; θi′) − Q(si , ai ; θi ) Cost function (37)

θi+1 = θi − α∇θi L(θi ) (38)

5. Result

The programming environment utilized in this study is Python 3.8,


which is supported by TensorFlow 2.2 for the D&BRM agent and Pytorch
1.9 for the DCA agent. The DT model is constructed under the Unity
2019.2.4f1 development platform using C# language. The communi­
cation between agents is facilitated through the TCP/IP protocol. The Fig. 7. The digital space of the manufacturing system.
experimental setup runs on a workstation equipped with an Intel® Xeon
Gold 6230 R CPU, 32 GB of RAM, and an NVIDIA RTX 3090 graphics
card. Table 4
The parameters of AGVs.
5.1. Experimental setting Parameters Value

ws 1310KG
In the case of AGV D&BRM with CFR problems, the manufacturing Vml 0.5 m/s
system comprises of ten machines, two warehouses, and one battery Bc 6Ah
replacement station, as illustrated in Fig. 7. The numerical experiments Bl 0.05
conducted in this study are generated based on the method outlined in Tr 120 s
Lt 20 s
[41], which assumes that the arrival of dynamic tasks follows Poisson
process with rate, λ. Thus, the arrival time of dynamic tasks is defined
using an exponential distribution with an interval time, Ii, as expressed 5.2. Parameter sensitivity analysis
in (39− 40). The loading and unloading positions are selected using a
uniform distribution, U [1,12], where each position is unique. The Multiple hyper-parameters affect DRL algorithms’ convergence, the
weight of parts is generated randomly using a uniform distribution most important being learning rates and discount factors. Learning rates
defined as {200, 400, 600, 800, 1000}. As such, the due date is calcu­ impact the learning speed and accuracy of neural networks, while dis­
lated using (41), which involves the arrival time, the unloading and count factors influence the estimation of future rewards. This study in­
loading position, and the delay factor (df). vestigates the performance of the proposed algorithm under nine hyper-
parameter combinations, including three levels of learning rates
Ii ∼ Exp(λ) (39)
(1.0 ×10-3, 1.0 ×10-4, 1.0 ×10-5) and three levels of discount factors
ati+1 = ati + Ii (40) (0.8, 0.9, 1.0). We generate 500 tasks with λ = 100 while running five
arrival time + interval
AGVs to evaluate performance. The ID3QN-based agent is trained with
( / ) 100 episodes and becomes stable after around 80 episodes. Table 5
dti = ati + df dspi dpi Vml + Lt deadline time = arrival + deadline factor (41)
(dis/vel) + loading time displays the average reward and standard deviation when the algorithm
The AGVs utilized in this study are developed based on the Robot reaches stability. The results indicate that the neural network is well-
Operating System (ROS) and are connected through TCP/IP with the DT designed, given that the performance remains stable and efficient
model. Table 4 provides a detailed description of the parameters of the across different hyper-parameters combinations. We found that the
AGVs used in this study. proposed algorithm obtains the best average reward when the discount
factor is 0.9 and the learning rate is 1.0 × 10-4. We also observed that

Table 5
Table 3 The reward with different hyper-parameters.
The parameters of the proposed ID3QN.
α γ Average reward Standard deviation
Name Description Value -3
1.0 × 10 0.8 11.178 0.825
BS The number of samples in each batch size 64 1.0 × 10-3 0.9 11.584 0.808
TE The number of training epochs 5 1.0 × 10-3 1.0 11.022 0.792
UF The updating frequency of the target network 50 1.0 × 10-4 0.8 11.217 0.601
RBS The maximum records of replay buffer D 10,000 1.0 × 10-4 0.9 11.776 0.573
β The bonus coefficient 0.01 1.0 × 10-4 1.0 11.355 0.507
γ The discount factor of future rewards 1.0 × 10-5 0.8 11.359 0.526
α The learning rate of the main Q network 1.0 × 10-5 0.9 11.588 0.574
DS The dimension of SimHash 20 1.0 × 10-5 1.0 11.224 0.545

499
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503

lower learning rates result in lower standard deviation. However, lower (D3QN) [8,40] are built as same as in Section 4. They help to evaluate
learning rates come at a higher computational cost, which may require the generalizability of the proposed method.
more episodes to achieve convergence. As a result, we set the learning We investigate nine typical scenarios with different job arrival rates
rate as 1.0 × 10-4 and the discount factor as 0.9. and varying numbers of AGVs [8], represented as (q, λ). Each scenario
In addition, Fig. 8(a), (b), and (c) present the learning curves of re­ includes ten instances. This study evaluates the performance of the
wards, tardiness, and energy consumption, respectively. The results proposed ID3QN algorithm and compares it to other state-of-the-art
show that the learning process is stable, and the reward function methods on the two optimization objectives, tardiness and energy con­
designed in this study fits the optimization objectives well. sumption. The results are impressive. Table 6 shows that the proposed
ID3QN algorithm achieves shorter tardiness than other methods. It is a
5.3. Performance comparison remarkable achievement, considering that the improvement rates for the
proposed algorithm are 76.51%, 83.83%, 76.71%, and 27.70%,
Four baselines are selected to evaluate the performance. respectively. Moreover, the proposed method obtained excellent sta­
Minimum waiting time (MIWT) dispatches AGVs as defined in (41). bility in all scenarios, as indicated by the standard deviation and the box
The heuristic rule helps to evaluate the improvement for long-term de­ plot in Fig. 9. It indicates that the ID3QN algorithm efficiently satisfies
cision-making problems. dynamic, personalized demands.
{ } Energy consumption is an essential objective to reduce production
argminj∈G rltij + rutij + rotij (41) costs in the realm of production optimization. Due to the loaded trav­
Deep Q Network (DQN) was developed by [22]. In this paper, we also eling energy consumption for each method being similar, we only
consider battery management for suitable applications. compare the unloaded traveling energy consumption. Upon analyzing
Double deep Q network (DDQN) and dueling double deep Q network the data in Table 7, four DRL-based methods obtain lower energy con­
sumption in all scenarios. The proposed ID3QN method, in particular,
delivers the best performance. The improvement rates are 17.96%,
10.35%, 10.55%, and 0.05%. These numbers confirm that the DRL-
based methods are significantly better at reducing energy consump­
tion than other methods. Additionally, Fig. 10 presents the box plot of
energy consumption, further reinforcing the data shown in Table 7. The
D3QN-based methods have proven to be highly efficient in reducing
energy consumption, which is crucial in optimizing production costs.
Overall, the study provides valuable insights into the development of
efficient algorithms for managing AGVs in complex and dynamic envi­
ronments. The findings of this study can help organizations improve
their AGV systems and enhance their productivity and efficiency.

6. Discussion

To delve into the performance and adaptability of the DT-enhanced


IPLS, we analyze AGV utilization and tardiness. The methods are trained
by integrating CFR and without considering CFR in the DT environment.
As depicted in Fig. 11, the AGV utilization becomes stable as the system
runs, and the average utilization is similar. It suggests that the DRL-
based agents possess excellent adaptability to tackle complex
environments.
Moreover, Fig. 12 illustrates the trend of tardiness. The DT-enhanced
agent integrated with CFR yields lower and more stable tardiness
compared to the one without considering CFR. It implies that the DT-
enhanced agent that integrates dispatching and routing has better
adaptability than the one without, thanks to the precise representation
of AGV movement control in the digital space. These results indicate that
the high-fidelity DT environment can enhance the accuracy of optimi­
zation solutions, which offers a new perspective to explore DT-enhanced
integration optimization methods to deal with dynamic changes in
complex manufacturing systems.
Besides, by training the agents in the DT environment, the poor so­
lutions that DRL-based agents generate during training processes can be
prevented from impacting the real-world IPLS, ensuring the system’s
safety.

7. Conclusion

This study proposes a digital twin-enhanced deep reinforcement


learning-based optimization method to address automated guide vehicle
dispatching and batter replacement management with conflict-free
routing problems, shortening tardiness and efficiently reducing energy
consumption in dynamic environments. It derives from two innovative
Fig. 8. (a) The learning curve of rewards, (b) the learning curve of tardiness, advantages. First, a well-developed digital twin model embeds intelli­
(c) the learning curve of energy consumption. gent agents at the routing level, which provides an accurate and

500
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503

Table 6
The tardiness with different methods.
Scenario MIWT DQN DDQN D3QN ID3QN

(q, λ) Average Std Average Std Average Std Average Std Average Std
(3, 80) 5770.26 888.08 102.76 55.00 76.82 78.06 46.71 83.19 14.88 6.13
(3, 100) 1151.66 404.83 40.32 7.84 26.25 4.60 8.17 5.34 6.75 0.83
(3, 120) 172.21 97.51 24.27 5.94 17.12 4.38 4.37 2.22 4.10 0.72
(4, 80) 471.76 366.37 61.63 24.84 54.19 20.31 10.56 4.22 9.87 1.93
(4, 100) 46.96 19.60 37.10 11.96 22.35 3.17 7.12 3.68 5.81 1.40
(4, 120) 15.70 7.08 20.22 2.78 13.92 1.91 3.31 1.04 3.03 0.48
(5, 80) 45.39 39.25 36.41 7.78 47.51 49.17 9.90 3.75 6.67 0.87
(5, 100) 7.14 2.24 22.63 5.63 13.49 2.46 7.01 1.09 4.56 0.84
(5, 120) 2.50 0.90 19.48 8.50 8.94 1.63 5.51 1.03 2.39 0.18

Fig. 9. The tardiness in all scenarios.

Table 7
The energy consumption with different methods.
Scenario MIWT DQN DDQN D3QN ID3QN

(q, λ) Average Std Average Std Average Std Average Std Average Std
(3, 80) 594.11 14.19 524.35 12.26 522.65 13.47 484.96 9.84 477.63 3.24
(3, 100) 599.49 11.14 529.59 16.40 526.40 14.25 491.26 10.10 486.82 13.78
(3, 120) 596.29 8.19 533.74 8.11 528.15 11.44 492.48 14.71 492.19 11.74
(4, 80) 592.97 13.43 544.27 17.81 547.88 15.69 480.71 10.05 494.67 5.04
(4, 100) 598.83 10.86 548.07 13.79 551.84 12.65 491.59 13.06 487.76 12.67
(4, 120) 595.99 8.79 545.93 13.49 556.77 11.58 490.90 11.47 495.00 9.91
(5, 80) 592.74 12.74 560.33 16.57 554.14 14.21 485.62 9.34 486.56 12.68
(5, 100) 598.87 11.25 562.27 12.97 569.07 11.91 493.58 12.15 487.37 12.27
(5, 120) 596.07 8.34 560.91 5.87 563.65 7.75 492.76 13.76 493.52 11.26

Fig. 10. The energy consumption in all scenarios.

intelligent training environment. It helps to represent the complex and of performance and adaptability. Overall, this method fills the gaps
dynamic constraints of AGV movement in intelligent production logis­ between theoretical research and industrial applications and is a
tics systems. By integrating dispatching and routing, this method im­ promising solution for addressing the challenges of AGV dispatching and
proves the adaptability of dispatching solutions in industrial routing problems.
applications. Secondly, this study develops an improved dueling double From a managerial perspective, the proposed approach significantly
deep Q network algorithm-based dispatching and battery replacement enhances the practicality and dependability of theoretical solutions in
management agent, simultaneously satisfying the critical requirements real-world settings by leveraging the high-fidelity simulation

501
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503

References

[1] Zhong RY, Xu X, Klotz E, Newman ST. Intelligent manufacturing in the context of
industry 4.0: a review. Engineering 2017;3:616–30. https://doi.org/10.1016/J.
ENG.2017.05.015.
[2] Sisinni E, Saifullah A, Han S, Jennehag U, Gidlund M. Industrial internet of things:
challenges, opportunities, and directions. IEEE Trans Ind Inform 2018;14:4724–34.
https://doi.org/10.1109/TII.2018.2852491.
[3] Oztemel E, Gursev S. Literature review of industry 4.0 and related technologies.
J Intell Manuf 2020;31:127–82. https://doi.org/10.1007/s10845-018-1433-8.
[4] Hofmann E, Rüsch M. Industry 4.0 and the current status as well as future prospects
on logistics. Comput Ind 2017;89:23–34. https://doi.org/10.1016/j.
compind.2017.04.002.
[5] Fragapane G, de Koster R, Sgarbossa F, Strandhagen JO. Planning and control of
autonomous mobile robots for intralogistics: literature review and research agenda.
Eur J Oper Res 2021;294:405–26. https://doi.org/10.1016/j.ejor.2021.01.019.
[6] Li ZK, Sang HY, Li JQ, Han YY, Gao KZ, Zheng ZX, et al. Invasive weed
optimization for multi-AGVs dispatching problem in a matrix manufacturing
workshop. Swarm Evol Comput 2023;77:101227. https://doi.org/10.1016/j.
swevo.2023.101227.
[7] Nazari M. , Oroojlooy A. , Takáč M. , Snyder L.V. RL for Solving the Vehicle
Routing Problem. NIPS 2014 Work Pers Methods Appl, 2018:9861–9871.
Fig. 11. The AGV utilization in DT-enhanced IPLS. [8] Zhang L, Yang C, Yan Y, Hu Y. Distributed real-time scheduling in cloud
manufacturing by deep reinforcement learning. IEEE Trans Ind Inform 2022;3203.
https://doi.org/10.1109/TII.2022.3178410.
[9] Usuga Cadavid JP, Lamouri S, Grabot B, Pellerin R, Fortin A. Machine learning
applied in production planning and control: a state-of-the-art in the era of industry
4.0. J Intell Manuf 2020;31:1531–58. https://doi.org/10.1007/s10845-019-
01531-7.
[10] Guo D, Zhong RY, Rong Y, Huang GGQ. Synchronization of shop-floor logistics and
manufacturing under IIoT and digital twin-enabled graduation intelligent
manufacturing system. IEEE Trans Cyber 2021:1–12. https://doi.org/10.1109/
TCYB.2021.3108546.
[11] Yang C, Peng T, Lan S, Shen W, Wang L. Towards IoT-enabled dynamic service
optimal selection in multiple manufacturing clouds. J Manuf Syst 2020;56:213–26.
https://doi.org/10.1016/j.jmsy.2020.06.004.
[12] Tao F, Zhang H, Liu A, Nee AYC. Digital twin in industry: state-of-the-art. IEEE
Trans Ind Inform 2019;15:2405–15. https://doi.org/10.1109/TII.2018.2873186.
[13] Schluse M, Priggemeyer M, Atorf L, Rossmann J. Experimentable digital twins-
streamlining simulation-based systems engineering for industry 4.0. IEEE Trans Ind
Inform 2018;14:1722–31. https://doi.org/10.1109/TII.2018.2804917.
[14] Wang Z, Liu C, Gombolay M. Heterogeneous graph attention networks for scalable
multi-robot scheduling with temporospatial constraints. Auton Robots 2022;46:
249–68. https://doi.org/10.1007/s10514-021-09997-2.
[15] Egbelu PJ, Tanchoco JMA. Characterization of automatic guided vehicle
dispatching rules. Int J Prod Res 1984;22:359–74. https://doi.org/10.1080/
00207548408942459.
[16] Nishi T, Hiranaka Y, Grossmann IE. A bilevel decomposition algorithm for
Fig. 12. The real tardiness in DT-enhanced IPLS. simultaneous production scheduling and conflict-free routing for automated
guided vehicles. Comput Oper Res 2011;38:876–88. https://doi.org/10.1016/j.
cor.2010.08.012.
environment in the digital twin. Moreover, the proposed method em­
[17] Demesure G, Defoort M, Bekrar A, Trentesaux D, Djemai M. Decentralized motion
ploys deep reinforcement learning-based strategies that deliver efficient planning and scheduling of AGVs in an FMS. IEEE Trans Ind Inform 2018;14:
optimization performance by integrating several sub-problems in the 1744–52. https://doi.org/10.1109/TII.2017.2749520.
digital twin at both horizontal and vertical levels, offering more rational [18] Zhang L, Yan Y, Hu Y, Ren W. A dynamic scheduling method for self-organized
AGVs in production logistics systems. Procedia CIRP 2021;104:381–6. https://doi.
resource allocation settings for manufacturing systems. It helps to in­ org/10.1016/j.procir.2021.11.064.
crease equipment utilization rates, leading to a reduction in production [19] Li Z, Barenji AV, Jiang J, Zhong RY, Xu G. A mechanism for scheduling multi robot
costs. intelligent warehouse system face with dynamic demand. J Intell Manuf 2020;31:
469–80. https://doi.org/10.1007/s10845-018-1459-y.
Based on the significant results found in this study, we plan to expand [20] Nishi T, Tanaka Y. Petri net decomposition approach for dispatching and conflict-
the proposed method to tackle multi-resource production scheduling free routing of bidirectional automated guided vehicle systems. IEEE Trans Syst
problems in the future. Our future research considers fault diagnosis for Man Cyber A Sys Humans 2012;42:1230–43. https://doi.org/10.1109/
TSMCA.2012.2183353.
critical equipment in flexible job-shop scheduling problems based on the [21] Shiue YR, Lee KC, Su CT. Real-time scheduling for a smart factory using a
DT-enhanced DRL-based optimization method. It will help ensure the reinforcement learning approach. Comput Ind Eng 2018;125:604–14. https://doi.
synchronization of production, logistics, and maintenance activities to org/10.1016/j.cie.2018.03.039.
[22] Hu H, Jia X, He Q, Fu S, Liu K. Deep reinforcement learning based AGVs real-time
enhance the operation sustainability of manufacturing systems.
scheduling with mixed rule for flexible shop floor in industry 4.0. Comput Ind Eng
2020;149:106749. https://doi.org/10.1016/j.cie.2020.106749.
Declaration of Competing Interest [23] Tang H, Wang A, Xue F, Yang J, Cao Y. A novel hierarchical soft actor-critic
algorithm for multi-logistics robots task allocation. IEEE Access 2021;9:42568–82.
https://doi.org/10.1109/ACCESS.2021.3062457.
The authors declare that they have no known competing financial [24] Grieves. Digital twin: manufacturing excellence through virtual factory replication,
interests or personal relationships that could have appeared to influence 2014.
the work reported in this paper. [25] Tao F, Qi Q, Wang L, Nee AYC. Digital twins and cyber–physical systems toward
smart manufacturing and industry 4.0: correlation and comparison. Engineering
2019;5:653–61. https://doi.org/10.1016/j.eng.2019.01.014.
Acknowledgments [26] Fang Y, Peng C, Lou P, Zhou Z, Hu J, Yan J. Digital-twin-based job shop scheduling
toward smart manufacturing. IEEE Trans Ind Inform 2019;15:6425–35. https://
doi.org/10.1109/TII.2019.2938572.
We want to thank the support of the National Key R&D Program of [27] Negri E, Pandhare V, Cattaneo L, Singh J, Macchi M, Lee J. Field-synchronized
China (Project No. 2021YFB1715700) and the National Natural Science digital twin framework for production scheduling with uncertainty. J Intell Manuf
Foundation of China (Project No. 52175451). 2021;32:1207–28. https://doi.org/10.1007/s10845-020-01685-9.

502
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503

[28] Zhang L, Yan Y, Hu Y, Ren W. Reinforcement learning and digital twin-based real- [35] Shen G, Lei L, Li Z, Cai S, Zhang L, Cao P, et al. Deep reinforcement learning for
time scheduling method in intelligent manufacturing systems. IFAC-Pap 2022;55: flocking motion of multi-UAV systems: learn from a digital twin. IEEE Internet
359–64. https://doi.org/10.1016/j.ifacol.2022.09.413. Things J 2022;9:11141–53. https://doi.org/10.1109/JIOT.2021.3127873.
[29] Kuhnle A, Kaiser JP, Theiß F, Stricker N, Lanza G. Designing an adaptive [36] Cai Z, Hu Y, Wen J, Zhang L. Collision avoidance for AGV based on deep
production control system using reinforcement learning. J Intell Manuf 2021;32: reinforcement learning in complex dynamic environment. Comput Integr Manuf
855–76. https://doi.org/10.1007/s10845-020-01612-y. Syst 2023;29:236–45. https://doi.org/10.13196/j.cims.2023.01.020.
[30] Müller-Zhang Z., Antonino P.O., Kuhn T. Dynamic Process Planning using Digital [37] Z. Wang T. Schaul M. Hessel H. Van Hasselt M. Lanctot N. De Frcitas, Dueling
Twins and Reinforcement Learning. IEEE Int Conf Emerg Technol Fact Autom Network Architectures for Deep Reinforcement Learning. 33rd Int Conf Mach Learn
ETFA 2020;2020-Septe:1757–1764. https://doi.org/10.1109/ETFA46521.202 ICML 2016 2016;48:1995–2003.
0.9211946. [38] Luo B, Yang Y, Liu D. Adaptive Q-learning for data-based optimal output regulation
[31] Xia K, Sacco C, Kirkpatrick M, Saidy C, Nguyen L, Kircaliali A, et al. A digital twin with experience replay. IEEE Trans Cyber 2018;48:3337–48. https://doi.org/
to train deep reinforcement learning agent for smart manufacturing plants: 10.1109/TCYB.2018.2821369.
environment, interfaces and intelligence. J Manuf Syst 2021;58:210–30. https:// [39] Tang H, Houthooft R, Foote D, Stooke A, Chen X, Duan Y, et al. Exploration: a study
doi.org/10.1016/j.jmsy.2020.06.012. of count-based exploration for deep reinforcement learning. Adv Neural Inf Process
[32] Park KT, Jeon SW, Noh SDo. Digital twin application with horizontal coordination Syst 2017:2754–63. 2017-Decem.
for reinforcement-learning-based production control in a re-entrant job shop. Int J [40] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-
Prod Res 2021;0:1–17. https://doi.org/10.1080/00207543.2021.1884309. level control through deep reinforcement learning. Nature 2015;518:529–33.
[33] Yan Q, Wang H, Wu F. Digital twin-enabled dynamic scheduling with preventive https://doi.org/10.1038/nature14236.
maintenance using a double-layer Q-learning algorithm. Comput Oper Res 2022; [41] Zhang L, Yan Y, Hu Y. Deep reinforcement learning for dynamic scheduling of
144:105823. https://doi.org/10.1016/j.cor.2022.105823. energy-efficient automated guided vehicles. J Intell Manuf 2023. https://doi.org/
[34] Lee D, Lee SH, Masoud N, Krishnan MS, Li VC. Digital twin-driven deep 10.1007/s10845-023-02208-y.
reinforcement learning for adaptive task allocation in robotic construction. Adv
Eng Inform 2022;53:101710. https://doi.org/10.1016/j.aei.2022.101710.

503

You might also like