Automated Guided Vehicle Dispatching and Routing Integration Via Digital Twin With Deep Reinforcement Learning
Automated Guided Vehicle Dispatching and Routing Integration Via Digital Twin With Deep Reinforcement Learning
Technical paper
A R T I C L E I N F O A B S T R A C T
Keywords: The manufacturing industry has witnessed a significant shift towards high flexibility and adaptability, driven by
Dispatching personalized demands. However, automated guided vehicle (AGV) dispatching optimization is still challenging
Routing when considering AGV routing with the spatial-temporal and kinematics constraints in intelligent production
Digital twin
logistics systems, limiting the evolving industry applications. Against this backdrop, this paper presents a digital
Reinforcement learning
twin (DT)-enhanced deep reinforcement learning-based optimization framework to integrate AGV dispatching
Automated guided vehicle
and routing at both horizontal and vertical levels. First, the proposed framework leverages a digital twin model
of the shop floor to provide a simulation environment that closely mimics the actual manufacturing process,
Dispatching - Double deep Q network enabling the AGV dispatching agent to be trained in a realistic setting, thus reducing the risk of finding unre
Routing - Markov Decision alistic solutions under specific shop-floor settings and preventing time-consuming trial-and-error processes.
Static Path Planning - A*
Collision and Congestion - deterministic policies Then, the AGV dispatching with the routing problem is modeled as a Markov Decision Process to optimize
tardiness and energy consumption. An improved dueling double deep Q network algorithm with count-based
exploration is developed to learn a better-dispatching policy by interacting with the high-fidelity DT model
that integrates a static path planning agent using A* and a dynamic collision avoidance agent using a deep
deterministic policy gradient to prevent the congestion and deadlock. Experimental results show that our method
outperforms four state-of-the-art methods with shorter tardiness, lower energy consumption, and better stability.
The proposed method provides significant potential to utilize the digital twin and reinforcement learning in the
decision-making and optimization of manufacturing processes.
1. Introduction example, that the velocity of AGVs is fixed and constant [6]. This
approach dispatches solutions that usually fail to satisfy the
The current market is experiencing a significant rise in personalized spatial-temporal constraints and kinematics of AGVs, resulting in gaps
products, which has posed considerable challenges in terms of quality, between theoretical dispatching and practical application. As a result,
cost, and delivery for manufacturing enterprises. In response, local congestion or deadlocks are likely to occur in IPLS.
manufacturing practitioners are exploring flexible and intelligent stra On the other hand, although traditional methods, such as exact al
tegies and devices to address these challenges [1–3]. One such strategy is gorithms, heuristic rules, and meta-heuristic algorithms, can derive
the application of automated guided vehicles (AGV) that can transport optimal or feasible solutions in certain specific environments, it becomes
materials in intelligent production logistics systems (IPLS) [4]. This challenging to maintain a balance between performance, responsive
approach has been widely utilized to reduce logistics costs and improve ness, and adaptability. In such a scenario, researchers have explored
production flexibility. In this condition, researchers have extensively reinforcement learning (RL) and deep reinforcement learning (DRL) to
studied AGV dispatching and routing problems to improve delivery solve vehicle routing and service scheduling [7–9], which exhibit
performance in the past decade [5]. However, there remain two critical excellent generalization ability in complex environments. However,
limitations between theoretical research and industrial application. One DRL algorithms belong to self-supervised trial-and-error learning
limitation is that researchers have typically simplified the kinematic methods, which demand a data-rich and safe simulation environment to
constraints of AGVs in the dispatching and routing model, assuming, for support the training of DRL-based agents. Therefore, constructing a
* Corresponding author.
E-mail address: [email protected] (Y. Hu).
https://doi.org/10.1016/j.jmsy.2023.12.008
Received 4 March 2023; Received in revised form 16 November 2023; Accepted 21 December 2023
Available online 6 January 2024
0278-6125/© 2023 The Society of Manufacturing Engineers. Published by Elsevier Ltd. All rights reserved.
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503
high-fidelity simulation environment is crucial for AGV dispatching and decomposition method to solve dynamic AGV dispatching and CFR in
routing when using DRL-based methods. bidirectional AGV systems by converting dynamic requests into static
In recent years, the domains of the Internet of Things, digital twin optimal firing sequencing problems. Shiue et al. [21] proposed a
(DT), cyber-physical systems, and machine learning have emerged as Q-learning and off-line learning-based method to utilize multi-attribute
potential game-changers for the manufacturing industry. These tech scheduling rules to respond to changes in the dynamic shop floor, which
nologies offer new perspectives to overcome critical limitations and performs better than traditional heuristic dispatching rules. Hu et al.
foster the integration of manufacturing and logistics decisions in In [22] proposed a deep Q-network-based AGV real-time scheduling
dustry 4.0 [10,11]. The concept of DT, in particular, has revolutionized method with mixed rules to minimize the makespan and delay ratio.
the organization and management of production processes by effectively Lastly, Tang et al. [23] proposed a hierarchical soft actor-critic method
synchronizing the physical and virtual spaces, thereby enabling pro trained under different sub-goals to solve the order-picking problem,
duction simulation, optimization, and control [12,13]. Researchers have where the top level learned a policy, and the bottom level achieved the
begun to leverage this technology along with DRL to reduce training subgoals.
costs and improve adaptability [14]. DT provides a new perspective for In Table 1, we can find a summary of the previous studies in AGV
realizing data-driven decision-making and optimization in dynamic dispatching with routing. It is evident that numerous researchers have
environments, strengthening manufacturing systems’ capacities for delved into AGV dispatching and routing problems under simplified
self-learning, self-optimization, and self-adaptability. constraints, and some have even integrated dispatching and routing at
The challenges and opportunities posed by AGV dispatching with the vertical level. Nevertheless, due to the complexity of integrating
routing in current studies have motivated us to propose an adaptive DT- dispatching and routing, AGV kinematic constraints are often over
enhanced DRL-based optimization method. Our main contributions in looked. It can lead to theoretical dispatching solutions that are
this regard can be summarized as follows: impractical in industrial settings.
A DT-enhanced DRL-based optimization framework is proposed to
involve high-fidelity simulation into data-driven algorithms for AGV 2.2. Digital twin and reinforcement learning
dispatching while integrating intelligent agents for conflict-free routing
in the DT model. Professor Grieves first introduced DT as a crucial technology for In
The Markov Decision Process (MDP) of AGV dispatching is formu dustry 4.0, which involves physical space, virtual space, and the
lated under the DT environment while considering the spatial-temporal connection between them [24]. With its ability to accurately represent
and kinematics constraints of AGVs to improve the adaptability of dis digital space, DT has opened up new possibilities for production simu
patching solutions. lation and optimization with the integration of multiple sub-systems at
An improved dueling double deep Q network (ID3QN) with count- the vertical and horizontal levels. [25]. Fang et al. [26] proposed a
based exploration is developed to maximize the long-term reward of DT-based method for updating parameters to facilitate rescheduling in
MDP, where a composite reward function is designed to fit tardiness and response to dynamic events. Negri et al. [27] developed a robust
energy consumption simultaneously. scheduling framework for flow shop scheduling problems that uses a DT
The remainder of this paper is organized as follows. Section 2 pro model with an equipment prognostics and health management module
vides a summary of related works in the field of manufacturing. Section to compute failure probability and optimize scheduling. In addition,
3 describes the DT-enhanced DRL-based optimization framework, Zhang et al. [28] presented a twins-learning framework that combines
multidimensional DT model, and optimization formulation. Section 4 DT’s simulation capacity with RL’s learning capacity to improve pro
presents the details of the developed DRL-based agents. Section 5 ver duction logistics systems’ performance and adaptability. By synchro
ifies performance and adaptability. Finally, the last section summarizes nizing physical and virtual spaces, DT can integrate more complex
the research and discusses future research. modules to support top-level scheduling optimization, providing a new
perspective for decision-making and optimization.
2. Related work As DT theories and technologies rapidly develop, more researchers
are exploring self-learning methods to enhance production performance
2.1. AGV dispatching with routing in complex environments. Kuhnle et al. [29] studied reinforcement
learning performance with different MDP designs to achieve a robust
Researchers have extensively explored the AGV dispatching problem design that satisfies the requirements of dynamic and complex produc
in the past few decades. The first AGV dispatching studies date back to tion systems with limited knowledge. Based on the learning capacity of
the 1980 s [15]. Since then, a growing number of researchers have reinforcement learning, many researchers are now exploring the inte
conducted related studies. Several studies have begun examining AGV gration of RL and DT. Zhang et al. [30] utilized real-time information
dispatching problems while considering conflict-free routing (CFR). from DT to derive near-optimal process plans based on reinforcement
Nishi et al. [16] developed a two-level mixed-integer formulation for learning. Xia et al. [31] simulated system behaviors and predicted pro
AGV dispatching and routing separately to minimize the total weighted cess faults using DT while incorporating DRL into the industrial control
tardiness using Lagrangian relaxation. Demesure et al. [17] proposed a process to expand system-level DT usage. Park et al. [32] coordinated DT
two-step decentralized control method to perform motion planning and and RL to improve production control in a re-entrant job shop. However,
dispatching with CFR according to global information and local priority the RL policy network was from creation procedures without synchro
to improve the pertinence and feasibility. However, due to the nizing with the DT model. Yan et al. [33] developed a double-layer
complexity of interaction, AGV routing constraints are often simplified Q-learning algorithm to solve dynamic scheduling with preventive
or ignored when solving AGV dispatching problems to meet practical maintenance, and the DT model shortened the gap between presched
requirements. uling and actual scheduling. Lee et al. [34] tested the potential of a
Many algorithms have been proposed to meet specific requirements DT-driven DRL learning method in robotic construction environments,
for AGV dispatching or conflict-free routing problems in various where the DT model simulated dynamic conditions and interacted with
manufacturing scenarios. For instance, Zhang et al. [18] proposed an the DRL agent, demonstrating the adaptability of DRL in dynamic en
improved gene expression programming algorithm to dynamically vironments. Additionally, the fusion of DRL and DT broke the limitation
schedule self-organized AGVs without considering routing. Li et al. [19] of the sim-to-real problem, promoting the applications of DRL for the
developed a particle swarm optimization-based mechanism for flocking motion of multi-AGV systems under unknown and stochastic
multi-robot and task allocation problems with dynamic demands in environments [35].
intelligent warehouse systems. Nishi et al. [20] proposed a Petri net In Table 2, we can find a summary of the previous studies in DT and
493
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503
Table 1
A summary of the previous studies in AGV dispatching with routing.
Integrated
Reference Problem Algorithm Advantages Disadvantages Kinematics
level
Table 2
A summary of the previous studies in digital twin and reinforcement learning.
Reference Problem Algorithm Advantage Disadvantage Integrated level Kinematics
Meta-
[26] Job-shop scheduling Updating state parameters dynamically Time-consuming evolution Horizontal ×
heuristic
Meta-
[27] Flow-shop scheduling Diagnosing equipment failure in real-time Time-consuming evolution Horizontal ×
heuristic
Integrating dispatching and routing in Without considering AGV Horizontal/
[28] Dispatching and routing £ ×
real-time movement control vertical
Generating near-optimal solutions in a Low dimensions of state and action
[30] Process planning RL Horizontal ×
limited time space
Modeling high-fidelity virtual
[31] Manufacturing control DRL Single equipment Horizontal √
environments
Re-entrant job shop Without defining DT autonomous
[32] DRL Integrating DT and RL Horizontal ×
dispatching operation
Flexible job shop Without considering the delivery
[33] DRL Considering prevent maintenance Horizontal ×
scheduling time
[34] Task allocation DRL Adapting robotic construction Single scenario Horizontal √
[35] Flocking motion DRL Updating the trained model continuously Without modeling high-fidelity DT Horizontal √
RL. It shows that DT opens up new ways to explore data-driven opti theoretical solutions and practical applications while ensuring excellent
mization methods for integrating equipment scheduling and control at system performance.
the horizontal level. The rich data environment enhances the training of
DRL-based agents, and the virtual environment reduces the risk of poor 3. Methods
solutions during training processes. Nonetheless, there is still a lack of
integrating multiple sub-problems at the vertical level. 3.1. Problem description
494
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503
495
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503
etj = vctj − vtj error = desired vel (calculated from DCA) - current vel. (AGV)
(14)
∫ t Adjusts the output based on the current
detj
control output utj = Kpetj + Ki etj dt + Kd error + (15)
0 dt integrate(Adjusts based on the sum of
Fig. 2. The AGV movement control model.
past errors) + derivate (Adjusts based
on the rate of change of the error)
vr(t+1)j = vtj + utj updated velocity (16)
3.3.2. Movement control model
The AGV movement control model in (12) considers linear velocity, 3.3.3. Behavior model
angular velocity, position, and attitude, as shown in Fig. 2. The behavior model serves as a representation of material handling
processes at both the equipment and system levels. At the system level,
the behaviors indicate the interactions occurring among various
equipment, as illustrated in Fig. 3. To cite an instance, the robot re
trieves the part from the AGV and subsequently positions it on the
496
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503
total energy consumption = (xij = 1 if task) integral of energy consumption The paper aims to minimize the average tardiness in (1) and the
∫
average energy consumption of material handling tasks in (2). The
∑ q t=rsuti
Ei = Xij ewtj dt (19)
j t=rasti tardiness of the material handling task is defined in (28), where rcti is
generated from the DT model. As a result, the composite reward function
3.3.4. Rule model is designed in (29), where ecti and erbcij are estimated using the math
The rule model imposes restrictions on the operation of equipment ematical model in (30− 32).
and systems based on the spatial-temporal constraints of AGVs, as out TDi = max(0, rcti − dti ) task delay = real completion - deadline time (28)
lined in Section 3.1. Moreover, the DCA agent must comply with con
⎧
straints (20− 21), which dictate that AGVs are required to move in ⎪ 20, If ecti ≤ dti and Yij = 0 and erbcij ≥ Bc × Bl est. comp. time < deadline time
⎪
tandem with the global path determined by the SPP agent. ⎪ y -> agv carrying load
⎪
⎨ 10, If ecti ≤ dti and Yij = 1 remaining energy >battery
→ → riE = 0, If ecti > dti and Yij = 0 and erbcij ≥ Bc × Bl threshold (29)
⎪
Ptj ⋅PGtj > 0 (20)
position vector . goal vector > 0 means pointing in same general direction
⎪
⎪
⎪
⎩
− 10, If ecti > dti and Yij = 1
− 20, Otherwise
distance pickup to start and start to end
→ → ( )
ΔPtj ⋅PGtj > 0 (21) dpa sp + dsp dp
ecti = ati + Xij rltij + rutij + rotij + j i l i i (30)
Vm
normal velocity
3.4. Dispatching optimization formulation dkk′ = |xk − xk′| + |yk − yk′| (31)
( )
The AGV D&BRM, considering the CFR problem in DT-enhanced erbcij = bstj − rltij + rutij × ewa − Xij Ei (32)
IPLS, is formulated as an MDP (S, A, P, γ, R, π). S is the state of the energy remaining before charging = initial battery stage + release and unload time + avg energy+ energy
current environment. A represents an integrated action, including the for each task
3.4.5. Policy
dispatching and battery replacement of AGVs. P is the state transition The action-value function Qπ (s, a) in (33) represents the value of
probability. γ represents the discount factor. R is the reward function. π using policy π to decision a action a at state s. The target is to find an
(s, a) is the probability of taking action a at state s when using policy π. optimal policy π* in (34) for dispatching AGVs and managing battery
replacement according to the real-time state in the DT-enhanced IPLS.
497
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503
expected value of a return given state and action is taken
Qπ (s, a) = Eπ [Gi |Si = s, Ai = a] (33) Positions
Linear and angular v
action taken
Qπ∗ (s, a) = max Eπ [R|Si = s, Ai = a] = maxqπ (s, a) max expected return (34) position history
π
4. Intelligent agent
time ati. It then provides the SPP agent with AGV dispatching and bat
locations, task times (like release, unload, operation, etc.), and the
battery status.
tery replacement decisions. The SPP agent plans a conflict-free path
using the A* algorithm, avoiding static obstacles, and sends the path as a
global guideline to the DCA agent. The DDPG-based DCA agent collects
data by calling the digital twin, decides AGV velocity, and uses PID to
calculate the actual velocity for the AGV controller. The decisions are
sent back to the digital twin, which updates the environment’s state in
real-time and executes the decisions. After the task is completed, the
digital twin provides feedback to the agent to generate decision-making Fig. 6. The neural network of the ID3QN-based D&BRM agent.
experiences for learning a better policy. experiences [38]. Besides, a count-based exploration discovers more
states to strengthen adaptability [39]. The main Q and target Q network,
4.2. Dynamic collision avoidance agent as depicted in Fig. 6, learn the mapping between the state and the action.
Two identical networks are employed, where the activation functions of
The DCA agent employs real-time data from relevant sensors to mea hidden layers are ReLU.
sure dynamic obstacles during the movement processes, guided by the SPP
agent with the A* algorithm. Based on static conflict-free path planning 4.4. Dispatching agent training
and dynamic collision avoidance, AGVs can avoid all types of obstacles for
CFR while preventing the occurrence of congestion or deadlock in complex In optimizing the cumulative reward, the D&BRM agent interacts
IPLS. The DDPG-based actor-network, as presented in Fig. 5, offers AGV with the DT environment to minimize the loss value between the main
linear velocity and angular velocity based on the direction of the global and target Q networks. Algorithm 1 presents the pseudocode of the
conflict-free path, velocity, radar data, and position [36]. training process.
Regarding the training of the DCA agent, stochastic obstacles exist in
the DT environment for each episode, and four AGVs move in the IPLS as Algorithm 1. The pseudocode of training the D&BRM agent.
the dynamic obstacles to gather historical experiences and learn a better
policy for collision avoidance. After training, the DCA agent is deployed to
the AGV’s controller to support the training of the D&BRM agent. Addi
tionally, the DCA agent can update the DCA policy as the system runs.
Fig. 4. The interaction of DRL-based agents for AGV D&BRM with CFR.
498
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503
5. Result
ws 1310KG
In the case of AGV D&BRM with CFR problems, the manufacturing Vml 0.5 m/s
system comprises of ten machines, two warehouses, and one battery Bc 6Ah
replacement station, as illustrated in Fig. 7. The numerical experiments Bl 0.05
conducted in this study are generated based on the method outlined in Tr 120 s
Lt 20 s
[41], which assumes that the arrival of dynamic tasks follows Poisson
process with rate, λ. Thus, the arrival time of dynamic tasks is defined
using an exponential distribution with an interval time, Ii, as expressed 5.2. Parameter sensitivity analysis
in (39− 40). The loading and unloading positions are selected using a
uniform distribution, U [1,12], where each position is unique. The Multiple hyper-parameters affect DRL algorithms’ convergence, the
weight of parts is generated randomly using a uniform distribution most important being learning rates and discount factors. Learning rates
defined as {200, 400, 600, 800, 1000}. As such, the due date is calcu impact the learning speed and accuracy of neural networks, while dis
lated using (41), which involves the arrival time, the unloading and count factors influence the estimation of future rewards. This study in
loading position, and the delay factor (df). vestigates the performance of the proposed algorithm under nine hyper-
parameter combinations, including three levels of learning rates
Ii ∼ Exp(λ) (39)
(1.0 ×10-3, 1.0 ×10-4, 1.0 ×10-5) and three levels of discount factors
ati+1 = ati + Ii (40) (0.8, 0.9, 1.0). We generate 500 tasks with λ = 100 while running five
arrival time + interval
AGVs to evaluate performance. The ID3QN-based agent is trained with
( / ) 100 episodes and becomes stable after around 80 episodes. Table 5
dti = ati + df dspi dpi Vml + Lt deadline time = arrival + deadline factor (41)
(dis/vel) + loading time displays the average reward and standard deviation when the algorithm
The AGVs utilized in this study are developed based on the Robot reaches stability. The results indicate that the neural network is well-
Operating System (ROS) and are connected through TCP/IP with the DT designed, given that the performance remains stable and efficient
model. Table 4 provides a detailed description of the parameters of the across different hyper-parameters combinations. We found that the
AGVs used in this study. proposed algorithm obtains the best average reward when the discount
factor is 0.9 and the learning rate is 1.0 × 10-4. We also observed that
Table 5
Table 3 The reward with different hyper-parameters.
The parameters of the proposed ID3QN.
α γ Average reward Standard deviation
Name Description Value -3
1.0 × 10 0.8 11.178 0.825
BS The number of samples in each batch size 64 1.0 × 10-3 0.9 11.584 0.808
TE The number of training epochs 5 1.0 × 10-3 1.0 11.022 0.792
UF The updating frequency of the target network 50 1.0 × 10-4 0.8 11.217 0.601
RBS The maximum records of replay buffer D 10,000 1.0 × 10-4 0.9 11.776 0.573
β The bonus coefficient 0.01 1.0 × 10-4 1.0 11.355 0.507
γ The discount factor of future rewards 1.0 × 10-5 0.8 11.359 0.526
α The learning rate of the main Q network 1.0 × 10-5 0.9 11.588 0.574
DS The dimension of SimHash 20 1.0 × 10-5 1.0 11.224 0.545
499
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503
lower learning rates result in lower standard deviation. However, lower (D3QN) [8,40] are built as same as in Section 4. They help to evaluate
learning rates come at a higher computational cost, which may require the generalizability of the proposed method.
more episodes to achieve convergence. As a result, we set the learning We investigate nine typical scenarios with different job arrival rates
rate as 1.0 × 10-4 and the discount factor as 0.9. and varying numbers of AGVs [8], represented as (q, λ). Each scenario
In addition, Fig. 8(a), (b), and (c) present the learning curves of re includes ten instances. This study evaluates the performance of the
wards, tardiness, and energy consumption, respectively. The results proposed ID3QN algorithm and compares it to other state-of-the-art
show that the learning process is stable, and the reward function methods on the two optimization objectives, tardiness and energy con
designed in this study fits the optimization objectives well. sumption. The results are impressive. Table 6 shows that the proposed
ID3QN algorithm achieves shorter tardiness than other methods. It is a
5.3. Performance comparison remarkable achievement, considering that the improvement rates for the
proposed algorithm are 76.51%, 83.83%, 76.71%, and 27.70%,
Four baselines are selected to evaluate the performance. respectively. Moreover, the proposed method obtained excellent sta
Minimum waiting time (MIWT) dispatches AGVs as defined in (41). bility in all scenarios, as indicated by the standard deviation and the box
The heuristic rule helps to evaluate the improvement for long-term de plot in Fig. 9. It indicates that the ID3QN algorithm efficiently satisfies
cision-making problems. dynamic, personalized demands.
{ } Energy consumption is an essential objective to reduce production
argminj∈G rltij + rutij + rotij (41) costs in the realm of production optimization. Due to the loaded trav
Deep Q Network (DQN) was developed by [22]. In this paper, we also eling energy consumption for each method being similar, we only
consider battery management for suitable applications. compare the unloaded traveling energy consumption. Upon analyzing
Double deep Q network (DDQN) and dueling double deep Q network the data in Table 7, four DRL-based methods obtain lower energy con
sumption in all scenarios. The proposed ID3QN method, in particular,
delivers the best performance. The improvement rates are 17.96%,
10.35%, 10.55%, and 0.05%. These numbers confirm that the DRL-
based methods are significantly better at reducing energy consump
tion than other methods. Additionally, Fig. 10 presents the box plot of
energy consumption, further reinforcing the data shown in Table 7. The
D3QN-based methods have proven to be highly efficient in reducing
energy consumption, which is crucial in optimizing production costs.
Overall, the study provides valuable insights into the development of
efficient algorithms for managing AGVs in complex and dynamic envi
ronments. The findings of this study can help organizations improve
their AGV systems and enhance their productivity and efficiency.
6. Discussion
7. Conclusion
500
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503
Table 6
The tardiness with different methods.
Scenario MIWT DQN DDQN D3QN ID3QN
(q, λ) Average Std Average Std Average Std Average Std Average Std
(3, 80) 5770.26 888.08 102.76 55.00 76.82 78.06 46.71 83.19 14.88 6.13
(3, 100) 1151.66 404.83 40.32 7.84 26.25 4.60 8.17 5.34 6.75 0.83
(3, 120) 172.21 97.51 24.27 5.94 17.12 4.38 4.37 2.22 4.10 0.72
(4, 80) 471.76 366.37 61.63 24.84 54.19 20.31 10.56 4.22 9.87 1.93
(4, 100) 46.96 19.60 37.10 11.96 22.35 3.17 7.12 3.68 5.81 1.40
(4, 120) 15.70 7.08 20.22 2.78 13.92 1.91 3.31 1.04 3.03 0.48
(5, 80) 45.39 39.25 36.41 7.78 47.51 49.17 9.90 3.75 6.67 0.87
(5, 100) 7.14 2.24 22.63 5.63 13.49 2.46 7.01 1.09 4.56 0.84
(5, 120) 2.50 0.90 19.48 8.50 8.94 1.63 5.51 1.03 2.39 0.18
Table 7
The energy consumption with different methods.
Scenario MIWT DQN DDQN D3QN ID3QN
(q, λ) Average Std Average Std Average Std Average Std Average Std
(3, 80) 594.11 14.19 524.35 12.26 522.65 13.47 484.96 9.84 477.63 3.24
(3, 100) 599.49 11.14 529.59 16.40 526.40 14.25 491.26 10.10 486.82 13.78
(3, 120) 596.29 8.19 533.74 8.11 528.15 11.44 492.48 14.71 492.19 11.74
(4, 80) 592.97 13.43 544.27 17.81 547.88 15.69 480.71 10.05 494.67 5.04
(4, 100) 598.83 10.86 548.07 13.79 551.84 12.65 491.59 13.06 487.76 12.67
(4, 120) 595.99 8.79 545.93 13.49 556.77 11.58 490.90 11.47 495.00 9.91
(5, 80) 592.74 12.74 560.33 16.57 554.14 14.21 485.62 9.34 486.56 12.68
(5, 100) 598.87 11.25 562.27 12.97 569.07 11.91 493.58 12.15 487.37 12.27
(5, 120) 596.07 8.34 560.91 5.87 563.65 7.75 492.76 13.76 493.52 11.26
intelligent training environment. It helps to represent the complex and of performance and adaptability. Overall, this method fills the gaps
dynamic constraints of AGV movement in intelligent production logis between theoretical research and industrial applications and is a
tics systems. By integrating dispatching and routing, this method im promising solution for addressing the challenges of AGV dispatching and
proves the adaptability of dispatching solutions in industrial routing problems.
applications. Secondly, this study develops an improved dueling double From a managerial perspective, the proposed approach significantly
deep Q network algorithm-based dispatching and battery replacement enhances the practicality and dependability of theoretical solutions in
management agent, simultaneously satisfying the critical requirements real-world settings by leveraging the high-fidelity simulation
501
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503
References
[1] Zhong RY, Xu X, Klotz E, Newman ST. Intelligent manufacturing in the context of
industry 4.0: a review. Engineering 2017;3:616–30. https://doi.org/10.1016/J.
ENG.2017.05.015.
[2] Sisinni E, Saifullah A, Han S, Jennehag U, Gidlund M. Industrial internet of things:
challenges, opportunities, and directions. IEEE Trans Ind Inform 2018;14:4724–34.
https://doi.org/10.1109/TII.2018.2852491.
[3] Oztemel E, Gursev S. Literature review of industry 4.0 and related technologies.
J Intell Manuf 2020;31:127–82. https://doi.org/10.1007/s10845-018-1433-8.
[4] Hofmann E, Rüsch M. Industry 4.0 and the current status as well as future prospects
on logistics. Comput Ind 2017;89:23–34. https://doi.org/10.1016/j.
compind.2017.04.002.
[5] Fragapane G, de Koster R, Sgarbossa F, Strandhagen JO. Planning and control of
autonomous mobile robots for intralogistics: literature review and research agenda.
Eur J Oper Res 2021;294:405–26. https://doi.org/10.1016/j.ejor.2021.01.019.
[6] Li ZK, Sang HY, Li JQ, Han YY, Gao KZ, Zheng ZX, et al. Invasive weed
optimization for multi-AGVs dispatching problem in a matrix manufacturing
workshop. Swarm Evol Comput 2023;77:101227. https://doi.org/10.1016/j.
swevo.2023.101227.
[7] Nazari M. , Oroojlooy A. , Takáč M. , Snyder L.V. RL for Solving the Vehicle
Routing Problem. NIPS 2014 Work Pers Methods Appl, 2018:9861–9871.
Fig. 11. The AGV utilization in DT-enhanced IPLS. [8] Zhang L, Yang C, Yan Y, Hu Y. Distributed real-time scheduling in cloud
manufacturing by deep reinforcement learning. IEEE Trans Ind Inform 2022;3203.
https://doi.org/10.1109/TII.2022.3178410.
[9] Usuga Cadavid JP, Lamouri S, Grabot B, Pellerin R, Fortin A. Machine learning
applied in production planning and control: a state-of-the-art in the era of industry
4.0. J Intell Manuf 2020;31:1531–58. https://doi.org/10.1007/s10845-019-
01531-7.
[10] Guo D, Zhong RY, Rong Y, Huang GGQ. Synchronization of shop-floor logistics and
manufacturing under IIoT and digital twin-enabled graduation intelligent
manufacturing system. IEEE Trans Cyber 2021:1–12. https://doi.org/10.1109/
TCYB.2021.3108546.
[11] Yang C, Peng T, Lan S, Shen W, Wang L. Towards IoT-enabled dynamic service
optimal selection in multiple manufacturing clouds. J Manuf Syst 2020;56:213–26.
https://doi.org/10.1016/j.jmsy.2020.06.004.
[12] Tao F, Zhang H, Liu A, Nee AYC. Digital twin in industry: state-of-the-art. IEEE
Trans Ind Inform 2019;15:2405–15. https://doi.org/10.1109/TII.2018.2873186.
[13] Schluse M, Priggemeyer M, Atorf L, Rossmann J. Experimentable digital twins-
streamlining simulation-based systems engineering for industry 4.0. IEEE Trans Ind
Inform 2018;14:1722–31. https://doi.org/10.1109/TII.2018.2804917.
[14] Wang Z, Liu C, Gombolay M. Heterogeneous graph attention networks for scalable
multi-robot scheduling with temporospatial constraints. Auton Robots 2022;46:
249–68. https://doi.org/10.1007/s10514-021-09997-2.
[15] Egbelu PJ, Tanchoco JMA. Characterization of automatic guided vehicle
dispatching rules. Int J Prod Res 1984;22:359–74. https://doi.org/10.1080/
00207548408942459.
[16] Nishi T, Hiranaka Y, Grossmann IE. A bilevel decomposition algorithm for
Fig. 12. The real tardiness in DT-enhanced IPLS. simultaneous production scheduling and conflict-free routing for automated
guided vehicles. Comput Oper Res 2011;38:876–88. https://doi.org/10.1016/j.
cor.2010.08.012.
environment in the digital twin. Moreover, the proposed method em
[17] Demesure G, Defoort M, Bekrar A, Trentesaux D, Djemai M. Decentralized motion
ploys deep reinforcement learning-based strategies that deliver efficient planning and scheduling of AGVs in an FMS. IEEE Trans Ind Inform 2018;14:
optimization performance by integrating several sub-problems in the 1744–52. https://doi.org/10.1109/TII.2017.2749520.
digital twin at both horizontal and vertical levels, offering more rational [18] Zhang L, Yan Y, Hu Y, Ren W. A dynamic scheduling method for self-organized
AGVs in production logistics systems. Procedia CIRP 2021;104:381–6. https://doi.
resource allocation settings for manufacturing systems. It helps to in org/10.1016/j.procir.2021.11.064.
crease equipment utilization rates, leading to a reduction in production [19] Li Z, Barenji AV, Jiang J, Zhong RY, Xu G. A mechanism for scheduling multi robot
costs. intelligent warehouse system face with dynamic demand. J Intell Manuf 2020;31:
469–80. https://doi.org/10.1007/s10845-018-1459-y.
Based on the significant results found in this study, we plan to expand [20] Nishi T, Tanaka Y. Petri net decomposition approach for dispatching and conflict-
the proposed method to tackle multi-resource production scheduling free routing of bidirectional automated guided vehicle systems. IEEE Trans Syst
problems in the future. Our future research considers fault diagnosis for Man Cyber A Sys Humans 2012;42:1230–43. https://doi.org/10.1109/
TSMCA.2012.2183353.
critical equipment in flexible job-shop scheduling problems based on the [21] Shiue YR, Lee KC, Su CT. Real-time scheduling for a smart factory using a
DT-enhanced DRL-based optimization method. It will help ensure the reinforcement learning approach. Comput Ind Eng 2018;125:604–14. https://doi.
synchronization of production, logistics, and maintenance activities to org/10.1016/j.cie.2018.03.039.
[22] Hu H, Jia X, He Q, Fu S, Liu K. Deep reinforcement learning based AGVs real-time
enhance the operation sustainability of manufacturing systems.
scheduling with mixed rule for flexible shop floor in industry 4.0. Comput Ind Eng
2020;149:106749. https://doi.org/10.1016/j.cie.2020.106749.
Declaration of Competing Interest [23] Tang H, Wang A, Xue F, Yang J, Cao Y. A novel hierarchical soft actor-critic
algorithm for multi-logistics robots task allocation. IEEE Access 2021;9:42568–82.
https://doi.org/10.1109/ACCESS.2021.3062457.
The authors declare that they have no known competing financial [24] Grieves. Digital twin: manufacturing excellence through virtual factory replication,
interests or personal relationships that could have appeared to influence 2014.
the work reported in this paper. [25] Tao F, Qi Q, Wang L, Nee AYC. Digital twins and cyber–physical systems toward
smart manufacturing and industry 4.0: correlation and comparison. Engineering
2019;5:653–61. https://doi.org/10.1016/j.eng.2019.01.014.
Acknowledgments [26] Fang Y, Peng C, Lou P, Zhou Z, Hu J, Yan J. Digital-twin-based job shop scheduling
toward smart manufacturing. IEEE Trans Ind Inform 2019;15:6425–35. https://
doi.org/10.1109/TII.2019.2938572.
We want to thank the support of the National Key R&D Program of [27] Negri E, Pandhare V, Cattaneo L, Singh J, Macchi M, Lee J. Field-synchronized
China (Project No. 2021YFB1715700) and the National Natural Science digital twin framework for production scheduling with uncertainty. J Intell Manuf
Foundation of China (Project No. 52175451). 2021;32:1207–28. https://doi.org/10.1007/s10845-020-01685-9.
502
L. Zhang et al. Journal of Manufacturing Systems 72 (2024) 492–503
[28] Zhang L, Yan Y, Hu Y, Ren W. Reinforcement learning and digital twin-based real- [35] Shen G, Lei L, Li Z, Cai S, Zhang L, Cao P, et al. Deep reinforcement learning for
time scheduling method in intelligent manufacturing systems. IFAC-Pap 2022;55: flocking motion of multi-UAV systems: learn from a digital twin. IEEE Internet
359–64. https://doi.org/10.1016/j.ifacol.2022.09.413. Things J 2022;9:11141–53. https://doi.org/10.1109/JIOT.2021.3127873.
[29] Kuhnle A, Kaiser JP, Theiß F, Stricker N, Lanza G. Designing an adaptive [36] Cai Z, Hu Y, Wen J, Zhang L. Collision avoidance for AGV based on deep
production control system using reinforcement learning. J Intell Manuf 2021;32: reinforcement learning in complex dynamic environment. Comput Integr Manuf
855–76. https://doi.org/10.1007/s10845-020-01612-y. Syst 2023;29:236–45. https://doi.org/10.13196/j.cims.2023.01.020.
[30] Müller-Zhang Z., Antonino P.O., Kuhn T. Dynamic Process Planning using Digital [37] Z. Wang T. Schaul M. Hessel H. Van Hasselt M. Lanctot N. De Frcitas, Dueling
Twins and Reinforcement Learning. IEEE Int Conf Emerg Technol Fact Autom Network Architectures for Deep Reinforcement Learning. 33rd Int Conf Mach Learn
ETFA 2020;2020-Septe:1757–1764. https://doi.org/10.1109/ETFA46521.202 ICML 2016 2016;48:1995–2003.
0.9211946. [38] Luo B, Yang Y, Liu D. Adaptive Q-learning for data-based optimal output regulation
[31] Xia K, Sacco C, Kirkpatrick M, Saidy C, Nguyen L, Kircaliali A, et al. A digital twin with experience replay. IEEE Trans Cyber 2018;48:3337–48. https://doi.org/
to train deep reinforcement learning agent for smart manufacturing plants: 10.1109/TCYB.2018.2821369.
environment, interfaces and intelligence. J Manuf Syst 2021;58:210–30. https:// [39] Tang H, Houthooft R, Foote D, Stooke A, Chen X, Duan Y, et al. Exploration: a study
doi.org/10.1016/j.jmsy.2020.06.012. of count-based exploration for deep reinforcement learning. Adv Neural Inf Process
[32] Park KT, Jeon SW, Noh SDo. Digital twin application with horizontal coordination Syst 2017:2754–63. 2017-Decem.
for reinforcement-learning-based production control in a re-entrant job shop. Int J [40] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-
Prod Res 2021;0:1–17. https://doi.org/10.1080/00207543.2021.1884309. level control through deep reinforcement learning. Nature 2015;518:529–33.
[33] Yan Q, Wang H, Wu F. Digital twin-enabled dynamic scheduling with preventive https://doi.org/10.1038/nature14236.
maintenance using a double-layer Q-learning algorithm. Comput Oper Res 2022; [41] Zhang L, Yan Y, Hu Y. Deep reinforcement learning for dynamic scheduling of
144:105823. https://doi.org/10.1016/j.cor.2022.105823. energy-efficient automated guided vehicles. J Intell Manuf 2023. https://doi.org/
[34] Lee D, Lee SH, Masoud N, Krishnan MS, Li VC. Digital twin-driven deep 10.1007/s10845-023-02208-y.
reinforcement learning for adaptive task allocation in robotic construction. Adv
Eng Inform 2022;53:101710. https://doi.org/10.1016/j.aei.2022.101710.
503