1 s2.0 S0167739X23003862 Main
1 s2.0 S0167739X23003862 Main
Keywords: Edge/fog computing, as a distributed computing paradigm, satisfies the low-latency requirements of ever-
Edge computing increasing number of IoT applications and has become the mainstream computing paradigm behind IoT
Fog computing applications. However, because large number of IoT applications require execution on the edge/fog resources,
Machine learning
the servers may be overloaded. Hence, it may disrupt the edge/fog servers and also negatively affect IoT
Deep reinforcement learning
applications’ response time. Moreover, many IoT applications are composed of dependent components incurring
Internet of Things
extra constraints for their execution. Besides, edge/fog computing environments and IoT applications are
inherently dynamic and stochastic. Thus, efficient and adaptive scheduling of IoT applications in heterogeneous
edge/fog computing environments is of paramount importance. However, limited computational resources on
edge/fog servers imposes an extra burden for applying optimal but computationally demanding techniques.
To overcome these challenges, we propose a Deep Reinforcement Learning-based IoT application Scheduling
algorithm, called DRLIS to adaptively and efficiently optimize the response time of heterogeneous IoT
applications and balance the load of the edge/fog servers. We implemented DRLIS as a practical scheduler in
the FogBus2 function-as-a-service framework for creating an edge–fog–cloud integrated serverless computing
environment. Results obtained from extensive experiments show that DRLIS significantly reduces the execution
cost of IoT applications by up to 55%, 37%, and 50% in terms of load balancing, response time, and weighted
cost, respectively, compared with metaheuristic algorithms and other reinforcement learning techniques.
1. Introduction consider the case that use ‘‘only’’ edge resources for real-time IoT ap-
plications as edge computing, and the case that use edge and whenever
The past few years have witnessed the rapid rise of the Internet necessary also utilizes cloud resources (along with edge resources in a
of Things (IoT) industry, enabling the connection of people to things seamless manner) as fog computing. Edge computing as a decentralized
and things to things, and facilitating the digitization of the physical computing architecture brings processing, storage, and intelligent con-
world [1]. Meanwhile, with the explosive growth of IoT devices and trol to the vicinity of IoT devices [3]. This flexible architecture extends
various applications, the expectation for stability and low latency is cloud computing services to the edge of the network. In contrast, the
higher than ever [2]. As the main enabler of IoT, cloud computing fog computing paradigm inherits the advantages of both cloud and
stores and processes data and information generated by IoT devices. edge computing [4], which not only provides powerful computational
Leveraging powerful computing capabilities and advanced storage tech- capabilities but also reduces the need to transfer data to the cloud for
nologies, cloud computing ensures the security and reliability of stored processing, analysis, and storage, thus reducing the inter-network dis-
information. However, servers in the cloud computing paradigm are tance. In the real world, edge and fog computing provide strong support
usually located at a long physical distance from IoT devices, and the for innovation and development in various fields. For example, in the
high latency caused by long distances cannot efficiently satisfy real- field of smart healthcare, deploying edge computing nodes on wearable
time IoT applications. Prompted by these issues, edge and fog comput- devices and medical devices can monitor patients’ physiological param-
ing have emerged as popular computing paradigms in the IoT context. eters in real time and transmit the data to the cloud for analysis and
Although some researchers use the terms edge computing and fog diagnosis, realizing telemedicine and personalized medicine [5]; in the
computing interchangeably, we clearly define them in this paper. We
∗ Corresponding author.
E-mail addresses: [email protected] (Z. Wang), [email protected] (M. Goudarzi), [email protected] (M. Gong),
[email protected] (R. Buyya).
https://doi.org/10.1016/j.future.2023.10.012
Received 28 June 2023; Received in revised form 14 August 2023; Accepted 20 October 2023
Available online 28 October 2023
0167-739X/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69
field of autonomous driving, deploying edge computing nodes on self- the response time of the application. In addition, we adapt this
driving vehicles can perform real-time sensing and decision processing, weighted cost model to make it applicable to DRL algorithms.
enabling shorter response time and improving driving safety [6]. • We propose a DRL-based algorithm (DRLIS) to solve the defined
However, the massive growth in the number of IoT applications weighted cost optimization problem in dynamic and stochastic
and servers in fog computing environments also creates new chal- fog computing environments. When the computing environment
lenges. Firstly, the execution time is expected to be minimized [7], changes (e.g., requests from different IoT applications, server
which means that the applications should be processed by the best computing resources, the number of servers), it can adaptively
(i.e., the most powerful and physically closest) server. Besides, the load update the scheduling policy with a fast convergence speed.
should be ideally balanced and distributed to run on multiple operating • Based on DRLIS, we implement a practical scheduler in the Fog-
units. For example, by distributing requests across multiple servers in Bus2 function-as-a-service framework1 [14] for handling schedul-
a seamless manner (as in serverless computing environments), load ing requests of IoT applications in heterogeneous fog and edge
balancing can avoid overloading individual servers and ensure that computing environments. We also extend the functionality of the
each server handles a moderate load. This improves response times, FogBus2 framework to make different DRL techniques applicable
overall system performance, and throughput, and also helps servers to it.
run more consistently. Therefore, improving the load balancing level of • We conduct practical experiments and use real IoT applications
servers (i.e., lowering the variance of server resource utilization) while with heterogeneous tasks and resource demands to evaluate the
reducing the response time becomes an important but challenging performance of DRLIS in real system setup. By comparing with
problem for scheduling IoT applications on servers in edge/fog com- common metaheuristics (Non-dominated Sorting Genetic Algo-
puting environments. Since this is an NP-hard problem, metaheuristic rithm 2 (NSGA2) [16], Non-dominated Sorting Genetic Algorithm
and rule-based solutions can be considered [8,9]. However, these ap- 3 (NSGA3) [17]) and other reinforcement learning algorithms (Q-
proaches often rely on omniscient knowledge of global information and Learning [18]), we demonstrate the superiority of DRLIS in terms
require the solution proponent to have control over the changes. In the of convergence speed, optimization cost, and scheduling time.
fog computing environment, there is often no regularity in server per-
formance, utilization, and downtime. The number of IoT applications The rest of the paper is organized as follows. Section 2 discusses
and the corresponding resource requirements are even more nearly related work and Section 3 presents the system model and problem
random. Besides, in reality, Directed Acyclic Graphs (DAGs) are often formulation. The Deep Reinforcement Learning model for IoT applica-
used to model IoT applications [10], where nodes represent tasks and tions in edge and fog computing environments is presented in Section 4.
edges represent data communication between dependent tasks. The DRLIS is discussed in Section 5. Section 6 evaluates the performance
dependency among tasks introduces higher complexity in scheduling of DRLIS and compares it with other counterparts. Finally, Section 7
applications. Therefore, metaheuristic and rule-based solutions cannot concludes the paper and states future work.
efficiently cope with the IoT application scheduling problem in fog
computing environments. 2. Related work
Deep Reinforcement Learning (DRL) is the product of combin-
ing deep learning with reinforcement learning, integrating the power- In this section, we review the literature on scheduling IoT applica-
ful understanding of deep learning on perceptual problems with the tions in edge and fog computing environments. The related works are
decision-making capabilities of reinforcement learning. In deep rein- divided into metaheuristic and reinforcement learning categories.
forcement learning, the agent continuously interacts with the environ-
ment, recording a large number of empirical trajectories (i.e., sequences 2.1. Metaheuristic
of states, actions, and rewards), which are used in the training phase to
learn optimal policies. In contrast to metaheuristic algorithms, agents In the dependent category, Liu et al. [19] adopted a Markov De-
in deep reinforcement learning are able to autonomously sense and re- cision Process (MDP) approach to achieving shorter average task ex-
spond to changes in the environment, which allows deep reinforcement ecution latency in edge computing environments. They proposed an
learning to solve complex problems in realistic scenarios. However, due efficient one-dimensional search algorithm to find the optimal task
to the limited computational resources of devices in fog computing
scheduling policy. However, this work cannot adapt to changes in the
environments [11], the computational requirements of complex Deep
computing environment and is difficult to extend to solve complex
Neural Networks (DNNs) are often not supported [12]. Therefore,
weighted cost optimization problems in heterogeneous fog computing
how to balance implementation simplicity, sample complexity, and
environments. Wu et al. [20] modeled the task scheduling problem in
solution performance becomes a key research problem in applying deep
edge and fog computing environments as a DAG and used an estimation
reinforcement learning to fog computing environments to cope with
of distribution algorithm (EDA) and a partitioning operator to partition
complex situations.
the graph in order to queue tasks and assign appropriate servers.
To address the above challenges, we propose a Deep Reinforce-
However, they did not practically implement and test their work. Sun
ment Learning-based IoT application Scheduling algorithm (DRLIS),
et al. [21] improved the NSGA2 algorithm and designed a resource
which employs Proximal Policy Optimization (PPO) [13] technique
scheduling scheme among fog nodes in the same fog cluster, taking into
for solving the IoT applications scheduling problem in fog computing
account the diversity of different devices. This work aims to reduce the
environments. DRLIS can effectively optimize the load balancing cost
service latency and improve the stability of task execution. Although ca-
of the servers, the response time cost of the IoT applications, and their
pable of handling weighted cost optimization problems, this work only
weighted cost. Besides, by using clipped surrogate objective to limit the
considers scheduling problems in the same computing environment.
magnitude of policy updates in each iteration and being able to perform
Hoseiny et al. [22] proposed a Genetic Algorithm (GA)-based technique
multiple iterations of updates in the sampled data, the convergence
speed of the algorithm is improved. Moreover, considering the limited for minimizing the total computation time and energy consumption of
computational resources and the optimization objective under study, task scheduling in a heterogeneous fog cloud computing environment.
we design efficient reward functions. The main contributions of this By introducing features for tasks, the technique can find a more suitable
paper are: computing environment for each task. However, it does not consider
the dependencies of different tasks in the application, and due to the
• We propose a weighted cost model regarding DAG-based IoT
applications’ scheduling in fog computing environments to im-
1
prove the load balancing level of the servers while minimizing Please refer to [14,15] for detailed description of the FogBus2 framework.
56
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69
57
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69
Table 1
A qualitative comparison of related works with ours.
Works Application properties Architectural properties Algorithm properties Evaluation
Task number Dependency IoT device layer Edge/Fog layer Multi-Cloud Main technique Optimization objectives
layer
Real Request type Computing Heterogeneity Time Load Weighted
applications environments balancing
[19] Single # Homogeneous Edge Homogeneous × MDP ✓ × × Simulation
[21] Multiple #
G Homogeneous Edge and Fog Heterogeneous × NSGA2 ✓ × ✓ Simulation
Independent Metaheuristic
[22] Single # Homogeneous Edge and Fog Heterogeneous × GA ✓ × × Simulation
Algorithms
[23] Single # Homogeneous Edge and Fog Heterogeneous × NSGA2 ✓ × ✓ Simulation
[20] Multiple Dependent #
G Homogeneous Edge and Fog Heterogeneous × EDA ✓ × ✓ Simulation
[25] Single # Homogeneous Edge and Fog Heterogeneous × Q-Learning × ✓ × Simulation
[24] Single # Homogeneous Edge and Fog Homogeneous × Q-Learning ✓ ✓ ✓ Simulation
[26] Single #
G Homogeneous Edge Homogeneous × DQN ✓ × × Simulation
[27] Multiple #
G Homogeneous Edge Homogeneous × DQN ✓ × × Simulation
[29] Multiple #
G Heterogeneous Edge Homogeneous × DQN ✓ × ✓ Simulation
[28] Single #
G Homogeneous Edge Homogeneous × DQN ✓ ✓ ✓ Simulation
[30] Single Independent #
G Heterogeneous Edge Homogeneous × Reinforcement Double DQN ✓ × ✓ Simulation
[31] Single #
G Homogeneous Edge Homogeneous × Learning SAC ✓ × × Simulation
[32] Single #
G Homogeneous Edge and Fog Homogeneous × Techniques TD3 ✓ × ✓ Simulation
[35] Single # Homogeneous Edge Homogeneous × DQN × ✓ × Simulation
[33] Single #
G Homogeneous Edge Homogeneous × DDPG and ✓ × ✓ Simulation
DQN
[36] Single # Homogeneous Edge Homogeneous × DDPG ✓ × ✓ Simulation
[34] Single #
G Homogeneous Edge and Fog Homogeneous × DQN × ✓ ✓ Simulation
[37] Multiple #
G Heterogeneous Edge Heterogeneous × GA and ✓ × × Simulation
Dependent
DQN
DRLIS Multiple Heterogeneous Edge and Fog Heterogeneous ✓ PPO ✓ ✓ ✓ Practical
Table 2
List of key notations.
Variable Description Variable Description
𝑆 The application set 𝜓𝑥𝑟𝑎𝑚 The variance of RAM utilization of the server set after the
𝑆𝑙
𝑖
scheduling configuration 𝑥𝑆𝑙
𝑖
𝑆𝑙 One application (one task set) 𝛹 (𝜒𝑙 ) The load balancing model after the scheduling configuration
𝜒𝑙
𝑆 𝑙𝑖 One task 𝛹 (𝜒) The load balancing model after the scheduling configuration
𝜒
𝑁 The server set 𝜔𝑥𝑆 The total execution time (ms) for task 𝑆𝑙𝑖 based on the
𝑙𝑖
scheduling configuration 𝑥𝑆𝑙
𝑖
𝑛𝑐𝑝𝑢_𝑢𝑡
𝑘
The CPU utilization (%) of server 𝑛𝑘 𝑃 𝑆(𝑆𝑙𝑖 ) The server set to which the dependency tasks of task 𝑥𝑆𝑙 are
𝑖
assigned
𝑛𝑓𝑘 𝑟𝑒𝑞 The CPU frequency (MHz) of server 𝑛𝑘 𝜔𝑡𝑟𝑎𝑛𝑠
𝑛 ,𝑛
The transmission time (ms) between server 𝑛𝑗 and server 𝑛𝑘
𝑗 𝑘
𝑛𝑟𝑎𝑚_𝑢𝑡
𝑘
The RAM utilization (%) of server 𝑛𝑘 𝜔𝑝𝑟𝑜𝑝
𝑛𝑗 ,𝑛𝑘 The propagation time (ms) between server 𝑛𝑗 and server 𝑛𝑘
𝑛𝑟𝑎𝑚_𝑠𝑖𝑧𝑒
𝑘
The RAM size (GB) of server 𝑛𝑘 𝑝𝑛𝑗 ,𝑛𝑘 The packet size (MB) from server 𝑛𝑗 to server 𝑛𝑘 for task 𝑆𝑙𝑖
𝑐𝑝𝑢_𝑢𝑡𝑖
𝑁 The CPU utilization (%) of each server in server set 𝑁, 𝑏𝑛𝑗 ,𝑛𝑘 The data rate (bit/s) between server 𝑛𝑗 and server 𝑛𝑘
denoted as a set
𝑁 𝑟𝑎𝑚_𝑢𝑡𝑖 The RAM utilization (%) of each server in server set 𝑁, 𝐶𝑃 (𝑆𝑙𝑖 ) Equals to 1 if 𝑆𝑙𝑖 is on the critical path of application 𝑆𝑙 ,
denoted as a set otherwise 0
𝑆𝑙𝑟𝑎𝑚 The minimum RAM required for executing task 𝑆𝑙𝑖 𝜔𝑝𝑟𝑜𝑐
𝑥𝑆 The processing time (ms) for task 𝑆𝑙𝑖 based on the
𝑖 𝑙𝑖
scheduling configuration 𝑥𝑆𝑙
𝑖
𝜓𝑥𝑆 The load balancing model after the scheduling configuration 𝛺(𝜒𝑙 ) The total execution time (ms) for application 𝑆𝑙 based on the
𝑙𝑖
𝑥𝑆𝑙 scheduling configuration 𝜒𝑙
𝑖
𝜓𝑥𝑐𝑝𝑢
𝑆𝑙
The variance of CPU utilization of the server set after the 𝛺(𝜒) The total execution time (ms) for the application set 𝑆 based
𝑖
scheduling configuration 𝑥𝑆𝑙 on the scheduling configuration 𝜒
𝑖
utilization (%), and 𝑛𝑘𝑟𝑎𝑚_𝑠𝑖𝑧𝑒 represents its RAM size (GB). Moreover, configuration 𝑥𝑆𝑙 of a task 𝑆𝑙𝑖 is defined as:
𝑖
𝑃 𝑆(𝑆𝑙𝑖 ) represents the server set to which the parent tasks of task 𝑆𝑙𝑖
𝑝𝑟𝑜𝑝
are assigned, and 𝜔𝑡𝑟𝑎𝑛𝑠
𝑛𝑗 ,𝑛𝑘 , 𝜔𝑛𝑗 ,𝑛𝑘 , 𝑝𝑛𝑗 ,𝑛𝑘 , and 𝑏𝑛𝑗 ,𝑛𝑘 denote the transmission 𝑥𝑆𝑙 = {𝑛𝑘 }, (1)
𝑖
time (ms), the propagation time (ms), the packet size (MB), and the
data rate (bit/s) between server 𝑛𝑗 and server 𝑛𝑘 , respectively.
where 𝑘 shows the server’s index. Accordingly, the scheduling config-
3.2. Problem formulation uration 𝜒𝑙 of an application 𝑆𝑙 is equal to the set of the scheduling
configuration of the tasks it contains, defined as:
Since an application contains one/multiple tasks, it may be exe-
cuted on different servers. With a set of servers 𝑁, the scheduling 𝜒𝑙 = {𝑥𝑆𝑙 |𝑆𝑙𝑖 ∈ 𝑆𝑙 , 1 ≤ 𝑖 ≤ |𝑆𝑙 |}. (2)
𝑖
58
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69
𝜔𝑥𝑆 = 𝜔𝑡𝑟𝑡
𝑥𝑆 + 𝜔𝑝𝑟𝑜𝑐
𝑥 . (10)
𝑙𝑖 𝑙𝑖 𝑆𝑙
𝑖
where where 𝐶𝑃 (𝑆𝑙𝑖 ) equals to 1 if task 𝑆𝑙𝑖 is on the critical path of application
𝑆𝑙 , otherwise 0.
𝑥𝑆𝑙 = {𝑛𝑘 }. (1) The main goal for the response time model 𝛺(𝜒) is to find the best-
𝑖
possible scheduling configuration for the application set 𝑆 such that
the total time for the server set 𝑁 processing them can be minimized.
Correspondingly, for application 𝑆𝑙 , the load balancing model 𝛹 (𝜒𝑙 )
Therefore, for the application set 𝑆, the response time model 𝛺(𝜒) is
is defined as the sum of the load balancing models for each task
defined as:
processed by server set 𝑁:
|𝑆| |𝑆𝑙 |
|𝑆| ∑
|𝑆𝑙 | ∑ ∑
∑ 𝛺(𝜒) = 𝛺(𝜒𝑙 ) = (𝜔𝑥𝑆 × 𝐶𝑃 (𝑆𝑙𝑖 )). (16)
𝛹 (𝜒𝑙 ) = 𝜓𝑥𝑆 . (8) 𝑙=1 𝑙=1 𝑖=1
𝑙𝑖
𝑙𝑖
𝑖=1
Our main goal is to find the best-possible scheduling configuration 3.2.3. Weighted cost model
for the application set 𝑆 such that the variance of the overall CPU The weighted cost model is defined as the weighted sum of the
and RAM utilization of the server set 𝑁 during the processing of the normalized load balancing and normalized response time models. For
59
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69
𝐶2 ∶ 0 ≤ 𝑛𝑘𝑟𝑎𝑚_𝑢𝑡 , 𝑛𝑐𝑝𝑢_𝑢𝑡
𝑘
≤ 1, ∀𝑛𝑘 ∈ 𝑁 (22) space A, and reward function R for the MDP are defined as follows:
𝑓 𝑟𝑒𝑞 𝑟𝑎𝑚_𝑠𝑖𝑧𝑒
𝐶3 ∶ 𝑛𝑘 , 𝑛𝑘 ≥ 0, ∀𝑛𝑘 ∈ 𝑁 (23) • State space S: Since the optimization problem is related to tasks
𝑟𝑎𝑚 𝑟𝑎𝑚_𝑠𝑖𝑧𝑒 and servers, the state of the problem consists of the feature space
𝐶4 ∶ 𝑆𝑙 < 𝑛𝑘 , ∀𝑆𝑙𝑖 ∈ 𝑆𝑙 , ∀𝑛𝑘 ∈𝑁 (24)
𝑖 of the task currently being processed and the state space of the
𝐶5 ∶ 𝛷(𝑥𝑆𝑙 ) ≤ 𝛷(𝑥𝑆𝑙 + 𝑥𝑆𝑙 ), ∀𝑆𝑙𝑗 ∈ 𝑃 (𝑆𝑙𝑖 ) (25) current server set 𝑁. Based on the discussion in Section 3, at the
𝑗 𝑗 𝑖
time step 𝑡, the feature space of the task 𝑆𝑙𝑖 includes the task ID,
𝐶6 ∶ 𝑤1 + 𝑤2 = 1, 0 ≤ 𝑤1 , 𝑤2 ≤ 1 (26)
the tasks’ predecessors and successors, the application ID to which
where 𝐶1 states that any task can only be assigned to one server the task belongs, the number of tasks in the current application,
for processing. 𝐶2 states that for any server, the CPU utilization and the estimate of the occupied CPU resources for the execution of
RAM utilization are between 0 and 1. Besides, 𝐶3 states that the CPU the task, the task’s RAM requirements, the estimate of the task’s
frequency and the RAM size of any server are larger than 0. Moreover, response time, etc. Formally, the feature space F for task 𝑆𝑙𝑖 at
𝐶4 denotes that any server should have sufficient RAM resources to the time step 𝑡 is defined as follows:
process any task. Also, 𝐶5 denotes that any task can only be processed F𝑡 (𝑆𝑙𝑖 ) = {𝑓𝑡𝑦 (𝑆𝑙𝑖 )|𝑆𝑙𝑖 ∈ 𝑆𝑙 , 0 ≤ 𝑦 ≤ |F|}, (27)
after its parent tasks have been processed, and thus the cumulative cost
is always larger than or equal to the parent task. In addition, 𝐶6 denotes where 𝑦 represents the index of the feature in the task feature
that the control parameters of the weighted cost model can only take space F, and |F| represents the number of features. Moreover,
value from 0 to 1, and the sum of them should be equal to 1. at the time step 𝑡, the state space of the current server set 𝑁
includes the number of servers, each server’s CPU utilization, CPU
The problem being formulated is presented to be a non-convex
frequency, RAM utilization, and RAM size, and the propagation
optimization problem, because there may be an infinite number of local
time and bandwidth between different servers, etc. Formally, the
optima in the set of feasible domains, and usually, the complexity of the
state space G for the server set 𝑁 at the time step 𝑡 is defined as:
algorithm to find the global optimum is exponential (NP-hard) [40].
To cope with such non-convex optimization problems, most work de- G𝑡 (𝑁) = {|𝑁|, 𝑔𝑡𝑧 (𝑛𝑘 ), ℎ𝑞𝑡 (𝑛𝑗 , 𝑛𝑘 )|𝑛𝑗 , 𝑛𝑘 ∈ 𝑁,
(28)
composes them into several convex sub-problems and then solves these 0 ≤ 𝑧 ≤ |𝑔|, 0 ≤ 𝑞 ≤ |ℎ|},
sub-problems iteratively until the algorithm converges [41]. This type
where 𝑔 represents the state type that is related to only one server
of approach reduces the complexity of the original problem at the
(i.e., CPU utilization), 𝑧 represents its index, and |𝑔| represents the
expense of accuracy [42]. In addition, such approaches are highly de-
length of this type of state; besides, ℎ denotes the state type that
pendent on the current environment and cannot be applied in dynamic
is related to two servers (i.e., propagation time), and similarly, 𝑞
environments with complex and continuously changeable parameters
represents its index and |ℎ| represents the length of this type of
and computational resources [42]. To deal with this problem, we pro-
state. Therefore, the state space S is defined as:
pose DRLIS to efficiently handle uncertainties in dynamic environments
by learning from interaction with the environment. S = {𝑆𝑡 = (F𝑡 (𝑆𝑙𝑖 ), G𝑡 (𝑁))|𝑆𝑙𝑖 ∈ 𝑆𝑙 , 𝑡 ∈ T}. (29)
60
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69
• Action space A: The goal is to find the best-possible scheduling sample-prune techniques to accommodate discrete actions [46]. Be-
configuration for the application set 𝑆 to minimize the objective sides, Wang et al. [47] shows that SAC requires more computation time
function Eq. (20). Therefore, at the time step 𝑡, the action can be and convergence time than PPO. Whereas our study focuses on edge
defined as the assignment of the server to the task 𝑆𝑙𝑖 : and fog computing environments, where handling latency sensitivity
and variation are important considerations for choosing the appropriate
𝐴𝑡 = 𝑥𝑆𝑙 = 𝑛𝑘 . (30) DRL algorithm. We choose PPO as the basis of DRLIS, because PPO
𝑖
is designed to be more easily adaptable to discrete action spaces [48]
Accordingly, the action space A can be defined as the server set
and we aim for the algorithm to converge quickly and perform well in
𝑁:
diverse environments.
A = 𝑁. (31)
5. DRL-based optimization algorithm
• Reward function R: Since this is a weighted cost optimization
problem, we need to define the reward function for each sub-
Based on the above-mentioned MDP model, we propose DRLIS
problem. First, as the 𝑝𝑒𝑛𝑎𝑙𝑡𝑦, a very large negative value is
to achieve weighted cost optimization of IoT applications in edge
introduced if the task cannot be processed on the assigned server
and fog computing environments. In this section, we introduce the
for any reason. Also, for the load balancing problem, based on the
mathematical principle of the PPO algorithm and discuss the proposed
discussion in Section 3.2.1, the reward function 𝑟𝑙𝑏
𝑡 is defined as: DRLIS.
⎧
⎪𝜓𝑥𝑆𝑙 − 𝜓𝑥𝑆𝑙 𝑠𝑢𝑐𝑐𝑒𝑒𝑑 5.1. Preliminaries
𝑟𝑙𝑏
𝑡 = ⎨ 𝑖−1 𝑖 (32)
⎪𝑝𝑒𝑛𝑎𝑙𝑡𝑦 𝑓 𝑎𝑖𝑙,
⎩ The PPO algorithm belongs to the Policy Gradient (PG) algorithm
where 𝜓𝑥𝑆 is obtained from Eq. (4). The value output by reward which considers the impact of actions on rewards and adjusts the
𝑙𝑖
probability of actions [49]. We use the same notations as in Section 3 to
function 𝑟𝑙𝑏
𝑡 is the difference between the load balancing models describe the algorithm. We consider the time horizon T is divided into
of the server set after scheduling the current task and the previous
multiple time steps 𝑡, and the agent has a policy 𝜋𝜃 for determining
one. If the value of the load balancing model of the server set is
its actions and interactions with the environment. The objective can
reduced after scheduling the current task, the output reward is
be expressed as adjusting the parameter 𝜃 to maximize the expected
positive, otherwise it is negative. Beside, for the response time ∑
cumulative discounted rewards E𝜋𝜃 [ 𝑡∈𝑇 𝛾𝑡 𝑟𝑡 ] [13], expressed by the
problem, based on the discussion in Section 3.2.2, the reward
formula:
function 𝑟𝑟𝑡
𝑡 is defined as: ∑
𝐽 (𝜃) = E𝜋𝜃 [ 𝛾𝑡 𝑟𝑡 ]. (35)
⎧ 𝑚𝑒𝑎𝑛 𝑡∈𝑇
⎪𝜔𝑥𝑆𝑙 − 𝜔𝑥𝑆𝑙 𝑠𝑢𝑐𝑐𝑒𝑒𝑑
𝑟𝑟𝑡
𝑡 = ⎨ 𝑖 𝑖 (33) Since this is a maximization problem, the gradient ascent algorithm can
⎪𝑝𝑒𝑛𝑎𝑙𝑡𝑦 𝑓 𝑎𝑖𝑙,
⎩ be used to find the maximum value:
where 𝜔𝑥𝑆 is obtained from Eq. (10), and 𝜔𝑚𝑒𝑎𝑛
𝑥 represents the 𝜃 ′ = 𝜃 + 𝛼∇𝜃 𝐽 (𝜃). (36)
𝑙𝑖 𝑆𝑙
𝑖
average response time for task 𝑆𝑙𝑖 . The value output by reward
The key is to obtain the gradient of the reward function 𝐽 (𝜃) with re-
function 𝑟𝑟𝑡
𝑡 is the difference between the average response time spect to 𝜃, which is called the policy gradient. The algorithm for solving
(the current response time is also considered) and the current
reinforcement problems by optimizing the policy gradient is called the
response time for task 𝑆𝑙𝑖 . If the current response time is lower
policy gradient algorithm. The policy gradient can be presented as,
than the average one, the output reward is positive, otherwise
it is negative. The reward function 𝑟𝑡 for the weighted cost ∇𝜃 𝐽 (𝜃) = E𝜋𝜃 [∇𝜃 𝑙𝑜𝑔𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )𝐴𝜃 (𝑎𝑡 |𝑠𝑡 )], (37)
optimization problem is defined as:
{ where 𝐴𝜃 (𝑎𝑡 |𝑠𝑡 ) is the advantage function at time step t, used to evaluate
𝑤1 × 𝑁𝑜𝑟𝑚(𝑟𝑙𝑏 𝑟𝑡
𝑡 ) + 𝑤2 × 𝑁𝑜𝑟𝑚(𝑟𝑡 ) 𝑠𝑢𝑐𝑐𝑒𝑒𝑑 the action 𝑎𝑡 at the state 𝑠𝑡 . Here, the policy gradient indicates the
𝑟𝑡 = (34)
𝑝𝑒𝑛𝑎𝑙𝑡𝑦 𝑓 𝑎𝑖𝑙, expectation of ∇𝜃 𝑙𝑜𝑔𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )𝐴𝜃 (𝑎𝑡 |𝑠𝑡 ), which can be estimated using the
empirical average obtained by sampling. However, the PG algorithm
where 𝑤1 and 𝑤2 are the control parameters, and 𝑁𝑜𝑟𝑚 repre- is very sensitive to the update step size, and choosing a suitable step
sents the normalization process. size is challenging [50]. Moreover, practice shows that the difference
Currently, many advanced deep reinforcement learning algorithms between old and new policies in training is usually large [13].
(e.g., PPO, TD3, SAC) have been proposed by different researchers. To address this problem, Trust Region Policy Optimization (TRPO)
They show excellent performance in different fields. PPO improves [51] is proposed. This algorithm introduces importance sampling to
convergence and sampling efficiency by adopting importance sampling evaluate the difference between the old and new policies and restricts
and proportional clipping [13]. TD3 (Twin Delayed DDPG) introduces the new policy if the importance sampling ratio grows large. Impor-
a dual Q network and delayed update strategy to effectively solve tance sampling refers to replacing the original sampling distribution
the overestimation problem in the continuous action space [43]. SAC with a new one to make sampling easier or more efficient. Specifically,
(Soft Actor–Critic) combines policy optimization and learning of Q- TRPO maintains two policies, the first policy 𝜋𝜃𝑜𝑙𝑑 is the current policy
value functions, providing more robust and exploratory policy learn- to be refined, and the second policy 𝜋𝜃 is used to collect the samples.
ing through maximum entropy theory [44]. These algorithms have The optimization problem is defined as follows:
achieved remarkable results in different tasks and environments. In 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )
maximize E𝑡 [ 𝐴] (38)
our research problem, the agent’s action and state space is discrete, 𝜃 𝜋𝜃𝑜𝑙𝑑 (𝑎𝑡 |𝑠𝑡 ) 𝑡
which hinders the application of TD3, because it is designed for con-
subject to E𝑡 [𝐾𝐿[𝜋𝜃𝑜𝑙𝑑 (⋅|𝑠𝑡 ), 𝜋𝜃 (⋅|𝑠𝑡 )]] ≤ 𝛿, (39)
tinuous control [45]. In addition, the original SAC only considers the
problem of continuous space [44], although there are some works where 𝐾𝐿 represents Kullback–Leibler Divergence, used to quantify the
discussing how to apply SAC to discrete space, they usually need to difference between two probability distributions [52], and 𝛿 represents
adopt some special tricks and extensions, such as using soft-max or the restriction of the update between old policy 𝜋𝜃𝑜𝑙𝑑 and new policy
61
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69
𝜋𝜃 . After linear approximation of the objective and quadratic approxi- DRLIS maintains three networks, one critical network, and two actor
mation of the constraints, the problem can be efficiently approximated networks (i.e., the old actor and the new actor), representing the old
using the conjugate gradient algorithm. However, the computation of policy function 𝜋𝜃𝑜𝑙𝑑 and the new policy function 𝜋𝜃 , as discussed
conjugate gradient makes the implementation of TRPO more complex in Section 5.1. Algorithm 1 describes DRLIS for the weighted cost
and inflexible in practice [53,54]. optimization problem in edge and fog computing environments.
To make this algorithm well applied in practice, the KL-PPO al-
gorithm [13] is proposed. Rather than using the constraint function
Algorithm 1: DRLIS for weighted cost optimization
E𝑡 [𝐾𝐿[𝜋𝜃𝑜𝑙𝑑 (⋅|𝑠𝑡 ), 𝜋𝜃 (⋅|𝑠𝑡 )]] ≤ 𝛿, the 𝐾𝐿 divergence is added as a penalty
Input : new actor network 𝛱𝜃 with parameter 𝜃; old actor
in the objective function:
network 𝛱𝜃𝑜𝑙𝑑 with parameter 𝜃𝑜𝑙𝑑 , where 𝜃𝑜𝑙𝑑 = 𝜃;
𝐿𝐾𝐿𝑃 𝐸𝑁 (𝜃) = E𝑡 [𝑟𝑡 (𝜃)𝐴𝑡 − 𝛽𝐾𝐿[𝜋𝜃𝑜𝑙𝑑 (⋅|𝑠𝑡 ), 𝜋𝜃 (⋅|𝑠𝑡 )]], (40) critic network 𝑉𝜇 with parameter 𝜇; max time step 𝑇 ;
update epoch 𝐾; policy objective function coefficient
𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )
where 𝑟𝑡 (𝜃) = 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )
is the ratio of the new policy and the old policy, 𝑎𝑐 ; value function loss function coefficient 𝑎𝑣 ; entropy
𝑜𝑙𝑑
bonus coefficient 𝑎𝑒 ; clipping ratio 𝜖
obtained in Eq. (38), and the parameter 𝛽 can be dynamically adjusted 1 while True do
during the iterative process according to the 𝐾𝐿 divergence. If the 2 𝑠𝑒𝑟𝑣𝑒𝑟𝑠 ← 𝐺𝑒𝑡𝑆𝑒𝑟𝑣𝑒𝑟𝑠();
current 𝐾𝐿 divergence is larger than the predefined maximum value, 3 𝑡𝑎𝑠𝑘 ← 𝐺𝑒𝑡𝑇 𝑎𝑠𝑘();
indicating that the penalty is not strong enough and the parameter 4 if 𝑠𝑒𝑟𝑣𝑒𝑟𝑠 ≠ 𝑠𝑒𝑟𝑣𝑒𝑟𝑠𝑜𝑙𝑑 then
𝛽 needs to be increased. Conversely, if the current 𝐾𝐿 divergence is 5 𝑎𝑔𝑒𝑛𝑡 ← 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝐴𝑔𝑒𝑛𝑡(𝑠𝑒𝑟𝑣𝑒𝑟𝑠);
smaller than the predefined minimum value, the parameter 𝛽 needs to 6 𝑠𝑒𝑟𝑣𝑒𝑟𝑠𝑜𝑙𝑑 ← 𝑠𝑒𝑟𝑣𝑒𝑟𝑠;
be reduced. 7 end if
Moreover, another idea to restrict the difference between old policy 8 𝑠1 ← 𝐺𝑒𝑛𝑒𝑟𝑎𝑙𝑖𝑧𝑒𝑆𝑡𝑎𝑡𝑒(𝑠𝑒𝑟𝑣𝑒𝑟𝑠, 𝑡𝑎𝑠𝑘);
𝜋𝜃𝑜𝑙𝑑 and new policy 𝜋𝜃 is to use clipped surrogate function 𝑐𝑙𝑖𝑝. The 9 ← 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝐵𝑢𝑓 𝑓 𝑒𝑟();
PPO algorithm using the clip function (CLIP-PPO) removes the KL 10 for 𝑡 ← 1 to 𝑇 do
11 𝑎𝑡 ← 𝛱𝜃 (𝑠𝑡 );
penalty and the need for adaptive updates to simplify the algorithm.
12 𝑆𝑐ℎ𝑒𝑑𝑢𝑙𝑒(𝑡𝑎𝑠𝑘, 𝑎𝑡 );
Practice shows CLIP-PPO usually performs better than KL-PPO [13].
13 𝑟𝑡 ← 𝐺𝑒𝑡𝑅𝑒𝑤𝑎𝑟𝑑();
Formally, the objective function of CLIP-PPO is defined as follows: 14 𝑠𝑒𝑟𝑣𝑒𝑟𝑠 ← 𝐺𝑒𝑡𝑆𝑒𝑟𝑣𝑒𝑟𝑠();
𝐿𝐶𝐿𝐼𝑃 (𝜃) = E𝑡 [𝑚𝑖𝑛(𝑟𝑡 (𝜃)𝐴𝑡 , 𝑐𝑙𝑖𝑝(𝑟𝑡 (𝜃), 1 − 𝜖, 1 + 𝜖)𝐴𝑡 )]. (41) 15 if 𝑠𝑒𝑟𝑣𝑒𝑟𝑠 ≠ 𝑠𝑒𝑟𝑣𝑒𝑟𝑠𝑜𝑙𝑑 then
16 𝑏𝑟𝑒𝑎𝑘;
And 𝑐𝑙𝑖𝑝(𝑟𝑡 (𝜃), 1−𝜖, 1+𝜖) restrict the ratio 𝑟𝑡 (𝜃) into (1−𝜖, 1+𝜖), defined 17 end if
as: 18 𝑡𝑎𝑠𝑘 ← 𝐺𝑒𝑡𝑇 𝑎𝑠𝑘();
19 𝑠𝑡+1 ← 𝐺𝑒𝑛𝑒𝑟𝑎𝑙𝑖𝑧𝑒𝑆𝑡𝑎𝑡𝑒(𝑠𝑒𝑟𝑣𝑒𝑟𝑠, 𝑡𝑎𝑠𝑘);
⎧1 − 𝜖 𝑟𝑡 (𝜃) < 1 − 𝜖
⎪ 20 𝑢𝑡 = (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 );
𝑐𝑙𝑖𝑝(𝑟𝑡 (𝜃), 1 − 𝜖, 1 + 𝜖) = ⎨𝑟𝑡 (𝜃) 1 − 𝜖 ≤ 𝑟𝑡 (𝜃) ≤ 1 + 𝜖 (42) 21 .𝐴𝑝𝑝𝑒𝑛𝑑(𝑢𝑡 );
⎪ end for
⎩1 + 𝜖 𝑟𝑡 (𝜃) > 1 + 𝜖. 22
23 𝐴̂ 𝑡 ← −𝑉𝜇 (𝑠𝑡 ) + 𝑟𝑡 + 𝛾𝑟𝑡+1 + ⋅ ⋅ ⋅ + 𝛾 𝑇 −𝑡+1 𝑟𝑇 −1 + 𝛾 𝑇 −𝑡 𝑉𝜇 (𝑠𝑇 );
By removing the constraint function as discussed in TRPO, both 24 for 𝑘 ← 1 to 𝐾 do
∑𝑡 𝛱 (𝑎 |𝑠 ) 𝛱 (𝑎 |𝑠 )
PPO algorithms significantly reduce the computational complexity, 25 𝐿𝐶𝐿𝐼𝑃 (𝜃) = 1 𝑚𝑖𝑛( 𝛱 𝜃 (𝑎𝑡 |𝑠𝑡 ) 𝐴̂ 𝑡 , 𝑐𝑙𝑖𝑝( 𝛱 𝜃 (𝑎𝑡 |𝑠𝑡 ) , 1 − 𝜖, 1 + 𝜖)𝐴̂ 𝑡 ;
∑
𝜃𝑜𝑙𝑑 𝑡 𝑡 𝜃𝑜𝑙𝑑 𝑡 𝑡
while ensuring that the updated policy deviates not too large from the 𝑡
26 𝐿𝑉 𝐹 (𝜇) = 1 (𝑉𝜇 (𝑠𝑡 ) − 𝐴̂ 𝑡 )2 ;
∑ 𝑡
previous one. 27 𝐿𝐸𝑇 (𝜃) = 1 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝛱𝜃 (𝑎𝑡 |𝑠𝑡 ));
28 𝐿(𝜃, 𝜇) = −𝑎𝑐 𝐿𝐶𝐿𝐼𝑃 (𝜃) + 𝑎𝑣 𝐿𝑉 𝐹 (𝜇) − 𝑎𝑒 𝐿𝐸𝑇 (𝜃);
5.2. DRLIS: DRL-based IoT application scheduling 29 update 𝜃 and 𝜇 with 𝐿(𝜃, 𝜇) by Adam optimizer;
30 end for
Since CLIP-PPO usually outperforms KL-PPO in practice, we choose 31 𝜃𝑜𝑙𝑑 ← 𝜃;
it as the basis for the optimization algorithm. DRLIS is based on 32 end while
the actor–critic framework, which is a reinforcement learning method
combining Policy Gradient and Temporal Differential (TD) learning.
As the name implies, this framework consists of two parts, the actor We consider a scheduler that is implemented based on DRLIS. When
and the critic, and in implementation, they are usually presented as this scheduler receives a scheduling request from an IoT application,
Deep Neural Networks (DNNs). The actor network is used to learn a it obtains information about the set of servers currently available and
policy function 𝜋𝜃 (𝑎|𝑠) to maximize the expected cumulative discounted initializes a DRL agent based on the information. This agent contains
∑ three deep neural networks, a new actor network 𝛱𝜃 with parameter
reward E𝜋 [ 𝑡∈𝑇 𝛾𝑡 𝑟𝑡 ], while the critic network is used to evaluate the
current policy and to guide the next stage of the actor’s action. In the 𝜃, an old actor network 𝛱𝜃𝑜𝑙𝑑 with parameter 𝜃𝑜𝑙𝑑 , where 𝜃𝑜𝑙𝑑 = 𝜃, and
learning process, at the time step 𝑡, the reinforcement learning agent a critic network 𝑉𝜇 with parameter 𝜇. After that, the scheduler obtains
inputs the current state 𝑠𝑡 into the actor network, and the actor network the information about the currently submitted task and generates the
outputs the action 𝑎𝑡 to be performed by the agent in the MDP. The current state 𝑠𝑡 based on the information regarding the task and servers.
agent performs the action 𝑎𝑡 , receives the reward 𝑟𝑡 from the environ- Inputting the state 𝑠𝑡 to the new actor network 𝛱𝜃 will output an
ment, and moves to the next state 𝑠𝑡+1 . The critic network receives the action 𝑎𝑡 , representing the target server to which the current task is
states 𝑠𝑡 and 𝑠𝑡+1 as input and estimates their value functions 𝑉𝜋𝜃 (𝑠𝑡 ) and to be assigned. The scheduler then assigns the task to the target server
𝑉𝜋𝜃 (𝑠𝑡+1 ). The agent then computes the TD error 𝛿𝑡 for the time step t: and receives the corresponding reward 𝑟𝑡 , which is calculated based
on Eqs. (32), (33), (34). The reward 𝑟𝑡 is essential for indicating the
𝛿𝑡 = 𝑟𝑡 + 𝛾𝑉𝜋𝜃 (𝑠𝑡+1 ) − 𝑉𝜋𝜃 (𝑠𝑡 ), (43) positive or negative impact of the agent’s current scheduling policy
on the optimization objectives (e.g., IoT application response time and
where 𝛾 denotes the discount factor, as discussed in Section 3, and the
servers load balancing level). Also, a tuple 𝑢𝑡 with three values (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 )
actor network and critic network update their parameters using the
will be stored in buffer . The scheduler repeats the process 𝑇 times
TD error 𝛿𝑡 . DRLIS continues this process after multiple steps, as an
until sufficient information is collected to update the neural networks.
estimate 𝐴̂ 𝑡 of the advantage function 𝐴𝑡 , which can be written as:
When updating the neural networks, the estimate of the advantage
𝐴̂ 𝑡 = −𝑉𝜋𝜃 (𝑠𝑡 ) + 𝑟𝑡 + 𝛾𝑟𝑡+1 + ⋯ + 𝛾 𝑇 −𝑡+1 𝑟𝑇 −1 + 𝛾 𝑇 −𝑡 𝑉𝜋𝜃 (𝑠𝑇 ). (44) function is first computed based on Eq. (44). Then the neural networks
62
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69
are optimized for K times. Both actor network and critic network use Fig. 4. Reinforcement learning scheduling module in FogBus2 framework.
Adam optimizer, and the loss function is computed as:
63
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69
the corresponding models. The Action Selector is responsible for 6. Performance evaluation
outputting the target server index for the currently scheduled
task. The Model Optimizer optimizes the running reinforcement In this section, we first describe the experimental setup and sample
learning scheduling policy based on the reward values returned applications used in the evaluation. Then, we investigate the hyperpa-
from the Reward Function Models sub-module. The State Converter rameters of DRLIS. Finally, we discuss the performance of DRLIS by
is responsible for converting the parameters of the server and comparing it with its counterparts.
IoT application into state vectors that can be recognized by the
6.1. Experiment setup
reinforcement learning scheduling model. The Scheduling Policy
Runner is the running program of the reinforcement learning
We first give a short introduction about the experimental environ-
scheduling Agent and is responsible for receiving submitted tasks,
ment and describe the IoT applications used in the experiment. Next,
saving or loading the trained policies, and requesting and access- the baseline algorithms used to compare with DRLIS are presented.
ing parameters from other FogBus2 components (e.g., FogBus2
Actor, FogBus2 User) for the computation of reward functions. 6.1.1. Experiment environment
• Model Warehouse: This sub-module can save the hyperparameters As discussed in Section 5.3, we implemented a scheduler based
of the trained scheduling policy to the database and loads the on DRLIS in the FogBus2 framework, and we use this scheduler for
hyperparameters to initialize a well-trained scheduling Agent. evaluation. We consider a heterogeneous experimental environment
consisting of IoT devices, resource-limited fog servers, and resource-
rich cloud servers. To simulate the heterogeneous multi-cloud comput-
Algorithm 2: Reinforcement learning scheduler in FogBus2 ing environment, we used two instances of Nectar Cloud infrastructure
framework based on the proposed weighted cost optimization (Intel Xeon 2 cores @2.0 GHz, 9 GB RAM, and Intel Xeon 16 cores
algorithm @2.0 GHz, 64 GB RAM) and one instance of AWS Cloud (AMD EPYC 2
Input : master component 𝑀; registered actor component set cores @2.2 GHz, 4GM RAM). In the fog computing environment, to
𝐴; user component 𝑈 ; tasks to be processed 𝑇 reflect the heterogeneity of the servers, we used a Raspberry Pi 3B
1 𝑆𝑐ℎ𝑒𝑑𝑢𝑙𝑒𝑟 ← 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝑆𝑐ℎ𝑒𝑑𝑢𝑒𝑟(𝐷𝑅𝐿𝐼𝑆); (Broadcom BCM2837 4 cores @1.2 GHz, 1 GB RAM), a MacBook Pro
2 𝐴 ← 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝐴𝑐𝑡𝑜𝑟𝐵𝑢𝑓 𝑓 𝑒𝑟(); (Apple M1 Pro 8 cores, 16 GB RAM), and a Linux virtual machine (Intel
3 𝑈 ← 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝑈 𝑠𝑒𝑟𝐵𝑢𝑓 𝑓 𝑒𝑟(); Core i5 2 cores @3.1 GHz, 4 GB RAM). In addition, the IoT devices are
4 while True do configured with 2 cores @3.2 GHz and 4 GB RAM. Furthermore, we
5 𝑈 .𝑆𝑢𝑏𝑚𝑖𝑡𝑇 𝑎𝑠𝑘𝑠(𝑇 ); profiled the average bandwidth (i.e., data rate) and latency between
6 𝐴𝑣𝑎𝑖𝑙𝑎𝑏𝑙𝑒𝐴𝑐𝑡𝑜𝑟𝑠 ← 𝑀.𝐶ℎ𝑒𝑐𝑘𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠(𝑇 ); servers as follows: the latency between the IoT device and the cloud
7 if 𝐴𝑣𝑎𝑖𝑙𝑎𝑏𝑙𝑒𝐴𝑐𝑡𝑜𝑟𝑠 is 𝑒𝑚𝑝𝑡𝑦 then server is around 15 ms, and the bandwidth is around 6 MB/s, while
8 𝑀.𝑀𝑒𝑠𝑠𝑎𝑔𝑒(𝑈 , 𝐹 𝑎𝑖𝑙); the latency between the IoT device and the fog server is around 3 ms,
9 𝑏𝑟𝑒𝑎𝑘; and the bandwidth is around 25 MB/s. Also, both 𝑤1 and 𝑤2 are set
10 end if to 0.5 in Eq. (19), meaning that the importance of load balancing and
11 foreach 𝑡𝑖 ∈ 𝑇 do response time are equal.
12 𝑆𝑐ℎ𝑒𝑑𝑢𝑙𝑒𝑟.𝑇 𝑎𝑠𝑘𝑃 𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡(𝑡𝑖 , 𝐴);
13 𝐴.𝑀𝑒𝑠𝑠𝑎𝑔𝑒(𝑀, 𝐼𝐴 );
6.1.2. Sample IoT applications
14 𝐴 .𝐴𝑝𝑝𝑒𝑛𝑑(𝐼𝐴 );
We used four IoT applications for evaluating the performance of
15 𝑈 .𝑀𝑒𝑠𝑠𝑎𝑔𝑒(𝑀, 𝐼𝑈 );
the scheduler based on DRLIS. All applications implement both real-
16 𝑈 .𝐴𝑝𝑝𝑒𝑛𝑑(𝐼𝑈 );
time and non-real-time features. Real-time means that the application
17 if 𝑈 𝑝𝑑𝑎𝑡𝑒𝑆𝑐ℎ𝑒𝑑𝑢𝑙𝑒𝑟 is 𝑇 𝑟𝑢𝑒 then
18 𝑅𝑒𝑤𝑎𝑟𝑑𝑠 ← 𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑅𝑒𝑤𝑎𝑟𝑑𝑠(𝐴 , 𝑈 );
can receive live streams and non-real-time means that the application
19 𝑆𝑐ℎ𝑒𝑑𝑢𝑙𝑒𝑟.𝑈 𝑝𝑑𝑎𝑡𝑒(); can receive pre-recorded video files. Specifically, applications follow
20 end if
a sensor–actuator architecture, with each application operating as a
single data stream. Sensors (e.g., cameras) capture environmental in-
21 end foreach
formation and process it into data patterns (e.g., image frames) that
22 end while
will be forwarded to surrogate servers for processing, while actuators
receive the processed data and represent the final outcome to the user.
In addition, all applications provide a parameter called application label,
Algorithm 2 summarizes the scheduling mechanism based on DRLIS.
which can be used to set the frame size in the video. These applications
The framework first initializes a scheduler, based on Algorithm 1. In are described as follows:
addition, two buffers 𝐴 and 𝑈 for storing information from the 𝐴𝑐𝑡𝑜𝑟
component and the 𝑈 𝑠𝑒𝑟 component are also initialized. After the 𝑈 𝑠𝑒𝑟 • Face Detection [15]: Detects and captures human faces. The human
component submits the IoT application to be processed, the 𝑀𝑎𝑠𝑡𝑒𝑟 faces in the video are marked by squares. This application is
component first checks whether the 𝐴𝑐𝑡𝑜𝑟 components that have been implemented based on OpenCV.3
registered to the framework have the corresponding resources to pro- • Color Tracking [15]: Tracks colors from video. The user can
cess the application. If true, the IoT application which contains one dynamically configure the target colors through the GUI provided
or multiple tasks will be scheduled; otherwise, the 𝑀𝑎𝑠𝑡𝑒𝑟 component by the application. This application is implemented based on
will inform the 𝑈 𝑠𝑒𝑟 component that the current application cannot OpenCV.3
be processed. For each task of an IoT application, the scheduler will • Face And Eye Detection [15]: In addition to detecting and captur-
ing human faces, the application also detects and captures human
place it to the target 𝐴𝑐𝑡𝑜𝑟 component for execution based on Algorithm
eyes. This application is implemented based on OpenCV.3
1. After that, the 𝐴𝑐𝑡𝑜𝑟 component sends the relevant information
• Video OCR [14]: Recognizes and extracts text information from
(i.e., CPU utilization, RAM utilization, etc.) to the 𝑀𝑎𝑠𝑡𝑒𝑟 component,
the video and transmits it back to the user. The application
which is stored in the buffer 𝐴 . The 𝑈 𝑠𝑒𝑟 component also sends
will automatically filter out keyframes. This application is imple-
relevant information (i.e., response time, the result of task execution,
mented based on Google’s Tesseract-OCR Engine.4
etc.) to the 𝑀𝑎𝑠𝑡𝑒𝑟 component, which is stored in the buffer 𝑈 . When
the 𝑀𝑎𝑠𝑡𝑒𝑟 collects sufficient information, it will update the scheduler,
where the data in 𝐴 and 𝐸 are used to compute the reward for each 3
https://github.com/opencv/opencv.
4
step, as discussed in Algorithm 1 and Eqs. (32), (33), (34). https://github.com/tesseract-ocr/tesseract.
64
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69
65
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69
66
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69
67
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69
References [26] X. Jie, T. Liu, H. Gao, C. Cao, P. Wang, W. Tong, A DQN-based approach
for online service placement in mobile edge computing, in: Proceedings of the
16th EAI International Conference on Collaborative Computing: Networking,
[1] G.S.S. Chalapathi, V. Chamola, A. Vaish, R. Buyya, Industrial internet of things
Applications and Worksharing, Springer, 2021, pp. 169–183.
(iiot) applications of edge and fog computing: A review and future directions,
in: Fog/Edge Computing for Security, Privacy, and Applications, Springer, 2021, [27] X. Xiong, K. Zheng, L. Lei, L. Hou, Resource allocation based on deep reinforce-
pp. 293–325. ment learning in IoT edge computing, IEEE J. Sel. Areas Commun. 38 (6) (2020)
1133–1146.
[2] S. Azizi, M. Shojafar, J. Abawajy, R. Buyya, Deadline-aware and energy-efficient
IoT task scheduling in fog computing systems: A semi-greedy approach, J. Netw. [28] J. Wang, L. Zhao, J. Liu, N. Kato, Smart resource allocation for mobile edge
Comput. Appl. 201 (2022) 103333. computing: A deep reinforcement learning approach, IEEE Trans. Emerg. Top.
Comput. 9 (3) (2021) 1529–1541.
[3] A.J. Ferrer, J.M. Marquès, J. Jorba, Towards the decentralised cloud: Survey
[29] L. Huang, X. Feng, C. Zhang, L. Qian, Y. Wu, Deep reinforcement learning-
on approaches and challenges for mobile, ad hoc, and edge computing, ACM
based joint task offloading and bandwidth allocation for multi-user mobile edge
Comput. Surv. 51 (6) (2019) 1–36.
computing, Digit. Commun. Netw. 5 (1) (2019) 10–17.
[4] M. Goudarzi, H. Wu, M. Palaniswami, R. Buyya, An application placement tech-
[30] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, M. Bennis, Optimized computation
nique for concurrent IoT applications in edge and fog computing environments,
offloading performance in virtual edge computing systems via deep reinforcement
IEEE Trans. Mob. Comput. 20 (4) (2020) 1298–1311.
learning, IEEE Internet Things J. 6 (3) (2019) 4005–4018.
[5] L. Catarinucci, D. de Donno, L. Mainetti, L. Palano, L. Patrono, M.L. Stefanizzi, L.
[31] Y. Zheng, H. Zhou, R. Chen, K. Jiang, Y. Cao, SAC-based computation offloading
Tarricone, An IoT-aware architecture for smart healthcare systems, IEEE Internet
and resource allocation in vehicular edge computing, in: Proceedings of the
Things J. 2 (6) (2015) 515–526.
IEEE INFOCOM - IEEE Conference on Computer Communications Workshops
[6] L. Liu, S. Lu, R. Zhong, B. Wu, Y. Yao, Q. Zhang, W. Shi, Computing systems
(INFOCOM WKSHPS), 2022, pp. 1–6.
for autonomous driving: State of the art and challenges, IEEE Internet Things J.
[32] T. Zhao, F. Li, L. He, Secure video offloading in MEC-enabled IIoT networks:
8 (8) (2021) 6469–6486.
A multi-cell federated deep reinforcement learning approach, IEEE Trans. Ind.
[7] M. Goudarzi, M.S. Palaniswami, R. Buyya, A distributed deep reinforcement
Inform. (2023) 1–12.
learning technique for application placement in edge and fog computing
[33] L. Liao, Y. Lai, F. Yang, W. Zeng, Online computation offloading with double
environments, IEEE Trans. Mob. Comput. 22 (5) (2023) 2491–2505.
reinforcement learning algorithm in mobile edge computing, J. Parallel Distrib.
[8] A. Brogi, S. Forti, QoS-aware deployment of IoT applications through the fog,
Comput. 171 (2023) 28–39.
IEEE Internet Things J. 4 (5) (2017) 1185–1192.
[34] V. Sethi, S. Pal, FedDOVe: A Federated Deep Q-learning-based Offloading for
[9] M. Goudarzi, M. Palaniswami, R. Buyya, Scheduling IoT applications in edge and
Vehicular fog computing, Future Gener. Comput. Syst. 141 (2023) 96–105.
fog computing environments: a taxonomy and future directions, ACM Comput.
[35] P. Li, W. Xie, Y. Yuan, C. Chen, S. Wan, Deep reinforcement learning for load
Surv. 55 (7) (2022) 1–41.
balancing of edge servers in iov, Mob. Netw. Appl. 27 (4) (2022) 1461–1474.
[10] X. Ma, H. Gao, H. Xu, M. Bian, An IoT-based task scheduling optimization scheme
[36] X. Chu, M. Zhu, H. Mao, Y. Qiu, Task offloading for multi-gateway-assisted
considering the deadline and cost-aware scientific workflow for cloud computing,
mobile edge computing based on deep reinforcement learning, in: Proceedings
EURASIP J. Wireless Commun. Networking 2019 (1) (2019) 1–19.
of the IEEE International Conference on Systems, Man, and Cybernetics (SMC),
[11] Z. Wang, M. Goudarzi, J. Aryal, R. Buyya, Container orchestration in edge and 2022, pp. 3234–3241.
fog computing environments for real-time iot applications, in: Proceedings of
[37] F. Xue, Q. Hai, T. Dong, Z. Cui, Y. Gong, A deep reinforcement learning
the Computational Intelligence and Data Analytics (ICCIDA), Springer, 2022, pp.
based hybrid algorithm for efficient resource scheduling in edge computing
1–21.
environment, Inform. Sci. 608 (2022) 362–374.
[12] E. Li, Z. Zhou, X. Chen, Edge intelligence: On-demand deep learning model co- [38] S. Pallewatta, V. Kostakos, R. Buyya, Placement of microservices-based IoT
inference with device-edge synergy, in: Proceedings of the 2018 Workshop on applications in fog computing: A taxonomy and future directions, ACM Comput.
Mobile Edge Communications, 2018, pp. 31–36. Surv. 55 (2023) 1–43.
[13] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy [39] W. Zhu, M. Goudarzi, R. Buyya, Flight: A lightweight federated learning
optimization algorithms, 2017, arXiv preprint arXiv:1707.06347. framework in edge and fog computing, 2023, arXiv preprint arXiv:2308.02834.
[14] Q. Deng, M. Goudarzi, R. Buyya, Fogbus2: a lightweight and distributed [40] X. Qiu, W. Zhang, W. Chen, Z. Zheng, Distributed and collective deep reinforce-
container-based framework for integration of iot-enabled systems with edge and ment learning for computation offloading: A practical perspective, IEEE Trans.
cloud computing, in: Proceedings of the International Workshop on Big Data in Parallel Distrib. Syst. 32 (5) (2020) 1085–1101.
Emergent Distributed Environments, 2021, pp. 1–8. [41] N.H. Tran, W. Bao, A. Zomaya, M.N. Nguyen, C.S. Hong, Federated learning over
[15] M. Goudarzi, Q. Deng, R. Buyya, Resource management in edge and fog wireless networks: Optimization model design and analysis, in: Proceedings of
computing using FogBus2 framework, 2021, arXiv preprint arXiv:2108.00591. the IEEE INFOCOM, IEEE, 2019, pp. 1387–1395.
[16] K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A fast elitist non-dominated sorting [42] J. Ji, K. Zhu, L. Cai, Trajectory and communication design for cache-enabled
genetic algorithm for multi-objective optimization: NSGA-II, in: Proceedings of UAVs in cellular networks: A deep reinforcement learning approach, IEEE Trans.
the International Conference on Parallel Problem Solving from Nature, Springer, Mob. Comput. (2022).
2000, pp. 849–858. [43] S. Fujimoto, H. van Hoof, D. Meger, Addressing function approximation error
[17] K. Deb, H. Jain, An evolutionary many-objective optimization algorithm using in actor-critic methods, in: J. Dy, A. Krause (Eds.), Proceedings of the 35th
reference-point-based nondominated sorting approach, part I: solving problems International Conference on Machine Learning, in: Proceedings of Machine
with box constraints, IEEE Trans. Evol. Comput. 18 (4) (2013) 577–601. Learning Research, 80, PMLR, 2018, pp. 1587–1596.
[18] C.J. Watkins, P. Dayan, Q-learning, Mach. Learn. 8 (1992) 279–292. [44] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: Off-policy maximum
[19] J. Liu, Y. Mao, J. Zhang, K.B. Letaief, Delay-optimal computation task scheduling entropy deep reinforcement learning with a stochastic actor, in: J. Dy, A. Krause
for mobile-edge computing systems, in: Proceedings of the IEEE International (Eds.), Proceedings of the 35th International Conference on Machine Learning,
Symposium on Information Theory (ISIT), 2016, pp. 1451–1455. in: Proceedings of Machine Learning Research, 80, PMLR, 2018, pp. 1861–1870.
[20] C.-g. Wu, W. Li, L. Wang, A.Y. Zomaya, Hybrid evolutionary scheduling for [45] S. Mysore, B.E. Mabsout, R. Mancuso, K. Saenko, Honey. I shrunk the actor: A
energy-efficient fog-enhanced internet of things, IEEE Trans. Cloud Comput. 9 case study on preserving performance with smaller actors in actor-critic RL, in:
(2) (2021) 641–653. 2021 IEEE Conference on Games (CoG), 2021, pp. 01–08.
[21] Y. Sun, F. Lin, H. Xu, Multi-objective optimization of resource scheduling in fog [46] P. Christodoulou, Soft actor-critic for discrete action settings, 2019, arXiv
computing using an improved NSGA-II, Wirel. Pers. Commun. 102 (2) (2018) preprint arXiv:1910.07207.
1369–1385. [47] H. Wang, Y. Ye, J. Zhang, B. Xu, A comparative study of 13 deep reinforcement
[22] F. Hoseiny, S. Azizi, M. Shojafar, F. Ahmadiazar, R. Tafazolli, PGA: A learning based energy management methods for a hybrid electric vehicle, Energy
priority-aware genetic algorithm for task scheduling in heterogeneous fog-cloud 266 (2023) 126497.
computing, in: Proceedings of the IEEE Conference on Computer Communications [48] J. Zhu, F. Wu, J. Zhao, An overview of the action space for deep reinforcement
Workshops (INFOCOM WKSHPS), 2021, pp. 1–6. learning, in: Proceedings of the 4th International Conference on Algorithms,
[23] I.M. Ali, K.M. Sallam, N. Moustafa, R. Chakraborty, M. Ryan, K.-K.R. Choo, An Computing and Artificial Intelligence, 2021, pp. 1–10.
automated task scheduling model using non-dominated sorting genetic algorithm [49] R.S. Sutton, D. McAllester, S. Singh, Y. Mansour, Policy gradient methods for
II for fog-cloud systems, IEEE Trans. Cloud Comput. 10 (4) (2022) 2294–2308. reinforcement learning with function approximation, Adv. Neural Inf. Process.
[24] F. Ramezani Shahidani, A. Ghasemi, A. Toroghi Haghighat, A. Keshavarzi, Syst. 12 (1999).
Task scheduling in edge-fog-cloud architecture: a multi-objective load balancing [50] R. Huang, T. Yu, Z. Ding, S. Zhang, Policy gradient, in: Deep Reinforce-
approach using reinforcement learning algorithm, Computing (2023) 1–23. ment Learning: Fundamentals, Research and Applications, Springer, 2020, pp.
[25] J.-y. Baek, G. Kaddoum, S. Garg, K. Kaur, V. Gravel, Managing fog networks 161–212.
using reinforcement learning based load balancing algorithm, in: Proceedings of [51] J. Schulman, S. Levine, P. Abbeel, M. Jordan, P. Moritz, Trust region policy op-
the IEEE Wireless Communications and Networking Conference (WCNC), IEEE, timization, in: Proceedings of the International Conference on Machine Learning,
2019, pp. 1–7. PMLR, 2015, pp. 1889–1897.
68
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69
[52] T. Van Erven, P. Harremos, Rényi divergence and Kullback-Leibler divergence, Mohammad Goudarzi is working as a senior research
IEEE Trans. Inform. Theory 60 (7) (2014) 3797–3820. associate at the University of New South Wales (UNSW),
[53] S. Shao, W. Luk, Customised pearlmutter propagation: A hardware architecture Sydney, Australia. He completed his Ph.D. at the Depart-
for trust region policy optimisation, in: Proceedings of the 27th International ment of Computing and Information Systems, the University
Conference on Field Programmable Logic and Applications (FPL), IEEE, 2017, of Melbourne, Australia. His research interests include In-
pp. 1–6. ternet of Things (IoT), Cloud/Fog/Edge Computing, Cyber
[54] S. Li, R. Wang, M. Tang, C. Zhang, Hierarchical reinforcement learning with Security, Distributed Systems, and Machine Learning. He
advantage-based auxiliary rewards, Adv. Neural Inf. Process. Syst. 32 (2019). received Oracle’s Cloud Architect of the Year Award 2022,
[55] S. Aljanabi, A. Chalechale, Improving IoT services using a hybrid fog-cloud IEEE TCCLD Outstanding PhD. Thesis Award 2022, IEEE
offloading, IEEE Access 9 (2021) 13775–13788. TCSC Outstanding PhD. Dissertation Award 2022, and IEEE
[56] L. Yliniemi, K. Tumer, Multi-objective multiagent credit assignment in Outstanding Service Award 2021. Media outlets such as
reinforcement learning and nsga-ii, Soft Comput. 20 (10) (2016) 3869–3887. Australian Financial Review (AFR) and Oracle Blogs have
[57] J. Blank, K. Deb, Pymoo: Multi-objective optimization in python, IEEE Access 8 covered his research findings.
(2020) 89497–89509.
[58] X. Li, X. Li, K. Wang, S. Yang, Y. Li, Achievement scalarizing function sorting for
strength Pareto evolutionary algorithm in many-objective optimization, Neural
Comput. Appl. 33 (2021) 6369–6388.
[59] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, D. Meger, Deep Mingming Gong is currently a Senior Lecturer in data
reinforcement learning that matters, in: Proceedings of the AAAI Conference on science with the School of Mathematics and Statistics,
Artificial Intelligence, Vol. 32, 2018. The University of Melbourne, Melbourne, VIC, Australia.
[60] D. Wei, N. Xi, X. Ma, M. Shojafar, S. Kumari, J. Ma, Personalized privacy-aware He has authored/coauthored more than 30 research pa-
task offloading for edge-cloud-assisted industrial internet of things in automated pers on top venues, such as International Conference on
manufacturing, IEEE Trans. Ind. Inform. 18 (11) (2022) 7935–7945. Machine Learning (ICML), Neural Information Processing
[61] N. Bjorck, C.P. Gomes, B. Selman, K.Q. Weinberger, Understanding batch Systems (NeurIPS), the IEEE TRANSACTIONS ON NEU-
normalization, Adv. Neural Inf. Process. Syst. 31 (2018). RAL NETWORKS AND LEARNING SYSTEMS, and the IEEE
[62] C. Huang, R. Mo, C. Yuen, Reconfigurable intelligent surface assisted mul- TRANSACTIONS ON IMAGE PROCESSING, with more than
tiuser MISO systems exploiting deep reinforcement learning, IEEE J. Sel. Areas ten oral/spotlight presentations. His research interests in-
Commun. 38 (8) (2020) 1839–1850. clude causal reasoning, machine learning, and computer
[63] F. Fu, Y. Kang, Z. Zhang, F.R. Yu, Transcoding for live streaming-based on vision. Dr. Gong received the Discovery Early Career Re-
vehicular fog computing: An actor-critic DRL approach, in: Proceedings of the searcher Award from the Australian Research Council in
IEEE INFOCOM - IEEE Conference on Computer Communications Workshops 2020.
(INFOCOM WKSHPS), 2020, pp. 1015–1020.
[64] R. Islam, P. Henderson, M. Gomrokchi, D. Precup, Reproducibility of bench-
marked deep reinforcement learning tasks for continuous control, 2017, arXiv
preprint arXiv:1708.04133.
Dr. Rajkumar Buyya is a Redmond Barry Distinguished
Zhiyu Wang is working towards the Ph.D. degree at Professor and Director of the Cloud Computing and Dis-
the Cloud Computing and Distributed Systems (CLOUDS) tributed Systems (CLOUDS) Laboratory at the University
Laboratory, Department of Computing and Information Sys- of Melbourne, Australia. He has authored over 625 pub-
tems, the University of Melbourne. His research interests lications and seven text books including "Mastering Cloud
include Edge/Fog Computing, Internet of Things (IoT), Computing" published by McGraw Hill, China Machine
Distributed Systems, and Artificial Intelligence. He is cur- Press, and Morgan Kaufmann for Indian, Chinese and in-
rently working on leveraging AI techniques to enhance ternational markets respectively. He is one of the highly
resource management in edge/fog and cloud computing cited authors in computer science and software engineering
environments. worldwide (h-index=160, g-index=322, 137900+ citations).
69