0% found this document useful (0 votes)
35 views15 pages

1 s2.0 S0167739X23003862 Main

Uploaded by

aurelio faiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views15 pages

1 s2.0 S0167739X23003862 Main

Uploaded by

aurelio faiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Future Generation Computer Systems 152 (2024) 55–69

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

Deep Reinforcement Learning-based scheduling for optimizing system load


and response time in edge and fog computing environments
Zhiyu Wang a ,∗, Mohammad Goudarzi b , Mingming Gong c , Rajkumar Buyya a
a
Cloud Computing and Distributed Systems (CLOUDS) Laboratory, The University of Melbourne, Melbourne, Australia
b
School of Computer Science and Engineering, The University of New South Wales (UNSW), Sydney, Australia
c
School of Mathematics and Statistics, The University of Melbourne, Melbourne, Australia

ARTICLE INFO ABSTRACT

Keywords: Edge/fog computing, as a distributed computing paradigm, satisfies the low-latency requirements of ever-
Edge computing increasing number of IoT applications and has become the mainstream computing paradigm behind IoT
Fog computing applications. However, because large number of IoT applications require execution on the edge/fog resources,
Machine learning
the servers may be overloaded. Hence, it may disrupt the edge/fog servers and also negatively affect IoT
Deep reinforcement learning
applications’ response time. Moreover, many IoT applications are composed of dependent components incurring
Internet of Things
extra constraints for their execution. Besides, edge/fog computing environments and IoT applications are
inherently dynamic and stochastic. Thus, efficient and adaptive scheduling of IoT applications in heterogeneous
edge/fog computing environments is of paramount importance. However, limited computational resources on
edge/fog servers imposes an extra burden for applying optimal but computationally demanding techniques.
To overcome these challenges, we propose a Deep Reinforcement Learning-based IoT application Scheduling
algorithm, called DRLIS to adaptively and efficiently optimize the response time of heterogeneous IoT
applications and balance the load of the edge/fog servers. We implemented DRLIS as a practical scheduler in
the FogBus2 function-as-a-service framework for creating an edge–fog–cloud integrated serverless computing
environment. Results obtained from extensive experiments show that DRLIS significantly reduces the execution
cost of IoT applications by up to 55%, 37%, and 50% in terms of load balancing, response time, and weighted
cost, respectively, compared with metaheuristic algorithms and other reinforcement learning techniques.

1. Introduction consider the case that use ‘‘only’’ edge resources for real-time IoT ap-
plications as edge computing, and the case that use edge and whenever
The past few years have witnessed the rapid rise of the Internet necessary also utilizes cloud resources (along with edge resources in a
of Things (IoT) industry, enabling the connection of people to things seamless manner) as fog computing. Edge computing as a decentralized
and things to things, and facilitating the digitization of the physical computing architecture brings processing, storage, and intelligent con-
world [1]. Meanwhile, with the explosive growth of IoT devices and trol to the vicinity of IoT devices [3]. This flexible architecture extends
various applications, the expectation for stability and low latency is cloud computing services to the edge of the network. In contrast, the
higher than ever [2]. As the main enabler of IoT, cloud computing fog computing paradigm inherits the advantages of both cloud and
stores and processes data and information generated by IoT devices. edge computing [4], which not only provides powerful computational
Leveraging powerful computing capabilities and advanced storage tech- capabilities but also reduces the need to transfer data to the cloud for
nologies, cloud computing ensures the security and reliability of stored processing, analysis, and storage, thus reducing the inter-network dis-
information. However, servers in the cloud computing paradigm are tance. In the real world, edge and fog computing provide strong support
usually located at a long physical distance from IoT devices, and the for innovation and development in various fields. For example, in the
high latency caused by long distances cannot efficiently satisfy real- field of smart healthcare, deploying edge computing nodes on wearable
time IoT applications. Prompted by these issues, edge and fog comput- devices and medical devices can monitor patients’ physiological param-
ing have emerged as popular computing paradigms in the IoT context. eters in real time and transmit the data to the cloud for analysis and
Although some researchers use the terms edge computing and fog diagnosis, realizing telemedicine and personalized medicine [5]; in the
computing interchangeably, we clearly define them in this paper. We

∗ Corresponding author.
E-mail addresses: [email protected] (Z. Wang), [email protected] (M. Goudarzi), [email protected] (M. Gong),
[email protected] (R. Buyya).

https://doi.org/10.1016/j.future.2023.10.012
Received 28 June 2023; Received in revised form 14 August 2023; Accepted 20 October 2023
Available online 28 October 2023
0167-739X/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69

field of autonomous driving, deploying edge computing nodes on self- the response time of the application. In addition, we adapt this
driving vehicles can perform real-time sensing and decision processing, weighted cost model to make it applicable to DRL algorithms.
enabling shorter response time and improving driving safety [6]. • We propose a DRL-based algorithm (DRLIS) to solve the defined
However, the massive growth in the number of IoT applications weighted cost optimization problem in dynamic and stochastic
and servers in fog computing environments also creates new chal- fog computing environments. When the computing environment
lenges. Firstly, the execution time is expected to be minimized [7], changes (e.g., requests from different IoT applications, server
which means that the applications should be processed by the best computing resources, the number of servers), it can adaptively
(i.e., the most powerful and physically closest) server. Besides, the load update the scheduling policy with a fast convergence speed.
should be ideally balanced and distributed to run on multiple operating • Based on DRLIS, we implement a practical scheduler in the Fog-
units. For example, by distributing requests across multiple servers in Bus2 function-as-a-service framework1 [14] for handling schedul-
a seamless manner (as in serverless computing environments), load ing requests of IoT applications in heterogeneous fog and edge
balancing can avoid overloading individual servers and ensure that computing environments. We also extend the functionality of the
each server handles a moderate load. This improves response times, FogBus2 framework to make different DRL techniques applicable
overall system performance, and throughput, and also helps servers to it.
run more consistently. Therefore, improving the load balancing level of • We conduct practical experiments and use real IoT applications
servers (i.e., lowering the variance of server resource utilization) while with heterogeneous tasks and resource demands to evaluate the
reducing the response time becomes an important but challenging performance of DRLIS in real system setup. By comparing with
problem for scheduling IoT applications on servers in edge/fog com- common metaheuristics (Non-dominated Sorting Genetic Algo-
puting environments. Since this is an NP-hard problem, metaheuristic rithm 2 (NSGA2) [16], Non-dominated Sorting Genetic Algorithm
and rule-based solutions can be considered [8,9]. However, these ap- 3 (NSGA3) [17]) and other reinforcement learning algorithms (Q-
proaches often rely on omniscient knowledge of global information and Learning [18]), we demonstrate the superiority of DRLIS in terms
require the solution proponent to have control over the changes. In the of convergence speed, optimization cost, and scheduling time.
fog computing environment, there is often no regularity in server per-
formance, utilization, and downtime. The number of IoT applications The rest of the paper is organized as follows. Section 2 discusses
and the corresponding resource requirements are even more nearly related work and Section 3 presents the system model and problem
random. Besides, in reality, Directed Acyclic Graphs (DAGs) are often formulation. The Deep Reinforcement Learning model for IoT applica-
used to model IoT applications [10], where nodes represent tasks and tions in edge and fog computing environments is presented in Section 4.
edges represent data communication between dependent tasks. The DRLIS is discussed in Section 5. Section 6 evaluates the performance
dependency among tasks introduces higher complexity in scheduling of DRLIS and compares it with other counterparts. Finally, Section 7
applications. Therefore, metaheuristic and rule-based solutions cannot concludes the paper and states future work.
efficiently cope with the IoT application scheduling problem in fog
computing environments. 2. Related work
Deep Reinforcement Learning (DRL) is the product of combin-
ing deep learning with reinforcement learning, integrating the power- In this section, we review the literature on scheduling IoT applica-
ful understanding of deep learning on perceptual problems with the tions in edge and fog computing environments. The related works are
decision-making capabilities of reinforcement learning. In deep rein- divided into metaheuristic and reinforcement learning categories.
forcement learning, the agent continuously interacts with the environ-
ment, recording a large number of empirical trajectories (i.e., sequences 2.1. Metaheuristic
of states, actions, and rewards), which are used in the training phase to
learn optimal policies. In contrast to metaheuristic algorithms, agents In the dependent category, Liu et al. [19] adopted a Markov De-
in deep reinforcement learning are able to autonomously sense and re- cision Process (MDP) approach to achieving shorter average task ex-
spond to changes in the environment, which allows deep reinforcement ecution latency in edge computing environments. They proposed an
learning to solve complex problems in realistic scenarios. However, due efficient one-dimensional search algorithm to find the optimal task
to the limited computational resources of devices in fog computing
scheduling policy. However, this work cannot adapt to changes in the
environments [11], the computational requirements of complex Deep
computing environment and is difficult to extend to solve complex
Neural Networks (DNNs) are often not supported [12]. Therefore,
weighted cost optimization problems in heterogeneous fog computing
how to balance implementation simplicity, sample complexity, and
environments. Wu et al. [20] modeled the task scheduling problem in
solution performance becomes a key research problem in applying deep
edge and fog computing environments as a DAG and used an estimation
reinforcement learning to fog computing environments to cope with
of distribution algorithm (EDA) and a partitioning operator to partition
complex situations.
the graph in order to queue tasks and assign appropriate servers.
To address the above challenges, we propose a Deep Reinforce-
However, they did not practically implement and test their work. Sun
ment Learning-based IoT application Scheduling algorithm (DRLIS),
et al. [21] improved the NSGA2 algorithm and designed a resource
which employs Proximal Policy Optimization (PPO) [13] technique
scheduling scheme among fog nodes in the same fog cluster, taking into
for solving the IoT applications scheduling problem in fog computing
account the diversity of different devices. This work aims to reduce the
environments. DRLIS can effectively optimize the load balancing cost
service latency and improve the stability of task execution. Although ca-
of the servers, the response time cost of the IoT applications, and their
pable of handling weighted cost optimization problems, this work only
weighted cost. Besides, by using clipped surrogate objective to limit the
considers scheduling problems in the same computing environment.
magnitude of policy updates in each iteration and being able to perform
Hoseiny et al. [22] proposed a Genetic Algorithm (GA)-based technique
multiple iterations of updates in the sampled data, the convergence
speed of the algorithm is improved. Moreover, considering the limited for minimizing the total computation time and energy consumption of
computational resources and the optimization objective under study, task scheduling in a heterogeneous fog cloud computing environment.
we design efficient reward functions. The main contributions of this By introducing features for tasks, the technique can find a more suitable
paper are: computing environment for each task. However, it does not consider
the dependencies of different tasks in the application, and due to the
• We propose a weighted cost model regarding DAG-based IoT
applications’ scheduling in fog computing environments to im-
1
prove the load balancing level of the servers while minimizing Please refer to [14,15] for detailed description of the FogBus2 framework.

56
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69

use of metaheuristic algorithms, scheduling rules need to be manually


set, which cannot adapt to changing computing environments. Ali
et al. [23] proposed an NSGA2-based technique for minimizing the total
computation time and system cost of task scheduling in heterogeneous
fog cloud computing environments. Their work formulates the task
scheduling problem as an optimization problem in order to dynamically
allocate appropriate resources for predefined tasks. Similarly, due to
the limitations of metaheuristic algorithms, this work requires the
assumption that the technique has some knowledge of the submitted
tasks to develop the scheduling policy and thus cannot cope with
dynamic and complex scenarios.

2.2. Reinforcement learning

In the dependent category, Shahidani et al. [24] proposed a Q-


Fig. 1. A view of the IoT system in fog computing.
learning-based algorithm to reduce task execution latency and balance
the load in a fog cloud computing environment. However, this work
does not consider the inter-task dependencies and the heterogeneity
of fog and cloud computing environments. Baek et al. [25] adapted properties, algorithm properties, and evaluation. In the application
the Q-learning algorithm and proposed an approach that aims at im- properties section, the number of tasks included in the IoT application,
proving load balancing in fog computing environments. This work and the dependencies between tasks are studied. In the architectural
considers the heterogeneity of nodes in fog computing environments properties section, three aspects are studied including the IoT device
but still assumes that the tasks within the application are independent layer, the edge/fog layer, and the multi-cloud layer. For the IoT device
of each other. Jie et al. [26] proposed a Deep Q-Network (DQN)- layer, the application type and request type are identified. The real
based approach to minimize the total latency of task processing in application section indicates that the work either deploys actual IoT
edge computing environments. This work formulates task scheduling applications, adopts simulated applications, or uses random data. The
as a Markov Decision Process while considering the heterogeneity of heterogeneous request type represents work considering that different
IoT applications. However, this work only considers the scheduling IoT devices have different numbers of requests and different require-
problem in edge computing environments and investigates only one ments. For the edge/fog layer, the computing environment and the
optimization objective. Xiong et al. [27] adapted the DQN algorithm heterogeneity of deployed servers are investigated. Besides, the multi-
and proposed a resource allocation strategy for IoT edge computing cloud layer studies whether the work considers the scenario of different
systems. This work aims at minimizing the average job completion time cloud service providers with heterogeneity. In the algorithm properties
but does not take into account more complex functions with multiple section, we investigate the main technique on which each work is
optimization objectives. Wang et al. [28] focus on edge computing en- based and the corresponding optimization objectives. The evaluation
vironments and propose a deep reinforcement learning-based resource section identifies whether the work is based on simulation or practi-
allocation (DRLRA) scheme based on DQN. This work targets to reduce cal experiments. Recent works that we reviewed (e.g., [31–37]) have
the average service time and balance the resource usage within the often used reinforcement learning approaches to deal with workload
edge computing environment. However, the work does not consider scheduling problems. This is because reinforcement learning can learn
the resources in fog computing environment, and the technique is by interacting with the environment and continuously optimizing the
not practically implemented and tested. Huang et al. [29] adopted a policy through feedback signals (e.g., reward or penalty). This learning
DQN-based approach to address the resource allocation problem in the ability gives reinforcement learning an advantage when facing com-
edge computing environment. This work investigated minimizing the plex, dynamic environments [38], whereas metaheuristic techniques
weighted cost, including the total energy consumption and the latency require manual adaptation and guidance.
to complete the task. However, it does not consider the heterogeneity
of servers in fog computing environments and assumes that the tasks 3. System model and problem formulation
are independent. Chen et al. [30] proposed an approach based on
double DQN to balance task execution time and energy consumption in In this section, we first introduce the topology of the IoT systems in
edge computing environments. Similarly, this work is only applicable the edge and fog computing environment. Then, we discuss the problem
to the edge environment and does not consider the dependencies formulation. The key notations are listed in Table 2.
between tasks. Zheng et al. [31] proposed a Soft Actor–Critic (SAC)-
based algorithm to minimize the task completion time in an edge 3.1. System model
computing environment. This work focuses on the latency problem
and the experiments are simulation-based. Zhao et al. [32] proposed Fig. 1 represents a layered view of the IoT Systems in the fog
a Twin Delayed DDPG (TD3)-based DRL algorithm. The goal of this computing environment. Consider 𝑆 = {𝑆𝑙 |1 ≤ 𝑙 ≤ |𝑆|} as a collection
work is to minimize the latency and energy consumption, but inter-task of |𝑆| applications, where each application contains one or more tasks,
dependencies are not considered and the results are also simulation- denoted as 𝑆𝑙 = {𝑆𝑙𝑖 |1 ≤ 𝑖 ≤ |𝑆𝑙 |}. The DAG 𝐺 = (𝑉 , 𝐸) is used
based. Liao et al. [33] used Deep Deterministic Policy Gradient (DDPG) to model an IoT application, as depicted in Fig. 2. A vertex 𝑣𝑖 = 𝑆𝑙𝑖
and Double Deep Q-Network (DQN) algorithms to model computation denotes a certain task of the application, and an edge 𝑒𝑖,𝑗 denotes the
in an edge environment. This work aims to reduce energy consumption data flow between tasks 𝑣𝑖 and 𝑣𝑗 , so some tasks must be executed after
and latency but does not consider the fog environment and the hetero- predecessor tasks are completed. 𝐶𝑃 (𝑆𝑙 ) represents the critical path
geneity of devices. Sethi et al. [34] proposed a DQN-based algorithm (i.e., the path with the highest cost) of the DAG, marked in red in the
to optimize energy consumption and load balancing of fog servers. figure.
Similarly, this work is simulation-based and does not consider the A set containing |𝑁| servers is used to process application set 𝑆,
dependencies between tasks. denoted as 𝑁 = {𝑛𝑘 |1 ≤ 𝑘 ≤ |𝑁|}. To reflect the heterogeneity of the
Table 1 presents the comparison of the related work with our servers, for each server 𝑛𝑘 , 𝑛𝑐𝑝𝑢_𝑢𝑡
𝑘
represents its CPU utilization (%),
proposed algorithm, in terms of application properties, architecture 𝑛𝑓𝑘 𝑟𝑒𝑞 represents its CPU frequency (MHz), 𝑛𝑟𝑎𝑚_𝑢𝑡
𝑘
represents its RAM

57
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69

Table 1
A qualitative comparison of related works with ours.
Works Application properties Architectural properties Algorithm properties Evaluation
Task number Dependency IoT device layer Edge/Fog layer Multi-Cloud Main technique Optimization objectives
layer
Real Request type Computing Heterogeneity Time Load Weighted
applications environments balancing
[19] Single # Homogeneous Edge Homogeneous × MDP ✓ × × Simulation
[21] Multiple #
G Homogeneous Edge and Fog Heterogeneous × NSGA2 ✓ × ✓ Simulation
Independent Metaheuristic
[22] Single # Homogeneous Edge and Fog Heterogeneous × GA ✓ × × Simulation
Algorithms
[23] Single # Homogeneous Edge and Fog Heterogeneous × NSGA2 ✓ × ✓ Simulation
[20] Multiple Dependent #
G Homogeneous Edge and Fog Heterogeneous × EDA ✓ × ✓ Simulation
[25] Single # Homogeneous Edge and Fog Heterogeneous × Q-Learning × ✓ × Simulation
[24] Single # Homogeneous Edge and Fog Homogeneous × Q-Learning ✓ ✓ ✓ Simulation
[26] Single #
G Homogeneous Edge Homogeneous × DQN ✓ × × Simulation
[27] Multiple #
G Homogeneous Edge Homogeneous × DQN ✓ × × Simulation
[29] Multiple #
G Heterogeneous Edge Homogeneous × DQN ✓ × ✓ Simulation
[28] Single #
G Homogeneous Edge Homogeneous × DQN ✓ ✓ ✓ Simulation
[30] Single Independent #
G Heterogeneous Edge Homogeneous × Reinforcement Double DQN ✓ × ✓ Simulation
[31] Single #
G Homogeneous Edge Homogeneous × Learning SAC ✓ × × Simulation
[32] Single #
G Homogeneous Edge and Fog Homogeneous × Techniques TD3 ✓ × ✓ Simulation
[35] Single # Homogeneous Edge Homogeneous × DQN × ✓ × Simulation
[33] Single #
G Homogeneous Edge Homogeneous × DDPG and ✓ × ✓ Simulation
DQN
[36] Single # Homogeneous Edge Homogeneous × DDPG ✓ × ✓ Simulation
[34] Single #
G Homogeneous Edge and Fog Homogeneous × DQN × ✓ ✓ Simulation
[37] Multiple #
G Heterogeneous Edge Heterogeneous × GA and ✓ × × Simulation
Dependent
DQN
DRLIS Multiple Heterogeneous Edge and Fog Heterogeneous ✓ PPO ✓ ✓ ✓ Practical

: Real IoT Application and Deployment, #


G: Simulated IoT Application, #: Random.

Table 2
List of key notations.
Variable Description Variable Description
𝑆 The application set 𝜓𝑥𝑟𝑎𝑚 The variance of RAM utilization of the server set after the
𝑆𝑙
𝑖
scheduling configuration 𝑥𝑆𝑙
𝑖

𝑆𝑙 One application (one task set) 𝛹 (𝜒𝑙 ) The load balancing model after the scheduling configuration
𝜒𝑙
𝑆 𝑙𝑖 One task 𝛹 (𝜒) The load balancing model after the scheduling configuration
𝜒
𝑁 The server set 𝜔𝑥𝑆 The total execution time (ms) for task 𝑆𝑙𝑖 based on the
𝑙𝑖
scheduling configuration 𝑥𝑆𝑙
𝑖

𝑥𝑆𝑙 The scheduling configuration of task 𝑆𝑙𝑖 𝜔𝑡𝑟𝑡


𝑥
The ready time (ms) for task 𝑆𝑙𝑖 based on the scheduling
𝑖 𝑆𝑙
𝑖
configuration 𝑥𝑆𝑙
𝑖

𝜒𝑙 The scheduling configuration of application 𝑆𝑙 𝜔𝑡𝑟𝑡


𝑛 ,𝑛
The time (ms) consumed for required data by task 𝑆𝑙𝑖 to be
𝑗 𝑘
sent from server 𝑛𝑗 to server 𝑛𝑘
𝜒 The scheduling configuration of applications 𝑆 𝑃 (𝑆𝑙𝑖 ) The parent tasks set of task 𝑥𝑆𝑙
𝑖

𝑛𝑐𝑝𝑢_𝑢𝑡
𝑘
The CPU utilization (%) of server 𝑛𝑘 𝑃 𝑆(𝑆𝑙𝑖 ) The server set to which the dependency tasks of task 𝑥𝑆𝑙 are
𝑖
assigned
𝑛𝑓𝑘 𝑟𝑒𝑞 The CPU frequency (MHz) of server 𝑛𝑘 𝜔𝑡𝑟𝑎𝑛𝑠
𝑛 ,𝑛
The transmission time (ms) between server 𝑛𝑗 and server 𝑛𝑘
𝑗 𝑘

𝑛𝑟𝑎𝑚_𝑢𝑡
𝑘
The RAM utilization (%) of server 𝑛𝑘 𝜔𝑝𝑟𝑜𝑝
𝑛𝑗 ,𝑛𝑘 The propagation time (ms) between server 𝑛𝑗 and server 𝑛𝑘
𝑛𝑟𝑎𝑚_𝑠𝑖𝑧𝑒
𝑘
The RAM size (GB) of server 𝑛𝑘 𝑝𝑛𝑗 ,𝑛𝑘 The packet size (MB) from server 𝑛𝑗 to server 𝑛𝑘 for task 𝑆𝑙𝑖
𝑐𝑝𝑢_𝑢𝑡𝑖
𝑁 The CPU utilization (%) of each server in server set 𝑁, 𝑏𝑛𝑗 ,𝑛𝑘 The data rate (bit/s) between server 𝑛𝑗 and server 𝑛𝑘
denoted as a set
𝑁 𝑟𝑎𝑚_𝑢𝑡𝑖 The RAM utilization (%) of each server in server set 𝑁, 𝐶𝑃 (𝑆𝑙𝑖 ) Equals to 1 if 𝑆𝑙𝑖 is on the critical path of application 𝑆𝑙 ,
denoted as a set otherwise 0
𝑆𝑙𝑟𝑎𝑚 The minimum RAM required for executing task 𝑆𝑙𝑖 𝜔𝑝𝑟𝑜𝑐
𝑥𝑆 The processing time (ms) for task 𝑆𝑙𝑖 based on the
𝑖 𝑙𝑖
scheduling configuration 𝑥𝑆𝑙
𝑖

𝜓𝑥𝑆 The load balancing model after the scheduling configuration 𝛺(𝜒𝑙 ) The total execution time (ms) for application 𝑆𝑙 based on the
𝑙𝑖
𝑥𝑆𝑙 scheduling configuration 𝜒𝑙
𝑖

𝜓𝑥𝑐𝑝𝑢
𝑆𝑙
The variance of CPU utilization of the server set after the 𝛺(𝜒) The total execution time (ms) for the application set 𝑆 based
𝑖
scheduling configuration 𝑥𝑆𝑙 on the scheduling configuration 𝜒
𝑖

utilization (%), and 𝑛𝑘𝑟𝑎𝑚_𝑠𝑖𝑧𝑒 represents its RAM size (GB). Moreover, configuration 𝑥𝑆𝑙 of a task 𝑆𝑙𝑖 is defined as:
𝑖
𝑃 𝑆(𝑆𝑙𝑖 ) represents the server set to which the parent tasks of task 𝑆𝑙𝑖
𝑝𝑟𝑜𝑝
are assigned, and 𝜔𝑡𝑟𝑎𝑛𝑠
𝑛𝑗 ,𝑛𝑘 , 𝜔𝑛𝑗 ,𝑛𝑘 , 𝑝𝑛𝑗 ,𝑛𝑘 , and 𝑏𝑛𝑗 ,𝑛𝑘 denote the transmission 𝑥𝑆𝑙 = {𝑛𝑘 }, (1)
𝑖
time (ms), the propagation time (ms), the packet size (MB), and the
data rate (bit/s) between server 𝑛𝑗 and server 𝑛𝑘 , respectively.
where 𝑘 shows the server’s index. Accordingly, the scheduling config-
3.2. Problem formulation uration 𝜒𝑙 of an application 𝑆𝑙 is equal to the set of the scheduling
configuration of the tasks it contains, defined as:
Since an application contains one/multiple tasks, it may be exe-
cuted on different servers. With a set of servers 𝑁, the scheduling 𝜒𝑙 = {𝑥𝑆𝑙 |𝑆𝑙𝑖 ∈ 𝑆𝑙 , 1 ≤ 𝑖 ≤ |𝑆𝑙 |}. (2)
𝑖

58
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69

application set 𝑆 can be minimized. Therefore, for the application set


𝑆, the load balancing model 𝛹 (𝜒) is defined as:
|𝑆| |𝑆𝑙 |
|𝑆| ∑
∑ ∑
𝛹 (𝜒) = 𝛹 (𝜒𝑙 ) = 𝜓𝑥𝑆 . (9)
𝑙𝑖
𝑙=1 𝑙=1 𝑖=1

3.2.2. Response time model


We consider the response time model 𝜔𝑥𝑆 for the task 𝑆𝑙𝑖 consisting
𝑙𝑖
of two components, the task ready time model 𝜔𝑡𝑟𝑡
𝑥 and the processing
𝑆𝑙
𝑖
model 𝜔𝑝𝑟𝑜𝑐
𝑥𝑆 :
𝑙𝑖

𝜔𝑥𝑆 = 𝜔𝑡𝑟𝑡
𝑥𝑆 + 𝜔𝑝𝑟𝑜𝑐
𝑥 . (10)
𝑙𝑖 𝑙𝑖 𝑆𝑙
𝑖

The task ready time model 𝜔𝑡𝑟𝑡


𝑥 represents the maximum time for the
Fig. 2. Sample IoT application with the critical path in red color. 𝑆𝑙
𝑖
data required by the task 𝑆𝑙𝑖 to arrive at the server to which it is
assigned, defined as:

The scheduling configuration 𝜒 of the application set 𝑆 is equal to the 𝜔𝑡𝑟𝑡


𝑥 = 𝑚𝑎𝑥 𝜔𝑡𝑟𝑡
𝑛 ,𝑛 , ∀𝑛𝑗 ∈ 𝑃 𝑆(𝑆𝑙𝑖 ), (11)
𝑆𝑙 𝑗 𝑘
𝑖
set of scheduling configuration per application:
where 𝜔𝑡𝑟𝑡 𝑛𝑗 ,𝑛𝑘 denotes the time consumed for required data by task 𝑆𝑙𝑖
𝜒 = {𝜒𝑙 |1 ≤ 𝑙 ≤ |𝑆|}. (3) sent from server 𝑛𝑗 to server 𝑛𝑘 , and 𝑛𝑘 is the server where the task 𝑆𝑙𝑖
will be executed based on scheduling configuration 𝑥𝑆𝑙 , and 𝑛𝑗 repre-
In addition, we consider that for a given application, the execution 𝑖
sents the server where the parent task of task 𝑆𝑙𝑖 is executed. Therefore,
model of tasks can be hybrid (i.e., sequential and/or parallel). That is, 𝑡𝑟𝑡 𝑡𝑟𝑎𝑛𝑠
𝜔𝑛 ,𝑛 depends on the transmission time 𝜔𝑛 ,𝑛 and the propagation time
children tasks have some dependencies on the parent tasks that need 𝑗 𝑘 𝑗 𝑘

to be executed after their completion, and we use 𝑃 (𝑆𝑙𝑖 ) to represent 𝜔𝑝𝑟𝑜𝑝


𝑛𝑗 ,𝑛𝑘 for task 𝑆𝑙𝑖 between server 𝑛𝑗 and server 𝑛𝑘 :
the parent task set of task 𝑆𝑙𝑖 [39]. While tasks that do not depend on { 𝑝𝑟𝑜𝑝
𝜔𝑡𝑟𝑎𝑛𝑠
𝑛 ,𝑛 + 𝜔𝑛𝑗 ,𝑛𝑘 𝑛𝑗 ≠ 𝑛𝑘 ,
each other can be executed in parallel, and we use 𝐶𝑃 (𝑆𝑙𝑖 ) to indicate 𝜔𝑡𝑟𝑡 (12)
𝑛 ,𝑛 =
𝑗 𝑘
𝑗 𝑘
that if a task 𝑆𝑙𝑖 is located on a critical path of application 𝑆𝑙 . 0 𝑛𝑗 = 𝑛𝑘 .

And the transmission time 𝜔𝑡𝑟𝑎𝑛𝑠


𝑛 ,𝑛 can be calculated as: 𝑗 𝑘
3.2.1. Load balancing model
𝑝𝑛𝑗 ,𝑛𝑘
The load balancing model is used to measure the resource balancing 𝜔𝑡𝑟𝑎𝑛𝑠 (13)
𝑛 ,𝑛 =
𝑗 𝑘 𝑏𝑛𝑗 ,𝑛𝑘
,
level of the server set 𝑁 during the processing of the application set 𝑆.
Regarding the server resource, both CPU and RAM are considered. For where 𝑝𝑛𝑗 ,𝑛𝑘 represents the packet size from server 𝑛𝑗 to server 𝑛𝑘 for
task 𝑆𝑙𝑖 , the load balancing model 𝜓𝑥𝑆 is defined as: task 𝑆𝑙𝑖 , and 𝑏𝑛𝑗 ,𝑛𝑘 represents the current bandwidth between server 𝑛𝑗
𝑙𝑖
and server 𝑛𝑘 when the data for task 𝑆𝑙𝑖 is transmitted.
𝜓𝑥𝑆 = 𝑎1 𝜓𝑥𝑐𝑝𝑢 + 𝑎2 𝜓𝑥𝑟𝑎𝑚 , (4) The processing model 𝜔𝑝𝑟𝑜𝑐 is defined as the time it takes for
𝑙𝑖 𝑆𝑙 𝑆𝑙 𝑥𝑆
𝑖 𝑖 𝑙𝑖
assigned server 𝑛𝑘 to process the task 𝑆𝑙𝑖 based on scheduling configu-
where 𝜓𝑥𝑐𝑝𝑢
𝑆
and 𝜓𝑥𝑟𝑎𝑚 represent the CPU and RAM models, and 𝑎1 and
𝑙𝑖 𝑆𝑙
𝑖 ration 𝑥𝑆𝑙 , and can be calculated as:
𝑎2 are the control parameters by which the weighted load balancing 𝑖

model can be tuned. They satisfy: 𝑆𝑙𝑠𝑖𝑧𝑒


𝑖
𝜔𝑝𝑟𝑜𝑐
𝑥 = , (14)
𝑎1 + 𝑎2 = 1, 0 ≤ 𝑎1 , 𝑎2 ≤ 1. (5)
𝑆𝑙
𝑖 𝑛𝑓𝑘 𝑟𝑒𝑞

CPU model 𝜓𝑥𝑐𝑝𝑢


𝑆
and RAM model 𝜓𝑥𝑟𝑎𝑚 are defined as the variance where 𝑆𝑙𝑠𝑖𝑧𝑒 represents the required CPU cycles for task 𝑆𝑙𝑖 and 𝑛𝑓𝑘 𝑟𝑒𝑞
𝑙𝑖 𝑆𝑙 𝑖
𝑖
of CPU and RAM utilization of the server set 𝑁 after the scheduling represents the CPU frequency of server 𝑛𝑘 (for multi-core CPUs, the
configuration 𝑥𝑆𝑙 : average frequency is considered).
𝑖
Accordingly, the response time model 𝛺(𝜒𝑙 ) for application 𝑆𝑙 is
𝜓𝑥𝑐𝑝𝑢 = Var[𝑁 𝑐𝑝𝑢_𝑢𝑡𝑖
], (6) defined as:
𝑆 𝑙𝑖
|𝑆𝑙 |

𝛺(𝜒𝑙 ) = (𝜔𝑥𝑆 × 𝐶𝑃 (𝑆𝑙𝑖 )), (15)
𝜓𝑥𝑟𝑎𝑚 = Var[𝑁 𝑟𝑎𝑚_𝑢𝑡𝑖 ], (7) 𝑙𝑖
𝑆𝑙
𝑖 𝑖=1

where where 𝐶𝑃 (𝑆𝑙𝑖 ) equals to 1 if task 𝑆𝑙𝑖 is on the critical path of application
𝑆𝑙 , otherwise 0.
𝑥𝑆𝑙 = {𝑛𝑘 }. (1) The main goal for the response time model 𝛺(𝜒) is to find the best-
𝑖
possible scheduling configuration for the application set 𝑆 such that
the total time for the server set 𝑁 processing them can be minimized.
Correspondingly, for application 𝑆𝑙 , the load balancing model 𝛹 (𝜒𝑙 )
Therefore, for the application set 𝑆, the response time model 𝛺(𝜒) is
is defined as the sum of the load balancing models for each task
defined as:
processed by server set 𝑁:
|𝑆| |𝑆𝑙 |
|𝑆| ∑
|𝑆𝑙 | ∑ ∑
∑ 𝛺(𝜒) = 𝛺(𝜒𝑙 ) = (𝜔𝑥𝑆 × 𝐶𝑃 (𝑆𝑙𝑖 )). (16)
𝛹 (𝜒𝑙 ) = 𝜓𝑥𝑆 . (8) 𝑙=1 𝑙=1 𝑖=1
𝑙𝑖
𝑙𝑖
𝑖=1

Our main goal is to find the best-possible scheduling configuration 3.2.3. Weighted cost model
for the application set 𝑆 such that the variance of the overall CPU The weighted cost model is defined as the weighted sum of the
and RAM utilization of the server set 𝑁 during the processing of the normalized load balancing and normalized response time models. For

59
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69

task 𝑆𝑙𝑖 : 4. Deep reinforcement learning model

𝜓𝑥𝑆 − 𝜓 𝑚𝑖𝑛 𝜔𝑥𝑆 − 𝜔𝑚𝑖𝑛


𝜙 𝑥 𝑆 = 𝑤1
𝑙𝑖
+ 𝑤2
𝑙𝑖
, (17) In reinforcement learning, the autonomous agent first interacts with
𝑙𝑖 𝜓 𝑚𝑎𝑥 − 𝜓 𝑚𝑖𝑛 𝜔𝑚𝑎𝑥 − 𝜔𝑚𝑖𝑛 the surrounding environment through action. Under the action and the
where 𝜓𝑥𝑆 and 𝜔𝑥𝑆 are the load balancing model and response environment, the agent generates a new state, while the environment
𝑙𝑖 𝑙𝑖
time model of task 𝑆𝑙𝑖 , and 𝜓 𝑚𝑖𝑛 , 𝜓 𝑚𝑎𝑥 , 𝜔𝑚𝑖𝑛 , and 𝜔𝑚𝑎𝑥 represent the gives an immediate reward. In this cycle, the agent interacts with
the environment continuously and thus generates sufficient data. The
minimum and the maximum value of the load balancing model and
reinforcement learning algorithm uses the generated data to modify its
response time model, respectively. Moreover, 𝑤1 and 𝑤2 are the control
own action policy, then interacts with the environment to generate new
parameters by which the weighted cost model can be tuned. The reason
data, and uses the new data to further improve its behavior. Formally,
we use the normalized models instead of the original models is that the
we use Markov Decision Process (MDP) to model the reinforcement
values of the two models may be in different ranges. For example, the
learning problem. Specifically, the learning problem can be described
load balancing model may have a value from 0 to 1, while the response
by the tuple ⟨S, A, P, R, 𝛾⟩, where S denotes a finite set of states; A
time model may have a value from 0 to 100. We need to normalize them
denotes a finite set of actions; P denotes the state transition probability;
so that the model values are in the same range.
R denotes the reward function; 𝛾 ∈ [0, 1] is the discount factor, used to
Accordingly, the weighted cost model for application 𝑆𝑙 is defined compute the cumulative rewards.
as: We assume that the time T of the learning process is divided into
𝛷(𝜒𝑙 ) = 𝑤1 × 𝑁𝑜𝑟𝑚(𝛹 (𝜒𝑙 )) + 𝑤2 × 𝑁𝑜𝑟𝑚(𝛺(𝜒𝑙 )), (18) multiple time steps 𝑡 and the agent will interact with the environment
at each time step and have multiple states 𝑆𝑡 . At a particular time step
where 𝛹 (𝜒𝑙 ) and 𝛺(𝜒𝑙 ) are obtained from Eqs. (8) and (15), and 𝑡, the agent possesses the environment state 𝑆𝑡 = 𝑠, where 𝑠 ∈ S.
𝑁𝑜𝑟𝑚 represents the normalization. The weighted cost model for the The agent chooses an action 𝐴𝑡 = 𝑎 according to the policy 𝜋(𝑎|𝑠),
application set 𝑆 is defined as: where 𝑎 ∈ A, and 𝜋(𝑎|𝑠) = 𝑃 𝑟[𝐴𝑡 = 𝑎|𝑆𝑡 = 𝑠] is the policy function,
which denotes the probability of choosing the action 𝑎 in state 𝑠. After
𝛷(𝜒) = 𝑤1 × 𝑁𝑜𝑟𝑚(𝛹 (𝜒)) + 𝑤2 × 𝑁𝑜𝑟𝑚(𝛺(𝜒)), (19) choosing action 𝑎, the agent receives a reward 𝑟 = R[𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎]
from the environment based on the reward function R, and it moves
where 𝛹 (𝜒) and 𝛺(𝜒) are obtained from Eqs. (9) and (16).
to the next state 𝑆𝑡+1 = 𝑠′ based on the state transition function
Therefore, the weighted cost optimization problem of IoT applica-
𝑡+1 = 𝑠 |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎]. The goal of the reinforcement
𝑎 = P[𝑆
𝑃𝑠𝑠 ′

tions can be formulated as:
learning agent is to learn a policy 𝜋 that maximizes the expectation

𝑚𝑖𝑛 𝛷(𝜒) (20) of cumulative discounted reward E𝜋 [ 𝑡∈𝑇 𝛾𝑡 𝑟𝑡 ].
Based on the weighted cost optimization problem of IoT applica-
s.t. 𝐶1 ∶ 𝑆𝑖𝑧𝑒(𝑥𝑆𝑙 ) = 1, ∀𝑥𝑆𝑙 ∈ 𝜒𝑙 (21) tions in edge and fog computing environments, the state space S, action
𝑖 𝑖

𝐶2 ∶ 0 ≤ 𝑛𝑘𝑟𝑎𝑚_𝑢𝑡 , 𝑛𝑐𝑝𝑢_𝑢𝑡
𝑘
≤ 1, ∀𝑛𝑘 ∈ 𝑁 (22) space A, and reward function R for the MDP are defined as follows:
𝑓 𝑟𝑒𝑞 𝑟𝑎𝑚_𝑠𝑖𝑧𝑒
𝐶3 ∶ 𝑛𝑘 , 𝑛𝑘 ≥ 0, ∀𝑛𝑘 ∈ 𝑁 (23) • State space S: Since the optimization problem is related to tasks
𝑟𝑎𝑚 𝑟𝑎𝑚_𝑠𝑖𝑧𝑒 and servers, the state of the problem consists of the feature space
𝐶4 ∶ 𝑆𝑙 < 𝑛𝑘 , ∀𝑆𝑙𝑖 ∈ 𝑆𝑙 , ∀𝑛𝑘 ∈𝑁 (24)
𝑖 of the task currently being processed and the state space of the
𝐶5 ∶ 𝛷(𝑥𝑆𝑙 ) ≤ 𝛷(𝑥𝑆𝑙 + 𝑥𝑆𝑙 ), ∀𝑆𝑙𝑗 ∈ 𝑃 (𝑆𝑙𝑖 ) (25) current server set 𝑁. Based on the discussion in Section 3, at the
𝑗 𝑗 𝑖
time step 𝑡, the feature space of the task 𝑆𝑙𝑖 includes the task ID,
𝐶6 ∶ 𝑤1 + 𝑤2 = 1, 0 ≤ 𝑤1 , 𝑤2 ≤ 1 (26)
the tasks’ predecessors and successors, the application ID to which
where 𝐶1 states that any task can only be assigned to one server the task belongs, the number of tasks in the current application,
for processing. 𝐶2 states that for any server, the CPU utilization and the estimate of the occupied CPU resources for the execution of
RAM utilization are between 0 and 1. Besides, 𝐶3 states that the CPU the task, the task’s RAM requirements, the estimate of the task’s
frequency and the RAM size of any server are larger than 0. Moreover, response time, etc. Formally, the feature space F for task 𝑆𝑙𝑖 at
𝐶4 denotes that any server should have sufficient RAM resources to the time step 𝑡 is defined as follows:
process any task. Also, 𝐶5 denotes that any task can only be processed F𝑡 (𝑆𝑙𝑖 ) = {𝑓𝑡𝑦 (𝑆𝑙𝑖 )|𝑆𝑙𝑖 ∈ 𝑆𝑙 , 0 ≤ 𝑦 ≤ |F|}, (27)
after its parent tasks have been processed, and thus the cumulative cost
is always larger than or equal to the parent task. In addition, 𝐶6 denotes where 𝑦 represents the index of the feature in the task feature
that the control parameters of the weighted cost model can only take space F, and |F| represents the number of features. Moreover,
value from 0 to 1, and the sum of them should be equal to 1. at the time step 𝑡, the state space of the current server set 𝑁
includes the number of servers, each server’s CPU utilization, CPU
The problem being formulated is presented to be a non-convex
frequency, RAM utilization, and RAM size, and the propagation
optimization problem, because there may be an infinite number of local
time and bandwidth between different servers, etc. Formally, the
optima in the set of feasible domains, and usually, the complexity of the
state space G for the server set 𝑁 at the time step 𝑡 is defined as:
algorithm to find the global optimum is exponential (NP-hard) [40].
To cope with such non-convex optimization problems, most work de- G𝑡 (𝑁) = {|𝑁|, 𝑔𝑡𝑧 (𝑛𝑘 ), ℎ𝑞𝑡 (𝑛𝑗 , 𝑛𝑘 )|𝑛𝑗 , 𝑛𝑘 ∈ 𝑁,
(28)
composes them into several convex sub-problems and then solves these 0 ≤ 𝑧 ≤ |𝑔|, 0 ≤ 𝑞 ≤ |ℎ|},
sub-problems iteratively until the algorithm converges [41]. This type
where 𝑔 represents the state type that is related to only one server
of approach reduces the complexity of the original problem at the
(i.e., CPU utilization), 𝑧 represents its index, and |𝑔| represents the
expense of accuracy [42]. In addition, such approaches are highly de-
length of this type of state; besides, ℎ denotes the state type that
pendent on the current environment and cannot be applied in dynamic
is related to two servers (i.e., propagation time), and similarly, 𝑞
environments with complex and continuously changeable parameters
represents its index and |ℎ| represents the length of this type of
and computational resources [42]. To deal with this problem, we pro-
state. Therefore, the state space S is defined as:
pose DRLIS to efficiently handle uncertainties in dynamic environments
by learning from interaction with the environment. S = {𝑆𝑡 = (F𝑡 (𝑆𝑙𝑖 ), G𝑡 (𝑁))|𝑆𝑙𝑖 ∈ 𝑆𝑙 , 𝑡 ∈ T}. (29)

60
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69

• Action space A: The goal is to find the best-possible scheduling sample-prune techniques to accommodate discrete actions [46]. Be-
configuration for the application set 𝑆 to minimize the objective sides, Wang et al. [47] shows that SAC requires more computation time
function Eq. (20). Therefore, at the time step 𝑡, the action can be and convergence time than PPO. Whereas our study focuses on edge
defined as the assignment of the server to the task 𝑆𝑙𝑖 : and fog computing environments, where handling latency sensitivity
and variation are important considerations for choosing the appropriate
𝐴𝑡 = 𝑥𝑆𝑙 = 𝑛𝑘 . (30) DRL algorithm. We choose PPO as the basis of DRLIS, because PPO
𝑖
is designed to be more easily adaptable to discrete action spaces [48]
Accordingly, the action space A can be defined as the server set
and we aim for the algorithm to converge quickly and perform well in
𝑁:
diverse environments.
A = 𝑁. (31)
5. DRL-based optimization algorithm
• Reward function R: Since this is a weighted cost optimization
problem, we need to define the reward function for each sub-
Based on the above-mentioned MDP model, we propose DRLIS
problem. First, as the 𝑝𝑒𝑛𝑎𝑙𝑡𝑦, a very large negative value is
to achieve weighted cost optimization of IoT applications in edge
introduced if the task cannot be processed on the assigned server
and fog computing environments. In this section, we introduce the
for any reason. Also, for the load balancing problem, based on the
mathematical principle of the PPO algorithm and discuss the proposed
discussion in Section 3.2.1, the reward function 𝑟𝑙𝑏
𝑡 is defined as: DRLIS.

⎪𝜓𝑥𝑆𝑙 − 𝜓𝑥𝑆𝑙 𝑠𝑢𝑐𝑐𝑒𝑒𝑑 5.1. Preliminaries
𝑟𝑙𝑏
𝑡 = ⎨ 𝑖−1 𝑖 (32)
⎪𝑝𝑒𝑛𝑎𝑙𝑡𝑦 𝑓 𝑎𝑖𝑙,
⎩ The PPO algorithm belongs to the Policy Gradient (PG) algorithm
where 𝜓𝑥𝑆 is obtained from Eq. (4). The value output by reward which considers the impact of actions on rewards and adjusts the
𝑙𝑖
probability of actions [49]. We use the same notations as in Section 3 to
function 𝑟𝑙𝑏
𝑡 is the difference between the load balancing models describe the algorithm. We consider the time horizon T is divided into
of the server set after scheduling the current task and the previous
multiple time steps 𝑡, and the agent has a policy 𝜋𝜃 for determining
one. If the value of the load balancing model of the server set is
its actions and interactions with the environment. The objective can
reduced after scheduling the current task, the output reward is
be expressed as adjusting the parameter 𝜃 to maximize the expected
positive, otherwise it is negative. Beside, for the response time ∑
cumulative discounted rewards E𝜋𝜃 [ 𝑡∈𝑇 𝛾𝑡 𝑟𝑡 ] [13], expressed by the
problem, based on the discussion in Section 3.2.2, the reward
formula:
function 𝑟𝑟𝑡
𝑡 is defined as: ∑
𝐽 (𝜃) = E𝜋𝜃 [ 𝛾𝑡 𝑟𝑡 ]. (35)
⎧ 𝑚𝑒𝑎𝑛 𝑡∈𝑇
⎪𝜔𝑥𝑆𝑙 − 𝜔𝑥𝑆𝑙 𝑠𝑢𝑐𝑐𝑒𝑒𝑑
𝑟𝑟𝑡
𝑡 = ⎨ 𝑖 𝑖 (33) Since this is a maximization problem, the gradient ascent algorithm can
⎪𝑝𝑒𝑛𝑎𝑙𝑡𝑦 𝑓 𝑎𝑖𝑙,
⎩ be used to find the maximum value:
where 𝜔𝑥𝑆 is obtained from Eq. (10), and 𝜔𝑚𝑒𝑎𝑛
𝑥 represents the 𝜃 ′ = 𝜃 + 𝛼∇𝜃 𝐽 (𝜃). (36)
𝑙𝑖 𝑆𝑙
𝑖
average response time for task 𝑆𝑙𝑖 . The value output by reward
The key is to obtain the gradient of the reward function 𝐽 (𝜃) with re-
function 𝑟𝑟𝑡
𝑡 is the difference between the average response time spect to 𝜃, which is called the policy gradient. The algorithm for solving
(the current response time is also considered) and the current
reinforcement problems by optimizing the policy gradient is called the
response time for task 𝑆𝑙𝑖 . If the current response time is lower
policy gradient algorithm. The policy gradient can be presented as,
than the average one, the output reward is positive, otherwise
it is negative. The reward function 𝑟𝑡 for the weighted cost ∇𝜃 𝐽 (𝜃) = E𝜋𝜃 [∇𝜃 𝑙𝑜𝑔𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )𝐴𝜃 (𝑎𝑡 |𝑠𝑡 )], (37)
optimization problem is defined as:
{ where 𝐴𝜃 (𝑎𝑡 |𝑠𝑡 ) is the advantage function at time step t, used to evaluate
𝑤1 × 𝑁𝑜𝑟𝑚(𝑟𝑙𝑏 𝑟𝑡
𝑡 ) + 𝑤2 × 𝑁𝑜𝑟𝑚(𝑟𝑡 ) 𝑠𝑢𝑐𝑐𝑒𝑒𝑑 the action 𝑎𝑡 at the state 𝑠𝑡 . Here, the policy gradient indicates the
𝑟𝑡 = (34)
𝑝𝑒𝑛𝑎𝑙𝑡𝑦 𝑓 𝑎𝑖𝑙, expectation of ∇𝜃 𝑙𝑜𝑔𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )𝐴𝜃 (𝑎𝑡 |𝑠𝑡 ), which can be estimated using the
empirical average obtained by sampling. However, the PG algorithm
where 𝑤1 and 𝑤2 are the control parameters, and 𝑁𝑜𝑟𝑚 repre- is very sensitive to the update step size, and choosing a suitable step
sents the normalization process. size is challenging [50]. Moreover, practice shows that the difference
Currently, many advanced deep reinforcement learning algorithms between old and new policies in training is usually large [13].
(e.g., PPO, TD3, SAC) have been proposed by different researchers. To address this problem, Trust Region Policy Optimization (TRPO)
They show excellent performance in different fields. PPO improves [51] is proposed. This algorithm introduces importance sampling to
convergence and sampling efficiency by adopting importance sampling evaluate the difference between the old and new policies and restricts
and proportional clipping [13]. TD3 (Twin Delayed DDPG) introduces the new policy if the importance sampling ratio grows large. Impor-
a dual Q network and delayed update strategy to effectively solve tance sampling refers to replacing the original sampling distribution
the overestimation problem in the continuous action space [43]. SAC with a new one to make sampling easier or more efficient. Specifically,
(Soft Actor–Critic) combines policy optimization and learning of Q- TRPO maintains two policies, the first policy 𝜋𝜃𝑜𝑙𝑑 is the current policy
value functions, providing more robust and exploratory policy learn- to be refined, and the second policy 𝜋𝜃 is used to collect the samples.
ing through maximum entropy theory [44]. These algorithms have The optimization problem is defined as follows:
achieved remarkable results in different tasks and environments. In 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )
maximize E𝑡 [ 𝐴] (38)
our research problem, the agent’s action and state space is discrete, 𝜃 𝜋𝜃𝑜𝑙𝑑 (𝑎𝑡 |𝑠𝑡 ) 𝑡
which hinders the application of TD3, because it is designed for con-
subject to E𝑡 [𝐾𝐿[𝜋𝜃𝑜𝑙𝑑 (⋅|𝑠𝑡 ), 𝜋𝜃 (⋅|𝑠𝑡 )]] ≤ 𝛿, (39)
tinuous control [45]. In addition, the original SAC only considers the
problem of continuous space [44], although there are some works where 𝐾𝐿 represents Kullback–Leibler Divergence, used to quantify the
discussing how to apply SAC to discrete space, they usually need to difference between two probability distributions [52], and 𝛿 represents
adopt some special tricks and extensions, such as using soft-max or the restriction of the update between old policy 𝜋𝜃𝑜𝑙𝑑 and new policy

61
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69

𝜋𝜃 . After linear approximation of the objective and quadratic approxi- DRLIS maintains three networks, one critical network, and two actor
mation of the constraints, the problem can be efficiently approximated networks (i.e., the old actor and the new actor), representing the old
using the conjugate gradient algorithm. However, the computation of policy function 𝜋𝜃𝑜𝑙𝑑 and the new policy function 𝜋𝜃 , as discussed
conjugate gradient makes the implementation of TRPO more complex in Section 5.1. Algorithm 1 describes DRLIS for the weighted cost
and inflexible in practice [53,54]. optimization problem in edge and fog computing environments.
To make this algorithm well applied in practice, the KL-PPO al-
gorithm [13] is proposed. Rather than using the constraint function
Algorithm 1: DRLIS for weighted cost optimization
E𝑡 [𝐾𝐿[𝜋𝜃𝑜𝑙𝑑 (⋅|𝑠𝑡 ), 𝜋𝜃 (⋅|𝑠𝑡 )]] ≤ 𝛿, the 𝐾𝐿 divergence is added as a penalty
Input : new actor network 𝛱𝜃 with parameter 𝜃; old actor
in the objective function:
network 𝛱𝜃𝑜𝑙𝑑 with parameter 𝜃𝑜𝑙𝑑 , where 𝜃𝑜𝑙𝑑 = 𝜃;
𝐿𝐾𝐿𝑃 𝐸𝑁 (𝜃) = E𝑡 [𝑟𝑡 (𝜃)𝐴𝑡 − 𝛽𝐾𝐿[𝜋𝜃𝑜𝑙𝑑 (⋅|𝑠𝑡 ), 𝜋𝜃 (⋅|𝑠𝑡 )]], (40) critic network 𝑉𝜇 with parameter 𝜇; max time step 𝑇 ;
update epoch 𝐾; policy objective function coefficient
𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )
where 𝑟𝑡 (𝜃) = 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )
is the ratio of the new policy and the old policy, 𝑎𝑐 ; value function loss function coefficient 𝑎𝑣 ; entropy
𝑜𝑙𝑑
bonus coefficient 𝑎𝑒 ; clipping ratio 𝜖
obtained in Eq. (38), and the parameter 𝛽 can be dynamically adjusted 1 while True do
during the iterative process according to the 𝐾𝐿 divergence. If the 2 𝑠𝑒𝑟𝑣𝑒𝑟𝑠 ← 𝐺𝑒𝑡𝑆𝑒𝑟𝑣𝑒𝑟𝑠();
current 𝐾𝐿 divergence is larger than the predefined maximum value, 3 𝑡𝑎𝑠𝑘 ← 𝐺𝑒𝑡𝑇 𝑎𝑠𝑘();
indicating that the penalty is not strong enough and the parameter 4 if 𝑠𝑒𝑟𝑣𝑒𝑟𝑠 ≠ 𝑠𝑒𝑟𝑣𝑒𝑟𝑠𝑜𝑙𝑑 then
𝛽 needs to be increased. Conversely, if the current 𝐾𝐿 divergence is 5 𝑎𝑔𝑒𝑛𝑡 ← 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝐴𝑔𝑒𝑛𝑡(𝑠𝑒𝑟𝑣𝑒𝑟𝑠);
smaller than the predefined minimum value, the parameter 𝛽 needs to 6 𝑠𝑒𝑟𝑣𝑒𝑟𝑠𝑜𝑙𝑑 ← 𝑠𝑒𝑟𝑣𝑒𝑟𝑠;
be reduced. 7 end if
Moreover, another idea to restrict the difference between old policy 8 𝑠1 ← 𝐺𝑒𝑛𝑒𝑟𝑎𝑙𝑖𝑧𝑒𝑆𝑡𝑎𝑡𝑒(𝑠𝑒𝑟𝑣𝑒𝑟𝑠, 𝑡𝑎𝑠𝑘);
𝜋𝜃𝑜𝑙𝑑 and new policy 𝜋𝜃 is to use clipped surrogate function 𝑐𝑙𝑖𝑝. The 9  ← 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝐵𝑢𝑓 𝑓 𝑒𝑟();
PPO algorithm using the clip function (CLIP-PPO) removes the KL 10 for 𝑡 ← 1 to 𝑇 do
11 𝑎𝑡 ← 𝛱𝜃 (𝑠𝑡 );
penalty and the need for adaptive updates to simplify the algorithm.
12 𝑆𝑐ℎ𝑒𝑑𝑢𝑙𝑒(𝑡𝑎𝑠𝑘, 𝑎𝑡 );
Practice shows CLIP-PPO usually performs better than KL-PPO [13].
13 𝑟𝑡 ← 𝐺𝑒𝑡𝑅𝑒𝑤𝑎𝑟𝑑();
Formally, the objective function of CLIP-PPO is defined as follows: 14 𝑠𝑒𝑟𝑣𝑒𝑟𝑠 ← 𝐺𝑒𝑡𝑆𝑒𝑟𝑣𝑒𝑟𝑠();
𝐿𝐶𝐿𝐼𝑃 (𝜃) = E𝑡 [𝑚𝑖𝑛(𝑟𝑡 (𝜃)𝐴𝑡 , 𝑐𝑙𝑖𝑝(𝑟𝑡 (𝜃), 1 − 𝜖, 1 + 𝜖)𝐴𝑡 )]. (41) 15 if 𝑠𝑒𝑟𝑣𝑒𝑟𝑠 ≠ 𝑠𝑒𝑟𝑣𝑒𝑟𝑠𝑜𝑙𝑑 then
16 𝑏𝑟𝑒𝑎𝑘;
And 𝑐𝑙𝑖𝑝(𝑟𝑡 (𝜃), 1−𝜖, 1+𝜖) restrict the ratio 𝑟𝑡 (𝜃) into (1−𝜖, 1+𝜖), defined 17 end if
as: 18 𝑡𝑎𝑠𝑘 ← 𝐺𝑒𝑡𝑇 𝑎𝑠𝑘();
19 𝑠𝑡+1 ← 𝐺𝑒𝑛𝑒𝑟𝑎𝑙𝑖𝑧𝑒𝑆𝑡𝑎𝑡𝑒(𝑠𝑒𝑟𝑣𝑒𝑟𝑠, 𝑡𝑎𝑠𝑘);
⎧1 − 𝜖 𝑟𝑡 (𝜃) < 1 − 𝜖
⎪ 20 𝑢𝑡 = (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 );
𝑐𝑙𝑖𝑝(𝑟𝑡 (𝜃), 1 − 𝜖, 1 + 𝜖) = ⎨𝑟𝑡 (𝜃) 1 − 𝜖 ≤ 𝑟𝑡 (𝜃) ≤ 1 + 𝜖 (42) 21 .𝐴𝑝𝑝𝑒𝑛𝑑(𝑢𝑡 );
⎪ end for
⎩1 + 𝜖 𝑟𝑡 (𝜃) > 1 + 𝜖. 22
23 𝐴̂ 𝑡 ← −𝑉𝜇 (𝑠𝑡 ) + 𝑟𝑡 + 𝛾𝑟𝑡+1 + ⋅ ⋅ ⋅ + 𝛾 𝑇 −𝑡+1 𝑟𝑇 −1 + 𝛾 𝑇 −𝑡 𝑉𝜇 (𝑠𝑇 );
By removing the constraint function as discussed in TRPO, both 24 for 𝑘 ← 1 to 𝐾 do
∑𝑡 𝛱 (𝑎 |𝑠 ) 𝛱 (𝑎 |𝑠 )
PPO algorithms significantly reduce the computational complexity, 25 𝐿𝐶𝐿𝐼𝑃 (𝜃) = 1 𝑚𝑖𝑛( 𝛱 𝜃 (𝑎𝑡 |𝑠𝑡 ) 𝐴̂ 𝑡 , 𝑐𝑙𝑖𝑝( 𝛱 𝜃 (𝑎𝑡 |𝑠𝑡 ) , 1 − 𝜖, 1 + 𝜖)𝐴̂ 𝑡 ;

𝜃𝑜𝑙𝑑 𝑡 𝑡 𝜃𝑜𝑙𝑑 𝑡 𝑡

while ensuring that the updated policy deviates not too large from the 𝑡
26 𝐿𝑉 𝐹 (𝜇) = 1 (𝑉𝜇 (𝑠𝑡 ) − 𝐴̂ 𝑡 )2 ;
∑ 𝑡
previous one. 27 𝐿𝐸𝑇 (𝜃) = 1 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝛱𝜃 (𝑎𝑡 |𝑠𝑡 ));
28 𝐿(𝜃, 𝜇) = −𝑎𝑐 𝐿𝐶𝐿𝐼𝑃 (𝜃) + 𝑎𝑣 𝐿𝑉 𝐹 (𝜇) − 𝑎𝑒 𝐿𝐸𝑇 (𝜃);
5.2. DRLIS: DRL-based IoT application scheduling 29 update 𝜃 and 𝜇 with 𝐿(𝜃, 𝜇) by Adam optimizer;
30 end for
Since CLIP-PPO usually outperforms KL-PPO in practice, we choose 31 𝜃𝑜𝑙𝑑 ← 𝜃;
it as the basis for the optimization algorithm. DRLIS is based on 32 end while
the actor–critic framework, which is a reinforcement learning method
combining Policy Gradient and Temporal Differential (TD) learning.
As the name implies, this framework consists of two parts, the actor We consider a scheduler that is implemented based on DRLIS. When
and the critic, and in implementation, they are usually presented as this scheduler receives a scheduling request from an IoT application,
Deep Neural Networks (DNNs). The actor network is used to learn a it obtains information about the set of servers currently available and
policy function 𝜋𝜃 (𝑎|𝑠) to maximize the expected cumulative discounted initializes a DRL agent based on the information. This agent contains
∑ three deep neural networks, a new actor network 𝛱𝜃 with parameter
reward E𝜋 [ 𝑡∈𝑇 𝛾𝑡 𝑟𝑡 ], while the critic network is used to evaluate the
current policy and to guide the next stage of the actor’s action. In the 𝜃, an old actor network 𝛱𝜃𝑜𝑙𝑑 with parameter 𝜃𝑜𝑙𝑑 , where 𝜃𝑜𝑙𝑑 = 𝜃, and
learning process, at the time step 𝑡, the reinforcement learning agent a critic network 𝑉𝜇 with parameter 𝜇. After that, the scheduler obtains
inputs the current state 𝑠𝑡 into the actor network, and the actor network the information about the currently submitted task and generates the
outputs the action 𝑎𝑡 to be performed by the agent in the MDP. The current state 𝑠𝑡 based on the information regarding the task and servers.
agent performs the action 𝑎𝑡 , receives the reward 𝑟𝑡 from the environ- Inputting the state 𝑠𝑡 to the new actor network 𝛱𝜃 will output an
ment, and moves to the next state 𝑠𝑡+1 . The critic network receives the action 𝑎𝑡 , representing the target server to which the current task is
states 𝑠𝑡 and 𝑠𝑡+1 as input and estimates their value functions 𝑉𝜋𝜃 (𝑠𝑡 ) and to be assigned. The scheduler then assigns the task to the target server
𝑉𝜋𝜃 (𝑠𝑡+1 ). The agent then computes the TD error 𝛿𝑡 for the time step t: and receives the corresponding reward 𝑟𝑡 , which is calculated based
on Eqs. (32), (33), (34). The reward 𝑟𝑡 is essential for indicating the
𝛿𝑡 = 𝑟𝑡 + 𝛾𝑉𝜋𝜃 (𝑠𝑡+1 ) − 𝑉𝜋𝜃 (𝑠𝑡 ), (43) positive or negative impact of the agent’s current scheduling policy
on the optimization objectives (e.g., IoT application response time and
where 𝛾 denotes the discount factor, as discussed in Section 3, and the
servers load balancing level). Also, a tuple 𝑢𝑡 with three values (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 )
actor network and critic network update their parameters using the
will be stored in buffer . The scheduler repeats the process 𝑇 times
TD error 𝛿𝑡 . DRLIS continues this process after multiple steps, as an
until sufficient information is collected to update the neural networks.
estimate 𝐴̂ 𝑡 of the advantage function 𝐴𝑡 , which can be written as:
When updating the neural networks, the estimate of the advantage
𝐴̂ 𝑡 = −𝑉𝜋𝜃 (𝑠𝑡 ) + 𝑟𝑡 + 𝛾𝑟𝑡+1 + ⋯ + 𝛾 𝑇 −𝑡+1 𝑟𝑇 −1 + 𝛾 𝑇 −𝑡 𝑉𝜋𝜃 (𝑠𝑇 ). (44) function is first computed based on Eq. (44). Then the neural networks

62
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69

Fig. 3. Updated sub-components for reinforcement learning in FogBus2 framework.

are optimized for K times. Both actor network and critic network use Fig. 4. Reinforcement learning scheduling module in FogBus2 framework.
Adam optimizer, and the loss function is computed as:

𝐿(𝜃, 𝜇) = −𝑎𝑐 𝐿𝐶𝐿𝐼𝑃 (𝜃) + 𝑎𝑣 𝐿𝑉 𝐹 (𝜇) − 𝑎𝑒 𝐿𝐸𝑇 (𝜃), (45)


• Master: It contains the scheduling module of FogBus2, responsible
where 𝐿𝐶𝐿𝐼𝑃 (𝜃) is the policy objective function from Eq. (41), and for the registration and scheduling of IoT applications. It can
𝐿𝑉 𝐹 (𝜇) is loss function for the state value function: also discover resources and self-scale based on the input load.
We implement a reinforcement learning scheduling module in

𝑇
the Scheduler & Scaler sub-component. Besides, we extend the
𝐿𝑉 𝐹 (𝜇) = (𝑉𝜇 (𝑠𝑡 ) − 𝐴̂ 𝑡 )2 . (46)
𝑡=1
functionality of the Profiler and the Message Handler components
to allow Master components to receive and handle information
And 𝐿𝐸𝑇 (𝜃) is the entropy bonus for the current policy: from other components for reinforcement learning scheduling.

𝑇 • Actor: It informs the Remote Logger and Master components of the
𝐿𝐸𝑇 (𝜃) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝛱𝜃 (𝑎𝑡 |𝑠𝑡 )). (47) computing resources of the corresponding node to coordinate the
𝑡=1 resource scheduling of the framework. Furthermore, it is respon-
In addition, 𝑎𝑐 , 𝑎𝑣 , and 𝑎𝑒 are the coefficients. After updating the neural sible for launching the appropriate Task Executor components
networks, the parameter 𝜃 of the new actor network 𝛱𝜃 will be copied to process the submitted IoT application. We extend the func-
to the old actor network 𝛱𝜃𝑜𝑙𝑑 . Assuming that there are 𝑁 tasks, from tionality of the Profiler and the Message Handler components to
Algorithm 1, the agent will update the policy K times after scheduling allow system characteristics regarding servers to be passed to the
𝑇 tasks, so the complexity of the algorithm as 𝑂(𝑁 + 𝑁 𝑇
𝐾). In practical reinforcement learning scheduling module in Master components.
applications, both 𝑇 and 𝐾 as hyperparameters can be customized • Task Executor: It is responsible for executing the corresponding
to suit different computational environments. Thus the computational tasks of the submitted application. The results are passed to the
complexity of the algorithm actually depends on the number of tasks 𝑁 Master component.
and can be written as 𝑂(𝑁). For the edge/fog environment with limited • User: It runs on IoT devices and is responsible for processing
computational resources, we consider this computational complexity to raw data from sensors and users. It sends the processed data to
be acceptable. the Master component and submits the execution request. We
extend the functionality of the Actuator and the Message Handler
components to allow information related to IoT applications to be
5.3. Practical implementation in the FogBus2 framework passed to the reinforcement learning scheduling module in Master
components.
We extend the scheduling module of the FogBus2 framework2 [14]
to design and develop the DRLIS in practice for processing placement Fig. 4 shows our implementation of the reinforcement learning
requests from different IoT applications in edge and fog computing scheduling module in the FogBus2 framework. The module can be
environments. divided into four sub-modules: (1) Reinforcement Learning Models, (2)
FogBus2 is a lightweight container-based distributed/serverless Rewards Models, (3) Reinforcement Learning Agent, and (4) Model
framework (realized using Docker microservices software) for inte- Warehouse.
grating edge and fog/cloud computing environments. A scheduling • Reinforcement Learning Models: This sub-module contains the rein-
module is implemented to decide the deployment of heterogeneous IoT forcement learning models. According to Algorithm 1, we imple-
applications, enabling the management of distributed resources in the ment a DRLIS-based model. In addition, to evaluate the perfor-
hybrid computing environment. There are five main components within mance of DRLIS, we also implement DQN and Q-Learning-based
FogBus2 framework, namely Master, Actor, RemoteLogger, TaskExecutor, models.
and User. Fig. 3 shows the relationship between different components • Rewards Models: This sub-module contains the models associated
in the FogBus2 framework, and the updated sub-components used to with the reward functions. According to Sections 3.2 and 4, we
implement the reinforcement learning function. implemented Load Balancing Model, Response Time Model, and
Weighted Cost Model. This sub-module is responsible for calcu-
• Remote Logger: It is designed for collecting and storing logs from lating the reward values based on the information (e.g., CPU and
other components, whether periodic or event-driven. RAM utilization) and transferring them to the Agent sub-module.
• Reinforcement Learning Agent : This sub-module implements the
functions of the reinforcement learning agent. The Agent Initiator
2
https://github.com/Cloudslab/FogBus2. calls the Reinforcement Learning Models sub-module and initializes

63
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69

the corresponding models. The Action Selector is responsible for 6. Performance evaluation
outputting the target server index for the currently scheduled
task. The Model Optimizer optimizes the running reinforcement In this section, we first describe the experimental setup and sample
learning scheduling policy based on the reward values returned applications used in the evaluation. Then, we investigate the hyperpa-
from the Reward Function Models sub-module. The State Converter rameters of DRLIS. Finally, we discuss the performance of DRLIS by
is responsible for converting the parameters of the server and comparing it with its counterparts.
IoT application into state vectors that can be recognized by the
6.1. Experiment setup
reinforcement learning scheduling model. The Scheduling Policy
Runner is the running program of the reinforcement learning
We first give a short introduction about the experimental environ-
scheduling Agent and is responsible for receiving submitted tasks,
ment and describe the IoT applications used in the experiment. Next,
saving or loading the trained policies, and requesting and access- the baseline algorithms used to compare with DRLIS are presented.
ing parameters from other FogBus2 components (e.g., FogBus2
Actor, FogBus2 User) for the computation of reward functions. 6.1.1. Experiment environment
• Model Warehouse: This sub-module can save the hyperparameters As discussed in Section 5.3, we implemented a scheduler based
of the trained scheduling policy to the database and loads the on DRLIS in the FogBus2 framework, and we use this scheduler for
hyperparameters to initialize a well-trained scheduling Agent. evaluation. We consider a heterogeneous experimental environment
consisting of IoT devices, resource-limited fog servers, and resource-
rich cloud servers. To simulate the heterogeneous multi-cloud comput-
Algorithm 2: Reinforcement learning scheduler in FogBus2 ing environment, we used two instances of Nectar Cloud infrastructure
framework based on the proposed weighted cost optimization (Intel Xeon 2 cores @2.0 GHz, 9 GB RAM, and Intel Xeon 16 cores
algorithm @2.0 GHz, 64 GB RAM) and one instance of AWS Cloud (AMD EPYC 2
Input : master component 𝑀; registered actor component set cores @2.2 GHz, 4GM RAM). In the fog computing environment, to
𝐴; user component 𝑈 ; tasks to be processed 𝑇 reflect the heterogeneity of the servers, we used a Raspberry Pi 3B
1 𝑆𝑐ℎ𝑒𝑑𝑢𝑙𝑒𝑟 ← 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝑆𝑐ℎ𝑒𝑑𝑢𝑒𝑟(𝐷𝑅𝐿𝐼𝑆); (Broadcom BCM2837 4 cores @1.2 GHz, 1 GB RAM), a MacBook Pro
2 𝐴 ← 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝐴𝑐𝑡𝑜𝑟𝐵𝑢𝑓 𝑓 𝑒𝑟(); (Apple M1 Pro 8 cores, 16 GB RAM), and a Linux virtual machine (Intel
3 𝑈 ← 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝑈 𝑠𝑒𝑟𝐵𝑢𝑓 𝑓 𝑒𝑟(); Core i5 2 cores @3.1 GHz, 4 GB RAM). In addition, the IoT devices are
4 while True do configured with 2 cores @3.2 GHz and 4 GB RAM. Furthermore, we
5 𝑈 .𝑆𝑢𝑏𝑚𝑖𝑡𝑇 𝑎𝑠𝑘𝑠(𝑇 ); profiled the average bandwidth (i.e., data rate) and latency between
6 𝐴𝑣𝑎𝑖𝑙𝑎𝑏𝑙𝑒𝐴𝑐𝑡𝑜𝑟𝑠 ← 𝑀.𝐶ℎ𝑒𝑐𝑘𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠(𝑇 ); servers as follows: the latency between the IoT device and the cloud
7 if 𝐴𝑣𝑎𝑖𝑙𝑎𝑏𝑙𝑒𝐴𝑐𝑡𝑜𝑟𝑠 is 𝑒𝑚𝑝𝑡𝑦 then server is around 15 ms, and the bandwidth is around 6 MB/s, while
8 𝑀.𝑀𝑒𝑠𝑠𝑎𝑔𝑒(𝑈 , 𝐹 𝑎𝑖𝑙); the latency between the IoT device and the fog server is around 3 ms,
9 𝑏𝑟𝑒𝑎𝑘; and the bandwidth is around 25 MB/s. Also, both 𝑤1 and 𝑤2 are set
10 end if to 0.5 in Eq. (19), meaning that the importance of load balancing and
11 foreach 𝑡𝑖 ∈ 𝑇 do response time are equal.
12 𝑆𝑐ℎ𝑒𝑑𝑢𝑙𝑒𝑟.𝑇 𝑎𝑠𝑘𝑃 𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡(𝑡𝑖 , 𝐴);
13 𝐴.𝑀𝑒𝑠𝑠𝑎𝑔𝑒(𝑀, 𝐼𝐴 );
6.1.2. Sample IoT applications
14 𝐴 .𝐴𝑝𝑝𝑒𝑛𝑑(𝐼𝐴 );
We used four IoT applications for evaluating the performance of
15 𝑈 .𝑀𝑒𝑠𝑠𝑎𝑔𝑒(𝑀, 𝐼𝑈 );
the scheduler based on DRLIS. All applications implement both real-
16 𝑈 .𝐴𝑝𝑝𝑒𝑛𝑑(𝐼𝑈 );
time and non-real-time features. Real-time means that the application
17 if 𝑈 𝑝𝑑𝑎𝑡𝑒𝑆𝑐ℎ𝑒𝑑𝑢𝑙𝑒𝑟 is 𝑇 𝑟𝑢𝑒 then
18 𝑅𝑒𝑤𝑎𝑟𝑑𝑠 ← 𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑅𝑒𝑤𝑎𝑟𝑑𝑠(𝐴 , 𝑈 );
can receive live streams and non-real-time means that the application
19 𝑆𝑐ℎ𝑒𝑑𝑢𝑙𝑒𝑟.𝑈 𝑝𝑑𝑎𝑡𝑒(); can receive pre-recorded video files. Specifically, applications follow
20 end if
a sensor–actuator architecture, with each application operating as a
single data stream. Sensors (e.g., cameras) capture environmental in-
21 end foreach
formation and process it into data patterns (e.g., image frames) that
22 end while
will be forwarded to surrogate servers for processing, while actuators
receive the processed data and represent the final outcome to the user.
In addition, all applications provide a parameter called application label,
Algorithm 2 summarizes the scheduling mechanism based on DRLIS.
which can be used to set the frame size in the video. These applications
The framework first initializes a scheduler, based on Algorithm 1. In are described as follows:
addition, two buffers 𝐴 and 𝑈 for storing information from the 𝐴𝑐𝑡𝑜𝑟
component and the 𝑈 𝑠𝑒𝑟 component are also initialized. After the 𝑈 𝑠𝑒𝑟 • Face Detection [15]: Detects and captures human faces. The human
component submits the IoT application to be processed, the 𝑀𝑎𝑠𝑡𝑒𝑟 faces in the video are marked by squares. This application is
component first checks whether the 𝐴𝑐𝑡𝑜𝑟 components that have been implemented based on OpenCV.3
registered to the framework have the corresponding resources to pro- • Color Tracking [15]: Tracks colors from video. The user can
cess the application. If true, the IoT application which contains one dynamically configure the target colors through the GUI provided
or multiple tasks will be scheduled; otherwise, the 𝑀𝑎𝑠𝑡𝑒𝑟 component by the application. This application is implemented based on
will inform the 𝑈 𝑠𝑒𝑟 component that the current application cannot OpenCV.3
be processed. For each task of an IoT application, the scheduler will • Face And Eye Detection [15]: In addition to detecting and captur-
ing human faces, the application also detects and captures human
place it to the target 𝐴𝑐𝑡𝑜𝑟 component for execution based on Algorithm
eyes. This application is implemented based on OpenCV.3
1. After that, the 𝐴𝑐𝑡𝑜𝑟 component sends the relevant information
• Video OCR [14]: Recognizes and extracts text information from
(i.e., CPU utilization, RAM utilization, etc.) to the 𝑀𝑎𝑠𝑡𝑒𝑟 component,
the video and transmits it back to the user. The application
which is stored in the buffer 𝐴 . The 𝑈 𝑠𝑒𝑟 component also sends
will automatically filter out keyframes. This application is imple-
relevant information (i.e., response time, the result of task execution,
mented based on Google’s Tesseract-OCR Engine.4
etc.) to the 𝑀𝑎𝑠𝑡𝑒𝑟 component, which is stored in the buffer 𝑈 . When
the 𝑀𝑎𝑠𝑡𝑒𝑟 collects sufficient information, it will update the scheduler,
where the data in 𝐴 and 𝐸 are used to compute the reward for each 3
https://github.com/opencv/opencv.
4
step, as discussed in Algorithm 1 and Eqs. (32), (33), (34). https://github.com/tesseract-ocr/tesseract.

64
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69

Fig. 5. Hyperparameters tuning results.

6.1.3. Baseline algorithms 6.2. Hyperparameter tuning


To evaluate the performance of DRLIS, three other schedulers based
on metaheuristic algorithms and reinforcement learning techniques are The scheduler based on DRLIS is implemented via PyTorch. Con-
implemented, as follows: sidering the limited computational resources of some devices in the
fog computing environment, both actor network and critic network
• DQN : It is one of the most adapted techniques in deep reinforce- consist of an input layer, a hidden layer, and an output layer. Hen-
ment learning, which constructs an end-to-end architecture from derson et al. [59] investigate the effect of hyperparameter settings
perception to decision. This algorithm has been used by many on the performance of reinforcement learning models. They survey
works in the current literature such as [26–28], and [29]. To the literature on different reinforcement learning techniques, list the
compare with our proposed algorithm, we implement a DQN- hyperparameter settings used in the literature, and compare the actual
performance of the models under different hyperparameter settings.
based scheduler and integrate it into the FogBus2 framework. This
They compare the performance of the PPO algorithm under different
scheduler can minimize the weighted load balancing and response
network architectures and the result shows that the model performs
time cost.
best under the network architecture where the hidden layer contains 64
• Q-Learning : This technique belongs to value-based reinforcement hidden units and the hyperbolic tangent (TanH) function is used as the
learning techniques that combine the Monte Carlo method and activation function. Therefore, we used the same network architecture
the TD method. Its ultimate goal is to learn a table (Q-Table). for our experiments. In addition, we performed a grid search to tune
Works including [25,55] adopt this technique. To integrate it into the four main hyperparameters (i.e., clipping range, discount factor,
the FogBus2 framework, we implemented a scheduling policy. learning rate for actor network, and learning rate for critic network),
Furthermore, as a comparison, the scheduler can be used in the and the results are shown in Fig. 5 The load balancing model control
weighted cost problem to minimize the weighted load balancing parameters 𝑎1 and 𝑎2 are both set to 0.5 to show the equal importance
and response time cost. of CPU and RAM, however, these values can be tuned by users based
• NSGA2: It is a weighted cost genetic algorithm. It adopts the on the objectives.
strategy of fast non-dominated sorting and crowding distance All the experiments regarding hyperparameters tuning are con-
to reduce the complexity of the non-dominated sorting genetic ducted in order to solve the weighted cost problem, as discussed in
Section 3.2.3. We describe the process of hyperparameters tuning of
algorithm. The algorithm has high efficiency and fast convergence
our reinforcement learning model. For tuning the clipping range 𝜖, we
rate [56]. This algorithm is implemented using Pymoo [57].
followed Schulman et al. [13], who proposed PPO and described that
• NSGA3: The framework of NSGA3 is basically the same as NSGA2,
the model performs best with settings of clipping range 𝜖 among 0.1,
using fast non-dominated sorting to classify population individu-
0.2, and 0.3. Fig. 5(a) shows that our model performs best when the
als into different non-dominated fronts, and the difference mainly clipping range 𝜖 is set to 0.3. For the discount factor 𝛾, we reviewed
lies in the change of selection mechanism. Compared with NSGA2 related work on DRL in order to understand the common range for
using crowding distance to select individuals of the same non- 𝛾. According to [13,60], the best setting for 𝛾 sits somewhere among
dominated level, NSGA3 introduces well-distributed reference {0.9–0.999}. Accordingly, to keep the search area for tuning 𝛾 in a
points to maintain population diversity under high-dimensional viable range, we used the nominated values in these works and found
goals [58]. This algorithm is implemented using Pymoo [57]. that our model converges faster when 𝛾 is set to 0.9. Fig. 5(b) shows

65
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69

Table 3 schedulers than for the NSGA2-based and NSGA3-based schedulers.


The hyperparameters setting for DRLIS.
Moreover, only the reinforcement learning-based scheduler can achieve
DRLIS hyperparameter Value a stable convergence state. However, the Q-Learning-based scheduler
Neural network layers 3 requires more than 60 iterations before reaching a converged state, and
Hidden layer units 64 the DQN-based scheduler requires more than 80 iterations, while the
Optimization method Adam DRLIS-based scheduler only requires about 20 updates to converge to a
Activation function TanH similar stable state. The NSGA2-based and NSGA3-based schedulers are
Clipping range 𝜖 0.3
unable to reach the convergence state. As shown in Fig. 6(b), when opti-
Discount factor 𝛾 0.9
mizing the response time problem of the application, unlike the former
Actor learning rate 𝑙𝑟𝑎 0.0003
Critic learning rate 𝑙𝑟𝑐 0.001 problem, all schedulers can converge. However, the average response
Policy objective function coefficient 𝑎𝑐 1 time of Q-Learning-based, DQN-based, and DRLIS-based schedulers is
Value function loss function coefficient 𝑎𝑣 0.5 still lower than that of NSGA2-based and NSGA3-based schedulers. In
Entropy bonus coefficient 𝑎𝑒 0.01 addition, the DRLIS-based scheduler still outperforms the Q-Learning-
Load balancing model CPU control parameter 𝑎1 0.5 based and the DQN-based schedulers in terms of convergence speed.
Load balancing model RAM control parameter 𝑎2 0.5 Finally, as Fig. 6(c) shows, when optimizing the weighted cost problem,
similar to the load balancing problem, the average cost is lower for the
Table 4 Q-Learning-based, DQN-based, and DRLIS-based schedulers than for the
The hyperparameters setting for baseline techniques. NSGA2 and NSGA3-based schedulers, and only the first three can reach
DQN hyperparameter Value a stable convergence state. Moreover, although the Q-Learning-based,
Neural network layers 3 DQN-based, and DRLIS-based schedulers have similar final conver-
Hidden layer units 64 gence levels, the DRLIS-based scheduler converges much faster than
Optimization method Adam the Q-Learning-based and DQN-based schedulers. This proves that the
Activation function ReLU DRLIS-based scheduler outperforms the other techniques in terms of
Discount factor 0.99 average cost, convergence, and convergence speed during the training
Learning rate 0.0001 phase.
Exploration rate 1 In the evaluation phase, we set the resolution to 240, which will
Exploration decay 0.9
make the demand for computational resources and response time of
Minimum exploration 0.05
the IoT application different from the training phase. The evaluation
Q-Learning hyperparameter Value phase results of the different algorithms regarding the three optimiza-
Discount factor 0.9 tion objectives are shown in Fig. 7. It can be observed that when
Learning rate 0.1 the optimization objective is server load balancing, IoT application
NSGA2 and NSGA3 hyperparameter Value response time, and weighted cost, respectively, the schedulers based on
Population size 200 different algorithms have similar performances as the training phase.
Generation numbers 100 Specifically, only the cost of the Q-Learning-based, DQN-based, and
DRLIS-based schedulers converges, and the cost of the NSGA2-based
and NSGA3-based schedulers fluctuates up and down in a higher range.
Moreover, the average and final costs of the Q-Learning-based, DQN-
the tuning process of 𝛾. Based on the similar approach for tuning based, and DRLIS-based schedulers are significantly lower than those of
𝜖 and 𝛾, for tuning the actor network learning rate 𝑙𝑟𝑎 , we referred the NSGA2-based and NSGA3-based schedulers during the evaluation
to [13,59,61] for designing our tuning range. Accordingly, we used phase. In addition, in the weighted cost scenario, the DRLIS-based
0.003, 0.0003, and 0.00003 to tune 𝑙𝑟𝑎 . Fig. 5(c) shows that our model scheduler can converge the cost to a stable level after about 30 policy
performs best when the 𝑙𝑟𝑎 is set to 0.0003. Considering the same updates, while the Q-Learning-based scheduler usually takes about 60
approach for tuning, we followed [62–64] and set our tuning range updates to converge to a slightly higher level, and the DQN-based
among {0.01, 0.001, 0.0001} and found that our model works best
scheduler needs more than 80 updates to converge to the same level.
when 𝑙𝑟𝑐 is 0.001. Fig. 5(d) shows the performance of our model
Overall, compared with the Q-Learning-based scheduler, which can
under different settings for 𝑙𝑟𝑐 . Overall, the deep neural network and
converge stably and with the fastest convergence speed in the baseline
training hyperparameters setting is presented in Table 3. Besides, we
algorithms, the average performance of the DRLIS-based scheduler
also tune the hyperparameters for baseline techniques to fairly study
improves by 55%, 37%, and 50%, in terms of servers load balancing,
their performance. The corresponding results are shown in Table 4.
IoT application response time, and weighted cost, respectively.
6.3. Performance study
6.3.2. Scheduling overhead analysis
We performed two experiments to evaluate DRLIS compared to its In this section, we investigate the scheduling overhead of different
counterparts, regarding the load balancing of the servers, the response techniques-based schedulers when handling IoT applications. The en-
time of the IoT applications, and the weighted cost. vironment settings are the same as Section 6.1.1, and the resolution
of the IoT applications is set to 480. For each scheduler, we repeat
6.3.1. Cost vs. policy update analysis the experiment for 100 rounds, feeding four IoT applications to the
In this experiment, we investigate the algorithm performance in scheduler in each round. Besides, we define the average scheduling
𝑇𝑡𝑜𝑡𝑎𝑙
different iterations when the policy is updated. We used the four overhead as 𝑇𝑎𝑣𝑒 = 100 , where 𝑇𝑡𝑜𝑡𝑎𝑙 represents the total overhead spent
applications mentioned in Section 6.1.2 for training with the resolution by the scheduler to handle the applications in 100 rounds.
parameter set to 480, and the maximum number of iterations is set Fig. 8 depicts the average scheduling overhead 𝑇𝑎𝑣𝑒 with a 95%
to 100. The training results of algorithms with the three optimization Confidence Interval (CNFI) of schedulers based on different technolo-
objectives are shown in Fig. 6. gies when handling IoT applications. It is obvious that the scheduling
As shown in Fig. 6(a), when optimizing the load-balancing problem overheads of reinforcement learning techniques (i.e., DRLIS, DQN,
of the servers, the average computational resource variance of the Q-Learning) are usually lower than metaheuristics techniques (i.e.,
servers is lower for the Q-Learning-based, DQN-based, and DRLIS-based NSGA2, NSGA3). In addition, the 95% CNFI of the scheduling overhead

66
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69

Fig. 6. Cost vs. policy update analysis — train phase.

Fig. 7. Cost vs. policy update analysis — evaluation phase.

and response time in heterogeneous edge and fog computing envi-


ronments and formulate a weighted cost model based on both of
them. In addition, we implemented a practical scheduler in the Fog-
Bus2 function-as-a-service framework for scheduling IoT applications.
Compared to existing work, DRLIS has significant advantages in con-
vergence speed, optimization cost, and scheduling overhead. Through
extensive experiments and comparisons with other works in the liter-
ature, DRLIS achieves performance improvements of up to 49%, 60%,
and 55% in terms of load balancing, response time, and weighted cost,
respectively.
For future work, considering the limited resources and the distribu-
tion of the devices in edge computing, we plan to explore distributed
deep reinforcement learning to further improve the scheduler’s perfor-
mance. Also, we plan to consider more models to extend our proposed
weighted cost model, including economic aspects and energy consump-
tion aspects in large-scale serverless computing environments. In ad-
dition, to optimize the performance of IoT applications involving GPU
Fig. 8. Average scheduling overhead with a 95% CNFI.
tasks (e.g., image processing oriented applications), we will extend Fog-
Bus2 framework to consider resource usage when scheduling such ap-
plications on Application-Specific Integrated Circuit (ASIC)/GPU-based
edge and cloud servers for more efficient performance.
of reinforcement learning techniques is also much shorter than meta-
heuristic techniques. Specifically, the scheduling overhead of DRLIS CRediT authorship contribution statement
is more than 50% lower than NSGA2 and NSGA3, and more than
33% lower than DQN, but it is about 2 ms more than Q-Learning. Zhiyu Wang: Problem formulation, Implementation of the work,
However, considering that the convergence speed of DRLIS is much Writing the paper. Mohammad Goudarzi: Problem formulation, De-
faster than that of Q-Learning, as discussed in Section 6.3.1, the in- sign of the DRL-based framework, Consultation for implementation
creased overhead cost of DRLIS over Q-Learning can be negligible. of DRL, Editing the paper. Mingming Gong: Supervisor, Providing
Therefore, in the heterogeneous edge and fog computing environment, technical advises. Rajkumar Buyya: Supervisor, Providing technical
our proposed DRLIS-based algorithm can handle the weighted cost advises.
optimization problem of IoT applications more efficiently than other
techniques. Declaration of competing interest

The authors declare that they have no known competing finan-


7. Conclusions and future work
cial interests or personal relationships that could have appeared to
influence the work reported in this paper.
In this paper, we proposed DRLIS, a DRL-based algorithm to solve
the weighted cost optimization problem for IoT applications scheduling Data availability
in heterogeneous edge and fog computing environments. First, we
proposed corresponding cost models for optimizing load balancing Data will be made available on request.

67
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69

References [26] X. Jie, T. Liu, H. Gao, C. Cao, P. Wang, W. Tong, A DQN-based approach
for online service placement in mobile edge computing, in: Proceedings of the
16th EAI International Conference on Collaborative Computing: Networking,
[1] G.S.S. Chalapathi, V. Chamola, A. Vaish, R. Buyya, Industrial internet of things
Applications and Worksharing, Springer, 2021, pp. 169–183.
(iiot) applications of edge and fog computing: A review and future directions,
in: Fog/Edge Computing for Security, Privacy, and Applications, Springer, 2021, [27] X. Xiong, K. Zheng, L. Lei, L. Hou, Resource allocation based on deep reinforce-
pp. 293–325. ment learning in IoT edge computing, IEEE J. Sel. Areas Commun. 38 (6) (2020)
1133–1146.
[2] S. Azizi, M. Shojafar, J. Abawajy, R. Buyya, Deadline-aware and energy-efficient
IoT task scheduling in fog computing systems: A semi-greedy approach, J. Netw. [28] J. Wang, L. Zhao, J. Liu, N. Kato, Smart resource allocation for mobile edge
Comput. Appl. 201 (2022) 103333. computing: A deep reinforcement learning approach, IEEE Trans. Emerg. Top.
Comput. 9 (3) (2021) 1529–1541.
[3] A.J. Ferrer, J.M. Marquès, J. Jorba, Towards the decentralised cloud: Survey
[29] L. Huang, X. Feng, C. Zhang, L. Qian, Y. Wu, Deep reinforcement learning-
on approaches and challenges for mobile, ad hoc, and edge computing, ACM
based joint task offloading and bandwidth allocation for multi-user mobile edge
Comput. Surv. 51 (6) (2019) 1–36.
computing, Digit. Commun. Netw. 5 (1) (2019) 10–17.
[4] M. Goudarzi, H. Wu, M. Palaniswami, R. Buyya, An application placement tech-
[30] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, M. Bennis, Optimized computation
nique for concurrent IoT applications in edge and fog computing environments,
offloading performance in virtual edge computing systems via deep reinforcement
IEEE Trans. Mob. Comput. 20 (4) (2020) 1298–1311.
learning, IEEE Internet Things J. 6 (3) (2019) 4005–4018.
[5] L. Catarinucci, D. de Donno, L. Mainetti, L. Palano, L. Patrono, M.L. Stefanizzi, L.
[31] Y. Zheng, H. Zhou, R. Chen, K. Jiang, Y. Cao, SAC-based computation offloading
Tarricone, An IoT-aware architecture for smart healthcare systems, IEEE Internet
and resource allocation in vehicular edge computing, in: Proceedings of the
Things J. 2 (6) (2015) 515–526.
IEEE INFOCOM - IEEE Conference on Computer Communications Workshops
[6] L. Liu, S. Lu, R. Zhong, B. Wu, Y. Yao, Q. Zhang, W. Shi, Computing systems
(INFOCOM WKSHPS), 2022, pp. 1–6.
for autonomous driving: State of the art and challenges, IEEE Internet Things J.
[32] T. Zhao, F. Li, L. He, Secure video offloading in MEC-enabled IIoT networks:
8 (8) (2021) 6469–6486.
A multi-cell federated deep reinforcement learning approach, IEEE Trans. Ind.
[7] M. Goudarzi, M.S. Palaniswami, R. Buyya, A distributed deep reinforcement
Inform. (2023) 1–12.
learning technique for application placement in edge and fog computing
[33] L. Liao, Y. Lai, F. Yang, W. Zeng, Online computation offloading with double
environments, IEEE Trans. Mob. Comput. 22 (5) (2023) 2491–2505.
reinforcement learning algorithm in mobile edge computing, J. Parallel Distrib.
[8] A. Brogi, S. Forti, QoS-aware deployment of IoT applications through the fog,
Comput. 171 (2023) 28–39.
IEEE Internet Things J. 4 (5) (2017) 1185–1192.
[34] V. Sethi, S. Pal, FedDOVe: A Federated Deep Q-learning-based Offloading for
[9] M. Goudarzi, M. Palaniswami, R. Buyya, Scheduling IoT applications in edge and
Vehicular fog computing, Future Gener. Comput. Syst. 141 (2023) 96–105.
fog computing environments: a taxonomy and future directions, ACM Comput.
[35] P. Li, W. Xie, Y. Yuan, C. Chen, S. Wan, Deep reinforcement learning for load
Surv. 55 (7) (2022) 1–41.
balancing of edge servers in iov, Mob. Netw. Appl. 27 (4) (2022) 1461–1474.
[10] X. Ma, H. Gao, H. Xu, M. Bian, An IoT-based task scheduling optimization scheme
[36] X. Chu, M. Zhu, H. Mao, Y. Qiu, Task offloading for multi-gateway-assisted
considering the deadline and cost-aware scientific workflow for cloud computing,
mobile edge computing based on deep reinforcement learning, in: Proceedings
EURASIP J. Wireless Commun. Networking 2019 (1) (2019) 1–19.
of the IEEE International Conference on Systems, Man, and Cybernetics (SMC),
[11] Z. Wang, M. Goudarzi, J. Aryal, R. Buyya, Container orchestration in edge and 2022, pp. 3234–3241.
fog computing environments for real-time iot applications, in: Proceedings of
[37] F. Xue, Q. Hai, T. Dong, Z. Cui, Y. Gong, A deep reinforcement learning
the Computational Intelligence and Data Analytics (ICCIDA), Springer, 2022, pp.
based hybrid algorithm for efficient resource scheduling in edge computing
1–21.
environment, Inform. Sci. 608 (2022) 362–374.
[12] E. Li, Z. Zhou, X. Chen, Edge intelligence: On-demand deep learning model co- [38] S. Pallewatta, V. Kostakos, R. Buyya, Placement of microservices-based IoT
inference with device-edge synergy, in: Proceedings of the 2018 Workshop on applications in fog computing: A taxonomy and future directions, ACM Comput.
Mobile Edge Communications, 2018, pp. 31–36. Surv. 55 (2023) 1–43.
[13] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy [39] W. Zhu, M. Goudarzi, R. Buyya, Flight: A lightweight federated learning
optimization algorithms, 2017, arXiv preprint arXiv:1707.06347. framework in edge and fog computing, 2023, arXiv preprint arXiv:2308.02834.
[14] Q. Deng, M. Goudarzi, R. Buyya, Fogbus2: a lightweight and distributed [40] X. Qiu, W. Zhang, W. Chen, Z. Zheng, Distributed and collective deep reinforce-
container-based framework for integration of iot-enabled systems with edge and ment learning for computation offloading: A practical perspective, IEEE Trans.
cloud computing, in: Proceedings of the International Workshop on Big Data in Parallel Distrib. Syst. 32 (5) (2020) 1085–1101.
Emergent Distributed Environments, 2021, pp. 1–8. [41] N.H. Tran, W. Bao, A. Zomaya, M.N. Nguyen, C.S. Hong, Federated learning over
[15] M. Goudarzi, Q. Deng, R. Buyya, Resource management in edge and fog wireless networks: Optimization model design and analysis, in: Proceedings of
computing using FogBus2 framework, 2021, arXiv preprint arXiv:2108.00591. the IEEE INFOCOM, IEEE, 2019, pp. 1387–1395.
[16] K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A fast elitist non-dominated sorting [42] J. Ji, K. Zhu, L. Cai, Trajectory and communication design for cache-enabled
genetic algorithm for multi-objective optimization: NSGA-II, in: Proceedings of UAVs in cellular networks: A deep reinforcement learning approach, IEEE Trans.
the International Conference on Parallel Problem Solving from Nature, Springer, Mob. Comput. (2022).
2000, pp. 849–858. [43] S. Fujimoto, H. van Hoof, D. Meger, Addressing function approximation error
[17] K. Deb, H. Jain, An evolutionary many-objective optimization algorithm using in actor-critic methods, in: J. Dy, A. Krause (Eds.), Proceedings of the 35th
reference-point-based nondominated sorting approach, part I: solving problems International Conference on Machine Learning, in: Proceedings of Machine
with box constraints, IEEE Trans. Evol. Comput. 18 (4) (2013) 577–601. Learning Research, 80, PMLR, 2018, pp. 1587–1596.
[18] C.J. Watkins, P. Dayan, Q-learning, Mach. Learn. 8 (1992) 279–292. [44] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: Off-policy maximum
[19] J. Liu, Y. Mao, J. Zhang, K.B. Letaief, Delay-optimal computation task scheduling entropy deep reinforcement learning with a stochastic actor, in: J. Dy, A. Krause
for mobile-edge computing systems, in: Proceedings of the IEEE International (Eds.), Proceedings of the 35th International Conference on Machine Learning,
Symposium on Information Theory (ISIT), 2016, pp. 1451–1455. in: Proceedings of Machine Learning Research, 80, PMLR, 2018, pp. 1861–1870.
[20] C.-g. Wu, W. Li, L. Wang, A.Y. Zomaya, Hybrid evolutionary scheduling for [45] S. Mysore, B.E. Mabsout, R. Mancuso, K. Saenko, Honey. I shrunk the actor: A
energy-efficient fog-enhanced internet of things, IEEE Trans. Cloud Comput. 9 case study on preserving performance with smaller actors in actor-critic RL, in:
(2) (2021) 641–653. 2021 IEEE Conference on Games (CoG), 2021, pp. 01–08.
[21] Y. Sun, F. Lin, H. Xu, Multi-objective optimization of resource scheduling in fog [46] P. Christodoulou, Soft actor-critic for discrete action settings, 2019, arXiv
computing using an improved NSGA-II, Wirel. Pers. Commun. 102 (2) (2018) preprint arXiv:1910.07207.
1369–1385. [47] H. Wang, Y. Ye, J. Zhang, B. Xu, A comparative study of 13 deep reinforcement
[22] F. Hoseiny, S. Azizi, M. Shojafar, F. Ahmadiazar, R. Tafazolli, PGA: A learning based energy management methods for a hybrid electric vehicle, Energy
priority-aware genetic algorithm for task scheduling in heterogeneous fog-cloud 266 (2023) 126497.
computing, in: Proceedings of the IEEE Conference on Computer Communications [48] J. Zhu, F. Wu, J. Zhao, An overview of the action space for deep reinforcement
Workshops (INFOCOM WKSHPS), 2021, pp. 1–6. learning, in: Proceedings of the 4th International Conference on Algorithms,
[23] I.M. Ali, K.M. Sallam, N. Moustafa, R. Chakraborty, M. Ryan, K.-K.R. Choo, An Computing and Artificial Intelligence, 2021, pp. 1–10.
automated task scheduling model using non-dominated sorting genetic algorithm [49] R.S. Sutton, D. McAllester, S. Singh, Y. Mansour, Policy gradient methods for
II for fog-cloud systems, IEEE Trans. Cloud Comput. 10 (4) (2022) 2294–2308. reinforcement learning with function approximation, Adv. Neural Inf. Process.
[24] F. Ramezani Shahidani, A. Ghasemi, A. Toroghi Haghighat, A. Keshavarzi, Syst. 12 (1999).
Task scheduling in edge-fog-cloud architecture: a multi-objective load balancing [50] R. Huang, T. Yu, Z. Ding, S. Zhang, Policy gradient, in: Deep Reinforce-
approach using reinforcement learning algorithm, Computing (2023) 1–23. ment Learning: Fundamentals, Research and Applications, Springer, 2020, pp.
[25] J.-y. Baek, G. Kaddoum, S. Garg, K. Kaur, V. Gravel, Managing fog networks 161–212.
using reinforcement learning based load balancing algorithm, in: Proceedings of [51] J. Schulman, S. Levine, P. Abbeel, M. Jordan, P. Moritz, Trust region policy op-
the IEEE Wireless Communications and Networking Conference (WCNC), IEEE, timization, in: Proceedings of the International Conference on Machine Learning,
2019, pp. 1–7. PMLR, 2015, pp. 1889–1897.

68
Z. Wang et al. Future Generation Computer Systems 152 (2024) 55–69

[52] T. Van Erven, P. Harremos, Rényi divergence and Kullback-Leibler divergence, Mohammad Goudarzi is working as a senior research
IEEE Trans. Inform. Theory 60 (7) (2014) 3797–3820. associate at the University of New South Wales (UNSW),
[53] S. Shao, W. Luk, Customised pearlmutter propagation: A hardware architecture Sydney, Australia. He completed his Ph.D. at the Depart-
for trust region policy optimisation, in: Proceedings of the 27th International ment of Computing and Information Systems, the University
Conference on Field Programmable Logic and Applications (FPL), IEEE, 2017, of Melbourne, Australia. His research interests include In-
pp. 1–6. ternet of Things (IoT), Cloud/Fog/Edge Computing, Cyber
[54] S. Li, R. Wang, M. Tang, C. Zhang, Hierarchical reinforcement learning with Security, Distributed Systems, and Machine Learning. He
advantage-based auxiliary rewards, Adv. Neural Inf. Process. Syst. 32 (2019). received Oracle’s Cloud Architect of the Year Award 2022,
[55] S. Aljanabi, A. Chalechale, Improving IoT services using a hybrid fog-cloud IEEE TCCLD Outstanding PhD. Thesis Award 2022, IEEE
offloading, IEEE Access 9 (2021) 13775–13788. TCSC Outstanding PhD. Dissertation Award 2022, and IEEE
[56] L. Yliniemi, K. Tumer, Multi-objective multiagent credit assignment in Outstanding Service Award 2021. Media outlets such as
reinforcement learning and nsga-ii, Soft Comput. 20 (10) (2016) 3869–3887. Australian Financial Review (AFR) and Oracle Blogs have
[57] J. Blank, K. Deb, Pymoo: Multi-objective optimization in python, IEEE Access 8 covered his research findings.
(2020) 89497–89509.
[58] X. Li, X. Li, K. Wang, S. Yang, Y. Li, Achievement scalarizing function sorting for
strength Pareto evolutionary algorithm in many-objective optimization, Neural
Comput. Appl. 33 (2021) 6369–6388.
[59] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, D. Meger, Deep Mingming Gong is currently a Senior Lecturer in data
reinforcement learning that matters, in: Proceedings of the AAAI Conference on science with the School of Mathematics and Statistics,
Artificial Intelligence, Vol. 32, 2018. The University of Melbourne, Melbourne, VIC, Australia.
[60] D. Wei, N. Xi, X. Ma, M. Shojafar, S. Kumari, J. Ma, Personalized privacy-aware He has authored/coauthored more than 30 research pa-
task offloading for edge-cloud-assisted industrial internet of things in automated pers on top venues, such as International Conference on
manufacturing, IEEE Trans. Ind. Inform. 18 (11) (2022) 7935–7945. Machine Learning (ICML), Neural Information Processing
[61] N. Bjorck, C.P. Gomes, B. Selman, K.Q. Weinberger, Understanding batch Systems (NeurIPS), the IEEE TRANSACTIONS ON NEU-
normalization, Adv. Neural Inf. Process. Syst. 31 (2018). RAL NETWORKS AND LEARNING SYSTEMS, and the IEEE
[62] C. Huang, R. Mo, C. Yuen, Reconfigurable intelligent surface assisted mul- TRANSACTIONS ON IMAGE PROCESSING, with more than
tiuser MISO systems exploiting deep reinforcement learning, IEEE J. Sel. Areas ten oral/spotlight presentations. His research interests in-
Commun. 38 (8) (2020) 1839–1850. clude causal reasoning, machine learning, and computer
[63] F. Fu, Y. Kang, Z. Zhang, F.R. Yu, Transcoding for live streaming-based on vision. Dr. Gong received the Discovery Early Career Re-
vehicular fog computing: An actor-critic DRL approach, in: Proceedings of the searcher Award from the Australian Research Council in
IEEE INFOCOM - IEEE Conference on Computer Communications Workshops 2020.
(INFOCOM WKSHPS), 2020, pp. 1015–1020.
[64] R. Islam, P. Henderson, M. Gomrokchi, D. Precup, Reproducibility of bench-
marked deep reinforcement learning tasks for continuous control, 2017, arXiv
preprint arXiv:1708.04133.
Dr. Rajkumar Buyya is a Redmond Barry Distinguished
Zhiyu Wang is working towards the Ph.D. degree at Professor and Director of the Cloud Computing and Dis-
the Cloud Computing and Distributed Systems (CLOUDS) tributed Systems (CLOUDS) Laboratory at the University
Laboratory, Department of Computing and Information Sys- of Melbourne, Australia. He has authored over 625 pub-
tems, the University of Melbourne. His research interests lications and seven text books including "Mastering Cloud
include Edge/Fog Computing, Internet of Things (IoT), Computing" published by McGraw Hill, China Machine
Distributed Systems, and Artificial Intelligence. He is cur- Press, and Morgan Kaufmann for Indian, Chinese and in-
rently working on leveraging AI techniques to enhance ternational markets respectively. He is one of the highly
resource management in edge/fog and cloud computing cited authors in computer science and software engineering
environments. worldwide (h-index=160, g-index=322, 137900+ citations).

69

You might also like