0% found this document useful (0 votes)
71 views15 pages

Soft Actor-Critic for Intrusion Detection

Uploaded by

ulhitch03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views15 pages

Soft Actor-Critic for Intrusion Detection

Uploaded by

ulhitch03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Computers & Security 135 (2023) 103502

Contents lists available at ScienceDirect

Computers & Security


journal homepage: [Link]/locate/cose

A soft actor-critic reinforcement learning algorithm for network intrusion


detection
Zhengfa Li a,b , Chuanhe Huang a,b,∗ , Shuhua Deng c , Wanyu Qiu a,b , Xieping Gao d
a
School of Computer Science, Wuhan University, Wuhan, China
b
Hubei LuoJia Laboratory, Wuhan, China
c
Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, China
d
College of Information Science and Engineering, Hunan Normal University, Changsha, China

A R T I C L E I N F O A B S T R A C T

Keywords: Network intrusion detection plays a very important role in network security. Although current deep learning-
Network security based intrusion detection algorithms have achieved good detection performance, there are still limitations
Anomaly detection in dealing with unbalanced datasets and identifying minority attacks and unknown attacks. In this paper,
Network intrusion detection
we propose an intrusion detection model AE-SAC based on adversarial environment learning and soft actor-
Deep reinforcement learning
Soft actor-critic
critic reinforcement learning algorithm. First, this paper introduces an environmental agent for training data
resampling to solve the imbalance problem of the original data. Second, rewards are redefined in reinforcement
learning. In order to improve the recognition rate of few categories of network attacks, we set different reward
values for different categories of attacks. The environment agent and classifier agent are trained adversarially
around maximizing their respective reward values. Finally, a multi-classification experiment is conducted on
the NSL-KDD and AWID datasets to compare with the existed excellent intrusion detection algorithms. AE-SAC
achieves excellent classification performance with an accuracy of 84.15% and a f1-score of 83.97% on the NSL-
KDD dataset, and an accuracy and a f1-score over 98.9% on the AWID dataset.

1. Introduction and other types of security data makes the intrusion detection problem
solvable by machine learning methods. These methods can be broadly
Intrusion detection technology employs an active defense strategy classified into two categories: shallow learning or classical models and
that enables timely and accurate warning before intrusions have a bad deep learning models (Gamage and Samarabandu, 2020). The structure
impact on computer systems, and builds a resilient defense system in of a shallow learning model can basically be viewed as having a layer of
real time to avoid, transfer, and reduce the risks faced by information hidden layer nodes (e.g., SVM, Boosting), or no hidden layer nodes (e.g.,
systems. Intruders are detected and responded to in a timely manner Logistic Regression). The intrusion detection model based on shallow
through continuous monitoring and analysis of network activity. With- machine learning algorithm has problems such as difficulty in process-
out a doubt, intrusion detection plays a critical role in network security. ing large-scale network intrusion data (Shone et al., 2018; Zhang et al.,
Based on intrusive behaviors, intrusion detection systems (IDS) can be 2020a), poor ability to identify various new attacks, high false positive
divided into two types: host-based intrusion detection systems (HIDS) rate, and over-reliance on researchers for feature design and feature
and network-based intrusion detection systems (NIDS) (Mishra et al., selection (Thaseen and Kumar, 2017). Compared to shallow learning,
2018). HIDS monitor individual hosts or devices and send alerts to currently used deep learning models are neural network models with
users when suspicious activity is detected. Unlike HIDS, NIDS are usu- a large number of hidden layers. These models are capable of learning
ally placed at network points such as gateways or routers to check for highly complex nonlinear functions, and the hierarchical arrangement
intrusions in network traffic. In this paper, we will focus on network of layers enables them to learn useful feature representations from the
intrusion detection algorithms. input data (Dong and Wang, 2016; Pingale and Sutar, 2022; Hou et al.,
Machine learning methods for intrusion detection have been studied 2022), which can overcome the limitations of classical models. In fact,
by researchers for over 20 years. The large amount of network telemetry deep learning has yielded good results in the field of intrusion detection.

* Corresponding author at: School of Computer Science, Wuhan University, Wuhan, China.
E-mail addresses: zhengfali@[Link] (Z. Li), huangch@[Link] (C. Huang), shuhuadeng@[Link] (S. Deng), wanyu3135@[Link] (W. Qiu),
xpgao@[Link] (X. Gao).

[Link]
Received 25 November 2022; Received in revised form 15 March 2023; Accepted 20 September 2023
Available online 26 September 2023
0167-4048/© 2023 Elsevier Ltd. All rights reserved.
Z. Li, C. Huang, S. Deng et al. Computers & Security 135 (2023) 103502

For example, Cil et al. (2021) proposed the deep neural network (DNN) WiFi Intrusion Dataset (AWID) (Kolias et al., 2015), and our algorithm
model that has an attack detection success rate (the attack detection achieved excellent results in terms of accuracy, precision, and recall and
success rate can be understood as the precision of the intrusion detec- f1-score when compared with the current state-of-the-art algorithms in
tion model) of 99.99% on the CICIDS2019 (Sharafaldin et al., 2019) the multi-classification case.
dataset for network traffic and an accuracy of 94.57% for classification The main contributions of the work are as follows:
of attack types.
In addition to DNN models, Convolutional Neural Network (CNN), • We formalize the intrusion detection problem as a Markov deci-
Gated Recurrent Units (GRU), Long-Term Short-Term Memory (LSTM), sion problem and propose an intrusion detection algorithm based
Deep Belief Network (DBN), Stacked Autoencoder (SAE) and hybrid on SAC deep reinforcement learning. On top of the original re-
neural network models are also widely used for intrusion detection. inforcement learning framework, additional environmental agents
Although the application of deep learning in the field of intrusion de- are introduced for resampling the original training dataset.
tection has achieved expected research results, there are still many • We redefine the reward function in reinforcement learning to bet-
problems to be solved. On one hand, deep learning models are very sen- ter handle the imbalance of the original dataset. Different reward
sitive to the dataset used for training. However, the network intrusion values are set for different classes of network attacks. For a few
detection data obtained in the real network environment often contains categories of attacks, the classifier agent receives a higher reward
a large amount of normal behavior data and a small number of attack value if it identifies successfully, otherwise the environment agent
behavior data, resulting in an extremely unbalanced dataset. This leads receives a higher reward value. The classifier agent and the envi-
to poor recognition of a few intrusions by deep learning models. On the ronment agent are trained in parallel in an adversarial manner.
other hand, the building of good models is extremely difficult. As we all • We choose excellent deep learning models and reinforcement
know, the number of parameters for deep learning models is huge and learning models as comparison algorithms and conduct multi-
tuning them is time consuming. classification experiments on the publicly available datasets NSL-
The emergence of deep reinforcement learning (DRL) has given a KDD2 and AWID.3 The code of AE-SAC can be found in the GitHub
new approach to solve the intrusion detection problem. In fact, the repository.4
intrusion detection problem can be viewed as an optimal decision prob-
lem; that is, given any network traffic, the IDS must determine whether The rest of this paper is organized as follows. Section 2 presents
it is a normal or anomalous flow, or whether it falls into the category of the current state of network intrusion detection research. Section 3 de-
attack traffic. Lopez-Martin et al. (2020) used four deep reinforcement scribes the proposed intrusion detection framework. Section 4 discusses
learning algorithms Deep Q-Network (DQN), Double Deep Q-Network the experimental results, and section 5 summarizes the study’s findings
(DDQN), Policy Gradient (PG), Actor-Critic (AC) for intrusion detection. and looks ahead to future research.
It is very noteworthy that, in contrast to deep learning models, deep re-
inforcement learning models do not require complex neural network 2. Related work
model1 and often require only a few simple layers of neural networks.
Although these reinforcement learning algorithms have achieved good In this section, we summarize the research progress related to intru-
results in intrusion detection, there are still challenges in dealing with
sion detection from the perspective of deep learning and reinforcement
data imbalance and multiple classifications of cyber attacks. In order
learning.
to solve the network intrusion detection problem using reinforcement
learning, it is necessary to map the label space of intrusion detection
2.1. Deep learning-based intrusion detection
(the number of categories of network attacks) to the action space of re-
inforcement learning. This implies that if there are more categories of
Shone et al. (2018) proposed an intrusion detection model based on
network attacks, then reinforcement learning algorithms are required
the unsupervised feature learning method of nonsymmetric deep au-
to have more action space exploration capabilities.
toencoder, and evaluated in the KDDCup 995 and NSL-KDD datasets
In this paper, we propose a deep reinforcement learning model based
have yielded promising results.
on adversarial learning for network intrusion detection. The Soft Actor-
Muna et al. (2018) proposed an intrusion detection model for In-
Critic (SAC) (Haarnoja et al., 2018a,b; Christodoulou, 2019) model with
ternet industrial control systems based on deep autoencoders and deep
the best exploration ability as the base model for this paper. First, we
feedforward neural networks. According to the experimental results, the
modify the original reinforcement learning framework by adding an
model has a higher detection rate and a lower false positive rate.
additional environment agent. The role of the environment agent is to
Zhang et al. (2020b) proposed a hybrid intrusion detection model
resample the original training dataset and change its imbalance. The
based on multiscale CNN (MSCNN) and LSTM. The model first analyzes
resampled dataset is used to train the main agent (classifier agent). Sec-
ond, the reward function is redefined and the rewards of the classifier the spatial features of the dataset with a MSCNN, and then processes
agent and the environment agent form an adversarial relationship. Dif- the temporal features with a LSTM. Finally, the model classifies using
ferent reward values are set for different classes of network attacks, spatiotemporal features.
giving larger rewards for the minority attacks and smaller rewards for Popoola et al. (2020) proposed a hybrid LSTM-Autoencoder and
the majority attacks. Then, the classifier agent and the environment Bidirectional LSTM (LAE-BLSTM) intrusion detection method. LAE re-
agent maximize their respective reward rewards through an adversar- duces the dimensionality of network traffic features, freeing up memory
ial learning model. Finally, the policy learned by the classifier is used space and increasing computational speed. BLSTM, on the other hand,
to detect anomalous traffic. The model of this paper was evaluated learns the long-term temporal relationships between low-dimensional
on the public dataset NSL-KDD (Tavallaee et al., 2009) and Aegean features in order to correctly distinguish benign traffic from various
types of botnet attack traffic.
To detect network intrusions, Hassan et al. (2020) proposed a hybrid
1
The complexity of neural networks consists of spatial complexity and tem- deep learning model based on CNN and weight-dropped long short-term
poral complexity. The spatial complexity is represented by the number of layers
of the neural network and the number of parameters to be optimized in the
neural network. The time complexity can be measured by the number of float- 2 [Link]
ing point operations. Undoubtedly, both spatial and temporal complexity are 3
[Link]
closely related to the number of hidden layers and the number of neurons per 4 [Link]
5
layer. [Link]

2
Z. Li, C. Huang, S. Deng et al. Computers & Security 135 (2023) 103502

memory networks (WDLSTM). To avoid overfitting, the model employs vironment in the traditional DRL with a sampling function for training
CNN to extract meaningful features from IDS big data, while WDLSTM intrusion data that generates rewards based on error detection during
preserves long-term dependencies among the extracted features. the training phase. The literature presents the experimental results of
Wang and Li (2021) created a hybrid neural network architecture four deep learning algorithms, DQN, DDQN, PG, and AC, on NSL-KDD
that combines transformers and CNN to detect DDoS attacks (DDosTC) and AWID datasets, with DDQN outperforming the others.
on Software-defined networking, and tested it on the most recent Dong et al. (2021) proposed a semi-supervised double-depth Q
dataset CICDDoS2019 (Sharafaldin et al., 2019). The experimental re- network-based network anomaly traffic detection method (SSDDQN).
sults show that DDosTC outperforms the current optimal model. SSDDQN employs a auto-encoder to reconstruct network traffic fea-
Han et al. (2021) proposed a network attack detection model com- tures, while the K-means algorithm is used to improve the model’s
bining sparse autoencoder and kernel, and used an iterative method ability to detect unknown attacks. The model is tested on the NSL-KDD
of adaptive genetic algorithm to optimize the objective function of the and AWID datasets and performs well in several metrics.
combined kernel with sparse autoencoder. The model was also trained Zhou et al. (2021) designed an adaptable asynchronous advantage
and evaluated using a dataset based on IoT botnet attacks. actor-critic reinforcement learning model for intrusion detection. For
Imran et al. (2022) proposed an intelligent and efficient network both sequence and image anomalies, the model employs an attention
intrusion detection system based on deep learning. This paper uses a mechanism neural network and a convolutional neural network. The
novel stacked asymmetric deep auto-encoder for unsupervised feature model was tested on the NSL-KDD, AWID, and CICIDS 20178 datasets
learning combined with a support vector machine classifier for network (Jazi et al., 2017), and it outperformed or achieved comparable results
intrusion detection. The experimental evaluation was performed on KD- to other anomaly detection models.
DCup 99 with an accuracy of 99.65%. Sethi et al. (2021) developed a distributed attack detection method
Lan et al. (2022) proposed a multi-task learning model with hybrid based on DRL that employs DQN across multiple distributed agents
deep features. Based on a CNN with embedded space and a channel while efficiently detecting and classifying advanced network attacks
attention mechanism, two auxiliary tasks (an auto-encoder enhanced through the use of attention mechanisms. The model was evaluated on
with a memory module and a distance-based prototype network) are the NSL-KDD and CICIDS 2017 datasets and achieved excellent results
introduced in an innovative way to improve the model’s generalization in terms of accuracy, precision, and recall compared to the state-of-the-
ability and mitigate performance degradation in an unbalanced network art methods.
environment. Alavizadeh et al. (2022) proposed a network intrusion detection
method based on Q Learning (QL) and deep feed forward neural net-
2.2. Reinforcement learning-based intrusion detection works. The method uses DQL to provide continuous automatic learning
for the network environment, while detecting various types of network
intrusions using automatic trial-and-error methods and continuously
DRL has sparked a flurry of research and applications since AlphaGo
improving its detection capability.
defeated human Go masters (Demis, 2016). DRL is currently producing
In conclusion, researchers in the field of intrusion detection, whether
promising research results in the field of intrusion detection.
deep learning or reinforcement learning, have focused on the design of
Sethi et al. (2020b) created a context-adaptive intrusion detec-
detection models while ignoring the imbalance of the original dataset.
tion model that detects and classifies new and complex attacks using
Caminero et al. (2019) employs adversarial learning to sample balanced
multiple independent deep reinforcement learning agents distributed
training data with environmental agents, while Ma and Shi (2020) intro-
throughout the network. The model was extensively tested on the NSL-
duces the SMOTE technique to address training data imbalance. Despite
KDD, UNSW-NB15 (Moustafa and Slay, 2015),6 and AWID datasets
the fact that these two models produced good results, there is still
and demonstrated superior accuracy and lower false positive rate when
much room for improvement. We address the imbalance of the original
compared to state-of-the-art systems.
training data in this paper from two perspectives: the selection of the
Caminero et al. (2019) presented the first application of adversar-
reinforcement learning model and the design of the reward function.
ial reinforcement learning in intrusion detection, as well as a new
On the other hand, the existing intrusion detection model (Cil et al.,
technique for incorporating environmental behavior into the learning
2021; Imran et al., 2022), although achieving excellent classification
process of improved reinforcement learning algorithms. This model
performance, however, focuses more on binary classification and ne-
incorporates a reinforcement and supervisory framework to generate
glects multi-categorization of anomalous traffic. In fact, in the field of
environments that interact with pre-recorded sample data sets formed
network intrusion, if we can classify the attack categories in more de-
by network features and relevant intrusion labels, and it chooses sam-
tail, it plays a crucial role in the subsequent defense. The focus of this
ples with optimal policies to achieve the best classification results.
paper is on the multi-categorization of network intrusions.
Ma and Shi (2020) proposed an intrusion detection model combin-
ing reinforcement learning and class imbalance techniques based on the 3. Proposed method for network intrusion detection
literature (Caminero et al., 2019). This model addressed the class im-
balance issue by introducing an adaptive SMOTE (Synthetic Minority We propose the AE-SAC method for network intrusion detection,
Over-Sampling Technique)7 and re-modeling the behavior of the envi- which is based on adversarial learning and the SAC reinforcement learn-
ronment agents to improve performance. ing model. As illustrated in Fig. 1, the AE-SAC model includes an en-
Sethi et al. (2020a) investigated intrusion detection in cloud plat- vironmental agent for data sampling. In fact, both environment agent
forms, taking into account their vulnerability to novel attacks and the and classifier agent are implemented with SAC model. Therefore, we
inability of existing detection models to strike a balance between high proceed to introduce the SAC model in detail before introducing the
accuracy and low false positive rate. The authors proposed an adaptive AE-SAC model proposed in this paper. The meanings of the symbols
cloud IDS architecture based on DRL that addresses the limitations men- used in the paper are listed in Table 1.
tioned above while also providing accurate detection and fine-grained
classification of novel and complex attacks. 3.1. Soft actor-critic
Lopez-Martin et al. (2020) designed the improved the classical DRL
model to allow it to be used for intrusion detection, replacing the en- Unlike general reinforcement learning, which maximizes the cu-
mulative reward expectation by learning a policy, the SAC algorithm

6 [Link]
7 8
[Link] [Link]

3
Z. Li, C. Huang, S. Deng et al. Computers & Security 135 (2023) 103502

Fig. 1. AE-SAC Framework, Environment Agent: Resampling of the training set,


Classifier Agent: Learning strategies for network traffic identification.

Table 1
Summary of symbols and their meanings.
Fig. 2. SAC Framework, Actor Network: action probability distribution func-
Symbols Description
tion, V Critic Network: estimate the state value, Q Critic Network: estimate the
𝑠𝑡 the state at moment 𝑡 state action values, Replay Memory: store experience data.
𝑎𝑡 the action at moment 𝑡
𝑟𝑡 the reward value obtained at moment 𝑡 (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡+1 , 𝑠𝑡+1 ), which is then added to the experience replay memory.
𝑅(𝑎, 𝑠) the reward value obtained by performing action 𝑎 in state 𝑠
In reinforcement learning, the current state 𝑠𝑡 of the environment is fre-
𝜋 policy function, it represents the decision component, that is, what
action 𝑎 should be taken in state 𝑠
quently transferred from the previous state 𝑠𝑡−1 , and the future state
𝜋(𝑎|𝑠) a conditional probability distribution function, that is, the probability 𝑠𝑡 of the environment is transferred from the current state. Therefore,
of taking action 𝑎 state 𝑠 the experience data has a temporal dependency (Mnih et al., 2013). The
𝐻(𝜋(⋅|𝑠𝑡 ) the entropy of the policy 𝜋 at the state 𝑠𝑡 purpose of the experience replay is to break up the correlation of experi-
Q(a,s) the action state value function, that is, the value obtained by
ences (Schaul et al., 2015; Zhang and Sutton, 2017), and then randomly
performing action 𝑎 in state 𝑠
V(s) the value function of the state 𝑠 select a batch of experiences from the experience replay memory when
𝛾 the discount rate of reward training the neural network, so that the neural network can be trained
𝛼 the temperature parameter that determines the importance of entropy better.
M a batch of historical data taken from the experience replay memory Table 2 lists the inputs, outputs, and loss functions of the Actor, Q
Critic, and V Critic networks. 𝑎′𝑡 is not the action 𝑎𝑡 in an experience
requires that the action entropy of each policy output be maximized drawn from the experience replay memory, but rather all possible ac-
in addition to the basic goal stated above. This means that SAC must tions predicted by reusing the Actor network. The action state value
learn an optimal policy (𝜋 ⋆ ) that maximizes the sum of the cumulative function 𝑄0 can be replaced by 𝑄1 . 𝑀 represents a batch of historical
reward value and the policy’s entropy value, as shown in equation (1). data taken from the experience replay memory. 𝔼 denotes expectation.
∑ Two Q networks are trained alone using the same update method and
𝜋 ⋆ = arg max 𝔼 (𝑅(𝑎𝑡 , 𝑠𝑡 ) + 𝛼𝐻(𝜋(⋅|𝑠𝑡 ))), (1) MSE (Mean Square Error) as the loss function. One of the two V net-
𝜋 𝑡 works is the evaluation network and the other network is the target
where 𝛼 determines the relative importance of the entropy term with network. The target network is updated using soft update.
respect to the reward and is referred to as the temperature parameter,
3.1.2. RL paradigm for intrusion detection
while 𝐻(𝜋(⋅|𝑠𝑡 ) is the entropy of the policy 𝜋 at the state 𝑠𝑡 . It is calcu-
In fact, the intrusion detection problem can be viewed as a rein-
lated as:
forcement learning problem with a discrete action space. The current
𝐻(𝜋(⋅|𝑠𝑡 )) = − log 𝜋(⋅|𝑠𝑡 ). (2) DRL input has two parts: state and action. Furthermore, the dataset we
use includes two components: network traffic features and feature la-
For more theoretical derivations and details, please refer to references bels. To reconcile network traffic elements with the DRL concept, we
(Haarnoja et al., 2018a,b; Christodoulou, 2019). In this paper, we will consider network traffic features to be states and feature labels to be
focus more on applying the discrete SAC algorithm to solve the intrusion actions. As shown in the upper part of Fig. 3, 𝑆 = {𝑠0 , 𝑠1 , ..., 𝑠𝑛 , 𝑠𝑛+1 } is
detection problem. network traffic feature and 𝐴 = {𝑎0 , 𝑎1 , ..., 𝑎𝑛 , 𝑎𝑛+1 } is the feature label.
General reinforcement learning uses random sampling to read the
3.1.1. Overview of SAC original dataset and begins sampling at time 𝑡. Each sample is made up
As shown in Fig. 2, the SAC architecture contains five neural net- of three parts: (1) network traffic characteristics 𝑠𝑡 at time 𝑡. (2) net-
works, including an Actor network, two V Critic networks (evaluation work traffic real labels 𝑎𝑡 at time 𝑡 (3) network traffic characteristics
network and target network), and two Q Critic networks (𝑄0 and 𝑄1 𝑠𝑡+1 at time 𝑡 + 1. Before random sampling, the original data is divided
network). The Actor network is used to obtain the probability distribu- into 𝑁 + 1 random samples and randomly arranged. In this paper, we
tion of the actions, the V network is used to estimate the state values, use an environment agent for random sampling, please refer to subsec-
and the Q network is used to estimate the state action values. tion 3.2.3 for details.
In the SAC algorithm, we introduce an experience replay (Mnih
et al., 2013), which is similar to reinforcement learning algorithms 3.2. AE-SAC intrusion detection method
such as DDQN. Assume the state at time 𝑡 is 𝑠𝑡 , and the probability
𝜋(𝑎|𝑠𝑡 ) of all actions is obtained via the Actor network, and then the ac- In the previous section, we introduced the network structure and
tion 𝑎𝑡 is sampled based on the probability 𝜋(𝑎|𝑠𝑡 ), and then 𝑎𝑡 is input update method of the SAC model. In this section, we will describe in
to the environment to obtain 𝑟𝑡+1 and 𝑠𝑡+1 , resulting in an experience detail how to use the SAC model for intrusion detection.

4
Z. Li, C. Huang, S. Deng et al. Computers & Security 135 (2023) 103502

Table 2
Inputs, outputs and loss functions of each network (Actor, Q Critic and V Critic) in the SAC network architecture.

Network Input Output Loss Function


1 ∑
Actor state 𝑠𝑡 action probability 𝜋(𝑎|𝑠) 𝐿𝐴 = − |𝑀| 𝑒∈𝑀 𝔼𝑎′𝑡 ∼𝜋(⋅|𝑠) [𝑄0 (𝑎𝑡 , 𝑠𝑡 ) − 𝛼 log(𝜋(𝑎𝑡 |𝑠𝑡 ))], 𝑒 = (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡+1 , 𝑠𝑡+1 )
′ ′

1 ∑ 𝑞 2 𝑞
Q Critic state 𝑠𝑡 action state value 𝑄(𝑎𝑡 , 𝑠𝑡 ) 𝐿𝑄 = |𝑀| 𝑒∈𝑀 [𝑄(𝑠 ,
𝑡 𝑡𝑎 ) − 𝑈 𝑡 ] , 𝑈 𝑡 = 𝑟 𝑡 + 𝛾𝑉 (𝑠 𝑡+1 )
1 ∑ 𝑣 2 𝑣
V Critic state 𝑠𝑡 state value 𝑉 (𝑠𝑡 ) 𝐿𝑉 = |𝑀| 𝑒∈𝑀 [𝑉 (𝑠𝑡 ) − 𝑈𝑡 ] , 𝑈𝑡 = 𝔼𝑎′𝑡 ∼𝜋(⋅|𝑠𝑡 ) [min 𝑄𝑖 (𝑎′𝑡 , 𝑠𝑡 ) − 𝛼 log(𝜋(𝑎′𝑡 |𝑠𝑡 ))]
𝑖=0,1

receive a larger reward by recognizing or correctly classifying more


data. As a result, the design of the reward function is critical.
Unlike the simple 0/1 reward function taken in the literature (Lopez-
Martin et al., 2020; Caminero et al., 2019), this paper modifies the 0/1
reward function on this basis. The goal of introducing environmental
agents in this paper is to sample more balanced data; however, because
the proportion of normal traffic in the training dataset is much larger,
there is a different proportion of traffic in different categories even in
abnormal traffic. In order to enable the classifier to identify anomalous
traffic in a few categories, this paper assigns a higher reward value to
the anomalous traffic in a few categories. Equations (3) and (4) describe
the settings of the classifier agent and environment agent reward values,
respectively.

⎧ 0, 𝑎𝑐𝑡 ! = 𝑎𝑡 ,

𝑟𝑐𝑡+1 = ⎨ 1, 𝑎𝑐𝑡 = 𝑎𝑡 , 𝑎𝑡 ∈ 𝐴𝐿 , (3)
⎪ 2, 𝑎𝑐𝑡 = 𝑎𝑡 , 𝑎𝑡 ∈ 𝐴𝑀 ,

⎧ 0, 𝑎𝑐𝑡 = 𝑎𝑡 ,

𝑟𝑒𝑡+1 = ⎨ 1, 𝑎𝑐𝑡 ! = 𝑎𝑡 , 𝑎𝑡 ∈ 𝐴𝐿 , (4)
⎪ 2, 𝑎𝑐𝑡 ! = 𝑎𝑡 , 𝑎𝑡 ∈ 𝐴𝑀 ,

where 𝑎𝑐𝑡 is the label chosen by the classifier agent for the current flow
(𝑠𝑡 represents the network traffic characteristics of the current data
flow). 𝑎𝑡 is the actual label for that flow, 𝐴𝐿 indicates that the flow
belongs to the category with a high percentage and 𝐴𝑀 indicates that
Fig. 3. Random sampling from the original dataset, network traffic features are
mapped to the state space and traffic labels are mapped to the action space. the flow belongs to the category with a low percentage. The design of
𝑟𝑐𝑡+1 and 𝑟𝑒𝑡+1 is consistent with the pattern of environment agents and
classifier agents confronting each other around reward values. In fact,
3.2.1. Overview of AE-SAC
this method of designing reward values may not be optimal, but it has
Network intrusion data sets are typically uneven, with a high pro-
been proven to be more effective through extensive experiments.
portion of normal traffic and very little abnormal traffic. Even if the
random sampling method (Lopez-Martin et al., 2020) is used, the sam-
pled data still contains a significant amount of normal traffic, making it 3.2.3. Sampling of training data
impossible to solve the unbalanced data problem. Therefore, as shown As shown in Fig. 4, this paper uses an environmental agent to sam-
in Fig. 1, we modified the original reinforcement learning framework ple the training dataset. The environment agent uses the policy 𝜋𝑒 to
to include an environmental agent that provides intelligent behavior select an action label 𝑎𝑒𝑡 for state 𝑠𝑡 . This action label 𝑎𝑒𝑡 is then used to
for the environment beyond random dataset sampling. This modifica- select the next state 𝑠𝑡+1 (the actual action label 𝑎𝑡 is also obtained at
tion enables us to address the issue of dataset imbalance by employing this point) in the training set. Next, the action label 𝑎𝑐𝑡 of state 𝑠𝑡+1 is ob-
an environment agent that performs dynamic and intelligent resampling tained using the classifier agent’s policy 𝜋𝑐 and the reward 𝑟𝑒𝑡+1 for the
of the dataset during training. environment agent and the reward 𝑟𝑐𝑡+1 for the classifier agent are calcu-
The environment agent and the classifier agent learn adversarially lated. At this point, we obtain a pair of experience data (𝑠𝑡 , 𝑎𝑒𝑡 , 𝑟𝑒𝑡+1 , 𝑠𝑡+1 )
iteratively, with the ultimate goal of teaching the classifier agent a near- and (𝑠𝑡 , 𝑎𝑐𝑡 , 𝑟𝑐𝑡+1 , 𝑠𝑡+1 ), and store them in the respective experience replay
optimal policy. The environment agent is in responsible of sampling memory. Subsequently, when we have obtained a certain amount of
from the training set to provide training data for the classifier, and the empirical data, we update the two agents to obtain new policies us-
classifier agent trains and learns using the data provided by the environ- ing the update method of the SAC model introduced above. Finally the
ment agent. If the classifier successfully classifies the batch of training environment agent continues the data sampling using the new policy.
data collected by the environment agent, the classifier is rewarded; oth- It is worth noting that the initial state 𝑠0 is randomly selected from
erwise, the environment agent is rewarded. the training set. The output of policies 𝜋𝑒 and 𝜋𝑐 have probability values
Following that, we will go over the design of the reward function, for each action, which are selected according to the probability of the
the sampling of the environment agents, and the training of the two action when selecting 𝑎𝑒𝑡 and 𝑎𝑐𝑡 . Consistent with literature (Caminero
agents. et al., 2019), the spaces of action 𝐴𝑒 and action 𝐴𝑐 can be different
on the NSL-KDD dataset. The classifier agent has a set of actions (𝐴𝑐 ∈
3.2.2. Design of the reward function [0, 4]) corresponding to the number of classes of the classifier. On the
The environment agent and classifier agent perform adversarial other hand, the environment agent has a set of actions (𝐴𝑒 ∈ [0, 22])
learning around obtaining the maximum reward. The environment corresponding to each of the possible attacks in the training data set.
agent gets the maximum reward by sampling more data that the clas- Fig. 5 depicts the distribution of the training data after sampling
sifier cannot recognize or classifies incorrectly. Conversely, classifiers through the environmental agent. Initially (epoch 0), the environment

5
Z. Li, C. Huang, S. Deng et al. Computers & Security 135 (2023) 103502

Algorithm 1: AE-SAC.
Input: train data set 𝐷𝑡𝑟𝑎𝑖𝑛 , test data set 𝐷𝑡𝑒𝑠𝑡
Output: detection results on the test data set
1 Initialize experience replay Memory: 𝐸𝑀𝑒 ⟵ 0, 𝐸𝑀𝑐 ⟵ 0 ;
2 Initialize 𝜋𝑐 (𝑠𝑡 ) arbitrarily;
3 Initialize 𝜋𝑒 (𝑠𝑡 ) arbitrarily;
4 for 𝑖 = 1 to epochs do
5 Initialize the state: 𝑠0 ⟵ random sampling from 𝐷𝑡𝑟𝑎𝑖𝑛 ;
6 Choose initial action: 𝑎𝑒𝑡 ⟵ 𝜋𝑒 (𝑠0 );
7 𝑠𝑡 ⟵ random sampling from 𝑠(𝑎𝑒𝑡 ), 𝑠(𝑎𝑒𝑡 ) are all sample whose label is 𝑎𝑒𝑡 ;
8 for 𝑡 = 0 to 𝑁 do
9 𝑎𝑐𝑡 ⟵ 𝜋𝑐 (𝑠𝑡 );
10 𝑟𝑐𝑡+1 , 𝑟𝑒𝑡+1 ⟵ Reward(𝑎𝑐𝑡 , 𝑎𝑡 );
11 𝑎𝑒𝑡+1 ⟵ 𝜋𝑒 (𝑠𝑡 );
12 𝑠𝑡+1 ⟵ random sampling from 𝑠(𝑎𝑒𝑡+1 ), 𝑠(𝑎𝑒𝑡+1 ) are all sample whose
label is 𝑎𝑒𝑡+1 ;
13 𝐸𝑀𝑒 .add((𝑠𝑡 , 𝑎𝑒𝑡 , 𝑟𝑒𝑡+1 , 𝑠𝑡+1 ));
14 𝐸𝑀𝑐 .add((𝑠𝑡 , 𝑎𝑐𝑡 , 𝑟𝑒𝑡+1 , 𝑠𝑡+1 ));
15 end
16 Update environment agent by randomly sampling M experience data from
𝐸𝑀𝑒 ;
Fig. 4. Sampling with environmental agent, Environment Agent: output the 17 Update classifier agent by randomly sampling M experience data from 𝐸𝑀𝑐 ;
action label for the next sample state 𝑠𝑡+1 , Classifier Agent: predicted action 18 end
labels for output state 𝑠𝑡 , Random Sampler: randomly select the next state from 19 Result ⟵ 𝜋𝑐 (𝐷𝑡𝑒𝑠𝑡 );
the data with the same label, Reward Calculation: calculate the reward value 20 return Result;
obtained by the classifier agent and the environment agent.

both agents, including data sampling and network updates. 𝑁 denotes


the number of environmental agents sampling the training set, and 𝑀
denotes the number of empirical data involved in the update of both
agents. After the training is completed, we predict the test set 𝐷𝑡𝑒𝑠𝑡 (line
19 of the Algorithm 1) using the policy 𝜋𝑐 learned by the classifier and
return the classification results.

4. Experiments and results

In this section, we will introduce the data set, data pre-processing,


evaluation metrics, experimental environment, and experimental re-
sults, respectively.

4.1. Dataset

The dataset used in this paper must satisfy the following aspects:
Fig. 5. The distribution of attacks and normal data generated by the environ- (1) A public and well known dataset that is used by many researchers
ment agent during training (NSL-KDD dataset).
for experiments. (2) A highly unbalanced dataset with varying degrees
of imbalance closer to the real network environment. (3) A dataset con-
agent is randomly provided with training data. As training proceeds, tains sufficient samples for training and testing.
the environment agent learns and samples the training data that max- Therefore, the NSL-KDD (Tavallaee et al., 2009) dataset and AWID
imizes its reward. We clearly observe that the data volume of the R2L (Kolias et al., 2015) dataset are selected as the primary dataset for the
attack category is much higher than that of the other categories, mainly experiment. The CICIDS2017 and CICDDoS2019 datasets were obtained
because we set its reward value high. The main reason why the U2R by parsing Packet Capture (PCAP) files with the CICFlowMeter9 tool.
category is not high may be that the number of U2Rs in the training However, there are many bugs in the CICFlowMeter tool (Engelen et al.,
set is too small, and the classifier agents are better at classifying and 2021; Liu et al., 2022; Lanvin et al., 2023), which makes us doubt the
giving less reward to the environment agents. Nevertheless, the data authenticity and reliability of these two datasets. There are still many
volume of U2R is still slightly higher than that of the other categories. researchers using the CICIDS2017 dataset, and this paper also performs
The unevenness of the dataset is largely reduced by resampling of the a simple algorithm validation on the CICIDS2017 dataset.
environment agents.
4.1.1. NSL-KDD
3.2.4. Training process of AE-SAC model NSL-KDD is an improved version of KDDCup 99 that removes re-
The environment agent and the classifier agent can be trained in a dundant records from the original data set and produces more realistic
parallel manner. During the training process, a certain amount of em- simulation results. The NSL-KDD dataset includes 41 features and a la-
pirical data is taken from the respective experience replay memory for bel. Furthermore, the NSL-KDD contains a training set (KDDTrain+)
network training. Algorithm 1 describes the training process for both and two test sets (KDDTest+ and KDDTest21-). These 41 features in-
agents. cluded 38 continuous and 3 categorical variables, all of which were
In Algorithm 1, the 𝑅𝑒𝑤𝑎𝑑(𝑎𝑐𝑡 , 𝑎𝑡 ) function calculates the reward transformed by a max-min normalization method to narrow the range
values 𝑟𝑒𝑡+1 and 𝑟𝑐𝑡+1 for the environment and classifier agents using between [0, 1] for continuous variables and one-hot coding for categor-
formulas (3) and (4). 𝑎𝑐𝑡 is the action label obtained by the classifier
using the policy to predict state 𝑠𝑡 , and 𝑎𝑡 is the true action label of
state 𝑠𝑡 . Lines 4 to 18 of Algorithm 1 describe the training process for 9
[Link]

6
Z. Li, C. Huang, S. Deng et al. Computers & Security 135 (2023) 103502

Fig. 6. Distribution frequency of each class of NSL-KDD.

Fig. 7. Distribution frequency of each class of AWID.


Table 3
Attack Categories, including 4 types of attacks: DoS, Probe, R2L and U2R.
Table 5
DoS back, land, neptune, pod, smurf, teardrop, mailbomb, apache2, Data distribution of 3 types of network attacks and normal data after resampling
processtable, udpstorm using environmental agent and original data distribution (AWID).
PROBE ipsweep, nmap, portsweep, satan, mscan, saint
R2L ftp write, guess passwd, imap, multihop, phf, spy, warezclient, Normal Flooding Injection Impersonation
warezmaster, sendmail, named, snmpgetattack, snmpguess, xlock, Resampling 46891 13946 12770 26393
xsnoop, worm Original 1633190 48484 65379 48522
U2R buffer overflow, loadmodule, perl, rootkit, httptunnel, ps, sqlattack,
xterm
dataset, some features have null values, constant values, timestamp
and network addresses. The 154-dimensional features were reduced to
Table 4
46 dimensions after they were removed. In particular, the timestamp,
Data distribution of 4 types of network attacks and normal data after resampling
which might affect the performance of the algorithm, we have removed
using environmental agent and original data distribution (NSL-KDD).
it in the data preprocessing phase (Chatzoglou et al., 2022). Because
Normal DoS Probe R2L U2R the range of values varies greatly across features, continuous variables
Resampling 4689 2789 2187 5795 3971 are also scaled to the interval [0–1] to eliminate the effect of this vari-
Original 67343 45927 11656 995 52 ation on the model. Furthermore, one-hot coding was used to convert
categorical variables into dummy variables.
The dataset includes one normal traffic category and three abnormal
ical variables. After data transformation, the dataset consisted of 122 traffic categories. The three types of abnormal flows are injection, sim-
features. ulation, and flooding. Fig. 7 depicts the sample distribution frequency
Furthermore, the attack type distribution in the dataset is severely for each traffic class. Undoubtedly, the AWID dataset, like the NSL-KDD
unbalanced. The training set contains one normal type and 22 attack dataset, is extremely unbalanced and well suited for the study presented
types, while the test set contains one normal type and 37 attack types. in this paper. Table 5 shows the distribution of the AWID dataset after
The training and test sets share 21 common types, with 2 unique to the the resampling of environmental agents, and we can observe that the
training set and 17 unique to the test set. The presence of unknown resampled dataset is more balanced, which solves the category imbal-
types of attacks in the test set makes learning the DRL model difficult. ance of the original training dataset to some extent. Compared with the
According to the attack type, the NSL-KDD dataset can be divided into original data distribution, the proportion of normal data is significantly
five categories: NORMAL, DoS, PROBE, R2L and U2R. The proportion reduced and the proportion of data from the other three types of cyber
of each traffic shows in Fig. 6. Without a doubt, the number of U2R and attacks is increased.
R2L attacks is extremely low. Table 3 lists the classification in detail.
The purpose of introducing environmental agent resampling in this 4.2. Data pre-processing
paper is to solve the problem of unbalanced original training set. The
data distribution of the four types of network attacks and normal data Network intrusion detection datasets contain a wide range of data
on the NSL-KDD dataset after using environmental proxy sampling is types, including character, boolean, numeric, and even null values.
shown in Table 4. Without a doubt, the sampled data fraction is more Therefore, before training the intrusion detection model, we must pre-
balanced than the original data distribution. In particular, the two types process the dataset. The pre-processing of the dataset includes data
of network attack data, R2L and U2R, have been resampled by environ- cleaning, data conversion, and data normalization.
mental agents with data volumes of over 5000 and 3000 respectively.
• Data cleaning. The infinity value, which has misled the model’s
4.1.2. AWID training, is replaced with -1, and data rows containing NaN values
The AWID (Kolias et al., 2015) dataset is the largest and most com- or NULL values are removed.
prehensive dataset collected in a real-world WiFi network environment. • Data conversion. The NSL-KDD dataset contains three types of char-
The dataset can be divided into two data subsets based on attack class acteristics: nominal, binary, and numeric. Nominal data cannot be
level: the ATK dataset with 16 sub-attack classes and the CLS dataset dealt with directly by machine learning or deep learning algo-
with four large attack classes. The AWID-CLS-R dataset, which contains rithms. As a result, the dataset’s non-numeric data features must
154 features divided into continuous and categorical features, is pre- be converted to numeric data (especially protocol types, services,
ferred by the majority of researchers. Furthermore, the training and test and flag features). One-hot encoding (Potdar et al., 2017) is used
sets contain 1795,474 and 675,642 samples, respectively. In the AWID to map non-numeric features to numeric features in this paper.

7
Z. Li, C. Huang, S. Deng et al. Computers & Security 135 (2023) 103502

• Data normalization. The numerical data values vary greatly. In the Table 6
NSL-KDD dataset, for example, feature attribute durations (connec- Network structure of the environment agent.
tion durations) range from 0 to 58329. The large range of values Name Network Struct
can cause issues such as slow network convergence and neuron out-
Actor 122(46)-100-100-100-23(4)
put saturation. Therefore, data normalization is essential. The data Q Critic 122(46)-100-100-100-23(4)
in this paper is restricted to [0, 1] using the max-min normalization V Critic 122(46)-100-100-100-1
method, as shown in equation (5).
𝑓𝑜𝑙𝑑 − 𝑓𝑚𝑖𝑛
𝑓𝑛𝑒𝑤 = , (5)
𝑓𝑚𝑎𝑥 − 𝑓𝑚𝑖𝑛 Table 7
Network structure of the classifier agent.
where 𝑓𝑜𝑙𝑑 is a network traffic feature vector and 𝑓𝑚𝑖𝑛 and 𝑓𝑚𝑎𝑥
are the minimum and maximum values of 𝑓𝑜𝑙𝑑 . The original dis- Name Network Struct

tribution of network traffic characteristics is preserved using this Actor 122(46)-100-100-100-5(4)


method. Q Critic 122(46)-100-100-100-5(4)
V Critic 122(46)-100-100-100-1

The DRL input is divided into two parts: state and action. Further-
more, the dataset we use includes two additional components: network
traffic features and feature labels. To incorporate network traffic ele- Table 8
ments into the DRL concept, we consider network traffic features to be Other training parameters of AE-SAC model.
states and feature labels to be actions. Parameter Value

epoch 100
4.3. Evaluation metrics batch_size 180
sample_size 180
learning rate 0.1
The generic ML-based approach evaluates its performance using four
𝛾 0.001
metrics: accuracy, precision, detection rate (DR, also known as recall), 𝛼 0.2
and f1-score. Accuracy is used to assess the method’s overall effective- experience replay memory 1000
ness. Precision and DR were used to evaluate each category’s efficiency.
F1-score is a comprehensive overall indicator that takes into account
both precision and recall. It is the weighted average of precision and
recall. As a result, when evaluating models, the f1-score is more rep-
resentative. The calculation formula of each evaluation indicator is as
follows:
𝑇𝑃 +𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (6)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
𝑇𝑃
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (7)
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
𝐷𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 = (8)
𝑇𝑃 + 𝐹𝑁
2 × 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹 1 − 𝑠𝑐𝑜𝑟𝑒 = (9)
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
The number of correctly labeled samples in the target sample is referred
to as the true positive (TP). The number of mislabeled samples in a non-
target sample is referred to as false positive (FP). The number of samples
Fig. 8. Trend of accuracy with the number of hidden layers.
in the target sample that were incorrectly labeled as other categories
is referred to as the true Negative (TN). The number of non-samples
marked in non-target samples is referred to as false negative (FN). training process of the AE-SAC model, where 𝛾 is the discount rate, 𝛼 is
the temperature parameter that determines the importance of entropy.
4.4. Experiment environment Grid search was used to find the best values for the SAC model’s
hyperparameters. It works by performing an exhaustive search for spe-
The experiment is running on a PC with Windows 10, an Intel(R) cific hyperparameter values automatically, saving time and resources.
Core(TM) I7-9750H CPU running at 2.60 GHz, and 16 GB of RAM. The The number of layers of hidden layers and the number of neurons
experiment made use of the Scikit-Learn library, the Tensorflow2.0 li- in each hidden layer were determined using grid search. The optimal
brary, and Keras. value chosen is the one with the highest accuracy across all parame-
To implement the environment agent and classifier agent, the SAC ters. Figs. 8 and 9 illustrate the results of the grid search. We can see
algorithm is used. The Actor network, Q Critic network, and V Critic that the SAC model only needs 3 hidden layers, and the number of neu-
network in the SAC model all use simple neural network structure in this rons in each hidden layer is 100, which can achieve good classification
paper. The network structure of the environment agent and classifier results. Of course, more hidden layers and more neurons may result
agent is shown in Table 6 and Table 7. The data in parentheses represent in better classification results, but it will require more training time
the number of neurons in the input and output layers on the AWID because SAC model is more complex than other reinforcement learn-
dataset. All the networks in the table use fully connected networks. All ing models. As described in section 3.1.1, the three network actors, Q
hidden layers in the Actor network and Critic network use ReLu as the Critic, and V Critic in the SAC model are only used to approximate the
activation function. The output layer of the Actor network uses Softmax probability distribution function, the state action value function, and
as the activation function, while the output layer of the Critic network the state value function. Deep learning networks usually learn a classi-
does not use the activation function. Adam algorithm is used to optimize fier directly, which undoubtedly requires more hidden layers and more
the parameters of all networks. Table 8 lists the other parameters in the neurons.

8
Z. Li, C. Huang, S. Deng et al. Computers & Security 135 (2023) 103502

Fig. 9. Trend of accuracy with the number of neurons.


Fig. 10. Performance scores of AE-SAC compared to other ML algorithms (KD-
DTest+).
4.5. Results

Early intrusion detection mostly used misuse based techniques to


detect intrusions. However, misuse-based intrusion detection systems
are highly dependent on the existing signature knowledge base, which
makes it difficult to detect unknown attacks and cannot adapt to new
intelligent attack behaviors. Therefore, anomaly-based intrusion detec-
tion is the focus of current research and development. From the perspec-
tive of model learning, anomaly-based intrusion detection also includes
traditional machine learning-based intrusion detection models, deep
learning-based intrusion detection models, and reinforcement learning-
based intrusion detection models. Machine learning algorithms, deep
learning networks belong to supervised learning and fully exploit the
data labels for model learning. Although reinforcement learning is not
supervised learning, it also uses the real labels of the data to obtain re-
ward values and thus to perform policy learning (see subsection 3.2.3
for the design of the reward function). Therefore, from the data point
of view, supervised learning and the reinforcement learning designed in Fig. 11. Accuracy comparison of AE-SAC method and other ML methods on the
KDDTest21 dataset.
this paper use the same data, with the difference that the AE-SAC model
resamples the data set to change the imbalance of the original training
set. of more than 84%. In fact, precision and recall affect each other, and
In order to verify the effectiveness of the method in this paper, while we would like both to be high, they are mutually constraining
we selected the existed network intrusion detection models (Caminero in practice. As a result, we will use f1-score to assess both precision
et al., 2019; Ma and Shi, 2020; Dong et al., 2021; Vinayakumar et al., and recall simultaneously. There is no doubt that the f1-score of AE-
2019) as the comparison algorithm for multi-classification experiments SAC algorithm is still higher than other machine learning algorithms.
on NSL-KDD and AWID. These methods involved in the comparison can AE-SAC is 3.41% higher than the best RBF-SVM algorithm and 23.31%
be divided into three main categories: ML methods, DL methods, and higher than the worst LR algorithm.
DRL methods. In the remainder of this section, we describe in detail The most likely reason for the poor classification effect of the ML
the performance comparison of the AE-SAC algorithm with other algo- algorithm on NSL-KDD is that the ML algorithm is more dependent on
rithms and the shortcomings of the AE-SAC algorithm. feature engineering. In general, as the number of features increases,
the performance of machine learning classifiers tends to rise and then
4.5.1. Compared to ML methods fall. This means that having too many or too few features can seriously
For the machine learning algorithms involved in the comparison, reduce the classifier’s effectiveness. When there aren’t enough features,
we choose Logistic Regression (LR), Support Vector Machine (SVM), it’s easy for data to overlap; when the feature dimension is too high, it’s
Decision Tree (DT), Random Forest (RF), Gradient Boosting Machine easy for the same category to become more distant and sparse in space,
(GBM) and Adaboost (AB) algorithms. causing many classification algorithms to fail. The feature dimension
Fig. 10 compares the experimental results of this AE-SAC algorithm of NSL-KDD is as high as 122 dimensions after processing, which will
to machine learning algorithms in terms of accuracy, precision, recall, seriously affect the performance of the machine learning algorithm.
and f1-score in the multi-classification case. Only RBF-SVM achieves an The Fig. 11 shows the accuracy of the AE-SAC algorithm compared
accuracy of 80% among the machine learning algorithms in the compar- with other machine learning algorithms on the KDDTest21 dataset.
ison, while the other machine learning algorithms achieve a maximum In fact, the recognition difficulty of KDDTest21 is higher than that of
accuracy of 76.06%. The accuracy of our AE-SAC algorithm has ex- KDDTest+, but the accuracy of AE-SAC algorithm reaches 70.33% far
ceeded 84%, which is higher than the accuracy of all machine learning higher than other algorithms compared to the commonly used machine
algorithms used for the comparison. Another important phenomenon is learning algorithms.
that the accuracy of RBF-SVM is 5.05% higher than that of Linear-SVM, On the AWID dataset, Fig. 12 compares the performance of the AE-
which is actually expected. The kernel function of SVM is optional, and SAC algorithm to that of other machine learning algorithms. In addition
linear functions are often less effective than Gaussian kernel functions. to the NSL-KDD dataset, we compare experimental results of algorithms
In terms of precision, recall, and f1-score, the AE-SAC algorithm still such as Hyperpipes (HP), J48, Naive Bayes (NB), OneR, Random Tree
has a clear advantage. The AE-SAC algorithm has a precision and recall (RT) and ZeroR provided in (Kolias et al., 2015). The experimental re-

9
Z. Li, C. Huang, S. Deng et al. Computers & Security 135 (2023) 103502

Fig. 12. Performance scores of AE-SAC compared to other ML algorithm


(AWID).

Fig. 14. Performance scores of AE-SAC compared to other DL algorithm (KD-


DTest+).

Fig. 13. Accuracy comparison of AE-SAC method and other ML algorithm on


the CICIDS2017.

sults are similar to the NSL-KDD dataset, the AE-SAC algorithm still Fig. 15. Performance scores of AE-SAC compared to other DL algorithm
outperformed the other algorithms, with all metrics exceeding 98.9% (AWID).
and only J48 having accuracy and f1-score exceeding 96%. Although
the processed AWID dataset only has 46 feature dimensions, feature se-
lection is still an important factor for machine learning algorithms to
affect their performance.
Fig. 13 depicts the experimental results of the accuracy of AE-SAC
with other machine learning algorithms on the CICIDS2017 dataset.
The accuracy of AE-SAC reached 96.65%, which is higher than other
algorithms.

4.5.2. Compared to DL methods


Compared with machine learning algorithms, deep learning net-
works can automatically extract complex network features without
relying on feature engineering. In this paper, the commonly used DL
methods MLP, CNN, GRU, DNN are selected as comparison algorithms.
Fig. 14 shows the experimental results of the AE-SAC algorithm
compared with the deep learning algorithm on NSL-KDD in the multi-
classification case. Combining Fig. 10 and Fig. 14, we can know that
except for the RBF-SVM algorithm, the classification performance of Fig. 16. Accuracy comparison of AE-SAC method and other DL methods on the
KDDTest21 dataset.
the DL algorithm is better than the performance of the ML algorithm.
Among these deep learning algorithms, the accuracy of CNN, DNN,
and MLP all exceeded 77%, with the highest accuracy of CNN reach- a significant advantage over the deep learning algorithm in terms of
ing 78.75%. The performance of GRU is the worst, with an accuracy of precision, recall and f1-score.
only 75.39%, which may be mainly due to the fact that GRU is better at The Fig. 16 shows the accuracy of the AE-SAC algorithm compared
processing data with sequence characteristics. AE-SAC algorithm is ac- with other deep learning models on the KDDTest21 dataset, where the
tually also composed of several simple DNN networks, but its accuracy algorithms BAT (Bidirectional Long Short-term memory and ATtention
is better than DNN. The experimental results of the DNN series of algo- mechanism) and BAT-MC (BAT and Multiple Convolutional layers) are
rithms illustrate that the performance of classification does not improve proposed by Su et al. (2020). The BAT-MC algorithm processes the input
with the increase of the number of DNN layers. Among the deep learn- data using multiple convolutional layers, and the attention mechanism
ing algorithms, the MLP algorithm has the highest precision of 82.62%, is used to filter the network flow vectors composed of the grouping
and the DNN 5 layer has the highest recall and f1-score of 78.5% and vectors generated by the BLSTM model to obtain the key features for
76.5%, respectively. There is no doubt that the AE-SAC algorithm has network traffic classification. It achieves 69.42% accuracy, second only

10
Z. Li, C. Huang, S. Deng et al. Computers & Security 135 (2023) 103502

Fig. 17. Accuracy comparison of AE-SAC method and other DL networks on the
CICIDS2017 dataset. Fig. 18. Performance scores of AE-SAC compared to other DRL algorithm (KD-
DTest+).
to the AE-SAC algorithm designed in this paper. Other deep learning
models do not perform well for classification on KDDTest21.
Consistent with the experimental results on the NSL-KDD dataset,
the AE-SAC algorithm achieves a very clear advantage on the AWID
dataset as shown in Fig. 15. the accuracy of AE-SAC is higher than the
95.37% of the CNN algorithm and the 95.27% of the GRU algorithm,
while none of the other algorithms achieves 95% accuracy. In terms of
f1-score, the deep learning algorithms involved in the comparison have
not yet reached 94%, while the AE-SAC algorithm has an f1-score close
to 99%.
Fig. 17 shows the accuracy of AE-SAC compared with different hid-
den layer DNN networks on the CICIDS2017 dataset. The accuracy of
AE-SAC algorithm is slightly higher than that of DNN algorithm. At the
same time, we can also see that the classification performance of DNN
will not improve with the increase of layers. A DNN network with one
hidden layer is more accurate than a DNN network with 4 hidden lay-
ers. However, the accuracy of DNN with 3 hidden layers is higher than
that of DNN with 1 hidden layer. Fig. 19. Performance scores of AE-SAC compared to other DRL algorithm
(AWID).
The main reason for the significant advantage of AE-SAC algorithm
over the deep learning algorithm on NSL-KDD and AWID may be that
AE-SAC changes the imbalance of the original dataset by introducing algorithm is still higher than that of the DDQN, AE-RL, and SSDDQN
environmental agents for training data resampling. algorithms. It is worth noting that the performance of SSDDQN is very
close to that of the AE SAC algorithm, with only 0.79% less accuracy
4.5.3. Compared to DRL methods and 0.7% less f1-score.
DRL algorithms have also achieved good research results in intrusion In fact, the AE-RL, AESOMTE and AE-SAC algorithms have similari-
detection. In this paper, DQN, DDQN, Dueling DQN, SSDDQN, A3C, AE- ties. The difference between AE-RL and AE-SAC is that the environment
RL, AESMOTE algorithms are selected as comparison algorithms. agent and the classifier agent use different reinforcement learning algo-
The experimental results are shown in Fig. 18. The DQN algorithm rithms; AE-RL uses the DDQN algorithm, while this paper uses the SAC
has the worst classification performance, and the AESMOTE algorithm algorithm which has better action space exploration capability. On the
obtains better classification results, which is only worse than the per- other hand, SAC uses fewer hyperparameters compared to the DDQN
formance of the AE-SAC algorithm proposed in this paper. The accuracy algorithm. AESOMTE is more complex than the other two algorithms,
of DQN is only 68.55%, which is worse than the accuracy of machine introducing the SMOTE (Synthetic Minority Over-Sampling Technique)
learning and deep learning algorithms. The main reason may be that the technique to solve the class imbalance problem by generating additional
DQN algorithm suffers from a serious overestimation problem. DDQN data. Experiments show that the AE-SAC algorithm has better classifi-
introduces the target network, Dueling DQN changes the network struc- cation performance.
ture of the original DDQN. The SSDDQN (Dong et al., 2021) algorithm
introduces the K-means algorithm to improve the model’s ability to 4.5.4. Time performance analysis
detect unknown attacks. The K-mean algorithm is used to predict the The AE-SAC algorithm achieves a significant advantage over the
action label of the next state 𝑠𝑡+1 . The obtained action label and the commonly used machine learning algorithms, deep learning algorithms,
next state 𝑠𝑡+1 are input to the target network to obtain the Q value. and deep reinforcement learning algorithms. In fact, we may be more
Therefore, all three algorithms have higher accuracy than the DQN al- interested in the time performance of the AE-SAC algorithm when it
gorithm. is actually trained and deployed. Table 9 lists the training times for
Precision, recall, and f1-score are similar to accuracy, and the AE- various algorithms, but is not sufficiently informative, with different pa-
SAC algorithm is in the overall leading position. The f1-score of AE-SAC rameter settings for each algorithm. Compared to other algorithms, the
algorithm is 18.61% higher than DQN and 1.54% higher than AES- AE-SAC training time is longer, which is to be expected. The main rea-
MOTE. son is that the AE-SAC model samples more data in each training round
Fig. 19 depicts the experimental results of AE-SAC on the AWID and has to update multiple networks. In fact, the real training time of
dataset. There is no doubt that the overall performance of the AE-SAC an algorithm can be affected by numerous factors, such as sample size,

11
Z. Li, C. Huang, S. Deng et al. Computers & Security 135 (2023) 103502

Table 9
Training time of various algorithms and models on the NSL-KDD dataset.

Algorithm LR Linear SVM RBF SVM RF GBM AdaBoost MLP


Train time (sec) 97.37 65.06 1696.16 97.31 2242.14 201.40 314.74

Algorithm 1D-CNN DDQN Dueling DQN A3C AE-RL AESOMTE AE-SAC


Train time (sec) 590.58 228.391 454.477 218.14 1090.13 2000 9099

Table 10 Table 13
Number of network parameters on the AWID FLOPs on the NSL-KDD dataset.
dataset.
Name Classifier Environment
Name Classifier Environment FLOPs FLOPs
Parameters Parameters
Actor 65730 69438
Actor 25304 25304 Q Critic 65705 69323
Q Critic 25304 25304 V Critic 64901 64901
V Critic 25001 25001

Table 11
Number of network parameters on the NSL-
KDD dataset.

Name Classifier Environment


Parameters Params

Actor 33005 34823


Q Critic 33005 34823
V Critic 32601 32601

Table 12
FLOPs on the AWID dataset.

Name Classifier Environment


FLOPs FLOPs

Actor 50324 50324


Fig. 20. Performance scores of AE-SAC algorithm with different reward func-
Q Critic 50304 50304
tions.
V Critic 49701 49701

time on the AWID dataset is 6.91 s, and the sample size is 675642,
GPU, etc., which is difficult to evaluate accurately. Therefore, we eval- which means that the prediction of a single sample takes only 10 mi-
uate the performance of AE-SAC by the number of network parameter croseconds. This prediction time can be deployed in a real environment
updates and floating point computations (FLOPs) during training. according to the conclusions obtained in Lopez-Martin et al. (2020);
The number of parameters included in each network for the AE-SAC Caminero et al. (2019); Dong et al. (2021); Paleyes et al. (2022). In
algorithm on the AWID and NSL-KDD datasets is shown in Tables 10 a real environment, we can consider distributed deployment of detec-
and 11, respectively. We can see that the number of parameters that tion models to improve detection efficiency even if the network flow
must be trained for AE-SAC is small. On the NSL-KDD dataset, the num- generated at a certain moment is large.
ber of parameters for the Actor and Q Critic networks is 33005 and the
number of parameters for the V Critic network is 32601. FLOPs are the 4.5.5. Impact of reward function on the performance of AE-SAC
number of multiplicative and additive operations in the model and are For reinforcement learning, the design of the reward function is cru-
used to measure the computational complexity of the model. Tables 12 cial. However, it is difficult to have a direct way to obtain the optimal
and 13 show how many floating-point operations are performed in the reward function. In this paper, seven reward functions are designed,
AE-SAC algorithm when each network is updated once. Both Actor and and then a relatively optimal reward function is obtained through a
Critic networks only require tens of thousands of floating point opera- grid search.
tions to be updated once. This means that no high-performance machine The reward functions involved in the comparison can be divided
is needed to complete the training. Although the SAC model’s overall into two categories: 0/N and 0/M/N. The former indicates that if the
structure is complex, it does not necessitate a complex neural network. classifier agent correctly identifies the classification to which the cur-
We achieved excellent classification performance in this paper by using rent network traffic belongs, it receives a reward value of N; otherwise,
only a fully connected network with three hidden layers. it is 0. The latter indicates that the classifier identifies successfully and
Tables 14 and 15 show the total number of parameters and FLOPs gives different reward values according to the different classes of net-
for the other models on the NSL-KDD. Although the AE-SAC model has work traffic. The reward value is M when the network traffic is correctly
a complex structure, the fact is that in a real deployment we only need classified as a majority network attack, and N when it is correctly clas-
to deploy the Actor network. Its number of parameters and FLOPs are sified as a minority network attack.
very small compared to other models. Fig. 20 depicts the performance scores of the AE-SAC algorithm on
For predictive classification of network traffic, we only need to use the KDDTest+ test set with different reward functions. As shown in
the Actor network, a fully connected network with three hidden layers, the Fig. 20, the best classification performance of AE-SAC is obtained
which has a very simple network structure. According to our experi- when the reward function is set to 0/1/2. This result is also expected to
mental results, the prediction time on the NSL-KDD is only 0.57 s with be predictable compared to other reward functions. When the reward
a sample size of 22544, implying that the prediction of a single sample function is set to 0/N, it means that the reward value is the same for all
takes only about 25 microseconds (Table 16). Similarly, the prediction network traffic identified successfully, however, this setting is unfair for

12
Z. Li, C. Huang, S. Deng et al. Computers & Security 135 (2023) 103502

Table 14
Number of parameters and FLOPs for other DL models on the NSL-KDD dataset.

Model MLP 1D-CNN GRU DNN 1 DNN 2 DNN 3 DNN 4 DNN 5


layer layer layer layer layer

Parameters 12805 90373 28837 52229 841221 1235717 1366789 1399557


FLOPs 25530 6886280 19678 103454 1680670 2469150 2731038 2796446

Table 15
Number of parameters and FLOPs for other DRL models on the NSL-KDD dataset.

AE-RL AESOMTE SSDDQN


Model
Environment Agent Classifier Agent Environment Agent Classifier Agent

Parameters 14623 33005 34823 53205 61307


FLOPs 29238 65730 69438 105930 121952

Table 16
Prediction time of AE-SAC on different data sets.

Dataset Number of Samples Predict Time

NSL-KDD 22544 0.57 s


AWID 675642 6.91 s

Fig. 22. Confusion matrix of AE-SAC model (AWID).

484,778 sample, which is good enough for deep learning and deep re-
inforcement learning. The precision of flooding is 95.47%, the recall is
61.37%, and the f1-score is 74.46%. There is no doubt that this result
still seems to have much room for improvement. In fact, the AE-SAC al-
gorithm can only solve the unbalanced situation of the data set to some
Fig. 21. Confusion matrix of AE-SAC model (KDDTest+). extent.
Another significant issue is the model’s complex structure, which
a minority network attack, and the classifier learning strategy is more results in an excessively long training time. The environment agent and
biased to identify majority network attacks. When the reward function the classifier agent must update at least three networks during each
is set to 0/M/N, if N is set to a value greater than M, the classifier agent training round.
learns a strategy that is more biased towards identifying a few classes
of network attacks. The reward functions 0/1/2, 0/1/4 and 0/2/4 out- 5. Conclusion
perform 0/N. However, the value of N should not be too large, and if it
is set too large, the learning strategy is completely biased towards the In this paper, the original reinforcement learning framework is mod-
minority class of network attacks and unfair to other classes of network ified by introducing an environmental agent to propose the AE-SAC
attacks. How the optimal M and N can be set is still a difficult problem model for solving the intrusion detection problem. In the SAC algo-
left for future research. rithm, the role of the environment agent is to resample the training
data to solve the unbalanced problem of the original training set. The
4.5.6. Discussion of AE-SAC limitations classifier agent uses the sampled data for training. The two agents are
Although the AE-SAC algorithm achieves excellent classification re- trained against each other in order to maximize their respective re-
sults compared to other algorithms, it still has shortcomings. The confu- wards. Finally, we performed a multiclassification evaluation on the
sion matrix of NSL-KDD in the AE-SAC algorithm is presented in Fig. 21. NSL-KDD and AWID test set using a classifier learning strategy. On the
As shown in Fig. 21, the precision and recall of U2R are very low, only NSL-KDD dataset, AE-SAC achieved an accuracy of 84.15% and an f1-
18.32% and 17.5%. The main reason is that the sample size of the U2R score of 83.97%, while on the AWID dataset, all performance metrics of
category in the training set of NSL-KDD is only 52, and both DL and DRL AE-SAC exceeded 98.9%.
require a large number of samples for training. Maybe we can introduce However, the sample size of the U2R category in the original train-
SMOTE technique to generate samples of U2R category like AESOMTE ing set is too small, resulting in a relatively low precision and recall of
to solve the problem of insufficient samples. AE-SAC on the U2R category. In the future, we will introduce SMOTE
Fig. 22 validates our conclusion that although the AWID dataset technique or generative adversarial network to generate the number of
also has a class hate imbalance, flooding and injection attacks still have samples of U2R category to solve the above problem.

13
Z. Li, C. Huang, S. Deng et al. Computers & Security 135 (2023) 103502

Another critical issue is the vulnerability of neural network models Jazi, Hossein Hadian, Gonzalez, Hugo, Stakhanova, Natalia, Ghorbani, Ali A., 2017. De-
to adversarial attacks (Merzouk et al., 2022), which is not currently tecting HTTP-based application layer DoS attacks on web servers in the presence of
sampling. Comput. Netw. 121, 25–36.
addressed in this paper. In the future, we will consider using adversarial
Kolias, Constantinos, Kambourakis, Georgios, Stavrou, Angelos, Gritzalis, Stefanos, 2015.
training to improve the robustness of the model. Intrusion detection in 802.11 networks: empirical evaluation of threats and a public
dataset. IEEE Commun. Surv. Tutor. 18 (1), 184–208.
CRediT authorship contribution statement Lan, Jinghong, Liu, Xudong, Li, Bo, Sun, Jie, Li, Beibei, Zhao, Jun, 2022. Member: a
multi-task learning model with hybrid deep features for network intrusion detection.
Comput. Secur. 123, 102919.
Zhengfa Li: Conceptualization, Formal analysis, Methodology, Val-
Lanvin, Maxime, Gimenez, Pierre-François, Han, Yufei, Majorczyk, Frédéric, Mé, Ludovic,
idation, Writing – original draft, Writing – review & editing. Chuanhe Totel, Eric, 2023. Errors in the CICIDS2017 dataset and the significant differences
Huang: Funding acquisition, Project administration, Supervision, Writ- in detection performances it makes. In: Risks and Security of Internet and Sys-
ing – review & editing. Shuhua Deng: Conceptualization, Formal anal- tems.
ysis, Methodology. Wanyu Qiu: Formal analysis, Resources. Xieping Liu, Lisa, Engelen, Gints, Lynar, Timothy, Essam, Daryl, Joosen, Wouter, 2022. Error
prevalence in NIDS datasets: a case study on CIC-IDS-2017 and CSE-CIC-IDS-2018.
Gao: Conceptualization, Formal analysis, Methodology. In: 2022 IEEE Conference on Communications and Network Security. CNS. IEEE,
pp. 254–262.
Declaration of competing interest Lopez-Martin, Manuel, Carro, Belen, Sanchez-Esguevillas, Antonio, 2020. Application of
deep reinforcement learning to intrusion detection for supervised problems. Expert
Syst. Appl. 141, 112963.
The authors declare that they have no known competing financial
Ma, Xiangyu, Shi, Wei, 2020. AESMOTE: adversarial reinforcement learning with SMOTE
interests or personal relationships that could have appeared to influence for anomaly detection. IEEE Trans. Netw. Sci. Eng. 8 (2), 943–956.
the work reported in this paper. Merzouk, Mohamed Amine, Delas, Joséphine, Neal, Christopher, Cuppens, Frédéric,
Boulahia-Cuppens, Nora, Yaich, Reda, 2022. Evading deep reinforcement learning-
Data availability based network intrusion detection with adversarial attacks. In: Proceedings of the
17th International Conference on Availability, Reliability and Security, pp. 1–6.
Mishra, Preeti, Varadharajan, Vijay, Tupakula, Uday, Pilli, Emmanuel S., 2018. A detailed
Data will be made available on request. investigation and analysis of using machine learning techniques for intrusion detec-
tion. IEEE Commun. Surv. Tutor. 21 (1), 686–728.
Acknowledgement Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioan-
nis, Wierstra, Daan, Riedmiller, Martin, 2013. Playing Atari with deep reinforcement
learning. arXiv preprint. arXiv:1312.5602.
This work is supported by the National Natural Science Foundation
Moustafa, Nour, Slay, Jill, 2015. UNSW-NB15: a comprehensive data set for network
of China (No. 61772385). intrusion detection systems (UNSW-NB15 network data set). In: 2015 Military Com-
munications and Information Systems Conference. MilCIS. IEEE, pp. 1–6.
References Muna, AL-Hawawreh, Moustafa, Nour, Sitnikova, Elena, 2018. Identification of malicious
activities in industrial internet of things based on deep learning models. J. Inf. Secur.
Alavizadeh, Hooman, Alavizadeh, Hootan, Jang-Jaccard, Julian, 2022. Deep q-learning Appl. 41, 1–11.
based reinforcement learning approach for network intrusion detection. Comput- Paleyes, Andrei, Urma, Raoul-Gabriel, Lawrence, Neil D., 2022. Challenges in deploying
ers 11 (3), 41. machine learning: a survey of case studies. ACM Comput. Surv. 55 (6), 1–29.
Caminero, Guillermo, Lopez-Martin, Manuel, Carro, Belen, 2019. Adversarial environ- Pingale, Subhash V., Sutar, Sanjay R., 2022. Remora whale optimization-based hybrid
ment reinforcement learning algorithm for intrusion detection. Comput. Netw. 159, deep learning for network intrusion detection using CNN features. Expert Syst.
96–109. Appl. 210, 118476.
Chatzoglou, Efstratios, Kambourakis, Georgios, Kolias, Constantinos, Smiliotopoulos, Popoola, Segun I., Adebisi, Bamidele, Hammoudeh, Mohammad, Gui, Guan, Gacanin,
Christos, 2022. Pick quality over quantity: expert feature selection and data prepro- Haris, 2020. Hybrid deep learning for botnet attack detection in the internet-of-things
cessing for 802.11 intrusion detection systems. IEEE Access 10, 64761–64784. networks. IEEE Int. Things J. 8 (6), 4944–4956.
Christodoulou, Petros, 2019. Soft actor-critic for discrete action settings. arXiv preprint. Potdar, Kedar, Pardawala, Taher S., Pai, Chinmay D., 2017. A comparative study of cat-
arXiv:1910.07207. egorical variable encoding techniques for neural network classifiers. Int. J. Comput.
Cil, Abdullah Emir, Yildiz, Kazim, Buldu, Ali, 2021. Detection of ddos attacks with feed Appl. 175 (4), 7–9.
forward based deep neural network model. Expert Syst. Appl. 169, 114520. Schaul, Tom, Quan, John, Antonoglou, Ioannis, Silver, David, 2015. Prioritized experience
Demis, Hassabis, 2016. AlphaGo: using machine learning to master the ancient game of replay. arXiv preprint. arXiv:1511.05952.
Go. Google Blog 27. Sethi, Kamalakanta, Kumar, Rahul, Prajapati, Nishant, Bera, Padmalochan, 2020a. Deep
Dong, Bo, Wang, Xue, 2016. Comparison deep learning method to traditional methods reinforcement learning based intrusion detection system for cloud infrastructure.
using for network intrusion detection. In: 2016 8th IEEE International Conference on In: 2020 International Conference on COMmunication Systems & NETworkS. COM-
Communication Software and Networks. ICCSN. IEEE, pp. 581–585. SNETS. IEEE, pp. 1–6.
Dong, Shi, Xia, Yuanjun, Peng, Tao, 2021. Network abnormal traffic detection model Sethi, Kamalakanta, Rupesh, E. Sai, Kumar, Rahul, Bera, Padmalochan, Madhav, Y. Venu,
based on semi-supervised deep reinforcement learning. IEEE Trans. Netw. Serv. 2020b. A context-aware robust intrusion detection system: a reinforcement learning-
Manag. 18 (4), 4197–4212. based approach. Int. J. Inf. Secur. 19 (6), 657–678.
Engelen, Gints, Rimmer, Vera, Joosen, Wouter, 2021. Troubleshooting an intrusion de- Sethi, Kamalakanta, Madhav, Y. Venu, Kumar, Rahul, Bera, Padmalochan, 2021. At-
tection dataset: the CICIDS2017 case study. In: 2021 IEEE Security and Privacy tention based multi-agent intrusion detection systems using reinforcement learning.
Workshops. SPW. IEEE, pp. 7–12. J. Inf. Secur. Appl. 61, 102923.
Gamage, Sunanda, Samarabandu, Jagath, 2020. Deep learning methods in network intru- Sharafaldin, Iman, Lashkari, Arash Habibi, Hakak, Saqib, Ghorbani, Ali A., 2019. De-
sion detection: a survey and an objective comparison. J. Netw. Comput. Appl. 169, veloping realistic distributed denial of service (DDoS) attack dataset and taxonomy.
102767. In: 2019 International Carnahan Conference on Security Technology. ICCST. IEEE,
Haarnoja, Tuomas, Zhou, Aurick, Abbeel, Pieter, Levine, Sergey, 2018a. Soft actor-critic: pp. 1–8.
off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Shone, Nathan, Ngoc, Tran Nguyen, Phai, Vu Dinh, Shi, Qi, 2018. A deep learning ap-
International Conference on Machine Learning. PMLR, pp. 1861–1870. proach to network intrusion detection. IEEE Trans. Emerg. Top. Comput. Intell. 2 (1),
Haarnoja, Tuomas, Zhou, Aurick, Hartikainen, Kristian, Tucker, George, Ha, Sehoon, Tan, 41–50.
Jie, Kumar, Vikash, Zhu, Henry, Gupta, Abhishek, Abbeel, Pieter, et al., 2018b. Soft Su, Tongtong, Sun, Huazhi, Zhu, Jinqi, Wang, Sheng, Li, Yabo, 2020. BAT: deep learn-
actor-critic algorithms and applications. arXiv preprint. arXiv:1812.05905. ing methods on network intrusion detection using NSL-KDD dataset. IEEE Access 8,
Han, Xiaolu, Liu, Yun, Zhang, Zhenjiang, Lü, Xin, Li, Yang, 2021. Sparse auto-encoder 29575–29585.
combined with kernel for network attack detection. Comput. Commun. 173, 14–20. Tavallaee, Mahbod, Bagheri, Ebrahim, Lu, Wei, Ghorbani, Ali A., 2009. A detailed analysis
Hassan, Mohammad Mehedi, Gumaei, Abdu, Alsanad, Ahmed, Alrubaian, Majed, Fortino, of the KDD CUP 99 data set. In: 2009 IEEE Symposium on Computational Intelligence
Giancarlo, 2020. A hybrid deep learning model for efficient intrusion detection in big for Security and Defense Applications. IEEE, pp. 1–6.
data environment. Inf. Sci. 513, 386–396. Thaseen, Ikram Sumaiya, Kumar, Cherukuri Aswani, 2017. Intrusion detection model us-
Hou, Tianhao, Xing, Hongyan, Liang, Xinyi, Su, Xin, Wang, Zenghui, 2022. Network in- ing fusion of chi-square feature selection and multi class SVM. J. King Saud Univ.
trusion detection based on DNA spatial information. Comput. Netw. 217, 109318. Comput. Inf. Sci. 29 (4), 462–472.
Imran, Muhammad, Haider, Noman, Shoaib, Muhammad, Razzak, Imran, et al., 2022. Vinayakumar, Ravi, Alazab, Mamoun, Soman, K.P., Poornachandran, Prabaharan, Al-
An intelligent and efficient network intrusion detection system using deep learning. Nemrat, Ameer, Venkatraman, Sitalakshmi, 2019. Deep learning approach for intelli-
Comput. Electr. Eng. 99, 107764. gent intrusion detection system. IEEE Access 7, 41525–41550.

14
Z. Li, C. Huang, S. Deng et al. Computers & Security 135 (2023) 103502

Wang, Haomin, Li, Wei, 2021. DDosTC: a transformer-based network attack detection Shuhua Deng received the B.S. degree in computer science and the Ph.D. degree in
hybrid mechanism in SDN. Sensors 21 (15), 5047. computational mathematics from Xiangtan University, Hunan, China, in 2013 and 2018,
Zhang, Guoling, Wang, Xiaodan, Li, Rui, Song, Yafei, He, Jiaxing, Lai, Jie, 2020a. Network respectively. His current research interests include software-defined networks, network
intrusion detection based on conditional Wasserstein generative adversarial network security, and system security.
and cost-sensitive stacked autoencoder. IEEE Access 8, 190431–190447.
Zhang, Jianwu, Ling, Yu, Fu, Xingbing, Yang, Xiongkun, Xiong, Gang, Zhang, Rui, 2020b. Wanyu Qiu received her B.E. degree in the School of Information Engineering from
Model of the intrusion detection system based on the integration of spatial-temporal Hubei University of Economics, China in 2017 and M.E. degree in the School of Computer
features. Comput. Secur. 89, 101681. Science from Wuhan University, China in 2019. She is currently a Ph.D. student in the
Zhang, Shangtong, Sutton, Richard S., 2017. A deeper look at experience replay. arXiv School of Computer Science, Wuhan University, China. Her research interest is computer
preprint. arXiv:1712.01275. network, game theory.
Zhou, Kun, Wang, Wenyong, Hu, Teng, Deng, Kai, 2021. Application of improved asyn-
chronous advantage actor critic reinforcement learning model on anomaly detection. Xieping Gao was born in 1965. He received the B.S. and M.S. degrees from Xiang-
Entropy 23 (3), 274. tan University, China, in 1985 and 1988, respectively, and the Ph.D. degree from Hunan
University, China, in 2003. He is a Professor in the Hunan Provincial Key Laboratory
of Intelligent Computing and Language Information Processing, Hunan Normal Univer-
Zhengfa Li received the B.S. and M.S. degrees in computer science and technology sity, Changsha, China. He is also with the Key Laboratory of Intelligent Computing and
from Xiangtan University, China, in 2015 and 2018, respectively. He is currently pursuing Information Processing of Ministry of Education, Xiangtan University, China. He was a
the Ph.D. degree in computer science and technology at Wuhan University, Wuhan, China. visiting scholar at the National Key Laboratory of Intelligent Technology and Systems,
His research interests are in the area of computer networks and network security. Tsinghua University, China, from 1995 to 1996, and at the School of Electrical and Elec-
tronic Engineering, Nanyang Technological University, Singapore, from 2002 to 2003.
He is a regular reviewer for several journals and he has been a member of the technical
Chuanhe Huang received the [Link]., [Link]., and Ph.D. degrees in computer science
committees of several scientific conferences. He has authored and co-authored over 110
from Wuhan University, Wuhan, China, in 1985, 1988, and 2002, respectively. He is cur-
journal papers, conference papers, and book chapters. His current research interests are
rently a Professor with the School of Computer Science, Wuhan University. His research
in the areas of wavelet analysis, neural network, bioinformatics, image processing, and
interests include computer networks, VANETs, the Internet of Things, and distributed
computer network.
computing.

15

You might also like