Research
Research
DOI: 10.1049/cps2.12013
- -
Revised: 14 March 2021
O R I G I N A L R E S E A R C H PA P E R
Accepted: 11 April 2021
1
School of Computer Science, Carleton University, Abstract
Ottawa, Canada
Cybersecurity has become a significant issue. Machine learning algorithms are known to
2
School of Information Technology, Carleton help identify cyberattacks such as network intrusion. However, common network
University, Ottawa, Canada
intrusion datasets are negatively affected by class imbalance: the normal traffic behaviour
Correspondence
constitutes most of the dataset, whereas intrusion traffic behaviour forms a significantly
Rahbar Ahsan, School of Computer Science, smaller portion. A comparative evaluation of the performance is conducted of several
Carleton University, Ottawa, ON K1S5B6, Canada. classical machine learning algorithms, as well as deep learning algorithms, on the well‐
Email: [email protected] and known National Security Lab Knowledge Discovery and Data Mining dataset for
[email protected]
intrusion detection. More specifically, two variants of a fully connected neural network,
Funding information
one with an autoencoder and one without, have been implemented to compare their
Natural Sciences and Engineering Research Council performance against seven classical machine learning algorithms. A voting classifier is also
of Canada Canada, Grant/Award Number: RGPIN‐ proposed to combine the decisions of these nine machine learning algorithms. All of the
2020‐06482 models are tested in combination with three different resampling techniques: over-
sampling, undersampling, and hybrid sampling. The details of the experiments conducted
and an analysis of their results are then discussed.
-
This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs License, which permits use and distribution in any medium, provided the
original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.
© 2021 The Authors. IET Cyber‐Physical Systems: Theory & Applications published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.
Various supervised machine learning algorithms were replace classical oversampling methods. In their proposed
tested with different preprocessing steps on a benchmark model, they trained two agents adversarially. The two agents are
network dataset named National Security Lab Knowledge the attacker agent and the environment agent. The attacker
Discovery and Data Mining (NSL‐KDD) [6]. A comparative agent's job is to predict attack labels provided in the batch. Based
evaluation has been conducted on the effectiveness of various on the attacker agent's prediction, it receives a reward point. The
machine learning algorithms and along with deep learning environment agent's job is to provide the attacker agent with
approaches. Furthermore, a hybrid voting classifier is used to difficult samples to increase the misclassification rate of the
improve the results. The task is posed as a multiclassification attacker agent. A reward point is associated with the environ-
problem over five different classes. All selected classifiers have ment agent's performance. Thus, both agents are being trained
been tested with different types of resampling techniques. The adversarially. After training, the attacker agent is used to predict
Synthetic Minority Oversampling Technique (SMOTE) is the class labels of the testing dataset. Because the attacker agent
selected from the oversampling family. Among the under- is trained adversarially, it can identify all of the individual classes
sampling techniques, NearMiss has been chosen. Finally, from correctly. This proposed AE‐RL model was able to outperform
the balance sampling techniques, a combination of SMOTE other machine learning models with the best oversampling
and Edited Nearest Neighbour (ENN), named SMOTEENN techniques. The acquired F1 score was 0.7940. Another research
is considered in this study. group extended this work [15] and combined SMOTE with AE‐
RL to improve the existing results. The result they obtained was
the best among all other state‐of‐the‐art methods using rein-
2 | RELATED WORKS forcement learning models, with an F1 score of 0.8243.
Shrivas and Dewangan [16] combined an artificial neural
In Vinayakumar et al. [7], the authors developed a deep neural network and a Bayesian net to form an ensemble classifier. They
network (DNN) by considering neural network performances tested their results on the KDD cup 99 and NSL‐KDD datasets.
against intrusion detection systems. They showed a detailed They considered only binary classification, and performed their
evaluation of the DNN with various kinds of hidden layers on studies only on the training dataset provided (using different
benchmark network traffic datasets. Their proposed model partitions). Zhou et al. [17] proposed a framework with
performed better than other existing traditional machine ensemble classifiers (C4.5, RF, and Forest PA) and applied
learning algorithms. In another study [8], a recurrent neural Correleation‐based Feature Selection ‐ Bat algorithm as a feature
network has been used on the NSL‐KDD dataset; in the selection technique. Their proposed frameworks were able to
multiclass scenario, it attained roughly 81% accuracy. To obtain 87.37% accuracy on KDDTest+. Another interesting
improve the classification results by using neural networks such voting classifier was proposed by Gao et al. [18], who built an
as Convolutional Neural Networks (CNNs), the NSL‐KDD adaptive voting system and combined a DNN with other basic
dataset has been converted into image format and fed into a classifiers on selected features, which enabled them to obtain
CNN model. A slight improvement in classification results was 85.2% accuracy. This study applied 10‐fold cross‐validation on
reported in Li et al. [9]. the training dataset to calculate the weight of the voting classi-
It was also shown that autoencoders perform well as a fier, which was then used for prediction.
dimensionality reduction system, and it outperformed the
popular dimensionality reduction method called principal
component analysis [10]. Chen et al. [11] conducted an inter- 3 | WORKFLOW
esting study using an autoencoder to extract features from the
NSL‐KDD dataset, which resulted in better a false positive rate The main workflow of this study is shown in Figure 1. First,
and detection accuracy than using classifiers such as K‐nearest data from the training dataset is preprocessed and fed into the
neighbour and support vector machine (SVM). In Lopez‐ classifiers. After the classifiers are trained, the test dataset is
Martin et al. [12], a conditional variational autoencoder was used to determine the performance of the classifiers. A total of
proposed for the network intrusion detection system, which four types of preprocessed data were used to train the machine
obtained an accuracy of 80%. learning algorithms: raw, oversampled, undersampled, and
In Lopez‐Martin et al. [13], the authors discussed various balance sampled. A final comparison of all of the experimental
types of oversampling techniques such as SMOTE and adap- results is provided in the Results and Discussion section
tive synthetic sampling performance on the NSL‐KDD data-
set. Their proposed generative variational autoencoder
oversampling technique was proven to be a better over- 3.1 | Dataset description
sampling technique compared with other oversampling
methods with classifiers such as Random Forest (RF), linear The NSL‐KDD dataset originated from KDD‐Cup99. It is
SVM, logistic regression, and multilayer perceptron (MLP). In seen as a refined version of KDD‐Cup99. NSL‐KDD has
that study, they obtained the best result using MLP, with an predefined training and testing partitions. The training parti-
accuracy of 79.26% and an F1 score of 76.45%. tion is referred to as KDD‐Train+ and the testing dataset is
Caminero et al. [14] applied Adversarial Environment referred to as KDD‐Test+. Figure 2 shows the training data
Reinforcement Learning (AE‐RL) as an alternative technique to distribution of NSL‐KDD. The first subfigure shows the data
AHSAN ET AL.
- 3
x − xminimum
X MinMax ¼ ð1Þ
xmaximum − xminimum
choose to use one hidden layer in both the encoder and decoder Furthermore, Zhou et al. [17] did not combine different
networks and the node sizes are set to 256 and 64, respectively. classifiers such as the autoencoder with an FCN and decision
We use 30 epochs and a batch size of 256 to train the autoen- tree together. Although some classifiers do not produce good
coder. The activation function of the hidden layers are ReLU, the prediction results individually, when they are included in the
Adam optimizer is used, and MSE is used as the loss function in decision‐making process of the proposed voting system, the
the autoencoder model based on a trial‐and‐error experiment. final results are further improved. The following list of clas-
The autoencoder combined with the FCN network uses the sifiers was included in the voting system: decision tree, Naïve
same architecture as the FCN architecture mentioned earlier. Bayes, SVM, RF, extra tree, FCN autoencoder with a fully
connected network as well as logistic regression with L1 and
L2 penalties.
3.3.3 | Proposed voting classifier
In Zhou et al. [17, 18] and Gao et al. the authors proposed a 4 | RESULTS AND DISCUSSION
voting classifier using a set of machine learning algorithms. A
similar type of voting strategy was used on a different set of 4.1 | Datasets and evaluation metrics
classifiers here. Furthermore, a hard‐voting system was used in
this study. The proposed voting classifier's architecture is The experimental results of the study are explained and dis-
shown in Figure 5. At first, all of the classifiers are trained cussed in detail next. All seven classical machine learning and
using either the raw or resampled data. Then, the trained two deep learning algorithms detailed in the previous section
models predict the class labels of the testing dataset. The final were trained on the original NSL‐KDD dataset as well as three
results are based on a majority vote system that decides the variants obtained from applying resampling techniques. To
final predicted class labels in the testing dataset. In Gao et al. evaluate intrusion detection models, four evaluation metrics
[18], the authors applied a weighted voting system. are commonly used: accuracy, precision, F1 score, and recall.
These evaluation metrics offer a complete view on how
different machine learning algorithms perform. Evaluation
based solely on the results of accuracy is not recommended
because of the imbalanced nature of the investigated dataset.
Accuracy tends to be biased towards the normal class.
Furthermore, a primary purpose was to build a model that
avoided bias towards the majority class. On the other hand,
when a resampling technique is applied, a model often be-
comes biased towards the minority classes, which results in a
lower precision value. It is important to have both high pre-
cision and recall in the network intrusion detection domain.
Because the F1 score is calculated based on precision and
recall, using the F1 score is the most appropriate way to
evaluate those intrusion detection models when the dataset is
imbalanced. The formula of the F1 score is:
2 � ðPrecision � RecallÞ
F I G U R E 4 Model loss in training and validation datasets in a fully F1 − score ¼ ð2Þ
connected neural network Precision þ Recall
T A B L E 1 Comparison of classifiers on
Accuracy Precision F1‐score Recall
original dataset
Decision tree 0.78 0.78 0.76 0.78
Autoencoder with fully connected neural networks 0.82 0.83 0.80 0.82
4.2 | Results and Analysis which is higher than the results of the proposed DNN re-
ported in Vinayakumar et al. [7] in terms of all four evaluation
Table 1 lists the results obtained from all nine classifiers as well metrics. Figure 6 shows the F1 score of all 10 classifiers trained
as the voting classifier. The results show that Naïve Bayes on the original dataset. Among all 10 classifiers, the autoen-
yields the worst performance. Naïve Bayes is a probabilistic coder with the FCN provides the best results. The poor result
model, and its predictions are done based on a probability of the voting classifier obtained from running it on the original
density estimation of a dataset. In the testing dataset, there are dataset follows from the fact that most individual classifiers we
new classes that are not present in the training dataset. investigated performed poorly owing to the imbalance present
Therefore, the probability density estimation does not reflect in the original data.
the presence of the new classes, which leads to degraded Table 2 lists the results of combining each of the 10 clas-
performance. Almost all other classifiers produce a level of sifiers with SMOTE resampling on the NSL‐KDD train
results similar to each other. Logistic regression with L2 pen- dataset. In almost all cases, the results are improved compared
alty demonstrated a slightly better result than the other classical with using classifiers with no resampling. Similar to our pre-
machine learning algorithms. Furthermore, the FCN network vious explanation, Naïve Bayes shows no improvement
produces an excellent result in terms of accuracy and F1 score, because the model fails to capture the probabilistic density
AHSAN ET AL.
- 7
F I G U R E 7 F1 scores of classifiers using synthetic minority F I G U R E 8 Confusion matrices of decision tree classifier trained on
oversampling technique resampling oversampled dataset and undersampled training sets
T A B L E 3 Comparison of classifiers
Accuracy Precision F1_score Recall
using NearMiss resampling
Decision tree 0.82 0.82 0.80 0.82
proposed voting classifier and decision tree show the highest number of nodes that prevent the model loss from stabilising
accuracy score of 0.82 and an F1 score of 0.80. These results and overfit the model. This results in incorrect predictions on
show that all classifiers perform well with the undersampling the minority classes.
techniques compared with the original data. Table 4 lists results obtained from training the classifiers
Figure 8 shows the confusion matrices of a decision tree combining with SMOTEENN. In terms of the F1 score, SVM
classifier trained on oversampled and undersampled datasets. performs the best, with an F1 score reaching 0.81. This is the
The true positive rate (TPR) of the normal class increased best F1 score that SVM has acquired so far. This is possibly
from 0.96 to 0.97 and the TPR of DoS, Probe, and R2L because SMOTEENN is a hybrid resampling technique
decreased by 1%. This implies that a classical machine learning combining SMOTE and ENN. These two techniques are
algorithm such as decision tree has a similar type of perfor- based on the distance measure. They successfully created good
mance when used with undersampled data compared with synthetic samples suitable for distance‐based classifiers such as
oversampled data. SVM. We observe that the autoencoder with FCN performs
Figure 9 demonstrates the F1 score obtained from training well with SMOTEENN compared with SMOTE. As
all 10 classifiers on the dataset after applying NearMiss mentioned, the poor results of the autoencoder with SMOTE
resampling. All classifiers had a similar F1 score except Naïve were caused by the poor quality of the synthetic samples
Bayes, FCN, and autoencoder with FCN. The FCN and generated by SMOTE. When ENN is added to SMOTE, it
autoencoder with FCN have a low score when combined with helps to discard some samples from the oversampled data and
NearMiss compared to other resampling techniques. This is keep samples related to the original distribution. Therefore, the
because deep learning models need enough data to learn performance of the autoencoder with FCN is improved.
properly and the undersampled dataset offers the fewest Figure 10 compares all 10 classifiers trained on a dataset
samples compared to other resampled datasets. Despite dis- generated from applying SMOTEENN resampling. All of the
carding some majority class samples, the dataset produced after classifiers perform relatively similarl in terms of the F1 score.
NearMiss is still highly imbalanced. Furthermore, our neural Apart from the FCN and the voting classifier, all other clas-
network architecture has three hidden layers with a fairly large sifiers perform slightly better or remain the same compared to
the results using SMOTE.
Figure 11 is a bar chart showing differences between the
F1 scores of all seven classical machine learning algorithms
trained with each of the three resampling techniques as well
as F1 scores from training the seven classifiers on the orig-
inal dataset. After applying resampling techniques, the F1
scores increased for all classifiers except Naïve Bayes. This
confirms that resampling techniques have a positive impact
on the performance of most classical machine learning
algorithms.
Figure 12 compares the F1 scores of the two deep learning
algorithms when the resampling techniques are used and run
FIGURE 9 F1‐scores of the classifiers using NearMiss resampling on the original dataset. The FCN performs well with SMOTE;
Note: The bold values indicate the best result for each evaluation metric.
Abbreviations: SMOTEENN, synthetic minority oversampling technique and edited nearest neighbour; SVM, support
vector machine.
AHSAN ET AL.
- 9
techniques in the preprocessing step before executing the 13. Lopez‐Martin, M., Carro, B., Sanchez‐Esguevillas, A.: Variational data
voting classifier, in hope of further improving the detection generative model for intrusion detection. Knowl Inf Syst. 60(1), 569–590
(2019)
rate of various types of network intrusions.
14. Caminero, G., Lopez‐Martin, M., Carro, B.: Adversarial environment
reinforcement learning algorithm for intrusion detection. Comput
ACK NOW L ED GE ME N T Network. 159, 96–109 (2019)
We gratefully acknowledge the financial support from the 15. Ma X., Shi W.: AESMOTE Adversarial Reinforcement Learning with
Natural Sciences and Engineering Research Council of Canada SMOTE for Anomaly Detection. IEEE Trans. Netw. Sci. Eng. 1–1
(NSERC) under Grants No. RGPIN‐2020‐06482 (2020). http://dx.doi.org/10.1109/tnse.2020.3004312
16. Shrivas, A.K., Dewangan, A.K.: An ensemble model for classification of
attacks with feature selection based on KDD99 and NSL‐KDD data set.
O RC ID Int. J. Comput. Appl. 99(15), 8–13 (2014)
Rahbar Ahsan https://orcid.org/0000-0001-6624-1462 17. Zhou Y., et al.: Building an efficient intrusion detection system based on
Wei Shi https://orcid.org/0000-0002-3071-8350 feature selection and ensemble classifier. Comput. Netw. 174, 107247
(2020). http://dx.doi.org/10.1016/j.comnet.2020.107247
18. Gao, X., et al.: An adaptive ensemble machine learning model for
RE FE RE NCE S intrusion detection. IEEE Access. 7, 82512–82521 (2019)
1. Bhuyan, M.H., Bhattacharyya, D.K., Kalita, J.K.: Towards generating real‐ 19. Gnanaprasanambikai, L., Munusamy, N.: Data pre‐processing and clas-
life datasets for network intrusion detection. IJ Network Security. 17(6), sification for traffic anomaly intrusion detection using NSLKDD dataset.
683–701 (2015) Cybern Inf Technol. 18(3), 111–119 (2018)
2. Jang‐Jaccard, J., Nepal, S.: A survey of emerging threats in cybersecurity. 20. Chawla, N.V., et al.: SMOTE: synthetic minority over‐sampling tech-
J Comput Syst Sci. 80(5), 973–993 (2014) nique. Jair. 16, 321–357 (2002)
3. Uppal, H.A.M., Javed, M., Arshad, M.: An overview of intrusion 21. Mani, I., & Zhang, I.: kNN approach to unbalanced data distributions:
detection system (IDS) along with its commonly used techniques and a case study involving information extraction. In: Proceedings of
classifications. Int J Comput Sci Telecommun. 5(2), 20–24 (2014) workshop on learning from imbalanced datasets, vol. 126. ICML, US
4. Sun, N., et al.: Data‐driven cybersecurity incident prediction: a survey. 2003.
IEEE Commun. Surv. Tutorials. 21(2), 1744–1772 (2018) 22. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of
5. Mishra, P., et al.: A detailed investigation and analysis of using machine several methods for balancing machine learning training data. ACM
learning techniques for intrusion detection. IEEE Commun. Surv. SIGKDD Explorations Newsletter. 6(1), 20–29 (2004). https://doi.org/
Tutorials. 21(1), 686–728 (2018) 10.1145/1007730.1007735
6. Tavallaee, M., et al.: A Detailed Analysis of the KDD CUP 99 Data Set. 23. Pedregosa, F., et al.: Scikit‐learn: machine learning in Python. J. Mach.
In: IEEE Symposium on Computational Intelligence for Security and Learn. Res. 12, 2825–2830 (2011)
Defence Applications, pp. 1–6. IEEE, Ottawa, ON (2009) 24. Berman, D., et al.: A survey of deep learning methods for cyber security.
7. Vinayakumar, R., et al.: Deep learning approach for intelligent intrusion Information. 10(4), 122 (2019)
detection system. IEEE Access. 7, 41525–41550 (2019) 25. Chollet, F.: Keras (2015). https://keras.io/getting_started/faq/#how‐
8. Yin, C., et al.: A deep learning approach for intrusion detection using should‐i‐cite‐keras
recurrent neural networks. IEEE Access. 5, 21954–21961 (2017)
9. Li, Z., et al.: Intrusion detection using Convolutional neural networks for
representation learning. In: International Conference on neural infor-
mation processing, pp. 858–866. Springer, Cham (2017)
10. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data
with neural networks. Science. 313(5786), 504–507 (2006) How to cite this article: Ahsan R, Shi W,
11. Chen, Z., et al.: Autoencoder‐based network anomaly detection. In: 2018 Corriveau JP. Network intrusion detection using
Wireless Telecommunications Symposium (WTS), Phoneix, AZ, USA. machine learning approaches: Addressing data
IEEE (2018) imbalance. IET Cyber‐Phys. Syst., Theory Appl.
12. Lopez‐Martin, M., et al.: Conditional variational autoencoder for pre-
diction and feature recovery applied to intrusion detection in iot. Sensors.
2021;1–10. https://doi.org/10.1049/cps2.12013
17(9), 1967 (2017)