0% found this document useful (0 votes)
14 views

Research

Uploaded by

Uthman Oguntola
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Research

Uploaded by

Uthman Oguntola
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Received: 18 January 2021

DOI: 10.1049/cps2.12013
- -
Revised: 14 March 2021

O R I G I N A L R E S E A R C H PA P E R
Accepted: 11 April 2021

- IET Cyber‐Physical Systems: Theory & Applications

Network intrusion detection using machine learning approaches:


Addressing data imbalance

Rahbar Ahsan1 | Wei Shi2 | Jean‐Pierre Corriveau1

1
School of Computer Science, Carleton University, Abstract
Ottawa, Canada
Cybersecurity has become a significant issue. Machine learning algorithms are known to
2
School of Information Technology, Carleton help identify cyberattacks such as network intrusion. However, common network
University, Ottawa, Canada
intrusion datasets are negatively affected by class imbalance: the normal traffic behaviour
Correspondence
constitutes most of the dataset, whereas intrusion traffic behaviour forms a significantly
Rahbar Ahsan, School of Computer Science, smaller portion. A comparative evaluation of the performance is conducted of several
Carleton University, Ottawa, ON K1S5B6, Canada. classical machine learning algorithms, as well as deep learning algorithms, on the well‐
Email: [email protected] and known National Security Lab Knowledge Discovery and Data Mining dataset for
[email protected]
intrusion detection. More specifically, two variants of a fully connected neural network,
Funding information
one with an autoencoder and one without, have been implemented to compare their
Natural Sciences and Engineering Research Council performance against seven classical machine learning algorithms. A voting classifier is also
of Canada Canada, Grant/Award Number: RGPIN‐ proposed to combine the decisions of these nine machine learning algorithms. All of the
2020‐06482 models are tested in combination with three different resampling techniques: over-
sampling, undersampling, and hybrid sampling. The details of the experiments conducted
and an analysis of their results are then discussed.

1 | INTRODUCTION learning algorithms require a large amount of data to predict all


of the classes accurately. There are some techniques such as
An exponential rise in the number of computing apps and oversampling, undersampling, and balance sampling to handle
network sizes has drastically increased the potential threat of this imbalanced dataset problem. Oversampling techniques,
cyberattacks [1]. It has become essential to ensure network which synthesise samples of the minority classes to form a
security in light of its adaptive nature [2]. Attackers can intrude balanced dataset, often excel in achieving a more accurate
on network traffic and control the computer system with detection rate for the minority classes. In contrast, in under-
powerful adaptive methods. Thus, it becomes essential to sampling techniques, samples from majority class are discarded
anticipate the network breaches and identify what kind of to create a more balanced distribution. However, under-
intrusion has been attempted to thwart potential intruders [3]. sampling methods are prone to overfitting because they might
By using numerous types of data sources such as reports, discard some crucial examples of the majority class, which
network activities, and data gathered from social media and might be essential to differentiate between majority and mi-
websites, some research institutions and companies are pro- nority classes. Balance sampling is the combination of over-
posing a model that can predict cybersecurity incidents [4]. sampling and undersampling techniques.
One proactive strategy to predict cybersecurity is to create an All of these approaches tend to decrease the performance
autonomous intrusion detection system that will help to of the majority class. Maintaining the majority class's perfor-
identify and categorise the types of cyberattacks promptly at mance while improving the cyberattacks detection rate is the
the network and host infrastructure levels [5]. biggest challenge while working with an imbalanced dataset.
Most cybersecurity datasets are imbalanced because Furthermore, people start to ignore the cyberattack warnings if
cyberattacks are not a common occurrence. Because of the they occur too frequently. Consequently, a weighted F1 score is
imbalanced nature of the data, the accurate classification of a good performance metric to evaluate solutions in intrusion
cyberattacks becomes challenging. Supervised machine detection systems.

-
This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs License, which permits use and distribution in any medium, provided the
original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.
© 2021 The Authors. IET Cyber‐Physical Systems: Theory & Applications published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.

IET Cyber‐Phys. Syst., Theory Appl. 2021;1–10. wileyonlinelibrary.com/journal/cps2 1


2
- AHSAN ET AL.

Various supervised machine learning algorithms were replace classical oversampling methods. In their proposed
tested with different preprocessing steps on a benchmark model, they trained two agents adversarially. The two agents are
network dataset named National Security Lab Knowledge the attacker agent and the environment agent. The attacker
Discovery and Data Mining (NSL‐KDD) [6]. A comparative agent's job is to predict attack labels provided in the batch. Based
evaluation has been conducted on the effectiveness of various on the attacker agent's prediction, it receives a reward point. The
machine learning algorithms and along with deep learning environment agent's job is to provide the attacker agent with
approaches. Furthermore, a hybrid voting classifier is used to difficult samples to increase the misclassification rate of the
improve the results. The task is posed as a multiclassification attacker agent. A reward point is associated with the environ-
problem over five different classes. All selected classifiers have ment agent's performance. Thus, both agents are being trained
been tested with different types of resampling techniques. The adversarially. After training, the attacker agent is used to predict
Synthetic Minority Oversampling Technique (SMOTE) is the class labels of the testing dataset. Because the attacker agent
selected from the oversampling family. Among the under- is trained adversarially, it can identify all of the individual classes
sampling techniques, NearMiss has been chosen. Finally, from correctly. This proposed AE‐RL model was able to outperform
the balance sampling techniques, a combination of SMOTE other machine learning models with the best oversampling
and Edited Nearest Neighbour (ENN), named SMOTEENN techniques. The acquired F1 score was 0.7940. Another research
is considered in this study. group extended this work [15] and combined SMOTE with AE‐
RL to improve the existing results. The result they obtained was
the best among all other state‐of‐the‐art methods using rein-
2 | RELATED WORKS forcement learning models, with an F1 score of 0.8243.
Shrivas and Dewangan [16] combined an artificial neural
In Vinayakumar et al. [7], the authors developed a deep neural network and a Bayesian net to form an ensemble classifier. They
network (DNN) by considering neural network performances tested their results on the KDD cup 99 and NSL‐KDD datasets.
against intrusion detection systems. They showed a detailed They considered only binary classification, and performed their
evaluation of the DNN with various kinds of hidden layers on studies only on the training dataset provided (using different
benchmark network traffic datasets. Their proposed model partitions). Zhou et al. [17] proposed a framework with
performed better than other existing traditional machine ensemble classifiers (C4.5, RF, and Forest PA) and applied
learning algorithms. In another study [8], a recurrent neural Correleation‐based Feature Selection ‐ Bat algorithm as a feature
network has been used on the NSL‐KDD dataset; in the selection technique. Their proposed frameworks were able to
multiclass scenario, it attained roughly 81% accuracy. To obtain 87.37% accuracy on KDDTest+. Another interesting
improve the classification results by using neural networks such voting classifier was proposed by Gao et al. [18], who built an
as Convolutional Neural Networks (CNNs), the NSL‐KDD adaptive voting system and combined a DNN with other basic
dataset has been converted into image format and fed into a classifiers on selected features, which enabled them to obtain
CNN model. A slight improvement in classification results was 85.2% accuracy. This study applied 10‐fold cross‐validation on
reported in Li et al. [9]. the training dataset to calculate the weight of the voting classi-
It was also shown that autoencoders perform well as a fier, which was then used for prediction.
dimensionality reduction system, and it outperformed the
popular dimensionality reduction method called principal
component analysis [10]. Chen et al. [11] conducted an inter- 3 | WORKFLOW
esting study using an autoencoder to extract features from the
NSL‐KDD dataset, which resulted in better a false positive rate The main workflow of this study is shown in Figure 1. First,
and detection accuracy than using classifiers such as K‐nearest data from the training dataset is preprocessed and fed into the
neighbour and support vector machine (SVM). In Lopez‐ classifiers. After the classifiers are trained, the test dataset is
Martin et al. [12], a conditional variational autoencoder was used to determine the performance of the classifiers. A total of
proposed for the network intrusion detection system, which four types of preprocessed data were used to train the machine
obtained an accuracy of 80%. learning algorithms: raw, oversampled, undersampled, and
In Lopez‐Martin et al. [13], the authors discussed various balance sampled. A final comparison of all of the experimental
types of oversampling techniques such as SMOTE and adap- results is provided in the Results and Discussion section
tive synthetic sampling performance on the NSL‐KDD data-
set. Their proposed generative variational autoencoder
oversampling technique was proven to be a better over- 3.1 | Dataset description
sampling technique compared with other oversampling
methods with classifiers such as Random Forest (RF), linear The NSL‐KDD dataset originated from KDD‐Cup99. It is
SVM, logistic regression, and multilayer perceptron (MLP). In seen as a refined version of KDD‐Cup99. NSL‐KDD has
that study, they obtained the best result using MLP, with an predefined training and testing partitions. The training parti-
accuracy of 79.26% and an F1 score of 76.45%. tion is referred to as KDD‐Train+ and the testing dataset is
Caminero et al. [14] applied Adversarial Environment referred to as KDD‐Test+. Figure 2 shows the training data
Reinforcement Learning (AE‐RL) as an alternative technique to distribution of NSL‐KDD. The first subfigure shows the data
AHSAN ET AL.
- 3

features, there are considerable differences between the


numeric features that might create biased results while using
machine learning algorithms. Therefore, Minmax scaling is
applied over the numeric attributes. The formula of the Min-
max scaler is given in Equation (1):

x − xminimum
X MinMax ¼ ð1Þ
xmaximum − xminimum

After applying Minmax scaling, all of the feature values fall


in the range of 0 to 1. Because no feature selection performs
equally well for all classifiers studied, we choose not to perform
feature selection in the preprocessing stage. However, in the
preprocessing stage, we apply data changes such as data
transformation, scaling, and resampling techniques that are
equally beneficial for all of the classifiers.
Most machine learning algorithms cannot process cate-
gorical data features. In the NSL‐KDD dataset, there are three
categorical features: protocol type, service, and flag. These
features are important for classification. Therefore, we apply
one‐hot encoding to these features as part of the data pre-
processing stage. After applying a one‐hot encoder over the
categorical features, the total number of features increases
from 43 to 123.
Because the dataset is highly imbalanced, we employ three
popular resampling techniques (instances of oversampling,
undersampling and balance sampling) to evaluate their impact
on the final classification results. A detailed comparative
evaluation of the resampling techniques along with the raw
data collected are discussed and reported in the Results and
Discussion section.
SMOTE [20] is chosen as the oversampling technique. In
the training dataset, a normal class contains 67,342 samples. All
other classes are resampled to 67,342 instances using SMOTE
oversampling. This is done to create a balanced dataset, pre-
venting normal instances from dominating the classifier's
prediction results.
NearMiss [21] is a well‐known undersampling technique.
FIGURE 1 Workflow Here, it is applied only to the normal class. The records are
resampled to 45,928 instances, the same as the number of
samples in the DoS class. A popular approach is to down-
distribution over various cyberattacks. The training dataset has sample all majority classes and reduce the number of records to
67,343 normal instances of 125,973 total records. The second that of the smallest minority class. In this case, however, mi-
subfigure shows the data distribution over four class into nority class U2R has only 52 samples. Downsampling all
which the cyberattacks are categorised, along with a fifth class classes to 52 instances would cause significant information
for normal instances. These four cyberattack categories are loss, resulting in inaccurate classification results.
DoS, Probe, R2L, and U2R. Even after categorisation, the For balance sampling techniques, SMOTEENN [22] was
dataset is highly imbalanced. In the training dataset, U2R and used. It is the combination of SMOTE and ENN techniques.
R2L have only 52 and 995 instances, respectively. Related It first oversamples all minority classes via SMOTE and
works [7, 8, 12–18] show that most multiclass classification upsamples the minority classes to make an equal class distri-
studies have been conducted on these five categories. bution. After that, it downsamples based on the ENN. After
applying SMOTEENN, the final class distribution is normal,
67,181; DoS, 67,308; Probe, 67,320; R2L, 67,325; and U2R,
3.2 | Data preprocessing 67,342. The next section will discuss all of the machine
learning classifiers and proposed algorithms against these four
Preprocessing is an essential task in knowledge learning [19]. datasets: raw data, oversampled data, undersampled data, and
After looking at the mean and standard deviation of all of the balance sampled data.
4
- AHSAN ET AL.

FIGURE 2 Training data distribution of NSL‐KDD dataset

3.3 | Machine learning algorithms

3.3.1 | Classical supervised machine learning


algorithms

To compare performance for the task of intrusion detection,


machine learning algorithms were selected for testing against
the NSL‐KDD dataset: from the tree‐based machine learning
algorithms: decision tree, RF, and extra tree are selected,
Gaussian Naïve Bayes algorithm was chosen from the proba-
bilistic models; SVM was selected from the distance‐based FIGURE 3 Fully connected neural network architecture
models; and logistic regression with L1 and L2 penalty were
chosen from the regression models. All of these classifiers were
implemented using sci‐kit learn library [23] with the full list of 512, and 256 respectively. The output layer has five nodes with
parameters selected. For tree‐based models, the default pa- a SoftMax activation function. For the rest of the hidden layers,
rameters yielded the best results. Logistic regression L1 & L2 a ReLU activation function is used. Figure 4 shows that initially,
penalty solver = ‘Saga’ and tol = 0.001, SVM with the ‘rbf’ the loss of the validation test was unstable. It became relatively
kernel, Naïve Bayes with var_smoothing = 1e−1 gave better stable on 200 epochs. The Adam optimizer, with the default
results than the default one. parameters, was selected for this model. ‘Sparsecategoryen-
tropy’ is used as the loss function along with accuracy metrics.
An autoencoder is a type of neural network with two main
3.3.2 | Deep learning algorithms components: an encoder and a decoder. The encoder com-
pressed the data into a reduced dimensionality. Then, the
It is known that fully connected DNNs and autoencoders decoder decodes it back to the original latent representation.
produce good results in anomaly detection systems [24]. After training, the encoder has learned about the latent repre-
Therefore, Fully Connected Neural (FCN) Network with and sentation of the data. When the encoder is connected to a neural
without an autoencoder were employed in these experiments. network and trained again, the encoder extracts important in-
An FCN network is built using Keras [25]. The architecture formation by encoding it into the latent dimension. In this case,
of this FCN network is shown in Figure 3. Three hidden layers the latent dimension is reduced to 32. In this study, we use an
were used, with the following number of nodes per layer: 1024, autoencoder on an FCN network to produce the output. We
AHSAN ET AL.
- 5

choose to use one hidden layer in both the encoder and decoder Furthermore, Zhou et al. [17] did not combine different
networks and the node sizes are set to 256 and 64, respectively. classifiers such as the autoencoder with an FCN and decision
We use 30 epochs and a batch size of 256 to train the autoen- tree together. Although some classifiers do not produce good
coder. The activation function of the hidden layers are ReLU, the prediction results individually, when they are included in the
Adam optimizer is used, and MSE is used as the loss function in decision‐making process of the proposed voting system, the
the autoencoder model based on a trial‐and‐error experiment. final results are further improved. The following list of clas-
The autoencoder combined with the FCN network uses the sifiers was included in the voting system: decision tree, Naïve
same architecture as the FCN architecture mentioned earlier. Bayes, SVM, RF, extra tree, FCN autoencoder with a fully
connected network as well as logistic regression with L1 and
L2 penalties.
3.3.3 | Proposed voting classifier
In Zhou et al. [17, 18] and Gao et al. the authors proposed a 4 | RESULTS AND DISCUSSION
voting classifier using a set of machine learning algorithms. A
similar type of voting strategy was used on a different set of 4.1 | Datasets and evaluation metrics
classifiers here. Furthermore, a hard‐voting system was used in
this study. The proposed voting classifier's architecture is The experimental results of the study are explained and dis-
shown in Figure 5. At first, all of the classifiers are trained cussed in detail next. All seven classical machine learning and
using either the raw or resampled data. Then, the trained two deep learning algorithms detailed in the previous section
models predict the class labels of the testing dataset. The final were trained on the original NSL‐KDD dataset as well as three
results are based on a majority vote system that decides the variants obtained from applying resampling techniques. To
final predicted class labels in the testing dataset. In Gao et al. evaluate intrusion detection models, four evaluation metrics
[18], the authors applied a weighted voting system. are commonly used: accuracy, precision, F1 score, and recall.
These evaluation metrics offer a complete view on how
different machine learning algorithms perform. Evaluation
based solely on the results of accuracy is not recommended
because of the imbalanced nature of the investigated dataset.
Accuracy tends to be biased towards the normal class.
Furthermore, a primary purpose was to build a model that
avoided bias towards the majority class. On the other hand,
when a resampling technique is applied, a model often be-
comes biased towards the minority classes, which results in a
lower precision value. It is important to have both high pre-
cision and recall in the network intrusion detection domain.
Because the F1 score is calculated based on precision and
recall, using the F1 score is the most appropriate way to
evaluate those intrusion detection models when the dataset is
imbalanced. The formula of the F1 score is:

2 � ðPrecision � RecallÞ
F I G U R E 4 Model loss in training and validation datasets in a fully F1 − score ¼ ð2Þ
connected neural network Precision þ Recall

FIGURE 5 Proposed voting classifier’s architecture


6
- AHSAN ET AL.

T A B L E 1 Comparison of classifiers on
Accuracy Precision F1‐score Recall
original dataset
Decision tree 0.78 0.78 0.76 0.78

Naïve Bayes 0.66 0.75 0.65 0.66

SVM 0.79 0.82 0.77 0.79

Random forest 0.76 0.81 0.72 0.76

Extra tree 0.77 0.82 0.74 0.77

Logistic regression (L1 penalty) 0.78 0.80 0.75 0.78

Logistic regression (L2 penalty) 0.79 0.81 0.76 0.79

Fully connected neural networks 0.81 0.82 0.78 0.80

Autoencoder with fully connected neural networks 0.82 0.83 0.80 0.82

Neural networks voting classifier 0.79 0.81 0.76 0.79

Abbreviation: SVM, support vector machine.

TABLE 2 Comparison of classifiers using SMOTE resampling

Accuracy Precision F1_score Recall


Decision tree 0.82 0.81 0.80 0.82

Naïve Bayes 0.63 0.75 0.63 0.63

SVM 0.81 0.83 0.80 0.81

Random forest 0.78 0.82 0.75 0.78

Extra tree 0.79 0.83 0.76 0.79


FIGURE 6 F1 scores of classifiers trained on original dataset
Logistic regression (L1 penalty) 0.80 0.82 0.79 0.80

Logistic regression (L2 penalty) 0.81 0.82 0.80 0.81


Furthermore, we used the weighted F1 score to compare Fully connected neural 0.82 0.82 0.80 0.82
the performance of the proposed voting system with the set of networks
classifiers listed in the previous section. In the weighted F1
Autoencoder with fully 0.79 0.80 0.76 0.78
score, the weight was calculated based on the samples present connected neural networks
in the testing dataset. By doing this, performance of the mi-
nority class no longer had a great effect on the overall F1 score. Voting classifier 0.83 0.85 0.82 0.83
In real‐world scenarios, minority classes usually remain much Abbreviation: SMOTE, synthetic minority oversampling technique; SVM, support
smaller than the normal class. vector machine.

4.2 | Results and Analysis which is higher than the results of the proposed DNN re-
ported in Vinayakumar et al. [7] in terms of all four evaluation
Table 1 lists the results obtained from all nine classifiers as well metrics. Figure 6 shows the F1 score of all 10 classifiers trained
as the voting classifier. The results show that Naïve Bayes on the original dataset. Among all 10 classifiers, the autoen-
yields the worst performance. Naïve Bayes is a probabilistic coder with the FCN provides the best results. The poor result
model, and its predictions are done based on a probability of the voting classifier obtained from running it on the original
density estimation of a dataset. In the testing dataset, there are dataset follows from the fact that most individual classifiers we
new classes that are not present in the training dataset. investigated performed poorly owing to the imbalance present
Therefore, the probability density estimation does not reflect in the original data.
the presence of the new classes, which leads to degraded Table 2 lists the results of combining each of the 10 clas-
performance. Almost all other classifiers produce a level of sifiers with SMOTE resampling on the NSL‐KDD train
results similar to each other. Logistic regression with L2 pen- dataset. In almost all cases, the results are improved compared
alty demonstrated a slightly better result than the other classical with using classifiers with no resampling. Similar to our pre-
machine learning algorithms. Furthermore, the FCN network vious explanation, Naïve Bayes shows no improvement
produces an excellent result in terms of accuracy and F1 score, because the model fails to capture the probabilistic density
AHSAN ET AL.
- 7

estimation on the new classes in the test dataset. Then, the


performance of autoencoder with FCN network are slightly
decreased. We believe that this is because of the quality of the
synthetic data generated. More precisely, the synthetic data
changed the distribution of important features and it doesn't
reflect the distribution of the original dataset. The autoencoder
model failed to capture the latent distribution of the over-
sampled dataset. To improve this result, the autoencoder
structure might need to be changed. Because running
autoencoder in a large dataset such as NSL‐KDD is compu-
tationally expensive, parameter tuning and architecture changes
for the autoencoder model are left for future work. After
applying SMOTE to the dataset, we perform intrusion detec-
tion using the 10 classifiers again on the new training dataset.
The proposed voting classifier shows a good result compared
with other classifiers in terms of the F1 score (Figure 7). It has
0.83 accuracy, 0.85 precision, and a 0.82 F1 score, which is the
highest F1 score achieved so far. This result is higher than all
of those obtained elsewhere [7, 8, 12, 13, 14]. The overall
performance of the voting classifier is improved significantly
because all of the classifiers except Naïve Bayes and the
autoencoder performed better after applying SMOTE
resampling.
Table 3 lists the results of running all 10 classifiers on the
test dataset when undersampling technique is applied. The

F I G U R E 7 F1 scores of classifiers using synthetic minority F I G U R E 8 Confusion matrices of decision tree classifier trained on
oversampling technique resampling oversampled dataset and undersampled training sets

T A B L E 3 Comparison of classifiers
Accuracy Precision F1_score Recall
using NearMiss resampling
Decision tree 0.82 0.82 0.80 0.82

Naïve Bayes 0.66 0.78 0.66 0.66

SVM 0.80 0.83 0.79 0.80

Random forest 0.81 0.84 0.77 0.81

Extra tree 0.80 0.84 0.76 0.80

Logistic regression (L1 penalty) 0.80 0.80 0.78 0.80

Logistic regression (L2 penalty) 0.80 0.81 0.78 0.80

Fully connected neural networks 0.72 0.55 0.62 0.72

Neural networks 0.71 0.54 0.61 0.71

Voting classifier 0.82 0.83 0.80 0.81

Abbreviation: SVM, support vector machine.


8
- AHSAN ET AL.

proposed voting classifier and decision tree show the highest number of nodes that prevent the model loss from stabilising
accuracy score of 0.82 and an F1 score of 0.80. These results and overfit the model. This results in incorrect predictions on
show that all classifiers perform well with the undersampling the minority classes.
techniques compared with the original data. Table 4 lists results obtained from training the classifiers
Figure 8 shows the confusion matrices of a decision tree combining with SMOTEENN. In terms of the F1 score, SVM
classifier trained on oversampled and undersampled datasets. performs the best, with an F1 score reaching 0.81. This is the
The true positive rate (TPR) of the normal class increased best F1 score that SVM has acquired so far. This is possibly
from 0.96 to 0.97 and the TPR of DoS, Probe, and R2L because SMOTEENN is a hybrid resampling technique
decreased by 1%. This implies that a classical machine learning combining SMOTE and ENN. These two techniques are
algorithm such as decision tree has a similar type of perfor- based on the distance measure. They successfully created good
mance when used with undersampled data compared with synthetic samples suitable for distance‐based classifiers such as
oversampled data. SVM. We observe that the autoencoder with FCN performs
Figure 9 demonstrates the F1 score obtained from training well with SMOTEENN compared with SMOTE. As
all 10 classifiers on the dataset after applying NearMiss mentioned, the poor results of the autoencoder with SMOTE
resampling. All classifiers had a similar F1 score except Naïve were caused by the poor quality of the synthetic samples
Bayes, FCN, and autoencoder with FCN. The FCN and generated by SMOTE. When ENN is added to SMOTE, it
autoencoder with FCN have a low score when combined with helps to discard some samples from the oversampled data and
NearMiss compared to other resampling techniques. This is keep samples related to the original distribution. Therefore, the
because deep learning models need enough data to learn performance of the autoencoder with FCN is improved.
properly and the undersampled dataset offers the fewest Figure 10 compares all 10 classifiers trained on a dataset
samples compared to other resampled datasets. Despite dis- generated from applying SMOTEENN resampling. All of the
carding some majority class samples, the dataset produced after classifiers perform relatively similarl in terms of the F1 score.
NearMiss is still highly imbalanced. Furthermore, our neural Apart from the FCN and the voting classifier, all other clas-
network architecture has three hidden layers with a fairly large sifiers perform slightly better or remain the same compared to
the results using SMOTE.
Figure 11 is a bar chart showing differences between the
F1 scores of all seven classical machine learning algorithms
trained with each of the three resampling techniques as well
as F1 scores from training the seven classifiers on the orig-
inal dataset. After applying resampling techniques, the F1
scores increased for all classifiers except Naïve Bayes. This
confirms that resampling techniques have a positive impact
on the performance of most classical machine learning
algorithms.
Figure 12 compares the F1 scores of the two deep learning
algorithms when the resampling techniques are used and run
FIGURE 9 F1‐scores of the classifiers using NearMiss resampling on the original dataset. The FCN performs well with SMOTE;

T A B L E 4 Comparison of the classifiers


Accuracy Precision F1_score Recall
using Edited Nearest Neighbour plus
Decision tree 0.82 0.82 0.80 0.82 Synthetic Minority Oversampling Technique
resampling
Naïve Bayes 0.63 0.75 0.63 0.63

SVM 0.82 0.83 0.81 0.81

Random forest 0.78 0.82 0.76 0.78

Extra tree 0.8 0.84 0.77 0.80

Logistic regression (L1penalty) 0.81 0.82 0.79 0.81

Logistic Regression (L2penalty) 0.81 0.83 0.81 0.81

Fully connected neural networks 0.81 0.81 0.79 0.81

Autoencoder with fully connected 0.80 0.81 0.79 0.80

Voting classifier 0.82 0.83 0.80 0.82

Note: The bold values indicate the best result for each evaluation metric.
Abbreviations: SMOTEENN, synthetic minority oversampling technique and edited nearest neighbour; SVM, support
vector machine.
AHSAN ET AL.
- 9

F I G U R E 1 0 F1 scores of classifiers using synthetic minority


oversampling technique and edited nearest neighbour resampling

F I G U R E 1 3 F1 scores of voting classifier trained on original dataset


and datasets generated using three resampling techniques

We also investigated the impact of these resampling tech-


niques on the performance of our proposed voting classifier.
The results compared with those obtained from trained on the
original dataset are shown in Figure 13. We attained the best F1
score for the voting classifier when it was combined with
SMOTE. The results confirm that the use of majority voting
improves intrusion detection. Also, the voting classifier per-
forms well with the resampled dataset, which implies that
resampling techniques provide sufficient variation to all clas-
F I G U R E 1 1 F1 scores of classical machine learning algorithms
sifiers to improve classification performance.
trained on the original dataset and datasets generated using three
resampling techniques

5 | CONCLUSION AND FUTURE WORK

We report the results of a comparative evaluation study per-


formed on an extremely imbalanced dataset using 10 machine
learning algorithms combined with three resampling tech-
niques to detect network intrusions. There is a positive impact
on the performance when resampled datasets are used with
classical machine learning algorithms. Deep Learning models
tend to suffer when undersampling techniques are applied to
the original dataset. Our intuition is that changing the archi-
tecture of the FCN and autoencoder with FCN may lead to a
better result.
A customised voting classifier is further proposed. The
advantage of the proposed voting classifier is evident when all
F I G U R E 1 2 F1 scores of deep learning algorithms trained on original participating classifiers perform well to detect individual clas-
dataset and datasets generated using three resampling techniques ses. In most cases, it yields better results when run on
resampled datasets. In general, bad performance from the
weaker learners is covered in the voting classifiers because it
takes majority voting to make the final decision. However,
however, its performance is slightly lessened when ENN has some classifiers are only good at predicting certain classes.
been used with SMOTE. On the other hand, the autoencoder Therefore, giving all classifier decisions equal weight might
cannot capture the original distribution properly with SMOTE, restrict the potential performance of the voting classifier. In
and its performance improves when SMOTE is combined with the immediate future, we will experiment on a weighted voting
ENN. Deep learning models are more dependent on training system to improve the proposed voting classifier's perfor-
data. Because of the imbalanced distribution of the under- mance. Although further improvement over the NSL‐KDD
sampled dataset and lack of data, our deep learning models do dataset appears to be challenging, we plan to investigate opti-
not perform well with undersampling compared with other mizations to the classifiers and include additional resampling
resampling techniques. techniques. Finally, we will add different feature selection
10
- AHSAN ET AL.

techniques in the preprocessing step before executing the 13. Lopez‐Martin, M., Carro, B., Sanchez‐Esguevillas, A.: Variational data
voting classifier, in hope of further improving the detection generative model for intrusion detection. Knowl Inf Syst. 60(1), 569–590
(2019)
rate of various types of network intrusions.
14. Caminero, G., Lopez‐Martin, M., Carro, B.: Adversarial environment
reinforcement learning algorithm for intrusion detection. Comput
ACK NOW L ED GE ME N T Network. 159, 96–109 (2019)
We gratefully acknowledge the financial support from the 15. Ma X., Shi W.: AESMOTE Adversarial Reinforcement Learning with
Natural Sciences and Engineering Research Council of Canada SMOTE for Anomaly Detection. IEEE Trans. Netw. Sci. Eng. 1–1
(NSERC) under Grants No. RGPIN‐2020‐06482 (2020). http://dx.doi.org/10.1109/tnse.2020.3004312
16. Shrivas, A.K., Dewangan, A.K.: An ensemble model for classification of
attacks with feature selection based on KDD99 and NSL‐KDD data set.
O RC ID Int. J. Comput. Appl. 99(15), 8–13 (2014)
Rahbar Ahsan https://orcid.org/0000-0001-6624-1462 17. Zhou Y., et al.: Building an efficient intrusion detection system based on
Wei Shi https://orcid.org/0000-0002-3071-8350 feature selection and ensemble classifier. Comput. Netw. 174, 107247
(2020). http://dx.doi.org/10.1016/j.comnet.2020.107247
18. Gao, X., et al.: An adaptive ensemble machine learning model for
RE FE RE NCE S intrusion detection. IEEE Access. 7, 82512–82521 (2019)
1. Bhuyan, M.H., Bhattacharyya, D.K., Kalita, J.K.: Towards generating real‐ 19. Gnanaprasanambikai, L., Munusamy, N.: Data pre‐processing and clas-
life datasets for network intrusion detection. IJ Network Security. 17(6), sification for traffic anomaly intrusion detection using NSLKDD dataset.
683–701 (2015) Cybern Inf Technol. 18(3), 111–119 (2018)
2. Jang‐Jaccard, J., Nepal, S.: A survey of emerging threats in cybersecurity. 20. Chawla, N.V., et al.: SMOTE: synthetic minority over‐sampling tech-
J Comput Syst Sci. 80(5), 973–993 (2014) nique. Jair. 16, 321–357 (2002)
3. Uppal, H.A.M., Javed, M., Arshad, M.: An overview of intrusion 21. Mani, I., & Zhang, I.: kNN approach to unbalanced data distributions:
detection system (IDS) along with its commonly used techniques and a case study involving information extraction. In: Proceedings of
classifications. Int J Comput Sci Telecommun. 5(2), 20–24 (2014) workshop on learning from imbalanced datasets, vol. 126. ICML, US
4. Sun, N., et al.: Data‐driven cybersecurity incident prediction: a survey. 2003.
IEEE Commun. Surv. Tutorials. 21(2), 1744–1772 (2018) 22. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of
5. Mishra, P., et al.: A detailed investigation and analysis of using machine several methods for balancing machine learning training data. ACM
learning techniques for intrusion detection. IEEE Commun. Surv. SIGKDD Explorations Newsletter. 6(1), 20–29 (2004). https://doi.org/
Tutorials. 21(1), 686–728 (2018) 10.1145/1007730.1007735
6. Tavallaee, M., et al.: A Detailed Analysis of the KDD CUP 99 Data Set. 23. Pedregosa, F., et al.: Scikit‐learn: machine learning in Python. J. Mach.
In: IEEE Symposium on Computational Intelligence for Security and Learn. Res. 12, 2825–2830 (2011)
Defence Applications, pp. 1–6. IEEE, Ottawa, ON (2009) 24. Berman, D., et al.: A survey of deep learning methods for cyber security.
7. Vinayakumar, R., et al.: Deep learning approach for intelligent intrusion Information. 10(4), 122 (2019)
detection system. IEEE Access. 7, 41525–41550 (2019) 25. Chollet, F.: Keras (2015). https://keras.io/getting_started/faq/#how‐
8. Yin, C., et al.: A deep learning approach for intrusion detection using should‐i‐cite‐keras
recurrent neural networks. IEEE Access. 5, 21954–21961 (2017)
9. Li, Z., et al.: Intrusion detection using Convolutional neural networks for
representation learning. In: International Conference on neural infor-
mation processing, pp. 858–866. Springer, Cham (2017)
10. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data
with neural networks. Science. 313(5786), 504–507 (2006) How to cite this article: Ahsan R, Shi W,
11. Chen, Z., et al.: Autoencoder‐based network anomaly detection. In: 2018 Corriveau JP. Network intrusion detection using
Wireless Telecommunications Symposium (WTS), Phoneix, AZ, USA. machine learning approaches: Addressing data
IEEE (2018) imbalance. IET Cyber‐Phys. Syst., Theory Appl.
12. Lopez‐Martin, M., et al.: Conditional variational autoencoder for pre-
diction and feature recovery applied to intrusion detection in iot. Sensors.
2021;1–10. https://doi.org/10.1049/cps2.12013
17(9), 1967 (2017)

You might also like