Over the past decade, more and more businesses and organizations are digitizing their confidential data.
This has
increased the volume of network traffic, with data being created at a very large scale. Computer networks have
expanded tremendously over the last decade especially with the emergence of new devices and services like cloud
computing and Internet of Things (IoT). The security of this data is a big challenge. Also, attacks on networks have
increased significantly and Network Intrusion is acknowledged to be the most danger to security [1],[2].
Attacks like Denial of Service (DoS), Zero-day attacks and Advanced Persistent threats (APT) have been significant
problems in today’s information technology global community. This is where the idea of Intrusion Detection System (IDS)
comes handy. Intrusion Detection System (IDS) are hardware and software systems that can identify such harmful
behaviors. The main objective of the Intrusion Detection System (IDS) is to observe the behavior of the system, identify
attacks and generate alarms so that appropriate actions can be taken to prevent any harmful consequences [2].
Intrusion can be detected using two classification techniques i.e., signature-based and anomaly based. Signature-based,
also known as pattern-based anomalies are looked against a list of patterns the database already has. Signature-based
intrusion detection comes with a drawback – it is unable to learn by itself, any anomalous patterns and intrusions within
raw data.
Anomaly-based intrusion detection can point out a normal or benign activity and look for anything that is anomalous. It
can learn any abnormal pattern based on Machine Learning and Deep Learning concepts. The inputs of an IDS could be
traffic logs, application logs, file system changes, packets, etc. that are monitored, and output is the label for each input
[2].
Numerous research studies have been conducted in the field of Machine Learning (ML) and Deep Learning (DL) because
they can learn trends of malicious behaviors while reducing false alarms [9]. Several authors have attempted to do a
comprehensive survey on Machine Learning and Deep Learning techniques for anomaly detection [2],[4],[5].
It turns out that much of the research in this area is based on shallow Learning technique which requires a lot of time,
effort and resources and their effectiveness depends on the expertise and extent of knowledge of the researchers in the
field [10].
Network Intrusion Detection using Machine Learning (ML) and Deep Learning (DL) is one of the most significant
developments in the field of information security. There is a competition among researchers, leading companies, and
economies to advance Deep Learning and Artificial Intelligence. In some cases, Artificial Intelligence has exceeded human
Intelligence, like the modern mobile applications, decision to predict stocks, decision to predict movie ratings, etc.
Although DL and ML in detecting network attacks have accomplished a lot, there are still areas where effectiveness is
lacking. There could be more precision, accuracy and performance of the algorithms that help classify these attacks in
order to prevent them.
With the increase in the volume of network traffic, with data being created at a very large scale. Computer networks
have expanded hugely over the last decade and especially with the emergence of new devices and services like cloud
computing and Internet of Things (IoT), attacks on networks, globally have increased significantly [34]. Malware, spear-
phishing,
Ransomware top the list of cybersecurity threats. Besides those many other network intrusion attacks like denial of
service, Zero-day attacks and advanced persistent threats (APT) have been reported as significant problems in today’s
information technology global community. APTs can be dangerous and costly as these are powerful attacks launched by
malicious actors against government and private organizations with the intent of causing great damage.
Objective of this project
The main objective of this study is to evaluate the effectiveness of machine learning models by using various
performance metrics. The performance metrics we used for this study is Accuracy, Precision, recall and F-1 score. The
goal of this study is to test the performance of various machine learning algorithms on the various categories of subset
data of realistic evaluation dataset CICIDS-2017.
It was expected that our machine learning model comprising feature selection using Pearson’s correlation coefficient
coupled with these algorithms would increase the accuracy on the CICIDS2017 dataset. This would be the contribution of
this study in the field of application of machine learning on anomaly detection.
Machine Learning and Deep Learning Algorithms
In recent years, Machine Learning and Deep Learning algorithms in anomaly detection
have garnered huge interest [4],[23]. Anomaly-based intrusion detection is essentially a
classification problem and Machine Learning and Deep Learning algorithms have proven to be
useful in Network Intrusion Detection [5],[6].
Machine Learning is a branch of Artificial Intelligence, and it gives computers the
ability to learn without being explicitly programmed [23]. Deep Learning is an advanced field in
Machine-Learning research, and it simulates the human brain style to analyze and interpret data.
Deep Learning is essentially an advancement of the Machine Learning process and it is derived
and formulated from the Artificial Neural Network. It is believed that Deep Learning algorithms
are the most significant breakthrough of the century, which significantly drives applications
towards Artificial Intelligence [11].
Traditional Machine Learning methods used for intrusion detection such as Support
Vector Machine (SVM), Decision Tree, Linear Regression, Hidden Markov Model etc. have
shallow architecture and are not capable of handling intrusion detection in modern data
environments [24]. The idea of Deep Learning was proposed by Hinton [25] and it is a Machine-
Learning method based on characterization of data Learning. Some examples of Deep Learning
algorithms include Convolutional Neural Network (CNN), LSTM (Long Short-Term Memory),
Deep Boltzmann Machine (DBM), etc.
Logistic Regression: Logistic regression is a predictive analysis algorithm, and it is based on the
concept of probability. It is used for classification problems. It is used for binary classification
which uses a logistic function called a sigmoid function for prediction. Although its name makes
it sound like a regression algorithm, logistic regression is a classification algorithm.
Kernelized Support Vector Machine (SVM): Support Vector Machine comprises a set of
supervised learning methods. It is one of the most simple and common ML algorithms used to
categorize different types of data in SVM. It is a non-probabilistic method. It creates a hyper-
plane or a multiple hyper-plane in a boundless dimensional input vector to classify the instances.
It is a powerful model and performs well on a variety of datasets. It has been used to identify
network intrusion quickly and accurately [41]. However, it requires very meticulous and careful
data pre-processing of the data and tuning of parameters.
K-Nearest Neighbor: The KNN is a classification algorithm inspired from Standard Euclidean
Distance (SED) that exists between two points in the same space [8]. It is a very simple and easy
to implement algorithm and there is no need to build a model and optimize parameters. However,
the algorithm performs very slowly with the increase in number of examples or variables.
The two important parameters in KNN algorithms are: number of neighbors and the way
distance between data points are measured. The default distance used is the Euclidean distance
which works well.
Naive Bayes: Naive Bayes algorithm is a supervised learning algorithm based on Bayes’
theorem which assumes conditional independence between every pair of features given the value
of the class variable. It is easy to implement an algorithm, but it requires the predictors to be
independent. Since most realistic cases have predictors that are dependent, the performance of
the classifier is affected negatively. Naïve Bayes Classifiers are efficient and the reason being
that they learn parameters by looking at each feature individually and they collect statistics from
each feature. There are three classes of Naïve Bayes Classifiers implemented in ScikitLearn:
BenoulliNB, MultinomialNB and GaussianNB. For this study , GaussianNB was used because it
can be applied to any continuous data [31].
The dataset used in this study is comparatively high-dimensional and GaussianNB is mostly used
on very high-dimensional data. The GaussianNB model requires very less training time and
makes predictions.
Decision tree: Decision tree is a supervised ML algorithm used to classify data. The
architecture of a decision tree comprises the category nodes, the internal nodes and a root node.
Decision trees are the building blocks of Random Forest. Decision trees are simple and easy to
implement and can handle high dimensional data. One advantage of running Random
Forest(RF) is that we have to specify fewer parameters compared to other machine learning
methods like support vector machines (SVM), Artificial Neural Network(ANN) [28] .
Decision trees make their decision(classification) by learning a hierarchy of if/else questions. In
the language of Machine Learning, these if/else questions are known as ‘tests’. To build a tree,
the algorithm looks for all tests and discovers the one that represents the target variable the most
[31]. The main downside to Decision trees is that they suffer from overfitting problems and they
are poor at generalizations.
Random Forest (RF)
The random forest classifier was proposed by Breimanis [29]. It is essentially a decision
tree concept that is constructed by using many decision trees. It takes thousands of input
variables without deleting variables and classifies them based on their importance [28]. It is an
ensemble of classification trees.
In random forest, a collection of individual tree structured classifiers can be mathematically
expressed as below:
{ h(x, θk ), k = 1, 2, ….i … } [30]
Where h represents RF classifier, {θk } stands for random vectors distributed independently
identical and each tree has an input for the most famous class at input variable x.
Figure 1. Structure of a decision tree.
As discussed above, a random forest is basically a collection of decision trees and the trees are slightly different from one
another. The issue of decision trees suffering from overfitting of training data is solved by random forests. Random forest
is a strong classifier.
Methodology
Introduction
The data for this study is secondary data i.e. collected by other researchers. The data was
generated by researchers from the Canadian Institute of Cybersecurity [1]. The dataset is very
realistic.
Figure 2. Flow Chart of the method used in the calculation.
In this study, a small subset of data from the CICIDS2017 was taken to optimize the
Machine Learning model that can help the attacks mentioned in the table above. The dataset
comprises attacks captured using CICFlowMeter [16] with timestamp, source and destination
IPs, source and destination ports, protocols and type of attack.
Hardware and Software Environment
Operating System : Windows 10 Home
Processor : AMD Ryzen 5 3600 6-Core Processor, 3.6 GHz
Installed RAM: 16.0 GB
Startup Disk : McIntosh HD
Software Environment: Python 3.9.4 64-bit
Design of the Study
The study is mathematical computation in nature. Our model uses Pearson Correlation
Coefficient as the feature elimination technique and various supervised Machine Learning
classifiers for performing classification. The python libraries that are useful in the study
are Scikit-learn, Numpy, Pandas, Keras, matplotlib, TensorFlow, and Pytorch.
The calculation was performed on a jupyter Notebook using python. In order to perform
the calculation, first the required python libraries were imported. Then, a dataset was imported.
The dataset was analyzed. As with every dataset, we need to take care of missing data and select
appropriate features. The ‘scikitlearn’ library comes very handy when using necessary resources
in python.
Feature selection is a very important task as it helps reduce the computational complexity
and eliminate unnecessary and irrelevant features while enhancing the performance of IDS [34],
[35],[38]. Correlation-based feature selection has been found to improve classification accuracy
and reduce the dimensionality of dataset [36],[37]. The correlation function called from scikit
learn library is used to obtain a confusion matrix. A correlation coefficient is a measure of the
degree to which variation in one variable is related to variation in one or more variables [32],
[34].
The value of correlation coefficient can range from -1 to 1. If the value of correlation is
close to +1, there is a very strong positive relationship between the variables and a value close to
-1 indicates that there is a very strong negative relationship between the variables. Basically, if
the sign of the correlation is opposite, it shows the direction of the relationship between variables
[33]. So, the value of correlation tells us the relationship between variables. Feature selection
In the case of continuous variables, if the two values are highly correlated, they
contribute the same factor to the target result, so appropriate selection of features can be done.
Figure 3. Scatter plot showing how the value of correlation coefficient defines the relationship
between attributes.
Figure 4. Set up for Pearson’s correlation coefficient in jupyter notebook
Figure 5. Pearson’s correlation plot showing features considered for this study.
After the analysis of the Pearson’s correlation plot, the final 14 features that were selected were:
1. Total Flow Duration
2. Total Forward Packets
3. Total Length of Forward Packets
4. Forward Packet’s maximum length
5. Forward Packet’s minimum length
6. Forward Packet’s mean length
7. Backward Packet maximum length
8. Backward Packet minimum length
9. Flow Bytes per second
10. Flow Packets per second
11. Backward Packets per second
12. Minimum Packet Length
13. Initial Window Bytes (Forward)
14. Initial Window Bytes (Backward)
Description of each features selected
Total Flow Duration: The total duration of flow in microseconds.
Total Forward Packets: Total packets in forward direction.
Total Length of Forward Packets: Total size of packet in forward direction.
Forward Packet’s maximum length: The maximum size of packet in forward direction. Forward
Packet’s minimum length: The minimum size of packet in forward direction. Forward Packet’s
mean length: The mean size of packet in forward direction.
Backward Packet maximum length: The maximum size of packet in backward direction.
Backward Packet minimum length: The minimum size of packet in backward direction. Flow
Bytes per second: The number of flow bytes per second.
Flow Packets per second: The number of flow packets per second.
Backward Packets per second: The number of backward packets per second
Minimum Packet Length: The minimum length of a packet.
Initial Window Bytes (Forward): The total count of bytes sent in the initial window in the
forward direction.
Initial Window Bytes (Backward): The total count of bytes sent in the initial window in the
forward direction.
Therefore, this helped in selecting appropriate features for this study. Usually, cluster
analysis is done to serve this purpose in the case of unsupervised studies [43],[45] but we
conducted the study to see the performance by supervised algorithms.
After feature selection, the datasets were imported and using scikit-learn’s train_test_split
function, the data was split into 80 % training set and 20 % test set. After this, the classifier i.e.,
machine learning model’s parameters were defined, and the model was trained on a training set.
The model was tested on the test set. The prediction was observed through a confusion matrix. A
classification report was generated for each dataset and algorithm, which shows the traffic
classified into ‘BENIGN’ and attack type or types. It shows various other metrics like precision,
recall, f-1 score and support for further analysis and conclusion.
Performance Metrics
As discussed above, in order to measure the performance of machine learning
algorithms, we use some metrics like accuracy, precision, recall, and F-1-score. The
performance indicators used for classification problems are based on the below mentioned four
possibilities: True Positive (TP): correct classification attack packets as attacks.
True Negative (TN): correct classification normal packets as normal.
False Positive (FP): normal activity that is wrongly labeled as intrusive by IDS.
False Negative (FN): intrusive activity that is classified as normal.
The accuracy, the precision, recall and F1-score are defined as follows:
Accuracy: The accuracy rate is the main prediction indicator for the several machine and deep
learning classifiers. It is simply the measure of how correctly the model classifies.
Accuracy = (Tp + Tn)/(Tp + Fp + Tn + Fn)
Where, Tp = True Positive, Tn = True Negative, Fp = False positive, Fn = False Negative
Precision: It is the ratio of correctly identified positive observations to all the predicted positive
observations. In other words, Precision measures the number of correct instances retrieved
divided by all retrieved instances [39].
Precision = True Positive / (True Positive + False Positive)
The precision is intuitively the ability of the classifier not to label as positive a sample that
is negative [40].
Recall: Recall is the ratio of correctly identified positive cases to all the observed cases. In
other words, recall measures the number of correct instances retrieved divided by all correct
instances [39].
The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of
false negatives. The recall is intuitively the ability of the classifier to find all the positive samples
[40].
F-1 score: It is the harmonic mean of precision and recall. It is needed when we want to find a
balance between Precision and Recall.
F-1 score = 2 * (Precision x Recall)/(Precision + Recall)
The CICIDS-2017 Dataset
CIC-IDS2017 has benign and common attacks which is very similar to true real-world data [1]. It
also has the result of network traffic analysis using CICFlowMeter with labeled flows based on
the timestamp, source, and destination IPs, source and destination ports, protocols and attack
(CSV files) [1].
Table 1. Types of Intrusion in the CICIDS-2017 dataset
No. Group of intrusion Type of Intrusion
1 Normal Benign
2 Denial of Service (DoS) Botnet, DDoS, DoSGoldenEye, DoS
Hulk, DoSSlowhttp, DoSSlowloris
3 Password attack FTP-Patator, SSH-Patator, Web-Attack-Brute-Force
4 Probing Port Scan
5 Vulnerability Heartbleed Attack, Infiltration, Web-Attack-SQL-
Injection, Web-Attack-XSS
The data was captured between July 3, 2017 and July 7, 2017 for a total of 5 days. The
implemented attacks contain Brute Force FTP, Brute Force SSH, DoS, Heartbleed, Web Attack,
Infiltration, Botnet and DDoS [1]. CICIDS2017 is a very huge dataset which has approximately
3 million network flows in different files [1][27]. In CICIDS2017, there is no specified training
or test sets to be used in the experiments. So, for this study, only 10% of this dataset was
selected for training and testing so that we can reduce training and testing time or the training
and testing time would be very lengthy. Also, the computer used for this study suffered memory
error while trying to take a bigger size of datasets for calculations. The selection of those 10% of
the dataset was done randomly by using the sampling without replacement technique to ensure
the diversity of traffic records and avoiding overfitting.
It has datasets listed under different categories. There are eight different
categories of datasets within the main folder containing the datasets. The
objective was to perform study on each dataset separately. So, instead of
combining these different files into one, machine learning study was performed in
each category of dataset separately. However, some datasets were avoided from
the study like the ‘Monday-WorkingHours.pcap’ dataset as it contained only
normal benign traffic and ‘Friday-WorkingHours-Morning.pcap_ISCX’ was
avoided because of only one class problem. The detailed study and results
obtained from the classification is listed in the next section in this paper.