0% found this document useful (0 votes)
8 views15 pages

AI in Cyber Security

The document discusses the increasing challenges of network security due to the rise in digitization and network traffic, highlighting the significance of Intrusion Detection Systems (IDS) in identifying malicious activities. It evaluates the effectiveness of various machine learning and deep learning algorithms for anomaly detection, using performance metrics such as accuracy, precision, recall, and F-1 score on the CICIDS-2017 dataset. The study aims to enhance the performance of IDS through feature selection and the application of advanced algorithms to improve detection accuracy.

Uploaded by

onyinye cynthia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views15 pages

AI in Cyber Security

The document discusses the increasing challenges of network security due to the rise in digitization and network traffic, highlighting the significance of Intrusion Detection Systems (IDS) in identifying malicious activities. It evaluates the effectiveness of various machine learning and deep learning algorithms for anomaly detection, using performance metrics such as accuracy, precision, recall, and F-1 score on the CICIDS-2017 dataset. The study aims to enhance the performance of IDS through feature selection and the application of advanced algorithms to improve detection accuracy.

Uploaded by

onyinye cynthia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Over the past decade, more and more businesses and organizations are digitizing their confidential data.

This has
increased the volume of network traffic, with data being created at a very large scale. Computer networks have
expanded tremendously over the last decade especially with the emergence of new devices and services like cloud
computing and Internet of Things (IoT). The security of this data is a big challenge. Also, attacks on networks have
increased significantly and Network Intrusion is acknowledged to be the most danger to security [1],[2].

Attacks like Denial of Service (DoS), Zero-day attacks and Advanced Persistent threats (APT) have been significant
problems in today’s information technology global community. This is where the idea of Intrusion Detection System (IDS)
comes handy. Intrusion Detection System (IDS) are hardware and software systems that can identify such harmful
behaviors. The main objective of the Intrusion Detection System (IDS) is to observe the behavior of the system, identify
attacks and generate alarms so that appropriate actions can be taken to prevent any harmful consequences [2].

Intrusion can be detected using two classification techniques i.e., signature-based and anomaly based. Signature-based,
also known as pattern-based anomalies are looked against a list of patterns the database already has. Signature-based
intrusion detection comes with a drawback – it is unable to learn by itself, any anomalous patterns and intrusions within
raw data.

Anomaly-based intrusion detection can point out a normal or benign activity and look for anything that is anomalous. It
can learn any abnormal pattern based on Machine Learning and Deep Learning concepts. The inputs of an IDS could be
traffic logs, application logs, file system changes, packets, etc. that are monitored, and output is the label for each input
[2].

Numerous research studies have been conducted in the field of Machine Learning (ML) and Deep Learning (DL) because
they can learn trends of malicious behaviors while reducing false alarms [9]. Several authors have attempted to do a
comprehensive survey on Machine Learning and Deep Learning techniques for anomaly detection [2],[4],[5].

It turns out that much of the research in this area is based on shallow Learning technique which requires a lot of time,
effort and resources and their effectiveness depends on the expertise and extent of knowledge of the researchers in the
field [10].

Network Intrusion Detection using Machine Learning (ML) and Deep Learning (DL) is one of the most significant
developments in the field of information security. There is a competition among researchers, leading companies, and
economies to advance Deep Learning and Artificial Intelligence. In some cases, Artificial Intelligence has exceeded human
Intelligence, like the modern mobile applications, decision to predict stocks, decision to predict movie ratings, etc.
Although DL and ML in detecting network attacks have accomplished a lot, there are still areas where effectiveness is
lacking. There could be more precision, accuracy and performance of the algorithms that help classify these attacks in
order to prevent them.

With the increase in the volume of network traffic, with data being created at a very large scale. Computer networks
have expanded hugely over the last decade and especially with the emergence of new devices and services like cloud
computing and Internet of Things (IoT), attacks on networks, globally have increased significantly [34]. Malware, spear-
phishing,

Ransomware top the list of cybersecurity threats. Besides those many other network intrusion attacks like denial of
service, Zero-day attacks and advanced persistent threats (APT) have been reported as significant problems in today’s
information technology global community. APTs can be dangerous and costly as these are powerful attacks launched by
malicious actors against government and private organizations with the intent of causing great damage.
Objective of this project

The main objective of this study is to evaluate the effectiveness of machine learning models by using various
performance metrics. The performance metrics we used for this study is Accuracy, Precision, recall and F-1 score. The
goal of this study is to test the performance of various machine learning algorithms on the various categories of subset
data of realistic evaluation dataset CICIDS-2017.

It was expected that our machine learning model comprising feature selection using Pearson’s correlation coefficient
coupled with these algorithms would increase the accuracy on the CICIDS2017 dataset. This would be the contribution of
this study in the field of application of machine learning on anomaly detection.

Machine Learning and Deep Learning Algorithms

In recent years, Machine Learning and Deep Learning algorithms in anomaly detection

have garnered huge interest [4],[23]. Anomaly-based intrusion detection is essentially a

classification problem and Machine Learning and Deep Learning algorithms have proven to be

useful in Network Intrusion Detection [5],[6].

Machine Learning is a branch of Artificial Intelligence, and it gives computers the

ability to learn without being explicitly programmed [23]. Deep Learning is an advanced field in

Machine-Learning research, and it simulates the human brain style to analyze and interpret data.

Deep Learning is essentially an advancement of the Machine Learning process and it is derived

and formulated from the Artificial Neural Network. It is believed that Deep Learning algorithms

are the most significant breakthrough of the century, which significantly drives applications

towards Artificial Intelligence [11].


Traditional Machine Learning methods used for intrusion detection such as Support

Vector Machine (SVM), Decision Tree, Linear Regression, Hidden Markov Model etc. have

shallow architecture and are not capable of handling intrusion detection in modern data

environments [24]. The idea of Deep Learning was proposed by Hinton [25] and it is a Machine-

Learning method based on characterization of data Learning. Some examples of Deep Learning

algorithms include Convolutional Neural Network (CNN), LSTM (Long Short-Term Memory),

Deep Boltzmann Machine (DBM), etc.

Logistic Regression: Logistic regression is a predictive analysis algorithm, and it is based on the

concept of probability. It is used for classification problems. It is used for binary classification

which uses a logistic function called a sigmoid function for prediction. Although its name makes

it sound like a regression algorithm, logistic regression is a classification algorithm.

Kernelized Support Vector Machine (SVM): Support Vector Machine comprises a set of

supervised learning methods. It is one of the most simple and common ML algorithms used to

categorize different types of data in SVM. It is a non-probabilistic method. It creates a hyper-

plane or a multiple hyper-plane in a boundless dimensional input vector to classify the instances.

It is a powerful model and performs well on a variety of datasets. It has been used to identify

network intrusion quickly and accurately [41]. However, it requires very meticulous and careful

data pre-processing of the data and tuning of parameters.

K-Nearest Neighbor: The KNN is a classification algorithm inspired from Standard Euclidean

Distance (SED) that exists between two points in the same space [8]. It is a very simple and easy

to implement algorithm and there is no need to build a model and optimize parameters. However,

the algorithm performs very slowly with the increase in number of examples or variables.
The two important parameters in KNN algorithms are: number of neighbors and the way

distance between data points are measured. The default distance used is the Euclidean distance

which works well.

Naive Bayes: Naive Bayes algorithm is a supervised learning algorithm based on Bayes’

theorem which assumes conditional independence between every pair of features given the value

of the class variable. It is easy to implement an algorithm, but it requires the predictors to be

independent. Since most realistic cases have predictors that are dependent, the performance of

the classifier is affected negatively. Naïve Bayes Classifiers are efficient and the reason being

that they learn parameters by looking at each feature individually and they collect statistics from

each feature. There are three classes of Naïve Bayes Classifiers implemented in ScikitLearn:

BenoulliNB, MultinomialNB and GaussianNB. For this study , GaussianNB was used because it

can be applied to any continuous data [31].

The dataset used in this study is comparatively high-dimensional and GaussianNB is mostly used

on very high-dimensional data. The GaussianNB model requires very less training time and

makes predictions.

Decision tree: Decision tree is a supervised ML algorithm used to classify data. The

architecture of a decision tree comprises the category nodes, the internal nodes and a root node.

Decision trees are the building blocks of Random Forest. Decision trees are simple and easy to

implement and can handle high dimensional data. One advantage of running Random

Forest(RF) is that we have to specify fewer parameters compared to other machine learning

methods like support vector machines (SVM), Artificial Neural Network(ANN) [28] .

Decision trees make their decision(classification) by learning a hierarchy of if/else questions. In

the language of Machine Learning, these if/else questions are known as ‘tests’. To build a tree,
the algorithm looks for all tests and discovers the one that represents the target variable the most

[31]. The main downside to Decision trees is that they suffer from overfitting problems and they

are poor at generalizations.

Random Forest (RF)


The random forest classifier was proposed by Breimanis [29]. It is essentially a decision

tree concept that is constructed by using many decision trees. It takes thousands of input

variables without deleting variables and classifies them based on their importance [28]. It is an

ensemble of classification trees.

In random forest, a collection of individual tree structured classifiers can be mathematically

expressed as below:

{ h(x, θk ), k = 1, 2, ….i … } [30]

Where h represents RF classifier, {θk } stands for random vectors distributed independently

identical and each tree has an input for the most famous class at input variable x.
Figure 1. Structure of a decision tree.

As discussed above, a random forest is basically a collection of decision trees and the trees are slightly different from one
another. The issue of decision trees suffering from overfitting of training data is solved by random forests. Random forest
is a strong classifier.

Methodology

Introduction
The data for this study is secondary data i.e. collected by other researchers. The data was

generated by researchers from the Canadian Institute of Cybersecurity [1]. The dataset is very

realistic.
Figure 2. Flow Chart of the method used in the calculation.

In this study, a small subset of data from the CICIDS2017 was taken to optimize the

Machine Learning model that can help the attacks mentioned in the table above. The dataset

comprises attacks captured using CICFlowMeter [16] with timestamp, source and destination

IPs, source and destination ports, protocols and type of attack.

Hardware and Software Environment


Operating System : Windows 10 Home

Processor : AMD Ryzen 5 3600 6-Core Processor, 3.6 GHz

Installed RAM: 16.0 GB

Startup Disk : McIntosh HD


Software Environment: Python 3.9.4 64-bit

Design of the Study


The study is mathematical computation in nature. Our model uses Pearson Correlation

Coefficient as the feature elimination technique and various supervised Machine Learning

classifiers for performing classification. The python libraries that are useful in the study

are Scikit-learn, Numpy, Pandas, Keras, matplotlib, TensorFlow, and Pytorch.

The calculation was performed on a jupyter Notebook using python. In order to perform

the calculation, first the required python libraries were imported. Then, a dataset was imported.

The dataset was analyzed. As with every dataset, we need to take care of missing data and select

appropriate features. The ‘scikitlearn’ library comes very handy when using necessary resources

in python.

Feature selection is a very important task as it helps reduce the computational complexity

and eliminate unnecessary and irrelevant features while enhancing the performance of IDS [34],

[35],[38]. Correlation-based feature selection has been found to improve classification accuracy

and reduce the dimensionality of dataset [36],[37]. The correlation function called from scikit

learn library is used to obtain a confusion matrix. A correlation coefficient is a measure of the

degree to which variation in one variable is related to variation in one or more variables [32],

[34].

The value of correlation coefficient can range from -1 to 1. If the value of correlation is

close to +1, there is a very strong positive relationship between the variables and a value close to

-1 indicates that there is a very strong negative relationship between the variables. Basically, if

the sign of the correlation is opposite, it shows the direction of the relationship between variables

[33]. So, the value of correlation tells us the relationship between variables. Feature selection
In the case of continuous variables, if the two values are highly correlated, they

contribute the same factor to the target result, so appropriate selection of features can be done.

Figure 3. Scatter plot showing how the value of correlation coefficient defines the relationship

between attributes.

Figure 4. Set up for Pearson’s correlation coefficient in jupyter notebook


Figure 5. Pearson’s correlation plot showing features considered for this study.

After the analysis of the Pearson’s correlation plot, the final 14 features that were selected were:

1. Total Flow Duration

2. Total Forward Packets

3. Total Length of Forward Packets

4. Forward Packet’s maximum length


5. Forward Packet’s minimum length

6. Forward Packet’s mean length

7. Backward Packet maximum length

8. Backward Packet minimum length

9. Flow Bytes per second

10. Flow Packets per second

11. Backward Packets per second

12. Minimum Packet Length

13. Initial Window Bytes (Forward)

14. Initial Window Bytes (Backward)

Description of each features selected

Total Flow Duration: The total duration of flow in microseconds.

Total Forward Packets: Total packets in forward direction.

Total Length of Forward Packets: Total size of packet in forward direction.

Forward Packet’s maximum length: The maximum size of packet in forward direction. Forward

Packet’s minimum length: The minimum size of packet in forward direction. Forward Packet’s

mean length: The mean size of packet in forward direction.

Backward Packet maximum length: The maximum size of packet in backward direction.

Backward Packet minimum length: The minimum size of packet in backward direction. Flow

Bytes per second: The number of flow bytes per second.

Flow Packets per second: The number of flow packets per second.

Backward Packets per second: The number of backward packets per second
Minimum Packet Length: The minimum length of a packet.

Initial Window Bytes (Forward): The total count of bytes sent in the initial window in the

forward direction.

Initial Window Bytes (Backward): The total count of bytes sent in the initial window in the

forward direction.

Therefore, this helped in selecting appropriate features for this study. Usually, cluster

analysis is done to serve this purpose in the case of unsupervised studies [43],[45] but we

conducted the study to see the performance by supervised algorithms.

After feature selection, the datasets were imported and using scikit-learn’s train_test_split

function, the data was split into 80 % training set and 20 % test set. After this, the classifier i.e.,

machine learning model’s parameters were defined, and the model was trained on a training set.

The model was tested on the test set. The prediction was observed through a confusion matrix. A

classification report was generated for each dataset and algorithm, which shows the traffic

classified into ‘BENIGN’ and attack type or types. It shows various other metrics like precision,

recall, f-1 score and support for further analysis and conclusion.

Performance Metrics
As discussed above, in order to measure the performance of machine learning

algorithms, we use some metrics like accuracy, precision, recall, and F-1-score. The

performance indicators used for classification problems are based on the below mentioned four

possibilities: True Positive (TP): correct classification attack packets as attacks.

True Negative (TN): correct classification normal packets as normal.

False Positive (FP): normal activity that is wrongly labeled as intrusive by IDS.

False Negative (FN): intrusive activity that is classified as normal.


The accuracy, the precision, recall and F1-score are defined as follows:

Accuracy: The accuracy rate is the main prediction indicator for the several machine and deep

learning classifiers. It is simply the measure of how correctly the model classifies.

Accuracy = (Tp + Tn)/(Tp + Fp + Tn + Fn)

Where, Tp = True Positive, Tn = True Negative, Fp = False positive, Fn = False Negative

Precision: It is the ratio of correctly identified positive observations to all the predicted positive

observations. In other words, Precision measures the number of correct instances retrieved

divided by all retrieved instances [39].

Precision = True Positive / (True Positive + False Positive)

The precision is intuitively the ability of the classifier not to label as positive a sample that

is negative [40].

Recall: Recall is the ratio of correctly identified positive cases to all the observed cases. In

other words, recall measures the number of correct instances retrieved divided by all correct

instances [39].

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of

false negatives. The recall is intuitively the ability of the classifier to find all the positive samples

[40].

F-1 score: It is the harmonic mean of precision and recall. It is needed when we want to find a

balance between Precision and Recall.

F-1 score = 2 * (Precision x Recall)/(Precision + Recall)

The CICIDS-2017 Dataset


CIC-IDS2017 has benign and common attacks which is very similar to true real-world data [1]. It

also has the result of network traffic analysis using CICFlowMeter with labeled flows based on
the timestamp, source, and destination IPs, source and destination ports, protocols and attack

(CSV files) [1].

Table 1. Types of Intrusion in the CICIDS-2017 dataset


No. Group of intrusion Type of Intrusion
1 Normal Benign

2 Denial of Service (DoS) Botnet, DDoS, DoSGoldenEye, DoS

Hulk, DoSSlowhttp, DoSSlowloris

3 Password attack FTP-Patator, SSH-Patator, Web-Attack-Brute-Force

4 Probing Port Scan

5 Vulnerability Heartbleed Attack, Infiltration, Web-Attack-SQL-

Injection, Web-Attack-XSS

The data was captured between July 3, 2017 and July 7, 2017 for a total of 5 days. The

implemented attacks contain Brute Force FTP, Brute Force SSH, DoS, Heartbleed, Web Attack,

Infiltration, Botnet and DDoS [1]. CICIDS2017 is a very huge dataset which has approximately

3 million network flows in different files [1][27]. In CICIDS2017, there is no specified training

or test sets to be used in the experiments. So, for this study, only 10% of this dataset was

selected for training and testing so that we can reduce training and testing time or the training

and testing time would be very lengthy. Also, the computer used for this study suffered memory

error while trying to take a bigger size of datasets for calculations. The selection of those 10% of

the dataset was done randomly by using the sampling without replacement technique to ensure

the diversity of traffic records and avoiding overfitting.


It has datasets listed under different categories. There are eight different

categories of datasets within the main folder containing the datasets. The

objective was to perform study on each dataset separately. So, instead of

combining these different files into one, machine learning study was performed in

each category of dataset separately. However, some datasets were avoided from

the study like the ‘Monday-WorkingHours.pcap’ dataset as it contained only

normal benign traffic and ‘Friday-WorkingHours-Morning.pcap_ISCX’ was

avoided because of only one class problem. The detailed study and results

obtained from the classification is listed in the next section in this paper.

You might also like