MACHINE LEARNING PERFORMANCE EVALUATION REPORT
1.0 Objectives
The aim of this report is to evaluate the performances of different machine
learning algorithms, using a supervised machine learning algorithm namely K-Nearest
Neighbour and multi-layer perceptron and artificial neural network on the prediction of
attacks on a system. We will test the performances of these machine learning models on
binary classification (if its normal (0) or abnormal (1)) and on a multi-class
classification into the different attack categories. For this research work, we have used
three different datasets namely UNSWB15, CICIDS2017, and CICIDS2019, each of these
datasets will be considered and analyzed separately in this report.
2.0 Intrusion Detection using CICIDS2017 dataset
The CICIDS2017 dataset has been used for intrusion detection in this section, and it was
acquired from the Kaggle platform. The CICIDS2017 dataset is one of the commonly
used datasets that contains most of the updated attacks in PCAP format and also in CSV
files for machine and deep learning purposes. It includes the result of network traffic
analysis using CICFlowmeter with the flows labeled based on the time stamp, source
and destination IPs, Source and destination ports, and protocols and attack. The dataset
is available in 8 different CSV files and each file contains a different attack profile based
on the period in which the system data was recorded. The 8 files were concatenated
using the python pandas data frame and it contains a total number of 2, 830743 records.
A partition of these datasets has been configured as a training set. The dataset has six
types of attack profiles including normal state namely: Brute Force Attack, Heartbleed
Attack, Botnet, DoS Attack, DDoS Attack, Web Attack, and Infiltration Attack.
2.1 Data Pre-processing
The dataset contains 78 attributes, and 1 class label in 2, 830,743 rows. First, the
dataset CSV files were imported and concatenated using the pandas libraries. Due to the
large number of datasets, and computational power not being able to deal with this large
dataset, the dataset was reduced to 890,353 entries and after which the dataset was
checked for null entries which can affect the performance of the models. The null entries
were dropped from the datasets and the resultant datasets had 889842 rows and 78
attributes.
It was observed that the dataset columns contain some unknown characters, these
unknown characters were removed from the column names. The dataset is an
imbalanced dataset and as such some attack types were merged to form new attack
types, particularly the minority classes. The dataset contains attack types such as
Dos_Hulk, PortScan, DDoS, DoS_GoldenEye, FTPPatator, SSHPatator, DoS_Slowloris,
DoS_Slowhttptest, Heartbleed, Bot, Web_Attack_Brute_force, Web_Attcack_XSS,
Web_Attcack_Sql_Injection and Infiltration. All of these attack types were grouped
together to form seven different attack profiles including normal transactions which are
Normal, DoS, PortScan, DdoS, Brute_force, Botnet, Web_Attack, and Infiltration.
In order to classify the attack into either normal or abnormal for binary classification, a
binary label was created and the values for the column label was populated using the
numpy method in python.
The data is visualized and it is given below for both binary classification and multi class
classification
The binary classification categorized the features for each entry as either normal or
abnormal while the multi-class classification classified each entry into eight operational
state and attack types represented as either 'Normal', 'PortScan', 'DoS', 'DdoS',
'Bruteforce', 'Botnet', 'Web attack', and 'Infiltration'.
Fig 2.1: Binary Classification Data Distribution
Fig 2.2: Multi-Class Classification Data Distribution
The next step is to normalize the data using MinMax Scaler. 79 attributes or columns of
numeric data were normalized.
Since the classification is divided into binary classification and multi-class classification,
two data frames will be created, one to be used for binary classification and the other for
multi-class classification. In order to create a data frame for the binary classification, the
“label” attribute which is categorized into “normal” and “abnormal” will be encoded
using the label encoder so they are represented as “0” for normal and “1” for abnormal.
Another data frame is created for the multiclass, where in this case the “attack_cat”
attribute categorized into the eight operational states is encoded using the
labelEncoder() and also one-hot encoded and thus the number of columns increased to
88 columns.
The next step in the data preprocessing is feature extraction, for both binary
classification and multi-class classification, which had 80 and 88 columns respectively
including the class label, the Pearson Correlation coefficient was used. A correlation
matrix that shows the correlation coefficient between the features of the data frame used
for binary classification and that used for multi-class classification is as shown below in
Fig 2.3 and 2.4. Correlation matrix can be used as a tool for feature extraction, which is
the process of selecting a subset of features (i.e., variables) from a larger set of features
in a dataset, in order to improve the performance of a machine learning model. To use
the correlation matrix for feature extraction, we first calculate the correlation coefficient
for each pair of features in the dataset. We then visualize the correlation matrix as a
heatmap, which highlights the pairs of features that are highly correlated with each
other. The correlation coefficient is a value that ranges from -1 to 1. A value of -1
indicates a perfect negative correlation, where as one variable increases, the other
decreases. A value of 1 indicates a perfect positive correlation, where as one variable
increases, the other increases. A value of 0 indicates no correlation, where there is no
linear relationship between the two variables.
Fig 2.3: Heatmap showing the correlation matrix for binary classification dataframe
Fig 2.4: Heatmap showing the correlation matrix for multi-class classification dataframe
The attributes that have more than 0.3 correlation coefficient with the target attribute
were the only ones selected and the rest were dropped. Finally, after the feature
extraction, the data frame for binary data had 20 attributes and class label and was
saved as bin_data.csv, data frame for multi-class had 22 attributes and class label was
saved as multi_data.csv
The next step is to split the dataset into training and testing data, the dataset was
randomly splitted using 70% for training data and 30% for testing data.
2.2 Modeling
2.2.1 Binary Classification
Supervised Algorithms in the sci-kit learn library were trained and tested for attack
prediction in binary classification as normal or abnormal. The supervised machine
learning algorithm used is K-Nearest Neighbour from the sci-kit learn library and two
neural network algorithms have been used, the multilayer perceptron from the sci-kit
learn library and Deep Neural Networks from the Keras library.
The evaluation metrics used for the model performance comparisons are:
Training Accuracy
Testing Accuracy
Recall
Precision
F1 Score
Each of these models was trained and tested to predict the value of the attack either as
normal or abnormal and the result of the prediction was compared against the actual
values of the particular entry saved in a folder as CSV files. The plot between the actual
and predicted value was also acquired for each algorithm to show how well-fitted the
algorithm is and is as given below. These plots were also saved into a separate folder.
Fig 2.5: Plot between real and predicted values for KNN binary Classification
Fig 2.6: Plot between real and predicted values for MLP binary Classification
Fig 2.7: Plot between real and predicted values for ANN binary classification
A table showing the plots between the real value and the value predicted by the model
was also generated to show how efficient the model, this table is given below for each
model.
Fig 2.8: Table showing binary predictions using KNN
Fig 2.9: Table showing binary predictions using Multilayer perceptron
Fig 2.10: Table showing binary predictions using ANN
From the tables above, the result of the prediction for the upper five rows and bottom
five rows of the test data has been given, it shows the actual values and the predicted
values for a given set of features in a row, with this table we can determine how well the
algorithm performed in relation the test accuracy. In some cases, the model might not
predict the actual values, as it is shown in the tables given above. 1 represents an attack
and 0 indicates a normal transaction for binary classification, hence the reason for
having 1’s and 0’s in the table.
The performance metrics of the algorithms are compared using some sci-kit learn tools
and one of such is the bar plot which are as shown below for different performance
metrics.
Fig 2.11: Train Accuracy Comparison for the Different Algorithms in binary
Fig 2.12: Test Accuracy Comparison for the Different Algorithms in binary
Fig 2.13: Precision Comparison for the Different Algorithms in binary
Fig 2.14: Precision Comparison for the Different Algorithms in binary
Fig 2.15: F1score Comparison for the Different Algorithms in binary
A table that shows the metrics for each model sorted by f1 score in descending order
(i.e., from the highest to lowest.) is plotted below. We focus on the F1 score because it is
the best metric for our imbalanced labels.
The deep neural network was built with an input layer, two hidden layers, and an output
layer.
The table below shows the values for the performance metrics considered for each of the
algorithm.
Table 2.1: Performance metrics for binary classification
Machine Train Test Precision Recall F1_score AUC
Learning Accuracy Accuracy
algorithms
used
MLP Classifier 0.978685 0.978813 0.979081 0.978813 0.978774 0.976853
KNeighbors 0.996486 0.995471 0.995473 0.995471 0.9954 0.995234
Classifier
Deep Neural 0.438481 0.438643 0.438643 0.438643 0.267485 0.500000
Network
The K-Nearest Neighbour gave the best f1 score, which means it’s the best-performing
model, even though other ML models performed well too except for the deep neural
network that was at a farther distance in terms of the f1 score. Overall the K-Nearest
Neighbor is a good algorithm for predicting if there is a “normal attack” or “abnormal
attack”.
2.2.2 Multi-classification
Supervised Algorithms in the sci-kit learn library were trained and tested for attack type
prediction in multiclass classification. The supervised machine learning algorithm used
is K-Nearest Neighbour. Also, two neural networks algorithm have been used, the
multilayer perceptron from the sci-kit learn library and Deep Neural Networks from the
Keras library.
The evaluation metrics used for the model performance comparisons are:
Training Accuracy
Testing Accuracy
Recall
Precision
F1 Score
Each of these models was trained and tested to predict the value of the different attack
category and the result of the prediction was compared against the actual values saved
in a folder as the CSV files. The plot between the actual and predicted value was also
acquired for each algorithm to show how well-fitted the algorithm is, these plots are also
saved into a separate folder and also shown below.
Fig 2.16: Plot between real and predicted values for KNN Multiclass Classification
Fig 2.17: Plot between real and predicted values for Multi-layer perceptron Multiclass
Classification
Fig 2.18: Plot between real and predicted values for ANN Multiclass Classification
A table showing the plots between the real value and the value predicted by the model
was also generated to show how efficient the model, this table is given below for each
model.
Fig 2.19: Table showing multi-class predictions using KNN
Fig 2.20: Table showing multi-class predictions using MLP
Fig 2.21: Table showing multi-class predictions using ANN
From the tables above, the result of the prediction for the upper five rows and bottom
five rows of the test data has been given, it shows the actual values and the predicted
values for a given set of features in a row, with this table we can determine how well the
algorithm performed in relation the test accuracy. In some cases, the model might not
predict the actual values, as it is shown in the tables given above. Since for multi-class
classification, we have encoded our attack category, each attack will now be represented
by a number ranging from 0-6, where 0 represents a normal transaction and the other
digits represents an attack category.
The performance metrics of the algorithms are compared using some sci-kit learn tools
and one such is the bar plot which is shown below for different performance metrics.
Fig 2.22: Train Accuracy Comparison for the Different Algorithms in multi-class
Fig 2.23: Test Accuracy Comparison for the Different Algorithms in multi-class
Fig 2.24: Precision Comparison for the Different Algorithms in multi-class
Fig 2.25: Recall Comparison for the Different Algorithms in multi-class
Fig 2.26: F1score Comparison for the Different Algorithms in multi-class
A table that shows the metrics for each model sorted by f1 score in descending order
(i.e., from the highest to lowest.) is plotted below. We focus on the F1 score because it is
the best metric for our imbalanced labels.
Table 2.2: Performance metrics for multiclass classification
Machine Learning Train Test Precision Recall F1_score
algorithms used Accuracy Accuracy
KNeighbors Classifier 0.999944 0.999914 0.999913 0.999914 0.999913
MLP Classifier 0.999844 0.999839 0.999839 0.999839 0.999837
Neural Network 0.999865 0.999850 0.999852 0.999850 0.999848
Here, the best model for the multi-classification model is the K-Nearest Neighbour. We
observed that the neural network performed better in classifying the targets(labels) than
the binary classification.
3.0 Intrusion Detection using UNSW-B15 dataset
The dataset used for intrusion detection is UNSW-B15, which was acquired from the
Kaggle platform. The datasets contain a total number of 2, 540,044 records and are
stored in four different CSV files. A partition of these datasets has been configured as a
training set and testing set. The training set contains 175, 341 records, while the testing
set contains 82,332 records. The training set has been used for this project. The dataset
has nine types of attacks including normal state namely: Fuzzers; Analysis; Backdoors;
Denial of Service (DOS), Exploits, Generic, Reconnaissance, and Worms.
3.1 Data Pre-processing
The dataset contains 45 attributes and 175, 341 rows. First, the dataset was imported
using the pandas libraries and after which the dataset was checked for null entries which
can affect the performance of the models. The null entries were dropped from the
datasets and the resultant datasets had 81173 rows and 45 attributes.
The data type of attributes was converted using the information provided in the
features.CSV file, a file that contains the features and an explanation of each of the
features.
The data is visualized and it is given below for both binary classification and multi class
classification
The binary classification categorized the features for each entry as either normal or
abnormal while the multi-class classification classified each entry into nine the
operational state and attack types represented as either 'Analysis', 'Backdoor', 'DoS',
'Exploits', 'Fuzzers', 'Generic', 'Normal', 'Reconnaissance', and 'Worms'
Fig 3.1: Binary Classification Data Distribution
Fig 3.2: Multi-Class Classification Data Distribution
Since Machine learning models works with numeric data rather than text data, some
data attributes which are in form of text data need to be converted to numeric data and
this is possible by encoding them, the columns or attributes namely ‘proto’, ‘service’ and
‘state’ are one-hot-encoded and removed after encoding and the result of one-hot-
encoding created 19 different attributes, which is joined to the main data frame and the
number of attributes increased to 61 attributes.
The next step is to normalize the data using MinMax Scaler. 58 attributes or columns of
numeric data were normalized.
Since the classification is divided into binary classification and multi-class classification,
two data frames will be created, one to be used for binary classification and the other for
multi-class classification. In order to create a data frame for the binary classification, the
“label” attribute which is categorized into “normal” and “abnormal” will be encoded
using the label encoder so they are represented as “0” for normal and “1” for abnormal.
Another data frame is created for the multiclass, where in this case the “attack_cat”
attribute categorized into the nine attacks is encoded using the labelEncoder() and also
one-hot encoded and thus the number of columns increased to 69 columns.
The next step in the data preprocessing is feature extraction, for both binary
classification and multi-class classification, which had 61 and 69 attributes respectively,
the Pearson Correlation coefficient was used. The correlation matrix as discussed in the
earlier section has also been used to determine the coefficient of correlation between the
features of the data frame for both binary and multiclass classification which are shown
in fig 3.3 and 3.4.
Fig 3.3: Heatmap showing the correlation matrix of the features of dataframe for binary
classification
Fig 3.4: Heatmap showing the correlation matrix of the features of dataframe for binary
classification
The attributes that have more than 0.3 correlation coefficient with the target attribute
were the only ones selected and the rest were dropped. Finally, after the feature
extraction, the data frame for binary data had 15 attributes and was saved as
bin_data.csv, data frame for multi-class had 16 attributes and was saved as
multi_data.csv
The next step is to split the dataset into training and testing data, the dataset was
randomly splitted using 80% for training data and 20% for testing data.
3.2 Modeling
3.2.1 Binary Classification
Supervised Algorithms in the sci-kit learn library were trained and tested for attack
prediction in binary classification as normal or abnormal. The supervised machine
learning algorithm used is K-Nearest Neighbour from the sci-kit learn library and two
neural network algorithms have been used, the multilayer perceptron from the sci-kit
learn library and Deep Neural Networks from the Keras library.
The evaluation metrics used for the model performance comparisons are:
Training Accuracy
Testing Accuracy
Recall
Precision
F1 Score
Each of these models was trained and tested to predict the value of the attack either as
normal or abnormal and the result of the prediction was compared against the actual
values of the particular entry saved in a folder as CSV files. The plot between the actual
and predicted value was also acquired for each algorithm to show how well-fitted the
algorithm is and is as given below. These plots were also saved into a separate folder.
Fig 3.3: Plot between real and predicted values for KNN binary Classification
Fig 3.4: Plot between real and predicted values for MLP binary Classification
Fig 3.5: Plot between real and predicted values for ANN binary classification
A table showing the plots between the real value and the value predicted by the model
was also generated to show how efficient the model, this table is given below for each
model.
Fig 3.6: Table showing binary predictions using KNN
Fig 3.7: Table showing binary predictions using MLP
Fig 3.8: Table showing binary predictions using ANN
The performance metrics of the algorithms are compared using some sci-kit learn tools
and one of such is the bar plot which are as shown below for different performance
metrics.
Fig 3.9: Train Accuracy Comparison for the Different Algorithms in binary
Fig 3.10: Test Accuracy Comparison for the Different Algorithms in binary
Fig 3.11: Precision Comparison for the Different Algorithms in binary
Fig 3.12: Precision Comparison for the Different Algorithms in binary
Fig 3.13: F1score Comparison for the Different Algorithms in binary
A table that shows the metrics for each model sorted by f1 score in descending order
(i.e., from the highest to lowest.) is plotted below. We focus on the F1 score because it is
the best metric for our imbalanced labels.
The deep neural network was built with an input layer, two hidden layers, and an output
layer.
The table below shows the values for the performance metrics considered for each of the
algorithm.
Table 3.1: Performance metrics for binary classification
Machine Train Test Precision Recall F1_score AUC
Learning Accuracy Accuracy
algorithms
used
MLP Classifier 0.981475 0.983677 0.983662 0.983677 0.983581 0.971956
KNeighbors 0.986603 0.983061 0.983006 0.983061 0.983008 0.973734
Classifier
Deep Neural 0.760094 0.759224 0.759224 0.759224 0.655313 0.500000
Network
The K-Nearest Neighbour gave the best f1 score, which means it’s the best-performing
model, even though other ML models performed well too except for the deep neural
network that was at a farther distance in terms of the f1 score. Overall the K-Nearest
Neighbor is a good algorithm for predicting if there is a “normal attack” or “abnormal
attack”.
3.2.2 Multi-classification
Supervised Algorithms in the sci-kit learn library were trained and tested for attack type
prediction in multiclass classification. The supervised machine learning algorithm used
is K-Nearest Neighbour. Also, two neural networks algorithm have been used, the
multilayer perceptron from the sci-kit learn library and Deep Neural Networks from the
Keras library.
The evaluation metrics used for the model performance comparisons are:
Training Accuracy
Testing Accuracy
Recall
Precision
F1 Score
Each of these models was trained and tested to predict the value of the different attack
as 'Analysis', 'Backdoor', 'DoS', 'Exploits', 'Fuzzers', 'Generic', 'Normal',
'Reconnaissance', 'Worms' and the result of the prediction was compared against the
actual values saved in a folder as the CSV files. The plot between the actual and
predicted value was also acquired for each algorithm to show how well-fitted the
algorithm is, these plots are also saved into a separate folder and also shown below.
Fig 3.14: Plot between real and predicted values for KNN Multiclass Classification
Fig 3.15: Plot between real and predicted values for Multi-layer perceptron Multiclass
Classification
Fig 3.16: Plot between real and predicted values for ANN Multiclass Classification
A table showing the plots between the real value and the value predicted by the model
was also generated to show how efficient the model, this table is given below for each
model.
Fig 3.17: Table showing multi-class predictions using KNN
Fig 3.19: Table showing multi-class predictions using MLP
Fig 3.20: Table showing multi-class predictions using ANN
The performance metrics of the algorithms are compared using some sci-kit learn tools
and one such is the bar plot which is shown below for different performance metrics.
Fig 3.21: Train Accuracy Comparison for the Different Algorithms in multi-class
Fig 3.22: Test Accuracy Comparison for the Different Algorithms in multi-class
Fig 3.23: Precision Comparison for the Different Algorithms in multi-class
Fig 3.24: Recall Comparison for the Different Algorithms in multi-class
Fig 3.25: F1score Comparison for the Different Algorithms in multi-class
A table that shows the metrics for each model sorted by f1 score in descending order
(i.e., from the highest to lowest.) is plotted below. We focus on the F1 score because it is
the best metric for our imbalanced labels.
Table 3.2: Performance metrics for Multiclass classification
Machine Learning Train Test Precision Recall F1_score
algorithms used Accuracy Accuracy
KNeighbors Classifier 0.978969 0.973760 0.972222 0.973760 0.972976
MLP Classifier 0.974411 0.975443 0.974641 0.975443 0.974828
Neural Network 0.974129 0.976265 0.975553 0.976265 0.974859
Here, the best model for the multi-classification model is the Neural network and
outperformed the other models used in the evaluation.