Detecting Cybersecurity Attacks Across Different Network Features and Learners
Detecting Cybersecurity Attacks Across Different Network Features and Learners
*Correspondence:
[email protected] Abstract
Florida Atlantic University, Machine learning algorithms efficiently trained on intrusion detection datasets can
777 Glades Road, Boca Raton,
FL 33431, USA detect network traffic capable of jeopardizing an information system. In this study, we
use the CSE-CIC-IDS2018 dataset to investigate ensemble feature selection on the per-
formance of seven classifiers. CSE-CIC-IDS2018 is big data (about 16,000,000 instances),
publicly available, modern, and covers a wide range of realistic attack types. Our
contribution is centered around answers to three research questions. The first question
is, “Does feature selection impact performance of classifiers in terms of Area Under the
Receiver Operating Characteristic Curve (AUC) and F1-score?” The second question is,
“Does including the Destination_Port categorical feature significantly impact perfor-
mance of LightGBM and Catboost in terms of AUC and F1-score?” The third question
is, “Does the choice of classifier: Decision Tree (DT), Random Forest (RF), Naive Bayes
(NB), Logistic Regression (LR), Catboost, LightGBM, or XGBoost, significantly impact
performance in terms of AUC and F1-score?” These research questions are all answered
in the affirmative and provide valuable, practical information for the development of
an efficient intrusion detection model. To the best of our knowledge, we are the first to
use an ensemble feature selection technique with the CSE-CIC-IDS2018 dataset.
Keywords: Feature selection, Intrusion detection, Catboost, XGBoost, LightGBM,
SlowlorisBig, Big data, CSE-CIC-IDS2018
Introduction
CSE-CIC-IDS2018 [1], also referred to as the 2018 dataset throughout this text, is an
intrusion detection dataset with normal and anomalous instances of network traffic.
Machine learning models efficiently trained on CSE-CIC-IDS2018 can detect network
traffic capable of compromising an information system. This dataset is the most recent
iteration of ISCXIDS2012 [2], a scalable project designed to produce modern, realistic
datasets. CSE-CIC-IDS2018 data originated from an extensive network of victim and
attack machines [3], yielding an aggregate of 16,233,002 instances. Six classes of attack
traffic (percentage distribution shown in Table 1) are represented by about 17% of these
instances.
The 2018 dataset has a binary class imbalance vis-à-vis the non-attack instances to the
total number of attack instances. The dataset is distributed over ten CSV files that are
© The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permit-
ted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.
org/licenses/by/4.0/.
Leevy et al. J Big Data (2021) 8:38 Page 2 of 29
Benign 83.070
DDoS 7.786
DoS 4.031
Brute force 2.347
Botnet 1.763
Infiltration 0.997
Web attack 0.006
downloadable from the cloud 1. Nine files consist of 79 independent variables, and the
remaining file consists of 83 independent variables.
Machine learning is greatly facilitated by the high number of features in CSE-CIC-
IDS2018. Machine learning algorithms typically outperform traditional statistical meth-
ods in classification tasks [4, 5]. However, the threshold settings of some learners may
not be appropriately set for imbalanced data, thus rendering these algorithms inefficient
at distinguishing majority and minority classes in a highly imbalanced environment.
The learners will consequently fail to properly model the distribution of the positive
(minority) class and become biased in favor of the negative (majority) class. Therefore,
one must employ metrics that safeguard against this outcome. The two metrics used in
this study, F1-score and Area Under the Receiver Operating Characteristic (ROC) Curve
(AUC), are suitable for evaluating classifier performance on imbalanced datasets [6, 7].
We note that class imbalance is more noticeable in big data as the number of majority
class instances is disproportionately high in that environment [8, 9].
The ensemble feature selection [10, 11] approach in this paper is tailored toward
improving classifier performance by using a relevant subset of variables from CSE-CIC-
IDS2018. It is worth noting that feature selection also provides data clarity and reduces
computation requirements. In our study, we utilize both supervised and filter-based [12]
feature ranking techniques, and the last stage of our ensemble approach is the selection
of common features from these techniques.
The specific properties of big data can make classification more challenging for learn-
ers trained on the 2018 dataset. These properties include volume, variety, velocity, vari-
ability, value, and complexity [8]. Traditional methods may have difficulty handling the
high data volume, the diversity of data formats, the speed of data originating from dif-
ferent sources, data flow inconsistencies, the filtering of important data, and data linking
and transformation.
Classifier performance in our case study is based on the training and testing of the fol-
lowing learners: Decision Tree (DT) [13], Random Forest (RF) [14], Naive Bayes (NB)
[15], Logistic Regression (LR) [16], Catboost [17], LightGBM [18], and XGBoost [19].
These learners are selected for their good coverage of several Machine Learning (ML)
model families and are viewed favorably in terms of performance [20]. The seven classi-
fiers are further discussed in "Classifier development and metrics" section.
1
https://www.unb.ca/cic/datasets/ids-2018.html
Leevy et al. J Big Data (2021) 8:38 Page 3 of 29
To the best of our knowledge, this study is the most comprehensive analysis on CSE-
CIC-IDS2018 to date. Our work uniquely uses the 2018 dataset to investigate ensemble
feature selection on the performance of seven classifiers. Our contribution is defined by
our responses to three research questions: The first question is, “Does feature selection
impact performance of classifiers in terms of AUC and F1-score?” The second question
is, “Does including the Destination_Port categorical feature significantly impact perfor-
mance of LightGBM and Catboost in terms of AUC and F1-score?” And, our third ques-
tion is, “Does the choice of classifier: DT, RF, NB, LR, Catboost, LightGBM, or XGBoost,
significantly impact performance in terms of AUC and F1-score?” The answers to these
research questions provide valuable and practical information for the development of an
efficient intrusion detection model.
The remainder of this paper is organized as follows: "Related work" provides an over-
view of literature that manipulates features of CSE-CIC-IDS2018; "Methodology" sec-
tion describes the cleaning process of the 2018 dataset, our unique ensemble approach
for feature selection, the classifiers and metrics used in the study, and the training and
testing procedure for these classifiers; "Results and discussion" section presents and dis-
cusses our empirical results; "Conclusion" section concludes our paper with a summary
of the work presented and suggestions for related future work.
Related work
In this section, we highlight studies that modify features of CSE-CIC-IDS2018 to
improve classification results. However, to the best of our knowledge, none of these
studies use an ensemble feature selection approach.
To address the high class imbalance of the 2018 dataset, Hua [21] uses an undersam-
pling and embedded feature selection approach with a LightGBM classifier. Undersam-
pling [22] randomly removes majority class instances to alter class distribution. During
the data cleaning stage, missing values and useless features were removed, resulting in
a modified set of 77 features. String labels were converted to integer labels, which were
then one-hot encoded. In addition to LightGBM, six other learners were evaluated in
this research work: Support Vector Machine (SVM) [23], RF, Adaboost [24], Multilayer
Perceptron (MLP) [25], Convolutional Neural Network (CNN) [26], and Naive Bayes.
Learners were implemented with Scikit-learn [27] and TensorFlow [28]. The train to test
data ratio was 70 to 30, and XGBoost was used to perform feature selection. LightGBM
had the best performance of the group, with an optimum accuracy of 98.37% when the
sample size was three million and the top ten features were selected. For this accuracy,
the precision and recall were 98.14% and 98.37%, respectively. LightGBM also had the
second fastest training time among the classifiers.
In another related work of research [29], five learners were evaluated on two datasets
(CSE-CIC-IDS2018 and ISOT HTTP Botnet [30]) to determine the best botnet classi-
fier. The ISOT HTTP Botnet dataset contains malicious and benign instances of Domain
Name System (DNS) traffic. The learners in the study include RF, DT, k-Nearest Neigh-
bor (k-NN) [31], Naive Bayes, and SVM. Feature selection was performed using various
techniques, including the feature importance method [32] of RF. Subsequent to feature
selection, CSE-CIC-IDS2018 had 19 independent attributes while ISOT HTTP had 20,
with destination port number, source port number, and transport protocol among the
Leevy et al. J Big Data (2021) 8:38 Page 4 of 29
selected features. The models were implemented with Python and Scikit-learn. About
80% of botnet instances were used for training, where five-fold cross-validation was
applied. The remaining botnet instances served as the testing set. For optimization, the
Grid Search algorithm [33] was used. With regard to CSE-CIC-IDS2018, the RF and DT
learners scored an accuracy of 99.99%. Tied to this accuracy, the precision was 100% and
the recall was 99.99% for both learners. The RF and DT learners also had the highest
accuracy for ISOT HTTP (99.94% for RF and 99.90% for DT).
Li et al. [34], in a third related study, apply clustering and feature selection to CSE-
CIC-IDS2018. This unsupervised learning study involves online real-time detection with
an autoencoder classifier. An autoencoder encodes data in a way that usually results
in dimensionality reduction [35]. For preprocessing, “Infinity” and “NaN” values were
replaced by 0, and the data was subsequently divided into sparse and dense matrices,
normalized by L2 regularization. A sparse matrix has a majority of elements with value
0, while a dense matrix has a majority of elements with non-zero values. The model
was built within a Python environment. The best features were selected by RF, and the
train to test data ratio was set as 85 to 15. The Affinity Propagation (AP) clustering [36]
algorithm was subsequently used on 25% of the training dataset to group features into
subsets, which were sent to the autoencoder. Recall rates for all attack types for the pro-
posed model were compared with those of another autoencoder model called Kitnet
[37]. Several attack types for both models had a recall of 100%. Only the proposed model
was evaluated with the AUC metric, with several attack types yielding a score of 1. Based
on detection time results, the authors showed that their model has a faster detection
time than KitNet.
Fitni and Ramli [38] adopt an ensemble model approach to compare seven single
learners for integration into a classifier unit. The seven learners are as follows: RF, Gauss-
ian Naive Bayes [39], DT, Quadratic Discriminant Analysis [40], Gradient Boosting, and
Logistic Regression. The models were built with Python and Scikit-learn. During pre-
processing, samples with missing values and infinity were removed. Records that were
actually a repetition of the header rows were also removed. The dataset was then divided
into training and testing validation sets in an 80-20 ratio. Feature selection [41], a tech-
nique for selecting the most important features of a predictive model, was performed
using the Spearman’s rank correlation coefficient [42] and Chi-squared test [43], result-
ing in the selection of 23 features. After the evaluation of the seven learners with these
features, Gradient Boosting, Logistic Regression, and DT emerged as the top performers
for use in the ensemble model. Accuracy, precision, and recall scores for this model were
98.80%, 98.80%,and 97.10%, respectively, along with an AUC of 0.94.
Finally, D’hooge et al. include both CICIDS2017 and CSE-CIC-IDS2018 in a study
investigating how efficiently the results of an intrusion detection dataset can be general-
ized [44]. CICIDS2017 is the predecessor of the 2018 dataset. For performance evalua-
tion, the authors used 12 supervised learning algorithms from various families: DT, RF,
Bag [45], gradient-boosted decision tree (GBDT), Extratree [46], Adaboost, XGBoost,
k-NN, Ncentroid [47], linearSVC [48], RBFSVC [49], and Logistic Regression. The mod-
els were built with the Scikit-learn and XGBoost modules in Python. The authors used
feature scaling, which is different from feature selection. Feature scaling attempts to
normalize the feature space of all attributes. Results show that the tree-based classifiers
Leevy et al. J Big Data (2021) 8:38 Page 5 of 29
yielded the best performance, and among them, XGBoost ranked first with many per-
fect values for F1-score and AUC. D’hooge et al. hinted overfitting might have been
a problem and “further analysis” was warranted. We note that their source code indi-
cated hyperparameter values of max-depth = 35 for some of their tree-based learners.
Such values are prone to overfitting. For intrusion detection, the authors concluded
that a model trained on one dataset (CICIDS2017) cannot generalize to another dataset
(CSE-CIC-IDS2018).
In summary, the related works exhibit shortcomings with nearly perfect classifica-
tion performance values typically associated with overfitting. We discovered additional
shortcomings, such as errors in preparation (e.g. using Destination_Port as a numeric
value instead of categorical value) and in data cleaning. Ambiguous specifications are
also an issue with regard to reproducibility of the studies.
Methodology
Data cleaning
Removing certain fields from CSE-CIC-IDS2018 was our first step in the data cleaning
stage. We dropped the Protocol field because it is redundant, since the Dst Port (Desti-
nation_Port) field mostly contains equivalent Protocol values for each Destination_Port
value. We dropped the Timestamp field as we wanted the learners to not discriminate
between attack predictions based on time, especially with more stealthy attacks in mind.
In other words, the learners should be able to distinguish attacks regardless of whether
they are high volume or slow and stealthy. Dropping the Timestamp field also allows us
the convenience of combining or dividing the datasets into ways more compatible with
our experimental frameworks.
We removed 59 records that were actually a repetition of the header rows. These were
easily found and removed by filtering records based on a white list of valid label values.
The fourth downloaded file “Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv”
was different than the other 9 files for the 2018 dataset. This file contained 4 extra col-
umns: Flow ID, Src IP, Src Port, and Dst IP. We dropped these 4 additional fields.
Certain fields contained negative values which did not make sense, and so we dropped
those instances with negative values for the Fwd_Header_Length, Flow_Duration, and
Flow_IAT_Min fields. In particular, the negative values from the Fwd_Header_Length
field occur with extreme values in other fields. These extreme values skew statistics that
are sensitive to outliers.
Eight fields contained values of zero for every instance. Prior to the start of machine
learning, we filtered out the following list of fields:
1. Bwd_PSH_Flags
2. Bwd_URG_Flags
3. Fwd_Avg_Bytes_Bulk
4. Fwd_Avg_Packets_Bulk
5. Fwd_Avg_Bulk_Rate
6. Bwd_Avg_Bytes_Bulk
7. Bwd_Avg_Packets_Bulk
8. Bwd_Avg_Bulk_Rate
Leevy et al. J Big Data (2021) 8:38 Page 6 of 29
Filter‑based techniques
The filter-based feature ranking techniques we use are based on the Information Gain
(IG) (also known as Mutual Information) [50], Gain Ratio (GR) [51], and Chi-Squared
(CS) [52] statistics. We use the value of the statistic calculated for each feature to filter
the list of all features to a reduced list.
To calculate IG and GR statistics, we use the “info_gain” and “info_gain_ratio” func-
tions from the info_gain Python library. To calculate the CS statistic, we use the “chi2”
function that is included as part of the Scikit-learn library. One may employ the same
method to rank features of a dataset with any of these 3 functions. We do not supply
configuration parameters to the IG, GR, or CS functions when we invoke them. Each of
the three functions accepts two arrays of data for input. We employ all 66 usable features
Leevy et al. J Big Data (2021) 8:38 Page 7 of 29
of the 2018 dataset as the source of data for the first input array, and the “label” value
of the 2018 data for the second input array. All three functions also return a list, l, of
numbers where we use the value of the ith element of the list to determine its rank, r,
relative to the values of the other elements in the list. To be concrete, we create a list l ′ of
pairs (ri , i) from l, and then sort l ′ in decreasing order of ri . After sorting l ′ , we truncate it
at the 20th element. If the feature ranking technique assigns an importance of zero to a
feature, we do not include it in the list of ranked features. For instance, we find CatBoost
assigns an importance greater than 0 to fewer than 20 features.
Some readers may have reservations about the applicability of IG, GR, or CS to cat-
egorical or numeric features. We apply IG, GR, and CS feature selection techniques to
CSE-CIC-IDS2018 network traffic data in a manner similar to Singh et al. in [53]. In
their study, Singh et al. apply these techniques to the KDD CUP 1999 network traf-
fic dataset. This dataset is similar to the 2018 dataset in that it contains numeric and
categorical features. Therefore, we are comfortable applying these filter-based feature
ranking techniques to the 2018 dataset. Table 2 contains the rankings for the three filter-
based feature ranking techniques.
Through filter-based feature ranking, we obtain three out of the seven lists used to
select features for our models. The remaining four lists of features are obtained with
supervised feature selection techniques that are discussed in the next subsection.
Scikit-learn Python library. Here, we discuss how we use elements in common to the
implementations of RF, CatBoost, XGBoost, and LightGBM to employ them in ranking
techniques.
All four libraries have classifier objects. These objects have initialization (constructor)
functions. One may pass configuration options to the initialization functions. Please see
Tables 3, 4, 5, and 6 for the configuration options we use for each classifier object. Each
object also has a “fit” function. After the classifiers’ fit function is successfully invoked,
the classifier object has a list attribute “feature_importances_”. We use the feature_
importances_ list in the same way we use the list of values l returned by the functions
for filter-based ranking techniques discussed in the previous subsection. Hereafter, we
refer to the feature selection technique of using the feature importance values from Cat-
Boost, LightGBM, XGBoost, and RF classifiers by the names of the classifiers, where not
ambiguous.
As discussed in the previous subsection, CSE-CIC-IDS2018 has one categorical fea-
ture: Destination_Port. We found this feature has 53,760 possible values in the 2018
Leevy et al. J Big Data (2021) 8:38 Page 9 of 29
dataset. Hence, we concluded that finding an appropriate encoding technique for this
feature is outside the scope of our study. However, CatBoost and LightGBM have built-
in support for categorical features, so we include Destination_Port as a candidate for
ranking for CatBoost and LightGBM, but not for XGBoost or Random Forest.
All supervised ranking techniques yield 20 features, except CatBoost. Due to imple-
mentation details and the hyper-parameter settings we use, CatBoost will not construct
DTs with a number of nodes sufficient to utilize 20 or more features in the training data.
Hence, we find CatBoost provides rankings with fewer than 20 features. Tables 7 and 8
contain the top 20 features in all supervised rankings, except CatBoost, which ranks only
14 features.
We use supervised ranking techniques to generate 4 out of 7 rankings, and filter-based
ranking techniques to generate the remaining 3 out of 7 rankings. We use the 7 rankings
to conduct ensemble feature selection. In the following subsection, we cover the specif-
ics of our feature selection techniques.
Feature selection
After obtaining the 7 rankings, our feature selection techniques are to select features
that appear in k out of 7 rankings, where k has the value 4, 5, 6, or 7. Hence, we have
4 feature selection techniques, based on an ensemble of 7 feature ranking techniques.
We refer to the set of features that appear in 4 out of 7 rankings as “feature group 1.”
This is our first ensemble feature selection technique. Since feature group 1 contains
the Destination_Port categorical feature which some learners that we use cannot
consume directly, “feature group 1A” is the set of all features in feature group 1, but
Leevy et al. J Big Data (2021) 8:38 Page 10 of 29
Destination_Port
Packet_Length_Std
min_seg_size_forward
Flow_IAT_Min
Fwd_Packet_Length_Std
Packet_Length_Variance
Fwd_IAT_Min
Idle_Max
Fwd_Packet_Length_Mean
Fwd_Packets_s
Bwd_Packet_Length_Mean
Flow_IAT_Max
Bwd_Packet_Length_Std
Fwd_Packet_Length_Max
Fwd_Header_Length
Fwd_IAT_Total
RST_Flag_Count
Bwd_IAT_Total
Bwd_Packets_s
Fwd_IAT_Max
Fwd_Packet_Length_Mean
Fwd_Packet_Length_Max
Flow_IAT_Mean
Total_Length_of_Fwd_Packets
Bwd_Packets_s
Fwd_Packets_s
Flow_Bytes_s
Fwd_IAT_Max
Fwd_IAT_Total
Flow_IAT_Std
Flow_IAT_Max
* Destination_Port
min_seg_size_forward
Flow_Packets_s
Fwd_Header_Length
* Indicates Destination_Port not included in feature group 1A
Fwd_Packets_s
Fwd_Header_Length
Fwd_Packet_Length_Mean
Fwd_IAT_Total
Flow_IAT_Max
Bwd_Packets_s
Fwd_Packet_Length_Max
* Destination_Port
* Indicates Destination_Port not included in feature group 2A
Fwd_Packets_s
Fwd_Header_Length
Flow_IAT_Max
Bwd_Packets_s
Fwd_Packet_Length_Max
* Destination_Port
* Indicates Destination_Port not included in feature group 3A
Flow_IAT_Max
* Destination_Port
* Indicates Destination_Port not included in feature group 4A
Leevy et al. J Big Data (2021) 8:38 Page 12 of 29
After selecting features listed in Tables 9, 10, 11, and 12, the datasets are suitable for
training and testing classifiers. The reader should not attach any significance to the
order of features in Tables 9, 10, 11, and 12.
We create a total of 11 datasets. Four are the result of applying the four feature
selection techniques, and another four are the result of adding or removing the Des-
tination_Port categorical feature as needed. In order to assess the impact of feature
selection, we require two more datasets, one that contains all 66 usable features, and
a similar dataset with all features except Destination_Port. We call these datasets “all
features” and “all features A”, respectively. Finally, we have one dataset that contains
only the Destination_Port feature, which we call “Destination_Port only”. In the next
subsection we review classifiers that we train and test with the 2018 dataset.
We do one set of experiments where Destination_Port is the only feature. For these
experiments we use CatBoost, LightGBM, and Scikit-learn’s NB classifier for cate-
gorical data, CategoricalNB [57].
Before we train and test our classifiers, we initialize them with certain parameters.
The settings of these parameters were selected based on experimentation. We list
these initialization parameters in Tables 13, 14, 15, 16, and 17. We do not provide
tables for initialization parameters for Naive Bayes or Logistic regression construc-
tors because we did not set any for those two classifiers.
Leevy et al. J Big Data (2021) 8:38 Page 14 of 29
Classifier metrics
Our work records the confusion matrix (Table 18) for a binary classification problem,
where the class of interest is usually the minority class and the opposite class is the
majority class, i.e. positives and negatives, respectively. A related list of simple perfor-
mance metrics [58] is explained as follows:
• True Positive (TP) is the number of positive samples correctly identified as posi-
tive.
• True Negative (TN) is the number of negative samples correctly identified as neg-
ative.
• False Positive (FP), also known as Type I error, is the number of negative instances
incorrectly identified as positive.
• False Negative (FN), also known as Type II error, is the number of positive
instances incorrectly identified as negative.
In our study, we used more than one performance metric to better understand the
challenge of evaluating machine learning models with severely imbalanced data. The
metrics are explained below:
• F1-score (traditional), also known as the harmonic mean of precision and recall, is
equal to 2 · Precision · Recall/(Precision + Recall).
• AUC is equal to the area under the Receiver Operating Characteristic (ROC)
curve, which graphically shows recall versus (1-specificity) across all classifier
Leevy et al. J Big Data (2021) 8:38 Page 15 of 29
decision thresholds. From this curve, the AUC obtained is a single value that
ranges from 0 to 1, with a perfect classifier having a value of 1.
Table 20 Mean performance of CatBoost and LightGBM in terms of AUC and F1-score
on datasets with features from feature group 1
Classifier Feature group 1
AUC SD AUC F1 SD F1
Table 21 Mean performance of CatBoost and LightGBM in terms of AUC and F1-score
on datasets with features from feature group 2
Classifier Feature group 2
AUC SD AUC F1 SD F1
Table 22 Mean performance of CatBoost and LightGBM in terms of AUC and F1-score
on datasets with features from feature group 3
Classifier Feature group 3
AUC SD AUC F1 SD F1
Table 23 Mean performance of CatBoost and LightGBM in terms of AUC and F1-score
on datasets with features from feature group 4
Classifier Feature group 4
AUC SD AUC F1 SD F1
Table 24 Mean performance of CatBoost and LightGBM in terms of AUC and F1-score
on datasets with features from feature group all features
All features
In Tables 21, 22, 23 and 24 we report the results of further experiments involving Cat-
Boost and LightGBM. In these tables, we train and test models on data with features in
feature groups 2, 3, 4, and all features.
In Tables 25, 26, 27, 28 and 29 we report performance of the 7 classifiers CatBoost,
LightGBM, DT, LR, NB, RF, and XGBoost as we train and test them on datasets with
Leevy et al. J Big Data (2021) 8:38 Page 18 of 29
features from feature groups 1A through 4A, and feature group all features A. Since
these datasets do not contain the Destination_Port categorical feature, more classifiers
are available for us to experiment with.
Fig. 1 Box plots of AUC grouped by classifier; here datasets do not include destination port; Tukey’s HSD
test indicates LightGBM, Random Forest and XGBoost in group a, CatBoost, Decision Tree in group b, Logistic
Regression in group c and Naive Bayes in group d (factors in the same group are not significantly different)
LightGBM yields an AUC value of 0.96890 and an F1-score of 0.96134. We see analogous
patterns of similar or better performance for other classifiers in Tables 20, 21, 22, 23, 24,
25, 26, 27, 28 and 29 for other classifiers and datasets.
Therefore, we perform two-factor ANOVA tests with classifiers and datasets as the
factors, and AUC or F1-score as the dependent variable. In all cases, with one excep-
tion, the p-values for the ANOVA tests are zero, so we conclude that classifier and data-
set are significant factors affecting the outcome of experiments. The exception is for
experiments involving the dataset with one feature of Destination_Port only. For this
one-feature dataset, the classifier choice is not significant. We report the results of these
experiments with the one-feature Destination_Port only dataset in Table 19, where we
are forced to report results to 8 decimal places instead of the usual 5 to show any differ-
ence in performance when we use different classifiers. Otherwise, Tukey’s HSD tests are
appropriate for both classifier and dataset factors.
The dataset factor (feature group 1, 1A, etc.) in an experiment is equivalent to the
application of a feature selection technique. In order to get a sense of the impact of
feature selection and classifier choice, we conduct Tukey’s HSD tests at a 99% confi-
dence level for the dataset and classifier factors in order to gauge the effect of feature
selection. We see in Figs. 1 and 3 that performance in terms of AUC and F1-score is
influenced by the classifier. Reflected in Figs. 2 and 4, and according to the group-
ings the Tukey’s HSD test yields, there is no significant difference performance in
terms of AUC for group a, which consists of the feature selection technique where
Leevy et al. J Big Data (2021) 8:38 Page 20 of 29
Fig. 2 Box plots of AUC grouped by feature group/selection technique; here datasets do not include
destination port; Tukey’s HSD test indicates feature groups/selection techniques 1A, all features A in group a,
2A 3A in group b and 4A in group c
Fig. 3 Box plots of F1-score grouped by classifier; here datasets do not include destination port; Tukey’s HSD
test indicates LightGBM, Random Forest, XGBoost in group a, CatBoost and Decision Tree in group b, Logistic
Regression in group c, Naive Bayes in group d
Leevy et al. J Big Data (2021) 8:38 Page 21 of 29
Fig. 4 Box plots of F1-score grouped by feature group/selection technique; here datasets do not include
destination port; Tukey’s HSD test indicates all features A in group a, 1A in group b, 2A in group c, 3A in group
c and 4A in group d
we use features 4 out of 7 classifiers agree on (feature group 1A), and when we use
all features (feature group all features A). This is an ideal result since it implies we
obtain similar performance with a smaller dataset. However, in terms of F1-score,
we do not obtain the ideal result, but one where performance in terms of F1-score
is similar. We see in Fig. 4 that the F1-scores for classifiers trained on all features
in feature group 1A are very close to the F1-scores that classifiers trained with data
from feature group 1 yield. In fact, the adjusted p-value for the Tukey’s HSD test
for the difference in F1-score for feature groups 1 and all features is 0.0100414. We
cite this adjusted p-value as another reason to claim that performance in terms of
F1-score for classifiers trained with feature group 1A is similar, or better than the
performance of classifiers trained with all features from CSE-CIC-IDS2018. How-
ever, results for the feature selection techniques 2, 3, 4, 2A, 3A, or 4A do not show
the same conclusion.
Only CatBoost and LightGBM have built-in support for categorical features. There-
fore, we deem it out of scope to address encoding techniques for the Destination_Port
categorical feature in the 2018 dataset. As a result, we perform separate experiments
to assess the impact of feature selection to further answer research question Q1. We
conduct ANOVA to determine if the classifier and feature selection technique have an
impact on the results for AUC and F1-score. Since p-values for the classifier and feature
selection technique factors are nearly zero for the ANOVA tests, we conduct Tukey’s
HSD tests to check the levels for factors that yield the best performance. Box plots of
results, grouped by factors analyzed in ANOVA and HSD tests, are depicted in Figs. 5, 6,
7 and 8.
In Figs. 5 and 7, we see the performance of LightGBM or CatBoost trained on fea-
ture group 1 is similar to the performance of LightGBM or CatBoost trained on all
Leevy et al. J Big Data (2021) 8:38 Page 22 of 29
Fig. 5 Box plots of AUC values grouped by classifier; here datasets include destination port; Tukey’s HSD test
indicates each factor (classifier) is in its own group
Fig. 6 Box plots of AUC grouped by feature group/selection technique; here datasets include destination
port; Tukey’s HSD test indicates feature selection techniques 1 and 2 are not significantly different, and other
techniques are in groups of their own
Fig. 7 Box plots of F1-score grouped by classifier; here datasets include destination port; Tukey’s HSD test
indicates each factor (classifier) is in its own group
Fig. 8 Box plots of F1-score grouped by feature group/ selection technique; here datasets include
destination port; Tukey’s HSD test indicates feature selection techniques 2 and 3 are not significantly different,
other techniques are in groups by themselves
Tukey’s HSD test yields a mean AUC of 0.96640 (a difference of 0.00641). Likewise,
the mean F1-scores for CatBoost and LightGBM are similar for models trained on
feature group 1 and all features. In this case the Tukey’s HSD adjusted mean F1-score
Leevy et al. J Big Data (2021) 8:38 Page 24 of 29
Fig. 9 Box plots of AUC grouped by classifier; Tukey’s HSD test indicates classifiers are significantly different
is 0.94591 for models trained with data from feature group 1, and 0.95594 for models
trained with all features (a difference of 0.0103).
Research Question Q1 Answer: Yes, our ensemble feature selection technique
yields performance similar to, or better than, using all features. More specifically, the
variant of our technique where 4 out of 7 classifiers agree on a feature is the crite-
rion for feature selection that yields performance similar to, or better than, using all
features.
Research question Q2: Does including the Destination_Port categorical feature
significantly impact performance of LightGBM and CatBoost in terms of AUC and
F1-score?
To answer research question Q2, we use results of experiments where classifier: Cat-
Boost or LightGBM, is a factor, and the datasets’ having, or not having the Destination_
Port feature is another factor. We perform ANOVA tests on the results of experiments
grouped by these factors. The p-values associated with classifier and dataset factors for
the ANOVA tests are both zero. Therefore, Tukey’s HSD tests are appropriate. We report
the results of those tests in Figs. 9, 10, 11 and 12.
It is interesting to note that the ranges of values of both AUC and F1-score are smaller
when we use a dataset that includes destination port. So, not only do the ANOVA and
HSD tests confirm that including Destination_Port is a significant factor in the perfor-
mance of models for identifying attacks, but our results here also show greater stability
in the values of results. These results enable us to answer our second research question.
Research question Q2 Answer: Yes, including the Destination_Port feature has a sig-
nificant impact on performance in terms of AUC and F1-score.
Research question Q3: Does the choice of classifier: RF, DT, NB, LR, CatBoost, Light-
GBM, or XGBoost, significantly impact performance in terms of AUC and F1-score?
Leevy et al. J Big Data (2021) 8:38 Page 25 of 29
Fig. 10 Box plots of AUC grouped by whether the dataset contains the Destination Port feature; Tukey’s HSD
test indicates including Destination Port produces significantly different results
Fig. 11 Box plots of performance in terms of F1-score grouped by classifier; Tukey’s HSD test indicates
classifiers are significantly different
To answer the third research question, we note that all ANOVA tests we conduct
show that classifier is a significant factor in experiments—the p-values associated
with the classifier factor are 0. So, we perform Tukey’s HSD tests to determine how
classifiers may be grouped in terms of their performance. The groupings enable a
Leevy et al. J Big Data (2021) 8:38 Page 26 of 29
Fig. 12 Box plots of F1-score grouped by whether the dataset contains the Destination Port feature; Tukey’s
HSD test indicates including Destination Port produces significantly different results
Conclusion
The results in Tables 19 through 24, as well as the results from Tukey’s HSD tests
depicted in Figs. 1 through 11, and the answer to research question Q1 show that the
feature selection technique that produces feature group 1A performs similar to or better
than using all features. These results demonstrate that our ensemble feature selection
technique should be used with classifiers to detect anomalies in CSE-CIC-IDS2018, since
training a model with the reduced feature set consumes fewer computing resources.
We may also draw conclusions from the results of the ANOVA and Tukey’s HSD tests
to answer research questions Q2 and Q3. Test results for research question Q2 indicate
that Destination_Port is a useful feature for classifiers. Hence, we conclude one should
encode it for use with a classifier, if the classifier does not handle categorical features
automatically. Test results for research question Q3 reveal that LightGBM performs
Leevy et al. J Big Data (2021) 8:38 Page 27 of 29
similar to, or better than, any other classifier of CSE-CIC-IDS2018, even when we do not
use Destination_Port as a feature for LightGBM.
Since our current study is limited to comparing CatBoost and LightGBM when we
include Destination_Port as a categorical feature, we have an opportunity for future
research to investigate whether another classifier might yield better performance in
conjunction with a technique for encoding Destination_Port. There is also an opportu-
nity to evaluate classifier performance with other network intrusion detection datasets.
Another subject we have not broached here that deserves attention deals with tech-
niques for addressing class imbalance, such as Random Undersampling (RUS) [61].
Abbreviations
ANOVA: ANalysis Of VAriance; AP: Affinity propagation; APT: Application programming interface; AUC: Area Under the
Receiver Operating Characteristic (ROC) Curve; AUC: Area Under the Receiver Operating Characteristic Curve; CNN:
Convolutional Neural Network; CS: Chi-Squared; CV: Cross-validation; DNS: Domain Name System; DT: Decision Tree; FAU:
Florida Atlantic University; FN: False Negative; FP: False Positive; GBDT: Gradient-boosted decision tree; GPU: Graphics
Processing Unit; GR: Gain Ratio; HSD: Honestly Significant Difference; IG: Information Gain; k-NN: k-Nearest Neighbor; LR:
Logistic Regression; ML: Machine Learning; MLP: Multilayer Perceptron; NB: Naive Bayes; NSF: National Science Founda-
tion; ODT: Oblivious Decision Tree; RF: Random Forest; ROC: Receiver Operating Characteristic; RUS: Random Undersam-
pling; SVM: Support Vector Machine; TN: True Negative; TNR: True Negative Rate; TP: True Positive; TPR: True Positive Rate;
TS: Target Statistic; AUC: Area Under the Receiver Operating Characteristic Curve; LR: Logistic Regression; NB: Naive Bayes;
RF: Random Forest; DT: Decision Tree.
Acknowledgements
We would like to thank the reviewers in the Data Mining and Machine Learning Laboratory at Florida Atlantic University.
Additionally, we acknowledge partial support by the National Science Foundation (NSF) (CNS-1427536). Opinions, find-
ings, conclusions, or recommendations in this paper are the authors’ and do not reflect the views of the NSF.
Authors’ contributions
JLL prepared the manuscript and the primary literary review for this work. RZ performed the data cleaning. JH performed
the statistical analyses. All authors provided feedback to TMK and helped shape the research. TMK introduced this topic
to JLL, and helped to complete and finalize this work. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
References
1. Sharafaldin I, Lashkari AH, Ghorbani AA. Toward generating a new intrusion detection dataset and intrusion traffic
characterization. In: ICISSP; 2018. p. 108–116.
2. Shiravi A, Shiravi H, Tavallaee M, Ghorbani AA. Toward developing a systematic approach to generate benchmark
datasets for intrusion detection. Comput Secur. 2012;31(3):357–74.
3. Thakkar A, Lohiya R. A review of the advancement in intrusion detection datasets. Proc Comput Sci.
2020;167:636–45.
4. Wald R, Khoshgoftaar TM, Zuech R, Napolitano A. Network traffic prediction models for near-and long-term predic-
tions. In: 2014 IEEE International Conference on Bioinformatics and Bioengineering. IEEE; 2014. p. 362–68
5. Najafabadi MM, Khoshgoftaar TM, Kemp C, Seliya N, Zuech R. Machine learning for detecting brute force attacks
at the network level. In: 2014 IEEE International Conference on Bioinformatics and Bioengineering. IEEE; 2014. p.
379–85.
6. Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Inf
Eng Appl. 2013;3(10).
7. Wald R, Villanustre F, Khoshgoftaar TM, Zuech R, Robinson J, Muharemagic E. Using feature selection and classifica-
tion to build effective and efficient firewalls. In: Proceedings of the 2014 IEEE 15th International Conference on
Information Reuse and Integration (IEEE IRI 2014). IEEE; 2014. p. 850–54.
Leevy et al. J Big Data (2021) 8:38 Page 28 of 29
8. Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. J Big Data.
2018;5(1):42.
9. Leevy JL, Khoshgoftaar TM. A survey and analysis of intrusion detection models based on cse-cic-ids2018 big data. J
Big Data. 2020;7(1):1–19.
10. Wang H, Khoshgoftaar TM, Napolitano A. A comparative study of ensemble feature selection techniques for soft-
ware defect prediction. In: 2010 Ninth International Conference on Machine Learning and Applications. IEEE; 2010.
p. 135–40.
11. Leevy JL, Hancock J, Zuech R, Khoshgoftaar TM. Detecting cybersecurity attacks using different network features
with lightgbm and xgboost learners. In: 2020 IEEE Second International Conference on Cognitive Machine Intel-
ligence (CogMI). IEEE; 2020. p. 184–91.
12. Najafabadi MM, Khoshgoftaar TM, Seliya N. Evaluating feature selection methods for network intrusion detection
with Kyoto data. Int J Reliab Qual Saf Eng. 2016;23(01):1650001.
13. Lee J-S. Auc4. 5: Auc-based c4. 5 decision tree algorithm for imbalanced data classification. IEEE Access.
2019;7:106034–42.
14. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
15. Saritas MM, Yasar A. Performance analysis of Ann and Naive Bayes classification algorithm for data classification. Int J
Intell Syst Appl Eng. 2019;7(2):88–91.
16. Rymarczyk T, Kozłowski E, Kłosowski G, Niderla K. Logistic regression for machine learning in process tomography.
Sensors. 2019;19(15):3400.
17. Hancock J, Khoshgoftaar TM. Medicare fraud detection using catboost. In: 2020 IEEE 21st International Conference
on Information Reuse and Integration for Data Science (IRI). IEEE Computer Society; 2020. p. 97–103.
18. Hancock JT, Khoshgoftaar TM. Catboost for big data: an interdisciplinary review. J Big Data. 2020;7(1):1–45.
19. Hancock J, Khoshgoftaar TM. Performance of catboost and xgboost in medicare fraud detection. In: 19th IEEE Inter-
national Conference On Machine Learning And Applications (ICMLA). IEEE; 2020.
20. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A. Mining data with rare events: a case study. In: 19th IEEE
International Conference on Tools with Artificial Intelligence (ICTAI 2007), vol 2. IEEE; 2007. p. 132–139.
21. Hua Y. An efficient traffic classification scheme using embedded feature selection and lightgbm. In: 2020 Informa-
tion Communication Technologies Conference (ICTC). IEEE; 2020. p. 125–30.
22. Yap BW, Abd Rani K, Abd Rahman HA, Fong S, Khairudin Z, Abdullah NN. An application of oversampling, undersam-
pling, bagging and boosting in handling imbalanced datasets. In: Proceedings of the First International Conference
on Advanced Data and Information Engineering (DaEng-2013). Springer; 2014. p. 13–22.
23. Ahmad I, Basheri M, Iqbal MJ, Rahim A. Performance comparison of support vector machine, random forest, and
extreme learning machine for intrusion detection. IEEE Access. 2018;6:33789–95.
24. Baig MM, Awais MM, El-Alfy E-SM. Adaboost-based artificial neural network learning. Neurocomputing.
2017;248:120–6.
25. Rynkiewicz J. Asymptotic statistics for multilayer perceptron with Relu hidden units. Neurocomputing.
2019;342:16–23.
26. Zhao Y, Li H, Wan S, Sekuboyina A, Hu X, Tetteh G, Piraud M, Menze B. Knowledge-aided convolutional neural net-
work for small organ segmentation. IEEE J Biomed Health Inform. 2019;23(4):1363–73.
27. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V,
Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: machine learning in Python. J
Mach Learn Res. 2011;12:2825–30.
28. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S,
Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore
S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas
F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X. TensorFlow: Large-Scale Machine Learning on Hetero-
geneous Systems. Software available from tensorflow.org 2015. https://www.tensor flow.org/
29. Huancayo Ramos KS, Sotelo Monge MA, Maestre Vidal J. Benchmark-based reference model for evaluating botnet
detection tools driven by traffic-flow analytics. Sensors. 2020;20(16):4501.
30. Alenazi A, Traore I, Ganame K, Woungang I. Holistic model for http botnet detection based on dns traffic analysis. In:
International Conference on Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments.
Springer; 2017. p. 1–18.
31. Vajda S, Santosh K. A fast k-nearest neighbor classifier using unsupervised clustering. In: International Conference on
Recent Trends in Image Processing and Pattern Recognition. Springer; 2016. p. 185–193.
32. Gupta V, Bhavsar A. Random forest-based feature importance for hep-2 cell image classification. In: Annual Confer-
ence on Medical Image Understanding and Analysis. Springer; 2017. p. 922–934.
33. Yuanyuan S, Yongming W, Lili G, Zhongsong M, Shan J. The comparison of optimizing svm by ga and grid search. In:
2017 13th IEEE International Conference on Electronic Measurement & Instruments (ICEMI). IEEE; 2017. p. 354–60.
34. Li X, Chen W, Zhang Q, Wu L. Building auto-encoder intrusion detection system based on random forest feature
selection. Comput Secur. 2020;95:101851.
35. Chen J, Xie B, Zhang H, Zhai J. Deep autoencoders in pattern recognition: a survey. Bio-inspired Computing Models
And Algorithms. 2019;229.
36. Wei Z, Wang Y, He S, Bao J. A novel intelligent method for bearing fault diagnosis based on affinity propagation
clustering and adaptive feature selection. Knowl Based Syst. 2017;116:1–12.
37. Mirsky Y, Doitshman T, Elovici Y, Shabtai A. Kitsune: an ensemble of autoencoders for online network intrusion detec-
tion 2018. arXiv preprint arXiv:1802.09089
38. Fitni QRS, Ramli K. Implementation of ensemble learning and feature selection for performance improvements in
anomaly-based intrusion detection systems. In: 2020 IEEE International Conference on Industry 4.0, Artificial Intel-
ligence, and Communications Technology (IAICT). IEEE; 2020. p. 118–24.
39. Fadlil A, Riadi I, Aji S. Ddos attacks classification using numeric attribute based Gaussian Naive Bayes. Int J Adv Com-
put Sci Appl (IJACSA). 2017;8(8):42–50.
Leevy et al. J Big Data (2021) 8:38 Page 29 of 29
40. Elkhalil K, Kammoun A, Couillet R, Al-Naffouri TY, Alouini M-S. Asymptotic performance of regularized quadratic
discriminant analysis based classifiers. In: 2017 IEEE 27th International Workshop on Machine Learning for Signal
Processing (MLSP). IEEE; 2017. p. 1–6.
41. Abd Elrahman SM, Abraham A. A review of class imbalance problem. J Netw Innov Compu. 2013;1(2013):332–40.
42. Zhang W-Y, Wei Z-W, Wang B-H, Han X-P. Measuring mixing patterns in complex networks by spearman rank correla-
tion coefficient. Phys A Statist Mech Appl. 2016;451:440–50.
43. Shi D, DiStefano C, McDaniel HL, Jiang Z. Examining chi-square test statistics under conditions of large model size
and ordinal data. Struct Equ Model. 2018;25(6):924–45.
44. D’hooge L, Wauters T, Volckaert B, De Turck FF. Inter-dataset generalization strength of supervised machine learning
methods for intrusion detection. J Inf Secur Appl. 2020;54:102564.
45. Taşer PY, Birant KU, Birant D. Comparison of ensemble-based multiple instance learning approaches. In: 2019 IEEE
International Symposium on INnovations in Intelligent SysTems and Applications (INISTA). IEEE; 2019. p. 1–5.
46. Wang R, Zeng S, Wang X, Ni J. Machine learning for hierarchical prediction of elastic properties in fe-cr-al system.
Comput Mater Sci. 2019;166:119–23.
47. Saikia T, Brox T, Schmid C. Optimized generic feature learning for few-shot classification across domains; 2020. arXiv
preprint arXiv:2001.07926
48. Sulaiman S, Wahid RA, Ariffin AH, Zulkifli CZ. Question classification based on cognitive levels using linear svc. Test
Eng Manag. 2020;83:6463–70.
49. Rahman MA, Hossain MA, Kabir MR, Sani MH, Awal MA, et al.: Optimization of sleep stage classification using single-
channel eeg signals. In: 2019 4th International Conference on Electrical Information and Communication Technol-
ogy (EICT). IEEE; 2019. p. 1–6.
50. Zuech R, Khoshgoftaar TM. A survey on feature selection for intrusion detection. In: Proceedings of the 21st ISSAT
International Conference on Reliability and Quality in Design; 2015. p. 150–155.
51. Witten IH, Frank E, Hall MA. Data mining: practical machine learning tools and techniques. 3rd ed. San Francisco:
Morgan Kaufmann Publishers Inc.; 2011.
52. Agresti A. Categorical data analysis. Wiley Series in Probability and Mathematical Statistics. Applied probability and
statistics, applied probability and statistics. Hoboken: Wiley; 1990. p. 42–3.
53. Singh R, Kumar H, Singla R. Analysis of feature selection techniques for network traffic dataset. In: 2013 International
Conference on Machine Intelligence and Research Advancement. IEEE; 2013. p. 42–46.
54. Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;1189–1232.
55. Gupta A, Nagarajan V, Ravi R. Approximation algorithms for optimal decision trees and adaptive tsp problems. Math
Oper Res. 2017;42(3):876–96.
56. Wes McKinney: Data Structures for Statistical Computing in Python. In: Stéfan van der Walt, Jarrod Millman (eds.)
Proceedings of the 9th Python in Science Conference; 2010. p. 56–61. https://doi.org/10.25080/Majora-92bf1
922-00a
57. Rish I, et al. An empirical study of the naive bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artifi-
cial Intelligence, vol 3; 2001. p. 41–46.
58. Seliya N, Khoshgoftaar TM, Van Hulse J. A study on the relationships of classifier performance metrics. In: Tools with
Artificial Intelligence, 2009. ICTAI’09. 21st International Conference On. IEEE; 2009. p. 59–66.
59. Iversen GR, Wildt AR, Norpoth H, Norpoth HP. Analysis of variance. Thousand Oak: Sage; 1987.
60. Tukey JW. Comparing individual means in the analysis of variance. Biometrics. 1949;5:99–114.
61. Liu B, Tsoumakas G. Dealing with class imbalance in classifier chains via random undersampling. Knowl Based Syst.
2020;192:105292.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.