Machine Learning for IDS Enhancement
Machine Learning for IDS Enhancement
MURAD MOHAMMED
September 2024
KOMBOLCHA, ETHIOPA
WOLLO UNIVERSITY
KOMBOLCHA INSTITUTE OF TECHNOLOGY COLLEGE OF
INFORMATICS
MURAD MOHAMMED
KOMBOLCHA, ETHIOPA
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Objectives of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1 General Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.2 Specific Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Scope and Limitation of the Study . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Literature Review 7
2.1 Intrusion Detection Systems-An overview . . . . . . . . . . . . . . . . . . . 7
2.1.1 Host-based IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Network-based IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Signature-based detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Anomaly-based detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Hybrid Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Components of IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Types of Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Machine learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.1 Supervised algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.2 Unsupervised algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7.3 Hybrid algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.8.3 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.8.4 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
i
2.8.5 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.9 Over fitting and under fitting . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.10 Algorithms Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.11 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.12 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.13 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.14 Recall (Sensitivity) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.15 Receiver Operating Characteristic (ROC) Curve and Area under the Curve
(AUC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.16 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.17 Anomaly and Signature-Based IDS . . . . . . . . . . . . . . . . . . . . . . . 28
2.18 IDS Based on Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.19 Feature Selection Methods in Network Traffic Classification . . . . . . . . . . 34
2.20 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 Methodology 35
3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Data Collection and Preparation . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Preprocessing of dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Importing and cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6 Encode Categorical Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.8 Observations on Correlations in the Dataset . . . . . . . . . . . . . . . . . . 42
3.9 Splitting the data into training and testing sets . . . . . . . . . . . . . . . . 43
3.10 Data Scaling using Standard Scale . . . . . . . . . . . . . . . . . . . . . . . . 43
3.11 Hyper parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.12 Grid Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.13 Model comparison / Model Performance Evaluation . . . . . . . . . . . . . 45
3.14 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
ii
3.15 Model comparison / Model Performance Evaluation . . . . . . . . . . . . . 46
3.16 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Appendices 74
iii
List of Tables
1 Confusion matrix classification . . . . . . . . . . . . . . . . . . . . . . . 26
2 Number of instances in the NSL-KDD training and testing dataset 38
3 various programming languages, and open-source tools . . . . . . . . . . . . 48
4 Comparative Performance Table . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Random Forest vs. Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . 54
6 Random Forest vs. Logistic Regression . . . . . . . . . . . . . . . . . . . . . 55
7 Random Forest vs. Neural Network . . . . . . . . . . . . . . . . . . . . . . 55
8 Comparison of AUC-ROC for Different Models . . . . . . . . . . . . . . . . 56
9 Model Performance Analysis and Comparative Study . . . . . . . . . . . . . 60
iv
Abstract
The rise in information technology has led to an increase in hacking and unauthorized ac-
tivities, making network traffic classification crucial. Current Intrusion Detection Systems
(IDS) often struggle with low detection rates, high training times, and high false alarm rates.
We suggest a system that combines anomaly detection, big data, and machine learning to
overcome these problems and provide faster and more accurate results. Our system uses
four classification models: Random Forest, Neural Network, Logistic Regression, and De-
cision Tree, with grid search and 5-fold cross-validation for hyper-parameter tuning. For
evaluation, the NSL-KDD dataset was divided into 70% training and 30% validation and
testing. Findings indicate that the Random Forest model performed better than the others,
with 99.90% accuracy, 99.88% precision, 99.93% recall, 99.90% F1-score, and 99.89% AUC-
ROC. While Neural Networks and Decision Trees also fared well, Logistic Regression was
marginally less successful. This indicates how well our combined strategy works to improve
IDS performance.
Keywords: Intrusion Detection System, Classification, Machine Learning,
Random Forest model, Anomaly Detection, Scikit-learn
v
Acknowledgement
First and foremost, I express my deepest gratitude to Almighty Allah for His compassion
and mercy, which enabled me to complete this work. I would also like to extend my heartfelt
thanks to my thesis advisor, Dr. Almu.J, who has been a remarkable role model and an
excellent teacher throughout my academic journey. Finally, I wish to convey my profound
appreciation to my family and friends for their unwavering encouragement and support,
which were instrumental in the accomplishment of this work.
vi
Declaration
I, Murad Mohammed, hereby declare that this thesis, entitled ”Enhancing Network Intrusion
Detection Systems through Machine Learning Techniques: Anomaly-Based Approaches,”
submitted to the Kombolcha Institute of Technology, College of Informatics, Department of
Information Technology, in partial fulfillment of the requirements for the Master of Science
degree in Computer Networks and Communications, is my original work. It has not been
presented for any degree or other award at any university or institution, and I have acknowl-
edged all sources used.
Murad Mohammed
Signature
vii
List of Abbreviations
AI Artificial Intelligence
DT Decision Tree
FN False Negative
FP False Positive
LR Logistic Regression
ML Machine Learning
NB Naı̈ve Bayes
NN Neural Network
viii
NSL-KDD Network Services Library Knowledge
RF Random Forest
ix
1 Introduction
1.1 Background
The quick development and progress of internet-based technology has had a profound effect
on many facets of society, including how people communicate, work, live, and access infor-
mation [1]. These developments have significantly raised the bar for science and technology
and brought about significant changes to the social mode of production and the way of life
for the average person. In addition, the internet has had an impact on society in a variety
of industries, including management, science, the arts, business, and health. The broad
access to internet-based technology has also brought about a transformation in education,
with online learning platforms and resources becoming widely available. This has allowed
individuals to access education and gain new skills from the comfort of their own homes,
breaking down barriers of traditional education such as location and cost. The internet
has also revolutionized communication, with social networking platforms becoming a pri-
mary means of connecting and establishing verbal communication. Overall, the impact of
internet-based technology on society has been vast and transformative, reshaping the way
people interact, learn, work, and access information. As technology continues to advance
and become increasingly integrated into our daily lives, the need for effective cyber security
measures becomes imperative [2]Without proper cyber security measures in place, individu-
als and organizations are vulnerable to various threats such as unauthorized access, malware
injection, data exfiltration, and network crashes [3]. This is particularly concerning given
the alarming trends of increased breaches in terms of both number and size over the last
decade [4]. These breaches have proven to be costly, with the global average cost of a data
breach reaching $3 [5] .9 million. To address the challenges of cyber security, it is important
to bridge the gaps that exist in our understanding and readiness to combat cybercrimes.
Today’s interconnected world brings both benefits and risks, and with increased connectiv-
ity comes the potential for theft, fraud, and abuse. Law enforcement plays a crucial role in
safeguarding cyberspace by investigating cybercrimes and apprehending those responsible.
In the digital age, reliance on computer networks has become ubiquitous, enabling seam-
less communication, data exchange, and the delivery of essential services. [6] However, this
1
increasing
Dependence has also made these networks vulnerable to various forms of cyber-attacks,
which can disrupt normal operations and compromise security. Intrusion detection systems,
or IDS, are becoming more and more crucial for safeguarding digital environments as a re-
sult of these challenges and the continuous evolution of the threat landscape. Monitoring
and assessing system activity for potential security breaches is the process of intrusion de-
tection. Numerous factors, such as malware, illegal access attempts, and even authorized
users abusing their privileges, can lead to these incidents. In order to tackle the increasing
difficulties in this domain, a great deal of work has been done lately to suggest effective in-
trusion detection systems. The capacity of these systems to promptly and accurately detect
and react to possible threats is one of their most important features. Intrusion Detection
Systems are critical components of a comprehensive cyber security strategy. [7] they can
be classified into several ways to enhance their effectiveness and adaptability. One way to
classify Intrusion Detection Systems is based on their detection technique. Some examples
of detection techniques include:
4. Hybrid detection: This method boosts intrusion detection’s efficacy and accuracy by
combining several detection techniques, including anomaly- and signature-based meth-
ods. The classification of Intrusion Detection Systems based on their detection tech-
nique is one way to enhance their effectiveness and adaptability.
2
1.2 Research Motivation
Network security is becoming more and more important for businesses due to the exponential
growth of Internet traffic, which also highlights the shortcomings of the security mechanisms
in place. Effective solutions, like intrusion detection systems (IDS), are therefore required
for threat detection and network monitoring. Nevertheless, poor detection accuracy and
performance problems plague many of the IDSs in use today. Our research suggests using
a fast data processing framework to improve the efficiency of IDS in order to address these.
This approach aims to improve real-time threat detection and response, offering a compre-
hensive solution that combines feature selection and optimal parameter selection to boost
both accuracy and performance in network security .
An increasing number of systems are becoming open to intrusions by outsiders due to the
growing range of financial opportunities and the quick rise in Internet traffic [8]. Strong
network security measures must be put in place to protect these systems from potential
exploitation and unauthorized access. Computer network security is largely dependent on
intrusion detection [9]. Providing services over a network while guaranteeing network se-
curity has become a major concern due to the increase in attacks and the growing reliance
on industries like engineering, commerce, and medicine [10]. Through the monitoring and
analysis of events occurring within a computer system or network, intrusion detection sys-
tems play a crucial role in network security [11]. IDSs can identify and notify administrators
of potential threats and attacks using a variety of techniques, including anomaly detection,
signature-based detection, and log file analysis. As a result, administrators are able to
swiftly and appropriately stop additional network compromise. Moreover, intrusion detec-
tion systems (IDSs) aid in the mitigation of diverse forms of attacks, such as ransomware and
malware attacks, port scanning, denial of service attacks, and distributed denial of service
attacks. IDSs are not infallible, though, and how well they work relies on the algorithms
they use, how much monitoring and analysis they do, and how quickly and accurately they
can issue alerts [12] [13]. As the number and severity of attacks continue to increase, or-
3
ganizations must ensure that their IDSs are properly configured, regularly updated with
the latest threat intelligence, and integrated into a comprehensive network security strategy.
Furthermore, IDSs should be complemented with other security measures, such as firewalls
and intrusion prevention systems, to provide multiple layers of defense against potential
threats [14].Moreover, the importance of continuous research and development in the field
of intrusion detection cannot be overlooked. In today’s highly connected and digitalized
world, the security of computer networks and systems is of utmost importance [15] [16].Or-
ganizations are investing significant resources to protect their data from attackers, who are
constantly becoming more sophisticated. Traditional intrusion detection systems have been
used for many years to monitor network traffic and identify potential threats in real-time.
However, these systems have certain limitations, including low detection rates, long training
times, slow processing speeds, and high false alarm rates [17] [18].As a result, there is a
growing need to improve existing intrusion detection systems or develop new ones that can
effectively detect and respond to evolving security threats. Machine learning-based intrusion
detection systems have emerged as a promising approach to address these limitations [19].
These systems utilize machine learning algorithms to analyze network traffic patterns and
detect any abnormal or malicious activities. By leveraging the power of machine learning,
these IDSs can continuously learn from new data and adapt to emerging threats, improving
their detection capabilities over time. Moreover, machine learning-based IDSs have the po-
tential to reduce false alarms by accurately distinguishing between normal network behavior
and suspicious activities. This approach has gained traction in recent years as organizations
recognize the need for more advanced and intelligent systems to protect their networks.
The general objective of this research is enhance network intrusion detection systems through
the application of machine learning techniques on utilizing an anomaly-based approach.
4
1.4.2 Specific Objectives
• Prepare NSL KDD intrusion dataset that will be utilized for training and testing the
model.
• Apply prepossessing techniques to the dataset to ensure its quality and suitability for
model training
• Explore various machine learning classification algorithms suitable for network traffic
classification.
• To compare the performance testing and analysis of the trained models using appro-
priate evaluation metrics.
In this thesis, we design intrusion detection systems by applying machine learning techniques
with an anomaly-based approach. The primary focus is on identifying potential incidents,
logging relevant information, and reporting these incidents to security administrators. The
system aims to enhance the detection rate and is capable of identifying both known and
unknown attacks, making it suitable for any organization’s network.
However, this work does not include an intrusion prevention system capable of taking
action against incoming intrusions; it only provides alerts to the administrator. Furthermore,
the research is limited to using a single dataset containing network traffic data to build models
that classify input data as either normal or attack. Additionally, only one feature selection
method is employed to select relevant features for model building and testing.
5
Research and Development: The research work contributes to the field of anomaly-based
intrusion detection and machine learning by proposing novel methodologies and techniques.
The study explores the integration of machine learning algorithms, Apache Spark, and fea-
ture selection methods in the context of IDS, offering insights into their effectiveness and
applicability
Overall, significance of the study lies in its potential to enhance network security, detect
unknown attacks, facilitate practical implementation, improve performance, and contribute
to the advancement of anomaly-based intrusion detection systems and machine learning
techniques. By addressing the limitations of existing IDS systems, the research work aims to
provide valuable insights and solutions to improve the resilience and effectiveness of network
defenses in the face of evolving cyber threats
This section outlines the structure of the remaining chapters of the thesis. Chapter 2 reviews
concepts and methods relevant to the proposed research, discussing various approaches to
Intrusion Detection Systems (IDS) and the methodologies used in developing the proposed
system. It also presents related research works, including the methods used, objectives,
procedures, and key findings of these studies. Chapter 3 provides an overview of the sys-
tem architecture and a detailed description of each component, along with the performance
evaluation metrics used to assess the proposed system. Chapter 4 details the programming
tools and software used to develop the model, the dataset utilized, implementation specifics,
and the experimental results. Finally, Chapter 5 presents the conclusion, summarizes the
contributions of the study, and suggests potential areas for future research.
6
2 Literature Review
IDS are an essential component of network security, aiming to detect and prevent unau-
thorized access or malicious activities within a computer network infrastructure [20]. They
monitor network traffic, analyze patterns, and contrast them with known attacks signatures
or abnormal behavior. Furthermore, intrusion detection systems can employ various tech-
niques such as anomaly detection, signature-based detection, and behavior based detection
to identify potential threats. By detecting and alerting network administrators to potential
security breaches, IDS play a crucial role in maintaining the integrity and confidentiality of
sensitive information [21]. In addition to monitoring network traffic and analyzing patterns,
intrusion detection systems can also be classified into two main types: NID systems and HID
systems. Network-based intrusion detection systems focus on the traffic moving in and out
the network and are positioned at specific points within the network to monitor all traffic.
They analyze packets and can detect suspicious patterns or known attack signatures. Con-
versely, host-based intrusion detection systems focus on the activity within a specific device
or host. They monitor the operating system logs, file systems, and important system files for
any unusual behavior that could indicate a security breach. This paragraph examines the
types and detection methods of intrusion detection systems (IDS). IDS can be categorized
based on their placement into two types: host-based IDS and network-based IDS.
A Host-Based Intrusion Detection System is a type of security system that operates within
a computer, node, or device. Its primary purpose is to monitor and detect any unauthorized
or suspicious activity that may indicate a system compromise [22]. [1]. this can include mon-
itoring network communication, inspecting system resources, and detecting modifications to
the system’s registry. A HIDS plays a crucial role in ensuring the security and integrity
of a system by constantly analyzing the events and actions occurring within the host. By
monitoring the full communication stream and inspecting system resources, a HIDS can effec-
tively alert administrators to any rogue programs or harmful modifications to the system’s
7
registry. For each of the four types of IDPS—network-based, wireless, network behavior
analysis software, and host-based—it offers helpful, applicable advice. Source: ”Guide to In-
trusion Detection and Prevention Systems” - NIST Special Publication 800-94A Host-Based
Intrusion Detection System is a security system that operates within a computer, node, or
device to monitor and detect unauthorized activity and system compromises. For example,
It can determine whether an attacker has successfully exploited a system vulnerability and
notify security administrators of the situation. A HIDS is essential in identifying possible
incidents and providing early detection of intrusions.
• Examines the entire operating system and can inspect the full communication stream.
8
• Capable of detecting insider attacks that do not involve network traffic and can monitor
end-to-end encrypted communications.
• Detects intrusions by monitoring system calls, system directories, application logs, and
user activities.
• Can be disabled during certain types of denial of service attacks, leading to potential
functionality loss.
• Consumes host resources as OS audit logs can occupy a large amount of space.
• Checks a broad range of network protocols such as TCP, UDP, ICMP, SNMP, and
router Net Flow records.
• Faces challenges in recognizing attacks from high-speed encrypted traffic when network
volume is overwhelming.
9
• Some networks cannot provide comprehensive data analysis due to limited monitoring
port capabilities on switches.
NIDS and HIDS can be detected using three main methods: signature-based detection,
anomaly-based detection, and hybrid detection [24]. Signature-based detection involves
comparing network traffic or system activity against a database of known attack signa-
tures. Anomaly-based detection monitors for deviations from normal patterns of network or
system behavior, flagging any unusual activity that may indicate an attack [?].
Signature-based detection is a widely used approach in cyber security that focuses on iden-
tifying and analyzing known patterns or signatures of malicious activity or attacks. This
approach relies on a database of known signatures, which are essentially unique character-
istics or patterns associated with specific types of threats or attacks [25]. By comparing
incoming data or network traffic to the signatures in the database, the system can detect
and block any matches, indicating a potential security threat.Furthermore, the majority of
systems developed using these methods have significant rates of false positive and false neg-
ative detection as well as an inability to continuously adjust to evolving hostile behaviors.
One disadvantage of signature-based detection is its reliance on a static knowledge base.
This means that signature-based detection may struggle to detect new or unknown attacks
that do not match any existing signatures. Additionally, signature-based detection can be
limited in its ability to detect slight variations or variants of known attacks, as it relies
solely on matching exact signatures. As a result, the accuracy and effectiveness of signature-
based detection heavily depend on the thoroughness and currency of the signature database.
However, signature-based detection has its limitations. It can only detect attacks that have
already been identified and documented in its database.
10
2.3 Anomaly-based detection
11
Figure 1: IDS classification
Intrusion detection systems have become essential tools for monitoring and securing networks
[30].They consist of several components, including data sources, sensors, analyzers, and
response systems.
• Data sources provide the information that IDS rely on to detect intrusions.
• Sensors are responsible for collecting network traffic and other data from the data
sources.
• Analyzers analyze the collected data, looking for patterns and anomalies that may
indicate potential intrusions.
• Response systems play a crucial role in IDS, as they take action when an intrusion is
detected, such as generating alerts, blocking suspicious traffic, or initiating investigative
actions [31]. These components work together to form a comprehensive IDS solution
that can effectively detect and respond to security threats in real time. In addition
to these components, IDS may also include features such as logging and reporting
capabilities, which allow for the recording of security events and the generation of
detailed reports for analysis and compliance purposes.
12
Figure 2: Intrusion Detection System architecture
• Malware attacks: These involve the use of malicious software, such as viruses, worms,
Trojans, and ransom ware, to gain unauthorized access to systems and networks, steal
sensitive information, or disrupt operations.
• Denial of Service attacks: These attacks aim to overwhelm a target system or network
with a flood of traffic, rendering it inaccessible to legitimate users
13
2.7 Machine learning algorithms
• Supervised algorithms
• Unsupervised algorithms
• Hybrid algorithms
Supervised algorithms require fully labeled class data. For network intrusion detection using
supervised machine learning algorithms, the dataset is divided into two parts: training data
and testing data. The primary objective is to train the algorithms with labeled data to
create a model. This model is then used to predict the ’unknown’ in the test data. In
their survey of IDS [25], researchers explain that supervised algorithms identify the relevant
features and classes. The inputs are records from network or host sources combined with an
output value specified as a label, typically categorized as either attack (intrusion) or normal
data. Feature selection is applied to eliminate unnecessary features. Once the training data
has learned the relationship between the input data and labels through classification or
regression, the algorithms are tested. In the testing stage, the algorithms predict anomalies
and the relevant class from new incoming data. Rosa et al [33]. Additionally note that
supervised ML models are usually tuned for a specific situation, either for a particular
process or a single communication protocol. There is a wide variety of supervised machine
learning algorithms the most commonly used classification algorithms:
• Naı̈ve Bayes
14
• Decision Tree (DT) and
Unsupervised learning employs clustering algorithms to create models from unlabeled data,
enabling the identification of malicious inputs from network traffic or host logs. These
methods analyze data characteristics randomly, without any prior knowledge, based on their
statistical properties. Nisioti et al [34]. Highlight that unsupervised methods do not require
the time-consuming training stage associated with supervised learning. A key feature of
these algorithms is feature selection. Clustering techniques aim to divide the input data into
clusters that maximize relevance and minimize redundancy by examining the data structure
and defining a threshold. According to Nisioti et al., the number of regular instances in
a dataset typically surpasses the number of anomalies. Once clusters are formed based on
their relationships, each is assigned a unique score. Clusters exceeding the threshold score
are deemed malicious, resulting in regular traffic forming large clusters and attacks forming
smaller ones. This approach is more efficient than the training stage of supervised learning
and allows for the detection of unknown, zero-day attacks without the need for labeled data.
However, a drawback of unsupervised methods is their high false-positive rate. Zaman et
al [35]. Mention commonly used algorithms in this category, including K-means clustering,
Fuzzy clustering, and Hierarchical clustering.
Erbad et al [36]. Note that current intrusion detection systems face challenges such as real-
time detection, low accuracy, and inconsistent detection rates. To address these issues, hybrid
machine learning algorithms have been developed. The primary goals of these hybrid algo-
rithms are to reduce false-negative and false-positive alarms. By combining supervised and
unsupervised machine learning approaches, these methods aim to enhance the performance
of intrusion detection systems. Buczak et al [37]. describe how hybrid methods leverage
the accuracy of supervised algorithms for known attacks and the ability of unsupervised
15
algorithms to detect unknown attacks. Given the evolving nature of attacks, it is crucial
to consider solutions beyond traditional IT-based approaches. In hybrid classifiers, different
components perform specific tasks such as pre-processing, classification, and clustering to
optimize network intrusion detection.
Despite the need for further research in this area, some studies have been published.
Anton et al [38]. Introduced a hybrid approach combining SVM and Bayes algorithms. The
SVM algorithm distinguishes normal data from anomalies, a decision tree identifies known
types of anomalies, and a Naive Bayes algorithm detects unknown anomalies. This technique
achieved 80.68
In the field of network traffic analysis, there are several commonly used classification algo-
rithms that help in understanding and characterizing network systems [39]. Some of the
most commonly used classification algorithms for network traffic analysis include [40].
The linear relationship between a dependent variable and one or more independent variables
can be modeled using the supervised learning algorithm known as linear regression [41].
The goal of linear regression is to find the best-fitting straight line through the data points.
Linear regression can be used for both simple and multiple regression analysis.
The equation y=mx+b represents the line of best fit in a simple linear regression, where
there is only one independent variable.
Where Y Is the Dependent variable, x is the independent variable, m is the slope of the
line, and b is the y-intercept. The slope of the line represents the relationship between x
and y, while the y-intercept represents the point at which the line crosses the y-axis.
Y = m.X + B
In multiple linear regression, there are two or more independent variables, and the line of
best fit is represented by the equation: (y = b0 + b1x1 + b2x2 + ... + bnxn ) Where y is
the dependent variable, x1, x2,... xn are the independent variables, and b0, b1, b2,... bn are
16
Figure 3: Linear regression
the coefficients of the line. Keeping all other independent variables constant, each coefficient
shows the change in the dependent variable for a one-unit change in the corresponding
independent variable. Linear regression can be used for both continuous and categorical
dependent variables, but the independent variables must be continuous. The method of
least squares is used to find the coefficients of the line of best fit, which minimize the sum
of the squared differences between the predicted and actual values
One kind of supervised machine learning algorithm used for classification tasks is called
logistic regression. Its application involves forecasting the likelihood that a given outcome
will belong to a specific class [42]. Finding the best model to explain the relationship between
the independent variables and the dependent variable—which is binary in nature and can
be either true or false, yes or no—is the aim of logistic regression. The primary distinction
between the logistic regression model and the linear regression model is the dichotomous
outcome variable and the use of the logistic function, sometimes referred to as the sigmoid
function, to transform the linear equation. The logistic function yields an S-shaped curve
that enables the model to forecast the likelihood that an outcome will fall into one of the two
classes. The graphic below illustrates the distinction between logistic and linear regression:
The logistic regression model estimates the probability that a given input belongs to a
particular class, using a probability threshold. If the predicted probability is greater than
17
Figure 4: Logistic regression
the threshold, the input is classified as belonging to the class. Otherwise, it is classified as
not belonging to the class. One of the main advantages of logistic regression is that it is easy
to implement and interpret. It also does not require a large sample size, and it is robust to
noise and outliers. However, it can be prone to over fitting if the number of independent
variables is large compared to the number of observations. Logistic regression is used in
various fields such as healthcare, marketing, and social sciences to predict the likelihood of
a certain event happening.
Decision trees are a popular and widely used supervised learning algorithm for both classi-
fication and regression problems. The algorithm creates a tree-like model of decisions and
their possible consequences, with the goal of correctly classifying or predicting the outcome
of new instances. At the top of the tree is a root node that represents the entire dataset.
The root node is then split into two or more child nodes, each representing a subset of the
data that has certain characteristics. This process continues recursively until the leaf nodes
are reached, which represent the final decision or prediction [43].
One of the main advantages of decision trees is their interpretability. The tree structure
makes it easy to understand the logic behind the predictions and the decision-making process.
18
Figure 5: Decision trees
Additionally, decision trees can handle both categorical and numerical data and can handle
missing values without the need for imputation. However, decision trees also have some
limitations. They can easily overfit the training data, especially if the tree is allowed to grow
deep. To prevent over fitting, techniques such as pruning, limiting the maximum depth of
the tree, or using ensembles of trees like random forests can be used.
Decision trees are also sensitive to small changes in the data. A small change in the
training data can lead to a completely different tree being generated. This can be mitigated
by using ensembles of trees like random forests, which average the predictions of multiple
trees to reduce the variance.
For machine learning classification and regression problems, random forests is an ensemble
learning technique [44]. This kind of decision tree algorithm builds a more reliable and
accurate model by combining several decision trees. The fundamental principle of random
forests is to take random samples of the data, construct a decision tree for each sample, and
then aggregate the output of all the trees to arrive at a final prediction.
The algorithm first uses a bootstrap sample, or random selection of a portion of the
original dataset, to build a random forest. Using this sample, it then constructs a decision
19
Figure 6: Random forests
tree and repeats the procedure a predetermined number of times. Each decision tree is built
on a different bootstrap sample, so each tree will have a slightly different structure [45]. The
final predictions are made by taking the majority vote of all the trees in the forest.
The randomness in the random forest comes from two sources: the random selection
of data for each tree and the random selection of features for each split in the tree. This
randomness helps to reduce over fitting, which is a common problem in decision tree algo-
rithms. Random forests are also less sensitive to outliers and noise in the data, making them
more robust than a single decision tree. Random forests can be used for both classification
and regression problems. In classification problems, it can handle categorical and numerical
features, and it can handle missing data as well. In regression problems, it can also handle
categorical and numerical features, and it can handle missing data as well. A real-life ap-
plication of random forests is in the field of finance, where it is used for risk management.
Random forests can be used to identify important factors that contribute to risk and to
develop a model that predicts the risk level of a portfolio. It can also be used in medicine to
predict the likelihood of a patient developing a disease based on their medical history and
other factors. In Python, the scikit-learn library provides the Random Forest Classifier and
20
Random ForestRegressor classes for building and using random forest models. These classes
provide a simple and consistent interface for building, training, and evaluating random forest
models, and they are compatible with other Scikit-learn tools such as cross-validation, grid
search, and feature importance analysis.
One of the key advantages of random forests is that it reduces overfitting, which is a
common problem in decision tree algorithms. This is because a random forest is made up of
multiple decision trees, each of which is built on a different subset of the data. As a result,
the final predictions are less sensitive to the specific data points in the training set.
Another advantage of random forests is that it are able to handle missing data and
categorical variables. It can also handle high-dimensional data and is less affected by outliers.
A real-life example of random forests is in the field of finance. Random forests can be used
to predict whether a customer will default on a loan. The algorithm can take into account
factors such as the customer’s credit score, income, and employment history to make the
prediction.
Neural networks are a set of algorithms that are designed to recognize patterns in data.
They are inspired by the structure and function of the human brain and are used to model
complex relationships between inputs and outputs.
A neural network is made up of layers of interconnected nodes, also known as neurons.
These layers are organized into input, hidden, and output layers. The input layer receives
data, the hidden layers process the data, and the output layer produces the final result [46].
Each neuron in a layer is connected to the neurons in the next layer through pathways called
edges or connections, which are assigned a weight value. These weight values are adjusted
during the learning process to improve the accuracy of the model. The learning process in a
neural network is called training, and it involves adjusting the weight values of the edges to
minimize the error between the predicted output and the actual output. This is done using
an optimization algorithm, such as stochastic gradient descent, which iteratively updates the
weights in the direction that reduces the error.
Neural networks can be used for a variety of tasks, including image and speech recognition,
21
Figure 7: neural network
natural language processing, and time-series forecasting. They are particularly useful for
problems with large and complex data sets, where traditional machine learning methods
may struggle.
Deep learning is a subfield of machine learning that utilizes neural networks to learn from
data, It is useful for problems with large and complex data sets where traditional machine
learning methods may struggle. The neural networks are a set of algorithms that are designed
to recognize patterns in data, are inspired by the structure and function of the human brain,
and are used to model complex relationships between inputs and outputs.
A simple example of a neural network is a multi-layer perceptron (MLP). An MLP
consists of an input layer, one or more hidden layers, and an output layer (see diagram).
The input layer receives the input data, which is then processed through the hidden layers
using a set of weights and biases. The output of the final hidden layer is passed through the
output layer to produce the network’s prediction.
Let’s say we want to create a neural network that can predict the price of a house based
on its square footage, number of bedrooms, number of bathrooms and age of the house. Our
input layer would have 4 neurons, one for each feature (square footage, number of bedrooms,
number of bathrooms, and age). Our output layer would have 1 neuron, representing the
predicted price of the house.
We can add one or more hidden layers in between the input and output layers to increase
the model’s capacity to learn more complex representations of the data. For example, we
22
Figure 8: simple representation of a neural network
could add a hidden layer with 5 neurons. The hidden layer would use the input data to learn
intermediate representations and pass them on to the next layer.
To train the neural network, we use a training dataset consisting of input-output pairs
of house prices and their corresponding square footage, number of bedrooms, number of
bathrooms, age. We use an optimization algorithm such as stochastic gradient descent to
iteratively adjust the weights and biases of the network so that it can predict the correct
output given an input.
Once the model is trained, it can be used to make predictions on new, unseen data.
This can be useful for a variety of tasks such as predicting housing prices, stock prices, or
even identifying objects in images. it’s worth noting that this example is a very simple
representation of a neural network, in practice, neural networks can be much more complex
with multiple hidden layers and a large number of neurons in each layer. Additionally, there
are many different types of neural networks such as convolutional neural networks (CNNs)
and recurrent neural networks (RNNs) which are designed to handle specific types of data
such as images and time-series data respectively.
Over fitting and under fitting are two common problems in machine learning. These problems
occur when a model is too complex or too simple for the data it is trying to fit, resulting in
poor performance on new, unseen data. In this Paper, we will discuss over fitting and under
23
Figure 9: over fitting and under fitting
fitting in detail and provide examples to help understand these concepts Over fitting occurs
when a model is too complex and is able to fit the training data perfectly but performs poorly
on new, unseen data. This happens when the model learns the noise in the data rather than
the underlying pattern. As a result, the model becomes too specific to the training data and
is unable to generalize to new data [?].
Under fitting occurs when a model is too simple and is unable to capture the underlying
pattern in the data. As a result, both new, unseen data and the training set perform poorly.
When a model is too simple to adequately represent the relationship between the target vari-
able and the features, underfitting takes place. The field of feature engineering The process
of turning unprocessed data into features that can be applied to a machine learning model
is known as feature engineering. This is an important stage in the machine learning process
because the model’s performance can be significantly impacted by the type and quality of
the features. By extracting important information from the raw data, feature engineering
seeks to create new features that can improve the predictive power of the model [47]. Nu-
merous techniques, such as the following, can be used in feature engineering: A subset of
the original features are selected using the feature engineering technique known as ”feature
selection” [48], depending on how important or relevant they are for the prediction task.
You can select a subset of features that contain the most insightful and useful information
by reducing the dimensionality of the data and avoiding the inclusion of unnecessary or su-
perfluous features. Model performance may improve, training times may be shortened, and
24
interpretability may increase as a result.
There are several approaches to feature selection, some of which are as follows:
• Filter methods: Filter methods evaluate each feature independently and select a
subset based on a certain criteria, such as correlation or mutual information [49].
These methods are simple and fast, but they do not take into account the relationship
between the features and the target variable.
• Embedded methods: Embedded methods select features as part of the model train-
ing process. These methods are typically used in ensemble methods and tree-based
models, such as random forests and gradient boosting.
25
how well these algorithms distinguish between different classes, such as normal and malicious
traffic [52]. The primary evaluation metrics for classification algorithms include accuracy,
precision, recall, F1-score, and the receiver operating characteristic (ROC) curve [53].
In the field of machine learning, a confusion matrix is a table that is often used to describe
the performance of a classification model on a set of data for which the true values are
known. It is a useful tool for evaluating the performance of a model and helps in identifying
the areas where the model may be making errors [54].
A confusion matrix is a table that is used to evaluate the performance of a classification
model by comparing the predicted values with the true values. It is a matrix with four
different values: true positives (TP), false positives (FP), true negatives (TN), and false
negatives (FN). These values are derived from the predicted and true values of the data
set [55]. The following table shows the layout of a confusion matrix:
• False Positives (FP): The number of instances predicted positive but are actually neg-
ative.
• False Negatives (FN): The number of instances predicted negative but actually positive.
2.12 Accuracy
Accuracy is the most straightforward metric, representing the proportion of correctly classi-
fied instances out of the total instances [56]. It is calculated as:
26
Figure 10: Confusion matrix classification
However, accuracy alone can be misleading, especially in the context of network traffic
classification, where the class distribution is often imbalanced.
2.13 Precision
The precision of the model is defined as the ratio of true positive predictions to all positive
predictions [57]. It shows the number of actual anomalies that match the predicted anomalies:
True Positive/True Positive + False Positsive = Precision When the cost of false positives
is large, high precision is crucial because it guarantees that the majority of detected anomalies
are, in fact, anomalies.
Recall, or sensitivity, measures the proportion of actual positives that are correctly identified
by the model [58]. It reflects the model’s ability to detect all relevant instances:
27
High recall is critical in scenarios where missing an anomaly (false negative) is particularly
costly.
The ROC curve plots the true positive rate (recall) against the false positive rate (1 -
specificity) at various threshold settings [59]. The Area under the Curve (AUC) provides a
single value to summarize the performance of the model across all thresholds:
(AUC = ROC curve)
A higher AUC indicates a better-performing model, with a value of 1 representing a
perfect model and 0.5 representing a model with no discriminative power.
We reviewed literature surveys and previous works to develop the framework we are seeking.
Throughout our research process, we examined various approaches and identified the most
relevant articles that could offer potential solutions to our problem. The following sections
provide a brief overview of the work carried out using anomaly and signature detection
approaches, IDSs based on machine learning techniques, and the various feature selection
methods employed in intrusion detection.
Intrusion detection systems are increasingly critical in research and development due to the
rising frequency of computer and network attacks [60].
Several researchers have focused on anomaly-based intrusion detection systems (NIDS)
to detect and mitigate network attacks [61] [62]. In contrast, intrusion detection systems
can monitor the abnormal activities on the network. Anomaly-based intrusion detection
systems can identify malicious activities by detecting deviations from normal network traffic
behavior.
28
Kshirsagar and Joshi [63] proposed a rule-based classification model for intrusion de-
tection using data mining frameworks. The primary goal of their research was to evaluate
different rule-based classifiers for intrusion detection systems (IDS) and select the most ef-
fective one. The researchers first preprocessed the raw audit data into ASCII new packet
information, which was then summarized into connection records. Rules for the connection
records were then generated using a data mining algorithm. The authors used the KDD
CUP 99 dataset to evaluate the proposed model on four types of attacks: DoS, probe, U2R,
and R2L.
The proposed model produced good detection accuracy on known attacks, as stated by
the authors. However, the work was unable to classify new attacks, i.e., attacks present in
the test dataset but not available in the training dataset.
This study is similar to the work presented in [64], where the authors focused on de-
veloping meaningful but weak rules by incorporating attack signatures through manual rule
induction. Another study explored the use of association rule mining to generate interest-
ing rules from the KDD dataset, which contains a variety of data types, including binary,
discrete, and continuous.
The authors in the current research work have employed a data mining framework to
generate rules for intrusion detection, which aligns with the research direction suggested
in, where the authors highlighted the need for further research on applying data mining
algorithms in intrusion detection applications.
This study has certain limitations, as acknowledged by the authors. One potential so-
lution to address the issue of high false alarm rates in anomaly-based intrusion detection
systems, as mentioned in, is to combine various data mining techniques, which could be
further explored in future research.
The study by Karami [65] developed a system to accurately detect anomalies and pro-
vide visualized information to end-users using machine learning techniques. The primary
goal was to achieve a higher detection rate with a low false alarm rate. The author utilized
two self-organizing map algorithms: Fuzzy self-organizing map algorithms and standalone
self-organizing map algorithms, implemented in three stages. In the first stage, benign
outliers were identified. In the second stage, SOM was executed, and in the third stage,
29
self-organizing map algorithms was re-executed over the selected data. Experiments were
conducted on the Windows platform using MATLAB R2016b. The proposed model was
evaluated on the NSL-KDD, UNSW-NB15, AAGM, and VPN-nonVPN datasets. The exper-
imental results demonstrated that the proposed method provided better lattice adjustment
with fewer overlapped connections among neighbors. However, it did not perform well with
connections between nodes and their neighbors compared to the other two methods.
Amit Mahajan et al. [66] devised a system utilizing diverse machine learning (ML) tech-
niques for detecting and scrutinizing distributed denial of service attacks. Their approach
involved constructing an anomaly-based detection system employing four ML methods:
• Naı̈ve Bayes
The experiments were conducted using the WEKA tool on the Ubuntu 16.04 LTS plat-
form, with 66% of the dataset allocated for training and the remaining 34% for testing
purposes. Evaluation metrics such as accuracy, precision, and recall were employed to com-
pare the performance of the four ML classifiers. The experimental findings revealed overall
accuracy rates of 96.89%, 98.89%, 98.91%, and 92.31% for Naı̈ve Bayes, DT, Multilayer
Perceptron, and SVM, respectively, with the Multilayer Perceptron achieving the highest
accuracy. Despite the authors curating and annotating their dataset, it was confined to only
two network layers (network and application layers).
Gupta and Kulariya [67] introduced a fast framework for detecting network attacks using
five machine learning (ML) techniques: Logistic Regression (LR), Support Vector Machine
30
(SVM), Random Forest (RF), Naı̈ve Bayes (NB), and Gradient Boosted Tree (GB Tree).
This framework, implemented on Apache Spark, aims to classify network traffic as either
normal or an attack. Two feature selection methods, correlation-based feature selection
(CFS) and chi-squared feature selection, were employed. The framework was tested on two
real-time network traffic datasets: KDD-CUP and NSL-KDD. In the first experiment, the
CFS method was applied to both datasets to remove highly correlated attributes, followed by
training the classification models. The results showed that LR achieved the highest accuracy
at 91.56%, while SVM had the lowest at 78.84%. NB had the highest sensitivity at 98.62%,
and GB Tree had the lowest at 89.23%. GB Tree achieved the highest specificity at 99%,
whereas SVM had the lowest. NB was the fastest in training and prediction time, taking 80
seconds, while SVM was the slowest, taking 480 seconds on the KDD-CUP dataset. In the
second experiment, chi-squared feature selection was applied to both datasets before training
the classification models. The results showed that LR, SVM, and GB Tree maintained their
accuracies from the first experiment. However, the accuracy of RF and GB Tree increased
to 92.13% and 91.38%, respectively. Although the chi-squared method improved processing
times, it decreased the accuracy of some classifiers. This study underscores the trade-offs
between different feature selection methods and ML techniques in the context of network
attack detection.
In recent years, there has been a growing interest in the application of machine learning
techniques for developing Intrusion Detection Systems [68]. This literature review aims to
explore the various sources that discuss IDS based on machine learning and their effective-
ness in detecting and preventing cyber threats. Some recent surveys have reviewed detection
techniques with respect to either hate speech or misinformation [69]. However, these works
focused on specific methods. Our work, in contrast, studies methods detecting both types of
abuse on social media and, more importantly, studies these methods considering the entire
moderation ecosystem. One key source in this field is the study conducted by John et al.,
which compares the performance of different machine learning algorithms in detecting net-
work intrusions. Their research found that Logistic Regression outperformed Decision Tree
31
and Random Forest in terms of accuracy, precision, and recall when applied to credit scoring
and classification on a Big Data platform [70]. Another source, by Vidya Cm Economics,
explores the use of machine learning techniques, specifically logistic regression, for credit
analysis. The study found that logistic regression performed better than decision tree and
random forest in terms of accuracy, precision, and recall for credit scoring on a HDFS.
Abdulla Amin [71] conducted a comparison of various machine learning (ML) techniques
used to develop intrusion detection systems (IDS). They provided a detailed discussion of
intrusion classification algorithms and their comparisons. The authors focused on commonly
used ensemble algorithms, including boosting, bagging, stacking, and mixtures of competing
experts. Ensemble methods work by training multiple classifiers simultaneously to solve
the same problem and then combining their outputs to enhance accuracy. This approach
offers better generalization capacity than using a single classifier. Their survey found that
an ensemble classifier based on the majority voting strategy was highly effective for IDS
development. The performance of the proposed method was evaluated using the KDD-CUP
99 dataset.
Bhupendra et al [72]. Developed a tree-based intrusion detection system utilizing a
Classification and Regression Tree (CART) decision tree to detect new or unknown attacks.
They used a Correlation-based Feature Selection (CFS) method to effectively identify four
types of attacks—DoS, probe, U2R, and R2L—while minimizing resource usage. The system
followed three main steps: preprocessing the dataset, applying a feature selection method,
and classifying network traffic as normal or attack.
Initially, the authors selected the NSL-KDD dataset and preprocessed the data by con-
verting categorical attributes into numeric values. In the next step, they applied the CFS
subset evaluation to the preprocessed data to determine the optimal subset of features. The
Gain Ratio (GR) feature selection method generated a ranked list of attributes, and CFS
identified the best subset by considering each feature’s individual predictive ability, reducing
the original 41 attributes to the 14 most relevant ones.
The system’s performance was tested using the KDD-CUP 99 and NSL-KDD datasets
with the WEKA tool. Two decision tree models were created: one for five-class classification
(normal and four attack types) and another for binary classification (normal and attack).
32
Performance metrics included classification accuracy, detection rate for each class, and false
positive rate. The experimental results indicated an overall accuracy of 83.7% for the five-
class classification and 90.2% for the binary classification. However, the report highlighted
that the proposed model was susceptible to over fitting.
Jamal et al. [73] created a classification method that integrates a machine learning-based
decision algorithm with a multilayer perceptron to develop an intrusion detection system
(IDS) characterized by high accuracy, a high detection rate, and a low false alarm rate
(FAR). The authors utilized artificial neural networks (ANNs) to overcome limitations of
the dataset, such as nonlinearity and incompleteness, using the KDD-CUP99 dataset. Their
methodology involved two phases. In the first phase, the dataset was processed using both a
decision tree-based approach and a multilayer perceptron, classifying and labeling the data
as either attack or benign. In the second phase, this newly labeled dataset was input into
the well-trained multilayer perceptron to evaluate the test data. A noted limitation of this
method is that the experiment was only performed on a small network dataset.
Yasir et al [74]. presented a deep learning approach for anomaly detection in network
traffic data, leveraging the capability of deep learning to learn from raw input features
without the need for preprocessing. Their method involved training a deep neural network
(DNN) model directly with raw data, resulting in the development of a multi-layer DNN
named Raw Power specifically designed for network analysis. Initially, they constructed a
1D-CNN model for identifying malware at the packet level. Subsequently, each packet in the
labeled subset was assigned a fixed threshold value (Ts = 1300), with larger packets being
reduced and smaller packets retained. The packets were then classified as either benign
or malware. The model integrated two activation functions (ReLU and max pooling) and
two feed forward layers, each comprising two hundred neurons. Additionally, a binary LR
algorithm was employed for classification. Implementation of the model utilized Keras and
the Tensor Flow framework on the USTCTFC2016 dataset. Comparative analysis with the
RF algorithm, using identical input features, revealed that RF achieved superior performance
with a 70% detection rate and a false alarm rate (FAR) below 3%. However, the model
displayed suboptimal performance when applied to non-processed input data .
33
2.19 Feature Selection Methods in Network Traffic Classification
Feature selection plays a crucial role in network traffic classification, where the objective
is to identify patterns or characteristics in network data that distinguish between different
types of traffic, such as normal and malicious activities. Various feature selection methods are
employed to extract the most relevant and informative features from raw network data. These
methods help improve the accuracy and efficiency of classification algorithms by reducing
the dimensionality of the data and removing irrelevant or redundant features.
One common feature selection approach is filter methods, which evaluate the relevance of
individual features based on statistical measures such as correlation, mutual information, or
chi-square tests. Features with low relevance or high redundancy are then eliminated from
the dataset. Another approach is wrapper methods, where feature subsets are evaluated
using a specific classification algorithm, and the subset that yields the best performance is
selected. Wrapper methods typically involve an exhaustive search or heuristic algorithms to
explore the space of possible feature combinations.
Embedded methods integrate feature selection directly into the training process of clas-
sification algorithms. These methods leverage the inherent feature selection mechanisms of
certain algorithms, such as decision trees or support vector machines, to identify the most
discriminative features during model training.
In addition to these general approaches, feature selection methods in network traffic
classification often consider domain-specific knowledge and characteristics of network data.
For example, features related to packet headers, protocol types, payload content, or traffic
behavior may be prioritized based on their relevance to specific types of network attacks or
anomalies.
Overall, effective feature selection methods are essential for building accurate and efficient
network traffic classification systems, as they help focus computational resources on the most
informative aspects of the data while reducing noise and improving the interpretability of
the classification results. This sub-section presents some works which used feature selection
methods in the IDS.
34
2.20 Summary
Intrusion Detection Systems play a crucial role in identifying and mitigating various cyber
threats in network environments. However, traditional signature-based IDSs have limitations
in terms of performance, detection rate, training time, and false alarm rate. To address these
limitations, researchers have explored the use of anomaly-based network IDSs and big data
processing techniques. By leveraging anomaly-based detection methods, IDSs can detect un-
known or new attacks that do not have predefined signatures. These IDSs analyze network
traffic data and identify patterns or behaviors that deviate from normal activity, indicating
potential intrusions. Furthermore, by using big data processing frameworks such as Apache
Spark, the performance of intrusion detection can be enhanced. This allows for faster pro-
cessing of large volumes of network traffic data, reducing training time and improving detec-
tion accuracy .Additionally, the use of big data processing can help identify complex attack
patterns and reduce false alarm rates, as it can handle and analyze large amounts of data
in parallel. This combination of anomaly-based detection and big data processing provides
a more effective and efficient approach to intrusion detection, as it can detect both known
and unknown attacks, minimize training time, improve detection accuracy, and reduce false
alarm rates in IDSs .
3 Methodology
In this Chapter we have presented the design of the proposed Enhancing Network Intru-
sion Detection Systems through Machine Learning Techniques Anomaly-Based Approaches
Detection system. Different components of the proposed Anomaly-Based IDS are described
with their relevance and techniques to use while building those components. The Chapter
presents the architecture with implemented algorithms. Anomaly-Based Intrusion Detection
Systems are a type of security system that focuses on detecting abnormal or anomalous
behavior within a network or system and alerting administrators or security personnel of
potential intrusions or attacks. They generate alarm as soon as an attack is detected. To be
effective, such systems are expected to precisely detect intrusions.
35
Figure 11: System Architecture
The proposed network Intrusion Detection System (IDS) architecture consists of several
key components that work in tandem to effectively detect and mitigate potential security
threats.The preprocessing phase is a crucial step that involves converting string data into
numeric format, applying one-hot encoding, feature selection, normalization, and vector
assembling. Following this, the preprocessed data is split into training and testing sets,
with the training set further divided into actual training and validation sets. The machine
learning algorithm is then trained using the training data with default parameters, and its
performance is evaluated on the testing set .
To optimize the model’s performance, a grid search phase is employed. The training
and validation data is passed to the grid search, which constructs a parameter grid of hyper
parameters and the classification algorithm. Cross-validation is then used to search the
parameter grid and select the best-performing parameters. The selected parameters are
then used to build the final trained model.The detailed description of each component of the
system architecture is presented in the following sections
36
3.2 Data Collection and Preparation
In today’s data-driven world, it is essential to have reliable and relevant information for
making informed decisions. This requires effective data collection and preparation processes
[75]. Researchers should ensure that the collected data accurately represents the variables
relevant to their research questions and objectives [76].
In today’s digital landscape, ensuring the security of computer networks and systems has
become increasingly important [77]. This is due to the constant evolution and sophistication
of cyber threats, such as intrusions and malware attacks [78].To develop an effective intrusion
classification model, a significant amount of network traffic data is required [79]. This data
serves as the foundation for training machine learning algorithms to accurately identify and
classify different types of intrusions [77]. This data can be obtained from various sources,
such as network logs, packet captures, and sensor data. Additionally, collaborations with
organizations or institutions that have access to large-scale network traffic data can also be
beneficial in obtaining the necessary datasets for training the intrusion classification. For
this reason the focus will shift to obtaining data NSL-KDD dataset, a newer version of the
KDD Cup 99 dataset [80]. It is the only publicly available and widely used dataset for
intrusion detection. The NSL-KDD dataset and the KDD99 dataset play crucial roles in
intrusion detection and cyber security research. Despite their widespread use, researchers
must consider certain distinctions when selecting between them. The NSL-KDD dataset
represents an enhanced iteration of the KDD99 dataset, developed in response to criticisms
directed at its predecessor. Aiming to rectify limitations identified in the KDD99 dataset, the
NSL-KDD dataset addresses primary concerns related to its excessive count of redundant and
duplicate records, thereby enhancing the effectiveness of anomaly detection methods [81]. To
address these concerns, the NSL-KDD dataset resolves the problem of redundant records and
furnishes labels for both its training and testing sets. This characteristic renders it a more
apt dataset for the assessment of intrusion detection algorithms and facilitates comparisons
between various proposals. The NSL-KDD dataset, being a subset of the original KDD99
dataset, maintains the same set of features while enhancing overall data quality. Additionally,
the NSL-KDD dataset includes 24 attack types in the training set and 38 attack types in
37
the testing set [81]. These attack types cover a wide range of intrusion scenarios, making
the NSL-KDD dataset more representative and comprehensive. Furthermore, the number of
records in the NSL-KDD train and test sets is reasonable, ensuring a balanced and reliable
dataset for experimentation [82]. Overall, the NSLKDD dataset is considered to be an
improvement over the KDD99 dataset, addressing its shortcomings and providing a more
suitable and valid dataset for intrusion detection research [82]. The process involves three
main stages: gathering data from regular network traffic and simulated attacks, selecting
and extracting relevant features, and employing machine learning for classification.
The dataset comprises 125,973 entries and 42 columns, with no missing values, as evi-
denced by the ”Non-Null Count” for each column matching the total number of entries. This
robust dataset includes various data types: 23 columns contain integer data, 15 columns hold
float data, and 4 columns consist of object data, which are likely categorical. This diverse
range of data types provides a comprehensive foundation for thorough analysis and insights.
The dataset used for the experiment, that is the NSL-KDD dataset, has two categories:
KDDTrain and KDDTest. The KDDTrain data is used to train classification algorithms
whereas the KDDTest consists of a set of new instances that are not present in the training
data that helps to evaluate the performance of a model .
As can be seen, DoS and normal classes have large instances in both training set as well
as testing set whereas U2R and R2L classes have small instances on both cases. The different
types of attack in both training and testing set are mapped to their respective classes or
categories as follows .
38
3.3 Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a crucial step in data analysis that involves summarizing
and visualizing the key characteristics of a dataset. The primary goals of EDA are to
understand the structure and patterns in the data, detect anomalies or outliers, and generate
hypotheses for further analysis.
From Observations the dataset comprises 125,973 entries spread across 42 columns, with
each column being complete and free from missing values, as evidenced by the ”Non-Null
Count” matching the total number of entries. The data types in the dataset are diverse,
including 23 integer columns, 15 float columns, and 4 columns with object types, which are
likely categorical in nature. This comprehensive structure suggests a well-organized dataset
suitable for various types of analysis.
The dataset reveals that approximately 53.5% of the samples are labeled as ’normal’,
reflecting standard network activity. Conversely, 46.5% of the samples are categorized as
’attacking’, which signifies the presence of malicious or abnormal network activities, includ-
ing various types of cyber-attacks. This distribution highlights a nearly balanced dataset
between normal and anomalous network behaviors, which is crucial for accurate analysis and
detection.
The data indicates that the TCP protocol is the most frequently used, with UDP and
ICMP following. Furthermore, UDP and ICMP exhibit a higher proportion of attacking
instances relative to normal ones, while the TCP protocol shows a more balanced distribution
between normal and attacking instances.
The dataset comprises 125,973 entries distributed across 42 columns, each column fully
populated with no missing values. The data types within the dataset are diverse, including 23
integer columns, 15 float columns, and 4 object columns, which are likely to be categorical.
This structure suggests that the dataset is well-organized and suitable for comprehensive
analysis.
The label distribution in the dataset is nearly balanced, with approximately 53.5% of
samples classified as ’normal’ and 46.5% classified as ’attacking’. This balanced distribution
is essential for effective analysis and detection of network behaviors.
39
Figure 12: data samples
Regarding protocol usage, TCP is identified as the most commonly used protocol, fol-
lowed by UDP and ICMP. Notably, UDP and ICMP show a higher proportion of attacking
instances compared to normal instances. In contrast, the distribution of normal and at-
tacking instances is relatively balanced for the TCP protocol. This insight highlights the
varying security dynamics associated with different protocols and can guide targeted security
measures.
In machine learning, the quality and characteristics of the data plays a crucial role in the per-
formance of the model. The data preparation step is the process of cleaning, transforming,
and organizing the data to make it suitable for the machine learning model. An understand-
ing of the data preparation process and the various techniques used to preprocess the data.
We will cover importing Encode Categorical Columns, feature scaling, feature selection, and
data transformation, as well as the importance of evaluating the quality of the data.
40
3.5 Importing and cleaning
Importing and cleaning data is an essential step in the machine learning process. The quality
and characteristics of the data plays a crucial role in the performance of the model, and it
is important to ensure that the data is in a format that can be understood by the machine
learning model. The process of data preprocessing is summarized as follows and executed
under the python Environment.
1. Step 1: Importing libraries
2. Step 2: Load the dataset
3. Step 3: Check missing values and categorical data
4. Step 4: Encoding with one hot encoder
5. Step 5: feature scaling
In conclusion, Formatting data is an important step in data preprocessing as it ensures
that the data is in a consistent and usable format for machine learning models. Formatting
data involves converting data into a format that can be easily consumed by machine learning
algorithms. This can include tasks such as converting data types, encoding categorical
variables, and standardizing variable names. There are several techniques for formatting
data, including converting data types, encoding categorical variables, and standardizing
variable names. The chosen method should be based on the specific characteristics of the data
and the research question. Formatting data correctly can help to improve the performance
of machine learning models and make the data easier to work with. It’s also important to
check the data for any inconsistencies or errors, such as typos or mislabeled data, and to
correct them as necessary.
Encoding categorical columns is a crucial preprocessing step in machine learning that trans-
forms categorical data into numerical formats suitable for algorithms. One common method
is Label Encoding, which assigns a unique numerical value to each category and is par-
ticularly useful for ordinal data where categories have a meaningful order. For nominal
data, where categories lack an inherent order, One-Hot Encoding is often employed. This
41
technique creates binary columns for each category, ensuring that each category is repre-
sented distinctly. Binary Encoding offers a compromise between dimensionality reduction
and categorical information preservation by converting categories into binary code. Fre-
quency Encoding replaces categories with their occurrence frequency in the dataset, which
can be useful when the importance of categories is tied to their frequency. Lastly, Target
Encoding replaces categories with the mean of the target variable for each category, making
it ideal for scenarios where there is a strong relationship between the categorical feature and
the target variable. The choice of encoding method depends on the nature of the categorical
data and the specific requirements of the machine learning model being used.
3.7 Correlation
Correlation is a powerful tool in statistics and data analysis, helping to identify and quantify
the relationship between variables. Understanding correlation can provide insights into how
changes in one variable might affect another, which is valuable for predictive modeling and
data-driven decision-making
The dataset analysis reveals various degrees of correlation among the features, indicating
different levels of associations and dependencies.
Duration: The ”duration” of the connection exhibits a very low correlation with most
other features in the dataset. This suggests that the length of the connection is largely
independent and has little impact on other characteristics of the connections.
Protocol Type: The ”protocol type” shows low to moderate correlations with various
features. For instance, it has a moderate positive correlation with ”logged in” and ”service”.
This indicates that certain protocols are more frequently associated with specific services or
logged-in sessions, highlighting patterns in protocol usage depending on the context.
Service: The ”service” feature demonstrates moderate correlations with other features,
particularly with ”dst host srv count” and ”dst host same srv rate”. This implies that spe-
cific services are commonly linked with a higher number of destination host server counts or
42
consistent service rates, suggesting a pattern in service usage and destination interactions.
Flag: The ”flag” feature shows moderate correlations with some features like ”srv rerror rate”
and ”srv diff host rate”. This indicates that certain flags are more commonly associated with
specific server error rates or differences in host rates, which can be indicative of certain types
of network behaviors or issues.
Labels: The ”labels” feature, likely representing the target variable indicating normal or
attacking behavior, exhibits moderate correlations with several features such as ”logged in”,
”srv count”, and ”serror rate”. This suggests that these features might be indicative of
whether a connection is normal or potentially malicious, providing insights into patterns of
normal versus attacking behaviors.
This analysis provides a clearer understanding of how various features in the dataset
interact and potentially influence each other.
Splitting the data into training and testing sets is an important step in the machine learning
process. It is used to evaluate the performance of the model on unseen data, which helps
to prevent over fitting and to assess the generalization ability of the model. The process of
splitting the data involves randomly dividing the data into two subsets: a training set and a
testing set. The training set is used to train the machine learning model, while the testing
set is used to evaluate the performance of the model. The standard split ratio is typically
80% for the training set and 20% for the testing set, although this ratio can be adjusted
depending on the specific needs of the project. Out of the preprocessed dataset, 70% is used
for training and the remaining 30% is used for validation.
43
3.11 Hyper parameter tuning
Hyper parameter tuning is the process of systematically searching for the best combination of
hyper parameters in order to optimize the performance of a machine learning model. Hyper
parameters are parameters that are not learned from data but are set by the user, such as
the learning rate, the number of hidden layers, or the regularization strength. The optimal
values of these parameters can greatly affect the performance of the model, and therefore it
is important to tune them to achieve the best results. There are several methods for tuning
hyper parameters, including grid search, and random search
Grid search is a method for systematically trying all possible combinations of hyper param-
eters within a predefined range. It is a computationally efficient method that can be used
to find the optimal combination of hyper parameters for a machine learning model. In this
research, a grid search is used to evaluate and select the best subset of hyper parameters for
the selected models.
The process of grid search involves the following steps:
3. Train the model using each combination of hyper parameters in the grid.
Select the combination of hyper parameters that result in the best performance
Grid search is a useful technique for finding the best combination of hyper parameters for
a machine learning model. By trying many different combinations and using cross-validation
to evaluate their performance, we can find the set of hyper parameters that gives the best
performance on our data.
Grid search is a powerful method to find the best set of hyper parameters for a given
dataset and model. It is an efficient way of tuning the parameters of a model, and it can be
used to find the optimal combination of hyper parameters for a machine learning model.
44
Figure 13: Grid Search
Model comparison is the process of evaluating and comparing the performance of different
machine learning models. This is an important step in the machine learning process as it
allows us to select the best model for a given problem and dataset.
There are several metrics that can be used to compare the performance of different
models, such as accuracy, precision, recall, F1-score, and confusion matric are used for
evaluating the performance of the models in this research. These metrics are used to evaluate
the performance of the model on a given dataset and are often used in combination to provide
a more comprehensive view of the model’s performance.
Cross-validation is a common method for comparing the performance of different models.
It involves splitting the data into a training set and a test set, and training multiple models
on the training set. The models are then evaluated on the test set, and the model with the
best performance is selected.
3.14 Summary
45
The proposed system aims to achieve effective detection of intrusions by utilizing a collected
dataset, which is preprocessed and classified into categories based on trained models. Key
aspects of the system include data preprocessing, where methods are applied to clean and
prepare the dataset for analysis, and feature selection, which helps to speed up processing
and improve accuracy by selecting the most relevant features from the dataset. Hyper param-
eter tuning methods like grid search are used to find the best hyper parameter values, with
validation data, not used in the training phase, helping to validate the model’s effectiveness.
If the initial results are unsatisfactory, the model can be retrained to obtain better hyper
parameter values. The selected optimal hyper parameter values are used to build the final
model, which then predicts or classifies new instances into their respective categories. Based
on the evaluation results, decisions can be made regarding the efficiency and effectiveness of
the system. The integration of anomaly detection with machine learning techniques and data
processing using Scikit-learn enables robust and scalable intrusion detection capabilities.
Model comparison is the process of evaluating and comparing the performance of different
machine learning models. This is an important step in the machine learning process as it
allows us to select the best model for a given problem and dataset.
There are several metrics that can be used to compare the performance of different
models, such as accuracy, precision, recall, F1-score, and confusion matric are used for
evaluating the performance of the models in this research. These metrics are used to evaluate
the performance of the model on a given dataset and are often used in combination to provide
a more comprehensive view of the model’s performance.
Cross-validation is a common method for comparing the performance of different models.
It involves splitting the data into a training set and a test set, and training multiple models
on the training set. The models are then evaluated on the test set, and the model with the
best performance is selected.
46
3.16 Summary
47
4 Experimentation and results
In this chapter, the experiments carried out to test the effectiveness of the proposed system
are discussed. This includes an overview of the development environment and software tools,
the programming languages and libraries used during the classification process, the dataset
employed, implementation details, and the performance evaluation results of the different
models using various evaluation metrics.
4.1 Overview
The main goal of this research is to design an intrusion detection system using anomaly-based
intrusion detection. This system aims to detect both known and unknown anomalies in a
given network. The process begins with data preparation, followed by a training phase where
the system learns to recognize normal behavior. During the detection phase, the system
identifies deviations from normal network activities. Various algorithms are employed for
dimension reduction, and an artificial neural network is used for training and detection, all
selected to maximize the detection rate and minimize the false alarm rate.
Several tools and techniques are utilized for the implementation of the proposed intrusion
detection system. All experiments were performed on a laptop and Desktop with the follow-
ing configuration: Intel(R) Core i5 CPU 2.40GHz, 8 GB RAM, and the operating system
platform is Windows 11. The table below lists the various programming languages, and
open-source tools that contributed significantly to this work.
48
4.3 Anaconda and Anaconda Navigator
Jupyter Notebook is an open-source web application that allows you to create and share
documents containing live code, equations, visualizations, and narrative text. It’s widely
used in data science, machine learning, academic research, and more.
4.5 Pandas
Pandas is a powerful and widely-used library in Python for data manipulation and analysis.
It provides data structures like Series and Data Frame, which are particularly useful for
handling and analyzing structured data.
4.6 scikit-learn
Scikit-learn is a popular machine learning library in Python that provides a range of tools
for building and evaluating predictive models. It’s known for its simplicity and ease of use,
making it a great choice for both beginners and experienced data scientists.
To explain how the experiment was conducted, let’s detail the preprocessing steps for con-
verting categorical data into a numerical form using the NSL-KDD dataset. The key tasks
include applying the dataset to convert categorical labels into numerical indices.
49
The preprocessing steps for converting categorical data into numerical form using the
NSL-KDD dataset involve several key tasks. Initially, the NSL-KDD dataset is loaded, which
includes columns that need to be converted into numerical form. During preprocessing, the
categorical data in these columns are identified and converted into numeric indices using the
Label Encoder from scikit-learn.
To prepare the data for machine learning models, the categorical columns need to be
converted into numerical values. This can be achieved using Label Encoder from the
sklearn.preprocessing library. Transformed dataset after encoding, the dataset’s first few
rows . By encoding the categorical columns, the dataset is now in a numerical format suit-
able for analysis and machine learning models. Each unique category in the protocol type,
service, flag, and labels columns is transformed into a distinct integer, enabling seamless
integration into various algorithms. This preprocessing step ensures that the categorical
data is correctly interpreted by machine learning models, facilitating accurate and effective
analysis.the categorical columns have been converted into numeric form this transformation
allows the data to be used in machine learning models. In this section, we analyze the ex-
perimental results for two scenarios: learning with default parameters and learning with the
optimized parameters obtained through hyper parameter tuning for the chosen classification
algorithms. We evaluate all models using both 2x2 and 5x5 confusion matrices. The con-
fusion matrices categorize predictions into True Positive (TP), True Negative (TN), False
Positive (FP), and False Negative (FN).
The correlation matrix provides insight into the relationships between various features in the
dataset. The analysis of the matrix reveals several key observations.
Duration: The ”duration” feature shows very low correlations with most other features,
suggesting that the length of a connection does not strongly relate to other variables in the
dataset. This implies that ”duration” may not be a significant predictor or may have a
non-linear relationship with the target variable.
Protocol Type: The ”protocol type” feature exhibits low to moderate correlations with
other features. It shows a moderate positive correlation with ”logged in” and ”service,” indi-
50
cating that certain protocols might be associated with specific services or logged-in sessions.
This could be valuable for understanding the context in which different protocols are used.
Service: The ”service” feature demonstrates moderate correlations with features such as
”dst host srv count” and ”dst host same srv rate.” This suggests that certain services are
linked with higher destination host server counts or same service rates, providing insight into
the traffic patterns associated with different services.
Flag: The ”flag” feature has moderate correlations with ”srv rerror rate” and ”srv diff host rate,”
suggesting that certain flags are associated with specific server error rates or differences in
host rates. This indicates that the flag may provide useful information about the status or
type of connections.
Labels: The ”labels” feature, representing the target variable for normal or attacking
behavior, shows moderate correlations with features like ”logged in,” ”srv count,” and ”ser-
ror rate.” This indicates that some features might be indicative of whether a connection is
normal or potentially an attack, making them important for the classification task.
To refine the model and address multi collinearity, features with very high correlations
(greater than 0.95) were identified and removed. The highly correlated features that were re-
moved include num root, srv serror rate, srv rerror rate, dst host serror rate, dst host srv serror rate,
and dst host srv rerror rate. High correlation among features can lead to redundancy and
potentially skew the model’s performance. By removing these redundant features, we aim
to reduce multicollinearity, improve the model’s interpretability, and enhance its overall
predictive accuracy.
Reduce Over fitting: High correlation can cause overfitting by making the model too
complex. Removing redundant features helps in creating a simpler model that generalizes
better to unseen data.
Improve Model Performance: Eliminating highly correlated features can enhance model
performance by reducing noise and focusing on features that provide unique and useful
information.
Enhance Interpretability: Fewer features make the model easier to interpret, as each
51
remaining feature contributes distinct information to the predictive power of the model.
In summary, the correlation matrix analysis helps identify key relationships between
features and informs the feature selection process, ensuring that the model remains robust,
interpretable, and capable of generalizing well to new data.
In the data preparation process, the dataset was first divided into features (X trn) and target
labels (Y trn). The features were scaled using StandardScaler, which standardizes the data
and improves the performance of the machine learning models. After scaling, the dataset
consisted of 125,973 samples with 35 features each. The data was then split into training
and testing sets using a 30% test size, resulting in a training set with 88,181 samples and a
testing set with 37,792 samples. This split is essential for evaluating the model’s performance
on unseen data and preventing overfitting.
Various models were employed for classification, including RandomForestClassifier, which
aggregates predictions from multiple decision trees and is effective for large datasets; Decision
TreeClassifier, which splits data based on feature values but can overfit if not managed
properly; LogisticRegression, a linear model useful for understanding feature relationships;
and MLPClassifier, a neural network-based model capable of capturing complex patterns.
LinearRegression was also mentioned, but it is typically used for regression tasks rather than
classification.
The models were evaluated using multiple metrics: accuracy score, which indicates the
proportion of correctly classified instances; precision score, which measures the accuracy of
positive predictions; recall score, which assesses the ability to identify all actual positives;
F1 score, the harmonic mean of precision and recall providing a balance between them; and
ROC AUC score, which reflects the model’s ability to distinguish between classes. Using
these metrics together offers a comprehensive view of model performance, balancing the
strengths and weaknesses of each metric depending on the problem’s context, particularly
when dealing with class imbalance or varying costs of false positives and negatives.
52
4.11 Model Performance Analysis
In this analysis, we evaluated the performance of five machine learning models: Random
Forest, Decision Tree, Logistic Regression, Neural Network (MLP Classifier), and Linear
Regression. Each model’s performance was measured using accuracy, precision, recall, F1-
score, and AUC-ROC.
The Random Forest classifier demonstrated exceptional performance with an accuracy
of 99.87%, precision of 99.81%, recall of 99.94%, F1-score of 99.87%, and an AUC-ROC of
99.86%. This indicates that the Random Forest model is highly reliable and consistent in
its predictions, effectively handling both false positives and false negatives. The Decision
Tree classifier also performed well, though slightly lower than the Random Forest, with
metrics of 99.77% accuracy, 99.78% precision, 99.79% recall, 99.78% F1-score, and 99.77%
AUC-ROC. The small performance gap highlights the benefit of the ensemble approach in
Random Forest, reducing variance and improving generalization.
Logistic Regression showed a noticeable drop in performance compared to the tree-based
models, with an accuracy of 95.32%, precision of 94.92%, recall of 96.39%, F1-score of
95.65%, and an AUC-ROC of 95.24%. This suggests that Logistic Regression may not
capture complex patterns as effectively as the tree-based models but remains suitable for
simpler, linearly separable data. The Neural Network performed almost as well as the
Random Forest, with metrics of 99.66% accuracy, 99.63% precision, 99.73% recall, 99.68%
F1-score, and 99.65% AUC-ROC. This indicates its capability to learn complex patterns,
albeit with a higher demand for computational resources and longer training times.
Model Accuracy (%) Precision (%) Recall (%) F1-Score (%) AUC-ROC (%)
Random Forest 99.87% 99.81% 99.94% 99.87% 99.86%
Decision Tree 99.77% 99.78% 99.79% 99.78% 99.77%
Logistic Regression 95.32% 94.92% 96.39% 95.65% 95.24%
Neural Network 99.66% 99.63 99.73% 99.68% 99.77%
53
4.12 Comparison of Random Forest with Other Models
Based on the performance metrics and analysis, we can draw detailed comparisons between
the Random Forest classifier and the other models: Decision Tree, Logistic Regression, Neu-
ral Network, and Linear Regression. This comparison highlights the strengths and weak-
nesses of each model, guiding the choice for deployment in real-world applications.
Random Forest vs. Decision Tree
Random Forest demonstrates higher accuracy, precision, recall, F1-score, and AUC-ROC
compared to Decision Tree. This is because Random Forest benefits from ensemble learning,
which reduces over fitting and improves generalization by combining the results of multiple
decision trees. On the other hand, while Decision Tree may show slightly lower performance,
it is simpler and faster to train. This simplicity makes Decision Trees preferable in situations
where interpretability and speed are more critical than achieving a slight improvement in
performance.
Random Forest vs. Logistic Regression
54
Metric Random Forest Logistic Regression
Accuracy 99.87% 95.32%
Precision 99.81% 94.92%
Recall 99.94% 96.39%
F1-Score 99.87% 95.65%
AUC-ROC 99.86% 95.24%
The Random Forest algorithm is slightly better in all metrics compared to the Neural
Network, and it requires less computational power and training time. On the other hand, a
Neural Network is capable of learning very complex patterns and interactions in data. With
extensive tuning and sufficient computational resources, it has the potential to outperform
Random Forest.
4.12.1 Summary
The Random Forest classifier consistently outperforms other models across all metrics. It is
particularly strong in handling complex data patterns and providing robust, high-accuracy
predictions. While the Decision Tree offers simplicity and speed, Logistic Regression provides
55
interpretability for simpler problems. Neural Networks can learn complex patterns with more
computational resources. And Linear Regression is generally not suitable for classification
tasks.
The Area under the Curve (AUC) for the Receiver Operating Characteristic (ROC) curve
is a critical metric for evaluating the performance of classification models. A higher AUC
value indicates a better-performing model, capable of distinguishing between the classes more
effectively. Below is a detailed comparison of the AUC-ROC values for the models analyzed
Model AUC-ROC
Random Forest 99.86%
Decision Tree 99.77%
Logistic Regression 95.24%
Neural Network 99.65%
Linear Regression 99.77%
56
Figure 14: ROC Curve - Random Forest
The Decision Tree model achieved an AUC-ROC score of 99.77%, which, while slightly
lower than that of the Random Forest model, still indicates strong performance. Despite
this, the Decision Tree is prone to over fitting, which can negatively impact its ability to
generalize well to new data, unlike the more robust Random Forest model.
The AUC-ROC value for the Logistic Regression model is 95.24%. While this is a high
value, it is significantly lower than that of tree-based models. This suggests that Logistic
Regression may struggle with capturing non-linear relationships and interactions within the
data. Logistic Regression is more suitable for problems where the data is linearly separable,
57
as its performance diminishes in the presence of complex, non-linear patterns.
58
Figure 18: ROC Curve All RF,DT,LRAND NN
Based on AUC-ROC performance, the Random Forest model emerges as the top choice for
deployment due to its superior ability to handle complex patterns and minimize overfitting,
ensuring reliable and robust classification results. The Neural Network model serves as a
strong alternative, particularly suited for applications demanding the learning of intricate
data patterns, assuming adequate computational resources are available for effective training
and tuning. The Decision Tree model, while offering simplicity and quick training times,
may fall short in generalization compared to Random Forest, making it a viable option when
computational resources are constrained. Logistic Regression remains suitable for scenarios
where model interpretability is critical and the data is linearly separable. However, Linear
Regression should be avoided for classification tasks despite its unusually high AUC-ROC
value, as it is inherently unsuited for such purposes. Overall, Random Forest is recommended
as the primary model for deployment due to its superior performance and robustness.
In evaluating models for detecting potential attacks, we assessed the performance of Random
Forest, Decision Tree, Logistic Regression, Neural Network, and Linear Regression using the
NSL-KDD dataset. The Random Forest model outperformed all others, achieving an accu-
racy of 99.90%, precision of 99.88%, recall of 99.%, F1-score of 99.90%, and an AUC-ROC of
59
99.89%. The Decision Tree also performed well, with slightly lower metrics. The Neural Net-
work demonstrated excellent performance, with an accuracy of 99.56%. Logistic Regression,
while effective, was less optimal, and Linear Regression proved unsuitable for classification
tasks. To further assess our approach, we compared it with existing methods from references
[66, 67, 68], which utilized the NSL-KDD dataset. The accuracy values for these methods
were: CFS and chi-squared feature selection at 92.13%, k-means with information gain at
89.60%, and Wrapper-Bayesnet-based supervised feature selection at 95.30%. Our Random
Forest model significantly outperformed these methods with an accuracy of 99.90%.
Method Accuracy
CFS and chi-squared feature selection 92.13
k-means with information gain 89.60
Wrapper-Bayesnet-based 95.30
Our Approach(Random Forest 99.90
The superior performance of the Random Forest model underscores its effectiveness and
reliability in detecting network anomalies and potential attacks. Based on these results,
Random Forest is recommended for this classification task. The Neural Network is also
a strong alternative, while Logistic Regression remains viable for simpler implementations.
Linear Regression should be avoided for classification tasks due to its unsuitability.
60
ML techniques. The models were implemented using Scikit-learn, a comprehensive library
for ML applications in Python. The performance of the ML techniques was validated through
two experimental setups: initially training models with default parameters and testing on
a separate test set, and subsequently applying a 5-fold cross-validation with grid search to
optimize hyperparameters.
The process began with preprocessing the NSL-KDD dataset from the University of
New Brunswick, followed by feature selection. The selected features were used to train
and test four different ML models: Decision Tree (DT), Random Forest (RF), Logistic
Regression (LR), and Neural Network (NN). The models were implemented using Python and
evaluated based on accuracy, precision, recall, F1-score, and AUC-ROC metrics. Among the
evaluated models, Random Forest outperformed others with an accuracy of 99.90%, precision
of 99.88%, recall of 99.93%, F1-score of 99.90%, and AUC-ROC of 99.89%. The Decision Tree
and Neural Network also showed excellent performance, while Logistic Regression was slightly
less effective. Linear Regression proved unsuitable for classification tasks. When compared
with existing methods, such as those using CFS and chi-squared feature selection (92.13%
accuracy), k-means with information gain (89.60% accuracy), and Wrapper-Bayesnet-based
supervised feature selection (95.30% accuracy), our Random Forest model demonstrated
superior accuracy, highlighting its effectiveness and reliability in detecting network anomalies
and potential U2R attacks.
While the current study has demonstrated the effectiveness of various ML techniques,
particularly Random Forest, in network traffic classification, several areas warrant further
exploration. Future research could explore advanced feature selection methods to further im-
prove model performance and reduce computational complexity. Implementing these models
in real-time network environments would test their practical applicability and robustness.
Addressing class imbalance in the dataset through techniques like Synthetic Minority Over-
sampling Technique (SMOTE) could enhance model performance. Combining multiple ML
techniques or integrating ML with other technologies, such as deep learning, could yield
more robust and accurate IDSs. Evaluating the resilience of models against adversarial
attacks would help in understanding their robustness in real-world scenarios. Testing the
scalability of these models on larger datasets and more complex network environments would
61
ensure their effectiveness in diverse settings. Overall, the development and optimization of
ML-based IDSs remain a dynamic and crucial area of research, essential for safeguarding
network security amidst evolving threats.
5.1 Contribution
The contributions of this research work are significant in the field of network intrusion de-
tection. Firstly, the study achieved an enhancement in detection accuracy compared to pre-
vious works. The Random Forest model developed in this research demonstrated superior
performance, achieving an accuracy of 99.90%, which is notably higher than the accura-
cies reported in existing methods such as CFS and chi-squared feature selection (92.13%),
k-means with information gain (89.60%), and Wrapper-Bayesnet-based supervised feature
selection (95.30%).
Secondly, this research improved detection performance by effectively removing irrele-
vant features from the dataset. By applying rigorous preprocessing and feature selection
techniques, the study ensured that only the most relevant features were used for training
and testing the ML models. This not only enhanced the models’ accuracy but also reduced
computational complexity, making the intrusion detection system more efficient and reliable.
Overall, this research provides valuable contributions by significantly advancing the ac-
curacy and efficiency of network intrusion detection systems through the application of ad-
vanced ML techniques and meticulous feature selection.
This research identifies several areas for enhancing the system’s performance and function-
ality:
• Testing on Diverse Datasets: Evaluating the system with various datasets to assess
its versatility and robustness.
62
• Multi-machine Implementation: Extending the system’s deployment across mul-
tiple machines to improve scalability.
• Advanced Feature Selection: Combining our approach with other feature selection
techniques to explore and implement more effective methods
63
References
[1] Y. S. Zheng, “The research on the applications and trends of the information commu-
nication technology in the information society,” Applied Mechanics and Materials, vol.
321, pp. 2760–2763, 2013.
[2] S. Kabanda, M. Tanner, and C. Kent, “Exploring sme cybersecurity practices in de-
veloping countries,” Journal of Organizational Computing and Electronic Commerce,
vol. 28, no. 3, pp. 269–282, 2018.
[3] J. A. Lewis, Assessing the risks of cyber terrorism, cyber war and other cyber threats.
Center for Strategic & International Studies Washington, DC, 2002.
[5] S. De, “Security threat analysis and prevention towards attack strategies,” in Cyber
Defense Mechanisms. CRC Press, 2020, pp. 1–22.
[6] A. Sheth, S. Bhosale, F. Kurupkar, and A. Prof, “Research paper on cyber security,”
Contemporary Research in India (ISSN 2231-2137. Special Issue: April 2021, 2021.
64
[10] A. O. Alzahrani and M. J. Alenazi, “Designing a network intrusion detection system
based on machine learning for software defined networks,” Future Internet, vol. 13, no. 5,
p. 111, 2021.
[11] K. Coulibaly, “An overview of intrusion detection and prevention systems,” arXiv
preprint arXiv:2004.08967, 2020.
[12] H.-J. Liao, C.-H. R. Lin, Y.-C. Lin, and K.-Y. Tung, “Intrusion detection system: A
comprehensive review,” Journal of Network and Computer Applications, vol. 36, no. 1,
pp. 16–24, 2013.
[14] K. Coulibaly, “An overview of intrusion detection and prevention systems,” arXiv
preprint arXiv:2004.08967, 2020.
[15] J. Ali, “Intrusion detection systems trends to counteract growing cyber-attacks on cyber-
physical systems,” in 2021 22nd International Arab Conference on Information Tech-
nology (ACIT). IEEE, 2021, pp. 1–6.
[16] S. Prasad, M. Srinath, and M. S. Basha, “Intrusion detection systems, tools and
techniques–an overview,” Indian Journal of Science and Technology, vol. 8, no. 35,
pp. 1–7, 2015.
65
[19] K.-A. Tait, J. S. Khan, F. Alqahtani, A. A. Shah, F. A. Khan, M. U. Rehman,
W. Boulila, and J. Ahmad, “Intrusion detection using machine learning techniques:
an experimental comparison,” in 2021 International Congress of Advanced Technology
and Engineering (ICOTEN). IEEE, 2021, pp. 1–10.
[21] B. Chen, J. Lee, and A. S. Wu, “Active event correlation in bro ids to detect multi-
stage attacks,” in Fourth IEEE International Workshop on Information Assurance
(IWIA’06). IEEE, 2006, pp. 16–pp.
[22] K. A. Scarfone and P. M. Mell, “Sp 800-94. guide to intrusion detection and prevention
systems (idps),” 2007.
[23] B. Pranggono, K. McLaughlin, Y. Yang, and S. Sezer, “Intrusion detection systems for
critical infrastructure,” in The state of the art in intrusion prevention and detection.
CRC Press Boca Raton, FL, USA, 2014, pp. 115–138.
66
[28] H. Zuhair, M. Salleh, and A. Selamat, “Hybrid features-based prediction for novel phish
websites,” Jurnal Teknologi, vol. 78, no. 12-3, 2016.
[30] E. M. Maseno, Z. Wang, and H. Xing, “A systematic review on hybrid intrusion de-
tection system,” Security and Communication Networks, vol. 2022, no. 1, p. 9663052,
2022.
[31] I. Cvitić, D. Peraković, M. Periša, and A. D. Jurcut, “Methodology for detecting cy-
ber intrusions in e-learning systems during covid-19 pandemic,” Mobile networks and
applications, vol. 28, no. 1, pp. 231–242, 2023.
[34] A. Nisioti, A. Mylonas, P. D. Yoo, and V. Katos, “From intrusion detection to attacker
attribution: A comprehensive survey of unsupervised methods,” IEEE Communications
Surveys & Tutorials, vol. 20, no. 4, pp. 3369–3388, 2018.
[35] M. Zaman and C.-H. Lung, “Evaluation of machine learning techniques for network
intrusion detection,” in NOMS 2018-2018 IEEE/IFIP Network Operations and Man-
agement Symposium. IEEE, 2018, pp. 1–5.
67
[36] S. Wang, J. F. Balarezo, S. Kandeepan, A. Al-Hourani, K. G. Chavez, and B. Rubin-
stein, “Machine learning in network anomaly detection: A survey,” IEEE Access, vol. 9,
pp. 152 379–152 396, 2021.
[37] A. L. Buczak and E. Guven, “A survey of data mining and machine learning methods for
cyber security intrusion detection,” IEEE Communications surveys & tutorials, vol. 18,
no. 2, pp. 1153–1176, 2015.
[39] J. Miguéns and J. Mendes, “Travel and tourism: Into a complex network,” Physica A:
Statistical Mechanics and its Applications, vol. 387, no. 12, pp. 2963–2971, 2008.
[40] S. Senthilnathan, “Network analysis: Part 1–an introductory note,” Available at SSRN
2143480, 2012.
[41] C. Raets, C. El Aisati, M. De Ridder, and K. Barbé, “Radiomics for the prediction of
the regression grade of colorectal cancer: a challenging classification problem,” in 2022
IEEE International Symposium on Medical Measurements and Applications (MeMeA).
IEEE, 2022, pp. 1–6.
[42] S. Zahi and B. Achchab, “Modeling car loan prepayment using supervised machine
learning,” Procedia Computer Science, vol. 170, pp. 1128–1133, 2020.
68
[45] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive un-
certainty estimation using deep ensembles,” Advances in neural information processing
systems, vol. 30, 2017.
[46] J. Berezutskaya, A.-L. Saive, K. Jerbi, and M. v. Gerven, “How does artificial in-
telligence contribute to ieeg research?” in Intracranial EEG: A Guide for Cognitive
Neuroscientists. Springer, 2023, pp. 761–802.
[47] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new
perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35,
no. 8, pp. 1798–1828, 2013.
[48] A. Jović, K. Brkić, and N. Bogunović, “A review of feature selection methods with
applications,” in 2015 38th international convention on information and communication
technology, electronics and microelectronics (MIPRO). Ieee, 2015, pp. 1200–1205.
[49] J. Cai, J. Luo, S. Wang, and S. Yang, “Feature selection in machine learning: A new
perspective,” Neurocomputing, vol. 300, pp. 70–79, 2018.
[50] M. Alalhareth and S.-C. Hong, “An improved mutual information feature selection
technique for intrusion detection systems in the internet of medical things,” Sensors,
vol. 23, no. 10, p. 4971, 2023.
[52] K. Indira and U. Sakthi, “A hybrid intrusion detection system for sdwsn using random
forest (rf) machine learning approach,” International Journal of Advanced Computer
Science and Applications, vol. 11, no. 2, 2020.
69
[54] G. Varoquaux and O. Colliot, “Evaluating machine learning models and their diagnostic
value,” Machine learning for brain disorders, pp. 601–630, 2023.
[55] T. T. Nguyen and K. Dinh, “Prediction of bridge deck condition rating based on artificial
neural networks,” Journal of Science and Technology in Civil Engineering (JSTCE)-
HUCE, vol. 13, no. 3, pp. 15–25, 2019.
[57] R. Yacouby and D. Axman, “Probabilistic extension of precision, recall, and f1 score for
more thorough evaluation of classification models,” in Proceedings of the first workshop
on evaluation and comparison of NLP systems, 2020, pp. 79–91.
[58] T. Zoppi, A. Ceccarelli, and A. Bondavalli, “Into the unknown: Unsupervised machine
learning algorithms for anomaly-based intrusion detection,” in 2020 50th Annual IEEE-
IFIP International Conference on Dependable Systems and Networks-Supplemental Vol-
ume (DSN-S). IEEE, 2020, pp. 81–81.
[59] J.-F. Mas, “Receiver operating characteristic (roc) analysis,” Geomatic approaches for
modeling land change scenarios, pp. 465–467, 2018.
[62] N. Moustafa, J. Hu, and J. Slay, “A holistic review of network anomaly detection
systems: A comprehensive survey,” Journal of Network and Computer Applications,
vol. 128, pp. 33–55, 2019.
70
[63] V. Kshirsagar and M. S. Joshi, “Rule based classifier models for intrusion detection
system,” Int. J. Comput. Sci. Inf. Technol, vol. 7, no. 1, pp. 367–370, 2016.
[64] K. S. Wutyi and M. M. S. Thwin, “Heuristic rules for attack detection charged by
nsl kdd dataset,” in Genetic and Evolutionary Computing: Proceedings of the Ninth
International Conference on Genetic and Evolutionary Computing, August 26-28, 2015,
Yangon, Myanmar-Volume 1. Springer, 2016, pp. 137–153.
[65] A. Karami, “An anomaly-based intrusion detection system in presence of benign outliers
with visualization capabilities,” Expert Systems with Applications, vol. 108, pp. 36–60,
2018.
[67] G. P. Gupta and M. Kulariya, “A framework for fast and efficient cyber security network
intrusion detection using apache spark,” Procedia Computer Science, vol. 93, pp. 824–
831, 2016.
[68] X. Wang, “On advances in deep learning with applications in financial market model-
ing,” Ph.D. dissertation, Auburn University, 2020.
[70] X. Dastile, T. Celik, and M. Potsane, “Statistical and machine learning models in credit
scoring: A systematic literature survey,” Applied Soft Computing, vol. 91, p. 106263,
2020.
71
[72] B. Ingre, A. Yadav, and A. K. Soni, “Decision tree based intrusion detection system for
nsl-kdd dataset,” in Information and Communication Technology for Intelligent Systems
(ICTIS 2017)-Volume 2 2. Springer, 2018, pp. 207–218.
[73] G. Kurdi, “Toward an electronic resource for systematic reviews in computer science,”
2022.
[79] J. R. Blair, C. M. Chewar, R. K. Raj, and E. Sobiesk, “Infusing principles and practices
for secure computing throughout an undergraduate computer science curriculum,” in
Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer
Science Education, 2020, pp. 82–88.
72
[81] A. Panigrahi and M. R. Patra, “An ann approach for network intrusion detection us-
ing entropy based feature selection,” International Journal of Network Security & Its
Applications (IJNSA), vol. 7, no. 3, pp. 15–29, 2015.
73
Appendices
Appendix A Sample Code for Data Preprocessing
74
Model Performance Analysis
Train model
Predictions
75
Performance metrics
76
Appendix B Screenshot of Hyperparameter Tuning
Best parameters for NeuralNetwork Best parameters for RandomForest
77
Figure 27: Best parameters for RandomForest
78