0% found this document useful (0 votes)
22 views42 pages

Finalized Blackbook Group 28

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views42 pages

Finalized Blackbook Group 28

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Network Intrusion Detection System using Machine Learning and Optimization Technique for

Feature Selection

Abstract

The optimization of feature selection is pivotal for bolstering the efficiency and efficacy of
machine learning models, aiming to pinpoint the most informative attributes within datasets.
This project delved into assessing the effectiveness of Particle Swarm Optimization (PSO) and
Hill Climbing algorithms in the realm of feature selection for classification tasks. The primary
goal was to pinpoint an optimal subset of features that maximizes overall classification
performance. The initial phase encompassed dataset preprocessing and the identification of a
subset of features for analysis. Subsequently, the project implemented the PSO algorithm,
iteratively selecting feature subsets by leveraging cognitive and social parameters of particles.
Concurrently, Hill Climbing was applied to further refine the selected features, ultimately
enhancing classification accuracy.
Through iterative optimization, the PSO algorithm converged to multiple potential feature
subsets, each evaluated based on key performance metrics such as accuracy, precision, recall,
and F1-score. The culmination of this process resulted in the identification of a superior
subset comprising 33 features, showcasing noteworthy performance across various
evaluation metrics. Utilizing these selected features, a classification model was trained,
achieving an impressive accuracy of 99.51% on the testing dataset. The model demonstrated
robustness in predicting class labels across diverse categories, exhibiting high accuracy and
precision for the majority of classes.
This project underscores the prowess of PSO and Hill Climbing algorithms in optimizing
feature selection for classification tasks, presenting substantial enhancements in classification
accuracy while mitigating computational overhead. The findings establish a resilient
framework for feature subset selection, holding implications for elevating the performance of
machine learning models across diverse domains.

Keywords: Network intrusion detection , Machine learning , Optimization technique , Particle


Swarm Optimization , Hill climbing

Contents
1 Introduction 4
1.1 Significance of the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Scope of Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature Survey 7
3 Problem Life Cycle 11
3.1 Problem Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Problem Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Problem Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 End Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Proposed System 14
4.1 Utilization of CICIDS Dataset: . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.1 Dataset Combination Steps . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.2 Data Preprocessing Steps .......................... 15
4.2 Initial Model Evaluation without Feature Selection (Implementation of Random
Forest Without Feature Selection on Training and Testing Data) . . . . . . . . . 17
4.2.1 Data Preparation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.2 Random Forest Classifier: . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.3 Model Evaluation: .............................. 18
4.3 Particle Swarm Optimization (PSO) as Global Optimization: . . . . . . . . . . . 18
4.3.1 Initialized PSO Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.2 Executed Main PSO Loop: . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.3 Retrieved Global Best Position: . . . . . . . . . . . . . . . . . . . . . . . 19
1
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

4.4 Hill Climbing Optimization as Local Optimization: . . . . . . . . . . . . . . . . 19


4.4.1 Initialized Hill Climbing Parameters: . . . . . . . . . . . . . . . . . . . . 19
4.4.2 Executed Hill Climbing Loop: ....................... 20
4.4.3 Integration: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5 Implementation of Random Forest After Feature Selection. . . . . . . . . . . . 21
4.5.1 Data Preparation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5.2 Feature Selection: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
4.5.3 Splitting Data: ................................ 23
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

4.5.4 Random Forest Classifier: . . . . . . . . . . . . . . . . . . . . . . . . . . 23


4.5.5 Model Training: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5.6 Prediction: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Results and Discussion27
5.1 Machine Configuration and parameter Setting : .................. 27
5.1.1 Machine Configuration: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.2 Parameters Setting For PSO and hill climbing: . . . . . . . . . . . . . . . 27
5.1.3 Parameters Setting for Random Forest Classifier:. . . . . . . . . . . . . 28
5.2 Manual Feature Selection vs. Global-Only Approach vs. Global-Local Approach 28
5.2.1 Optimization Techniques Integration . . . . . . . . . . . . . . . . . . . . 30
5.2.2 Comparative Analysis of Feature Selection Strategies . . . . . . . . . . . 32
5.2.3 Algorithmic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.4 Predictions on Training Data: . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.5 Testing Data Evaluation: . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.6 Visualization of Confusion Matrix: . . . . . . . . . . . . . . . . . . . . . 34
5.2.7 Attack Identification and Classification . . . . . . . . . . . . . . . . . . . 35
6 Conclusion & Future Work 36
7 References 38

List of Figures
3.1 Fish Bone Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1 Random Forest Classification Flowchart ...................... 17

4.2 Particle Swarm Optimization Flowchart . . . . . . . . . . . . . . . . . . . . . . . 20


4.3 Hill Climbing Optimization Flowchart . . . . . . . . . . . . . . . . . . . . . . . . 21

4.4 Model Architecture Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35


3

Chapter 1
Introduction
In today's hyperconnected world, securing communication across public and private
networks is of paramount importance. Virtual Private Networks (VPNs) play a crucial role in
safeguarding sensitive data, enabling individuals and organizations to transmit information
securely over the internet. By creating an encrypted tunnel between the user and the
destination network, VPNs provide both privacy and security, making them essential in
scenarios where data integrity and confidentiality are critical. VPNs are widely used in
various settings, including remote work environments, business networks, and geographically
dispersed teams, ensuring that data is protected from eavesdropping or tampering while in
transit.

Despite their widespread use, VPNs face growing challenges as network complexity increases
and cybersecurity threats evolve. The need for VPN optimization is driven by factors such as
network congestion, suboptimal routing, and the increased demand for low-latency
connections in real-time applications. Optimizing VPN routing is essential to improve
efficiency, reduce latency, and ensure smooth communication without compromising security.
Furthermore, with the rise of cyberattacks targeting VPN infrastructures, it has become
equally important to implement stronger security measures alongside optimization efforts.

Traditional VPN routing techniques often rely on static or heuristic algorithms, which may not
be sufficient to cope with the dynamic nature of modern networks. This is where
reinforcement learning (RL) provides a promising alternative. RL, a subfield of machine
learning, allows systems to learn optimal routing strategies through continuous interaction
with the environment. Unlike conventional methods, RL can adapt to changes in network
topology, traffic conditions, and potential security threats, making it a more resilient and
efficient approach for VPN routing optimization.

In our research, we focus on applying four advanced RL algorithms—Deep Q-Learning (DQL),


Proximal Policy Optimization (PPO), Advantage Actor-Critic (A2C), and Q-Learning with
Function Approximation (QLFA)—to optimize VPN routing. These algorithms have
demonstrated strong potential in dynamic decision-making environments, and we aim to
leverage their strengths to enhance VPN routing performance. Each algorithm brings unique
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

advantages in handling large state-action spaces, balancing exploration and exploitation, and
optimizing real-time decision-making.

Furthermore, we study the integration of VPN security protocols such as TCP/IP, SSL, and
IPsec, which are fundamental in establishing secure communication. These protocols ensure
that VPNs provide robust encryption, data integrity, and authentication mechanisms.
However, as network security requirements grow more complex, reinforcement learning-
based routing can complement these protocols by intelligently selecting optimal paths that
also reduce vulnerability to attacks.

–+This paper explores the intersection of VPN optimization and reinforcement learning,
aiming to provide a comprehensive solution that improves both the performance and security
of VPN systems. By evaluating and comparing these four RL algorithms, we aim to determine
the best approach for dynamic and secure VPN routing in modern network environments.

1.1 Significance of the Project


The project’s significance lies in its innovative approach to optimize VPN performance and
security by integrating reinforcement learning (RL) algorithms with traditional VPN
methodologies. By comparing four advanced RL algorithms—Deep Q-Learning, Proximal
Policy Optimization, Advantage Actor-Critic, and Q-Learning with Function Approximation—
this research enhances routing efficiency, reduces latency, and adapts to dynamic network
conditions. Additionally, the study addresses critical security aspects by incorporating robust
protocols like TCP/IP, SSL, and IPsec, ensuring data integrity and protection against emerging
threats. This comprehensive approach contributes to the advancement of adaptive and secure
VPN systems, meeting the demands of modern network environments.

1.2 Scope of Project


The scope of this project encompasses a comprehensive exploration and implementation of
VPN optimization and security enhancement using advanced reinforcement learning (RL)
methodologies. The primary focus lies in developing and evaluating adaptive VPN routing
strategies that optimize performance and strengthen security against evolving cyber threats.
The project involves the comparison of four RL algorithms—Deep Q-Learning, Proximal
Policy Optimization, Advantage Actor-Critic, and Q-Learning with Function Approximation—
to determine their effectiveness in reducing latency, improving routing efficiency, and
adapting to dynamic network environments. Additionally, the scope includes an in-depth
study of critical VPN security protocols such as TCP/IP, SSL, and IPsec to ensure robust
encryption, authentication, and data integrity.
By integrating RL-based optimization with traditional security measures, this project aims to
provide a scalable, secure, and efficient solution for modern VPN systems, addressing the
growing demands of today’s complex network infrastructures.
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

Chapter 2

Literature Survey
In an enterprise network, Layer 3 MPLS VPNs is widely used to provide seamless connectivity
between the geographically distributed sites. The most popular VPLS connectivity model is a
direct any-to-any model, which substantially increases the memory demand for routing tables
present on a provider edge (PE) router. To solve this issue, "Relaying" approach was
proposed to enable routing information to be stored at specific PE routers (hubs) and allow
other PE routers(spokes) to refer to accounts for indirect reachability, leading to a memory-
optimized solution. This approach is extended to a multi-VPN env environment taking into
consideration
considering shared resource constraints such as bandwidth and memory across multiple
VPNs. The proposed algorithms minimize the memory used by routers and ensure reliability
and cost-efficiency and also huge reduction in memory requirements, achieving up to 85%
savings, with minimal impact on latency and network performance. [1]

Virtual Path (VP) concept is used in ATM networks that has led to an effective transport
mechanism, but bandwidth utilization, remains a significant challenge. Traditional research
on VP management often assumed unlimited bandwidth, a premise increasingly outdated due
to advancements in technology and high-bandwidth applications. This is especially relevant
for bandwidth-constrained networks like wireless and high-demand wired networks, where
user access speeds can quickly saturate even Gigabit links. To address these challenges, recent
studies have focused on algorithms that optimize VP routes by considering VP terminators
and capacity demands while minimizing congestion on individual links. The proposed
solutions, such as efficient VP routing algorithms with provable performance guarantees,
demonstrate the potential for near-optimal VP allocation, paving the way for enhanced
bandwidth efficiency and reduced congestion in modern network infrastructures.[2]

Routing algorithms play a pivotal role in optimizing network performance, particularly in


distributed systems. Previous research has explored various approaches to achieve faster
convergence and stability, yet challenges such as loop formation and inefficiencies in
computation persist. To address these, a distributed optimal one-level routing algorithm
based on Newton's method has been proposed. By employing variable reduction techniques,
the algorithm achieves a diagonal Hessian matrix, significantly enhancing computational
efficiency and accuracy. Comparative studies highlight its superior convergence rate, precise
results, and improved transient behavior over earlier methods. Additionally, the algorithm
demonstrates critical properties such as stability, robustness, and loop-free operation, making
it a reliable solution for dynamic network environments. Its distributed nature also ensures
scalability, which is essential for modern network infrastructures.[3]
Routing policies are crucial in managing traffic flow across the Internet, ensuring the
commercial viability of networks. However, these policies often lead to inefficiencies and fail
to fully leverage the network's topology. Traditional approaches typically select routes based
on individual packet paths that adhere to specific routing policies, such as valley-free routing.
In contrast, the proposed work introduces a novel approach that applies policies at an
aggregate traffic level, avoiding the need for individual packets to strictly follow policy-
compliant paths. This method enhances network connectivity and capacity without violating
the core motivations behind the routing policies. The paper also presents polynomial-time
algorithms for solving otherwise NP-hard problems, such as determining the maximum
policy-observing routing capacity between two sets of Autonomous Systems (ASes),
minimizing cuts that separate policy-observing paths, and maximizing disjoint policy-
observing paths. This approach provides a more efficient and scalable solution to the complex
challenges of policy-constrained routing in modern networks.[4]
The VPN security gateway plays a critical role in providing authentication, confidentiality, and
key management for secure communications. Traditional methods for handling security
policies and key exchanges have often faced challenges in terms of efficiency and
performance. To address these limitations, recent research has focused on optimizing VPN
gateway performance from two key aspects: the security policy database (SPD) configuration
and key exchange mechanisms. The proposed solution applies machine learning techniques,
specifically the ID3 decision tree, to optimize SPD configurations, enhancing decision-making
processes for inbound and outbound packet handling. Additionally, elliptic curve
cryptography (ECC) is employed to optimize key exchange procedures, offering equivalent
security with smaller key sizes compared to other public-key systems. These optimizations
significantly improve the efficiency of VPN security gateways, making them more effective in
handling the increasing demand for secure network communications.[5]
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

Virtual private networks (VPNs) are essential for securely extending private networks across
public infrastructures, ensuring confidentiality and integrity through encryption and
authentication. However, setting up and managing VPN connections often comes with
significant overhead, including encryption/decryption latency and the complexity of VPN
software installations on both endpoints and private network hosts. To optimize VPN
connections, recent research has proposed a system that improves VPN efficiency by
introducing a routing apparatus within a public network that handles connections from
clients and gateways in a private network. This system encrypts and authenticates packets at
both the client and gateway ends using shared secrets, facilitating secure transmission with
reduced latency. Notably, the system eliminates the need for traditional three-way
handshakes and bypasses checksum verification, reducing the setup time for VPN
connections. The system also optimizes data transmission by adjusting the receive window
and maximum transmission unit (MTU) for each connection, further enhancing performance.
Additionally, by enabling direct connections between clients on the same network, it reduces
reliance on the routing apparatus, optimizing network resources and minimizing latency. This
approach significantly simplifies VPN deployment and improves connection efficiency,
addressing common performance bottlenecks in traditional VPN configurations.[8]
Software-Defined Networking (SDN) has revolutionized network management by separating
the control plane from the data plane, offering dynamic, real-time control and optimization of
network routing. However, SDN faces significant security challenges, particularly from
Distributed Denial of Service (DDoS) attacks, which exploit its centralized control and flow-
table limitations. To address these vulnerabilities, recent advancements have focused on
integrating machine learning techniques to optimize SDN routing while enhancing security. A
novel approach, Trust-Based Proximal Policy Optimization (TBPPO), has been introduced to
improve multi-path routing in SDN by incorporating a trust value mechanism based on
Kullback-Leibler divergence and a node diversity assessment. TBPPO not only mitigates
issues like congestion and network fluctuations but also strengthens SDN defences against
DDoS attacks. Additionally, the method employs an enhanced Depth-First Search (DFS)
algorithm to pre-compute optimal path sets, and an improved Proximal Policy Optimization
(PPO) algorithm to refine multi-path routing, balancing security, network delay, and
variations in path delays. Experimental results demonstrate that TBPPO outperforms
traditional routing methods, achieving a 20% reduction in average delay and a 50% reduction
in delay variation, marking a significant advancement in SDN security and routing efficiency.
[9]
Software-Defined Networking (SDN) offers a flexible and programmable network
architecture, providing centralized control and real-time network optimization. However,
SDN faces security challenges, particularly from Distributed Denial of Service (DDoS) attacks,
which exploit its centralized nature and flow-table limitations. To address these issues, recent
research has focused on enhancing SDN routing efficiency and security through deep
reinforcement learning (DRL). One promising approach is the Trust-Based Proximal Policy
Optimization (TBPPO) algorithm, which integrates a trust value mechanism using Kullback-
Leibler divergence and a node diversity assessment to improve network robustness, reduce
congestion, and mitigate DDoS attacks. This algorithm utilizes an enhanced Depth-First
Search (DFS) for path selection, avoiding routing loops and optimizing multi-path routing by
considering security, network delay, and delay variations. The TBPPO algorithm,
incorporating an improved Proximal Policy Optimization (PPO) model, addresses the
limitations of traditional routing algorithms by enhancing security, stability, and routing
efficiency. Experimental results show that TBPPO outperforms traditional methods, reducing
network delays and enhancing convergence, making it a practical solution for optimizing SDN
performance in dynamic and security-sensitive environments.[10]
VPNs use encrypted tunnels to protect sensitive online deals, such as banking and stock
trading. Traditional VPN security protocols including Secure Socket Layer/Transport Layer
Security (SSL/TLS), are very much susceptible to attacks, as they depend extensively on user
credentials and session establishment susceptible to attacks. Although Elliptic Curve
Cryptography (ECC) and multilayer authentication systems enhance security, they cannot
completely reduce risks or protect against unauthorized access.
To overcome these shortcomings, a new framework called SeDIC (Secure On-Demand IP
Based Connection) has been introduced. It enhances security by maintaining forward secrecy,
so that keys used in authentication are valid for only one session and cannot be used to replay
attacks, thus prevents replay attacks. It demonstrates that secure internet applications can be
implemented with lower cryptographic entropy, which offers both security and efficiency for
online transactions. It addresses vulnerabilities inherent in existing methods, ensuring that
only authorized parties have access to sensitive information.[11]
Challenges:

The literature highlights significant advancements in the domain of Intrusion Detection


Systems
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

(IDS), primarily focusing on the integration of machine learning and optimization techniques.
Researchers have extensively explored the effectiveness of various machine learning
algorithms such as artificial neural networks, decision trees, and ensemble methods like
Random Forests for network intrusion detection. Additionally, feature selection and
extraction techniques, including CfsSubset Attribute Evaluator and bio-inspired optimization,
have been investigated to enhance IDS efficiency. Studies have emphasized the importance of
long-term performance evaluation, considering factors like overfitting and the evolving
nature of cyber threats. Integration of machine learning with IDS has shown promise in
efficiently analyzing large volumes of data, improving accuracy, and enhancing detection
capabilities. Nevertheless, certain gaps and areas for further exploration exist.
Limitations of the current studies include potential bias in training data, generalization
challenges across diverse network environments, and the need for standardized evaluation
metrics. Addressing these limitations will contribute to the development of more reliable and
widely applicable intrusion detection solutions. Future research endeavors should aim to
bridge these gaps and advance the understanding and practical implementation of IDS in
evolving cybersecurity landscapes.
To advance the current state of IDS research, future studies could delve into the future
trajectory of research in Network Intrusion Detection Systems (IDS) utilizing machine
learning. The emphasis on developing a robust IDS using machine learning algorithms
underscores their indispensable role in bolstering network security. The exploration of
Particle Swarm Optimization (PSO) and Hill Climbing Optimization for enhancing intrusion
detection capabilities holds substantial promise. Strategic investigation into various feature
selection methods to efficiently optimize the feature space and the identification of distinct
cyber-attacks within network traffic emerge as critical objectives. The imperative assessment
of feature reduction techniques for their impact on detection accuracy and computational
efficiency. Furthermore, a recommended in-depth comparative analysis of machine learning
algorithms, particularly on the comprehensive CICIDS dataset, aims to pinpoint the most
effective algorithms for intrusion detection. Delving into dynamic threat adaptation, real-time
detection, enhancing explainability and interpretability, evaluating scalability, exploring
cross-domain applications, and scrutinizing robustness against adversarial attacks are
additional noteworthy recommendations. These suggestions, articulated from a neutral
standpoint, collectively advocate for a comprehensive and holistic approach, envisioning the
future landscape of IDS research with a steadfast commitment to augmenting effectiveness,
efficiency, and adaptability in network security.

Chapter 3

Problem Life Cycle

3.1 Problem Identification


The initial problem identification phase centered on recognizing the ever-evolving landscape
of cybersecurity threats and the inherent constraints within traditional intrusion detection
systems[1]. Conventional rule-based mechanisms faced challenges in adapting to the dynamic
nature of emerging cyber threats, highlighting the critical need for a more adaptable and
responsive approach to ensure precise and timely detection of intrusions[5]. This recognition
underscored the necessity for a dynamic and adaptable intrusion detection system that could
effectively tackle the evolving nature of cyber threats for enhanced accuracy and swift
response against intrusions[6].

3.2 Problem Selection


Amidst the diverse network security challenges, the focal problem was pinpointing the
creation of a potent Intrusion Detection System (IDS) employing machine learning[7]. This
selection primarily aimed at refining detection accuracy while mitigating false positives and
false negatives in discerning multiple cyber attack types within network traffic. The emphasis
was on crafting an IDS that could adeptly differentiate intrusion attempts from regular
network behavior, ensuring precise identification of threats while minimizing the chances of
overlooking or misidentifying potential security breaches[11].

11
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

3.3 Problem Definition


The problem centers on creating an adaptive Network Intrusion Detection System (IDS) using
machine learning to overcome rule-based system limitations. Objectives include accurate
attack identification, adapting to evolving threats, evaluating Particle Swarm and Hill
Climbing Optimizations, comparing feature selection methods, classifying 14 attack types,
assessing feature reduction’s impact on efficiency, and comparing algorithms (Random
Forest, Decision Tree, SVM, Naive Bayes) for optimal detection[13]. This aims to build an
efficient IDS capable of dynamic threat detection while optimizing system performance
through advanced machine learning and optimization techniques[14].

3.3.1 Objectives

1. To develop a robust Network Intrusion Detection System (IDS) using machine learning
algorithms.

2. To evaluate Particle Swarm Optimization (PSO) and Hill Climbing Optimization for
enhancing intrusion detection capabilities.

3. To compare various feature selection methods to optimize the feature space efficiently.

4. To identify and categorize 14 distinct cyber-attacks within network traffic to bolster


security measures.

5. To assess the impact of feature reduction techniques on detection accuracy and compu-
tational efficiency.

6. To perform a comprehensive comparative analysis of machine learning algorithms to


identify the most effective for intrusion detection.

3.4 Problem Analysis


The analysis phase involved a deep dive into existing intrusion detection methodologies,
examining their strengths and limitations. Evaluating the CICIDS dataset and understanding
its intricacies was crucial for formulating the problem statement. Additionally, scrutinizing
different optimization techniques, feature selection strategies, and machine learning
algorithms facilitated a structured approach to problem-solving[11].
Figure 3.1: Fish Bone Diagram

3.5 End Users


The end users of this proposed model are the cybersecurity researchers and professionals
vested in enhancing network security measures. The IDS developed through this project aims
to provide a valuable tool for researchers to analyze, optimize, and fortify network security
against an array of cyber threats[1][2]. This model serves as a foundation for advancing
intrusion detection techniques and contributes to the ongoing evolution of cybersecurity
measures.
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

Chapter 4

Proposed System
The primary goal of this project is to develop a robust Network Intrusion Detection System
(IDS) empowered by machine learning techniques, specifically tailored to enhance network
security. Leveraging the CICIDS dataset[16], an extensive and diverse collection of network
traffic instances, forms the foundation for training, validating, and testing the IDS model.

4.1 Utilization of CICIDS Dataset:


The CICIDS dataset, comprising various simulated cyber-attacks and normal network
behaviors, facilitates comprehensive analysis and model training. It allows for the
identification and classification of 14 distinct types of cyber-attacks within network traffic.
The table 4.1 is part1 and Table 4.2 is part2 of CICIDS dataset features with their respective
column number. This numbers are used as indexing to dataset and provide features as a
number in PSO and hill climbing implementation. Also the class name(Labels) are replaced
with 0 to 14 for each class they are=shown in table 4.3.

4.1.1 Dataset Combination Steps

Importing Libraries and Data:

Initially, import necessary libraries like Pandas and NumPy, followed by reading multiple CSV
files representing different days or sessions of network traffic data (data1, data2, ..., data8).

14
Combining Datasets:
Concatenate the individual datasets (data1 to data8) into larger datasets representing
different time frames or sessions (dataset1, dataset2, dataset3, dataset4).

Merging Combined Datasets:


Merge the intermediate datasets (dataset1 to dataset4) into a final comprehensive dataset
(FinalDataset) using the pd.concat() function.

Exporting the Final Dataset:


Save the FinalDataset to a CSV file (’processed data.csv’) using the to csv() function.

4.1.2 Data Preprocessing Steps

Saving Processed Data:

Save the preprocessed dataset (’processed data afterencoding.csv’) to facilitate further


analysis.

Loading the Dataset:


Read the preprocessed dataset file (’preprocessed data without outliers.csv’) using Pandas
and explore the dataset’s basic structure using head(), shape, and info() functions.

Encoding Labels:
Encode categorical labels (e.g., ’Label’) into numerical format using a dictionary mapping and
assign encoded values to a new column (’Labels’).

Handling Missing Values:


Check for missing values using isnull().sum() and handle them by dropping rows or columns
using dropna() function based on the data’s context.

Removing Duplicate Entries:


Use drop duplicates() function to remove duplicate rows in the dataset.
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

Handling Zero-Valued Columns:


Identify columns with all zero values using .all() and drop those columns from the dataset.
Drop columns with all-zero values from the dataset using the describe() function and
subsequent filtering based on zero values.

Renaming Labels:
Rename labels, strip leading and trailing spaces, and modify non-printable characters using
replace() and rename() functions. Rename labels with non-printable characters as per Table
4.1 ,4.2 and Table 4.3, remove leading and trailing spaces, and add a new column to
distinguish between normal traffic and attacks.

Visualizing Data Distribution:


Visualize the data distribution using count plots and histograms for understanding the balance
between different classes.

Correlation Analysis:
Use heatmap (sns.heatmap()) and correlation matrix (corr()) to visualize the correlation between
features or attributes in the dataset.

Handling Outliers:
Identify outliers using statistical methods like mean and standard deviation, then remove outliers
based on certain thresholds or conditions.

Balancing the Dataset:


Balance the dataset by undersampling majority classes or oversampling minority classes to achieve a
more uniform class distribution.

Saving Cleaned Dataset:


Save the cleaned dataset to a new CSV file (’Final.csv’) using to csv() function for further analysis or
model development.
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

4.2 Initial Model Evaluation without Feature Selection


(Implementation of Random Forest Without Feature
Selection on Training and Testing Data)
Begin by establishing a baseline performance of the IDS model using the CICIDS dataset
without any feature selection techniques applied. Evaluate the model’s accuracy, precision,
recall, and other relevant metrics to set a benchmark for comparison.

Figure 4.1: Random Forest Classification Flowchart

4.2.1 Data Preparation:

• The code starts by importing necessary libraries and mounting Google Drive.

• It reads a dataset from a CSV file named preprocessed data without outliers.csv.

• It drops the column named ”traffic type” from the dataset.


Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

• Splits the dataset into features (X) and the target variable (y).
• Divides the data into training (X train, y train) and testing (X test, y test) sets using an 80-20 split
ratio.

4.2.2 Random Forest Classifier:

• Instantiates a Random Forest Classifier with the following parameters:

n estimators=100: Number of trees in the forest. max depth=10: Maximum depth

of the trees. min samples split=2: Minimum samples required to split an internal

node. min samples leaf=1: Minimum number of samples required to be at a leaf

node. random state=42: Seed for random number generation.

• Fits the Random Forest model to the training data (X train, y train).

4.2.3 Model Evaluation:

• Predicts the labels for both the training and testing sets using the trained Random Forest model.

• Prints various evaluation metrics for the training and testing data:

Accuracy Score: Ratio of correctly predicted instances.

Precision (Micro): Metric indicating the accuracy of positive predictions.

Recall: Metric indicating the coverage of positive instances.

F1 Score: Harmonic mean of precision and recall.

Confusion Matrix: Matrix showing the counts of true positive, true negative, false positive, and
false negative predictions.

4.3 Particle Swarm Optimization (PSO) as Global Op-


timization:
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

PSO is applied as a global optimization method to fine-tune parameters and optimize the
overall performance of the IDS model. It aids in optimizing the feature space and parameters,
enhancing the system’s ability to detect intrusions effectively.

4.3.1 Initialized PSO Parameters


Set the number of particles, iterations, cognitive and social parameters (c1, c2), inertia weight
(w), and velocity limits. Created random binary solutions as particles and random velocities
within the defined limits.Set initial positions and objectives for each particle as their personal
best and selected the best particle as the global best.

4.3.2 Executed Main PSO Loop:


Iterations of the PSO algorithm.

For each iteration:

• Updated particle velocities considering cognitive and social parameters.

• Updated particle positions based on velocities while ensuring they remained within bounds.

• Evaluated objectives and constraints for the new position.

• Updated personal best if the objective improved and constraints were met.

• Updated global best if it improved the objective.

4.3.3 Retrieved Global Best Position:


Retrieved the selected features from the global best position obtained after the PSO iterations.

4.4 Hill Climbing Optimization as Local Optimization:


Hill Climbing is employed as a local optimization technique, focusing on refining specific
parameters or components within the IDS. This method aims to iteratively improve local
areas of the solution space to maximize detection accuracy.
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

4.4.1 Initialized Hill Climbing Parameters:


Set the maximum iterations for hill climbing.

Figure 4.2: Particle Swarm Optimization Flowchart

4.4.2 Executed Hill Climbing Loop:


Performed hill climbing optimization on the selected features obtained from PSO. •

Started with the features selected from PSO

• For a defined number of iterations:

– Explored neighboring feature subsets.


Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

– Evaluated accuracy for each new subset.

– Updated the selected features if the accuracy improved.

• Returned the optimized selected features obtained after hill climbing.

Figure 4.3: Hill Climbing Optimization Flowchart

4.4.3 Integration:

• Integrated the PSO loop that optimized feature selection by updating particle positions and
velocities.

• After obtaining the best global position from PSO, passed it to the hill climbing algorithm to further
refine the selected features based on accuracy improvements.

• Finally, the selected features obtained after hill climbing represented the optimized subset for
classification.
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

4.5 Implementation of Random Forest After Feature Se-


lection

4.5.1 Data Preparation:

• Imported necessary libraries and mounted Google Drive to access the dataset.

Figure 4.4: Model Architecture Diagram


• Loaded the dataset preprocessed data without outliers.csv.
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

• Removed columns: ”traffic type”, ”Unnamed: 0.1”, ”Unnamed: 0” from the dataset.

4.5.2 Feature Selection:

• Used preselected indices of features for training.

• Created a new dataset (new data) containing only the selected features and the target variable
(’Label’).

4.5.3 Splitting Data:

• Split the data into training and testing sets using train test split from sklearn.model selection

• The split ratio used was 80% for training (X train, y train) and 20% for testing (X test, y test).

4.5.4 Random Forest Classifier:

Initialized a Random Forest classifier (RandomForestClassifier) with specified parameters:

n estimators: Number of trees in the forest (set to 100).

max depth: Maximum depth of the trees (set to 10).

min samples split: Minimum samples required to split an internal node (set to 2).

min samples leaf: Minimum number of samples required to be at a leaf node (set to 1). random

state: Seed for random number generation (set to 42).

4.5.5 Model Training:

• Fitted the Random Forest classifier with the training data (X train, y train) using the fit method.

4.5.6 Prediction:

• Predicted the labels for both the training (y train pred) and testing (y test pred) datasets using
the trained Random Forest classifier(rf classifier).
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

Table 4.1: Features with their respective column number (Part 1)


Column Feature Column Feature Column Feature
0 Destination 12 Bwd 24 Fwd IAT
Port Packet Min
Length
Mean
1 Flow Du- 13 Bwd 25 Bwd IAT
ration Packet Total
Length
Std
2 Total Fwd 14 Flow 26 Bwd IAT
Packets Bytes/s Mean
3 Total 15 Flow Pack- 27 Bwd IAT
Backward ets/s Std
Packets
4 Total 16 Flow IAT 28 Bwd IAT
Length Mean Max
of Fwd
Packets
5 Total 17 Flow IAT 29 Bwd IAT
Length Std Min
of Bwd
Packets
6 Fwd 18 Flow IAT 30 Fwd PSH
Packet Max Flags
Length
Max
7 Fwd 19 Flow IAT 31 Fwd URG
Packet Min Flags
Length
Min
8 Fwd 20 Fwd IAT 32 Fwd
Packet Total Header
Length Length
Mean
9 Fwd 21 Fwd IAT 33 Bwd
Packet Mean Header
Length Length
Std
10 Bwd 22 Fwd IAT 34 Fwd Pack-
Packet Std ets/s
Length
Max
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

11 Bwd 23 Fwd IAT 35 Bwd Pack-


Packet Max ets/s
Length
Min
Table 4.2: Features with their respective column number (Part 2)

Column Feature Column Feature Column Feature


36 Min 48 ECE Flag 60 act data pkt fwd
Packet Count
Length
37 Max 49 Down/Up 61 min seg size
Packet Ratio
Length
38 Packet 50 Average 62 Active Mean
Length Packet
Mean Size
39 Packet 51 Avg Fwd 63 Active Std
Length Segment
Std Size
40 Packet 52 Avg Bwd 64 Active Max
Length Segment
Variance Size
41 FIN Flag 53 Fwd 65 Active Min
Count Header
Length.1
42 SYN Flag 54 Subflow 66 Idle Mean
Count Fwd Pack-
ets
43 RST Flag 55 Subflow 67 Idle Std
Count Fwd Bytes
44 PSH Flag 56 Subflow 68 Idle Max
Count Bwd Pack-
ets
45 ACK Flag 57 Subflow 69 Idle Min
Count Bwd Bytes
46 URG Flag 58 Init Win by tes70forwar Label
Count d
47 CWE Flag 59 Init Win by tes backward
Count
forward
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

Table 4.3: Label to Category Mapping


Label Category
0 BENIGN
1 BOT
2 Brute Force
3 DDoS
4 DoS GoldenEye
5 DoS Hulk
6 DoS Slowhttptest
7 DoS slowloris
8 FTP-Patator
9 Heartbleed
10 Infiltration
11 PortScan
12 SSH-Patator
13 Sql Injection
14 XSS
Table 4.4: Initial Parameter Setting for PSO
Parameter Name Variable Value
Cognitive Component C1 2.0
Social Component C2 2.0
No.of Particles n 30
No. of Iterations N 20
Fitness Function - Prediction Accuracy
Chapter 5

Results and Discussion

5.1 Machine Configuration and parameter Setting :

5.1.1 Machine Configuration:


Table5.1:MachineSpecifications
Component
Specification
Processor 1 thGenIntel(R)Core(TM)[email protected]
GHz
1 0
Installed 24. GB(23.8GBusable)
RAM 0
System 64bitoperatingsystem,x64-basedprocessor
Type -
Software GoogleColabforalgorithmimplementationandresultvisu-
alization

5.1.2 Parameters Setting For PSO and hill climbing:


The presented table 5.2 sheds light on the intricacies of the optimization process involving
Particle Swarm Optimization (PSO) and Hill Climbing algorithms. In this collaborative study,
both algorithms play crucial roles, contributing unique perspectives to enhance the overall
optimization procedure. The PSO algorithm undergoes meticulous tuning with specific
parameters: a population of 30 particles and cognitive and social parameters set at c1 = 2.0
and c2 = 2.0, respectively. The evaluation of fitness extends across accuracy, precision, recall,
and F1-score metrics. PSO’s termination criteria are established at 10 iterations, culminating
in the determination of the global best position. Subsequently, Hill Climbing further refines
solutions over
10 iterations, commencing its process with the solution derived from PSO. The optimization

27
Table 5.2: Statistical Data
StatisticalInformation Values
Incorporating Optimization Methods Particle Swarm Optimization (PSO) and Hill Climbing
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

PSO Parameters Population: 30 particles c1 = 2.0 (Cognitive parameter),


c2 = 2.0 (Social parameter)

Fitness Metrics Accuracy, Precision, Recall, F1-score


PSO Termination Criteria 10 iterations
PSO Output Global best position
Hill Climbing Initialization Initial solution from PSO output
Hill Climbing Iterations 10 iterations
PSO and Hill Climbing Repetition Multiple iterations with parameter variations

procedure iteratively repeats, introducing variations in particle numbers, iterations, and


cognitive/social parameters. This strategic iteration aims to pinpoint the optimal solution that
maximizes accuracy while adhering to predefined constraints. In summary, the table offers a
comprehensive overview of the experimental setup from a third-party perspective,
unraveling the nuanced interplay of algorithms and parameters in the relentless pursuit of
optimal results.

5.1.3 Parameters Setting for Random Forest Classifier:


The random forest algorithm is evaluated with parameter values as number of trees in the
forest is 100 , maximum depth of the trees is 10, Minimum samples required to split an
internal node are 2, Minimum number of samples required to be at a leaf node 1 and Seed for
random number generation as 42.

Table 5.3: Random Forest Algorithm Parameters


Parameter Value
NumberofTrees 100
MaximumDepthofTrees 10
MinimumSamplestoSplitanInternalNode 2
MinimumSamplesataLeafNode 1
RandomSeed 42
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

5.2 Manual Feature Selection vs. Global-Only Approach vs.


Global-Local Approach
Table 5.4 shows thorough analysis of feature reduction techniques’ impact on detection accuracy
and computational efficiency is conducted. This assessment compared results obtained using
datasets with and without feature reduction, aimed to strike a balance between reduced feature
sets and optimal system performance.

• Comparative analysis is conducted among different feature selection methodologies.

• Manual selection involves expert domain knowledge to select features.

• Global-only approach used PSO for feature selection.

• Global+Local approach combines PSO for initial feature selection and Hill Climbing for further
refinement.

• The objective is to determine the most efficient method for reducing feature space while
maintaining or improving detection accuracy.

Table 5.4: Comparison of different features


Optimizati
onNo.of No.of Selecte Accuracy Precision Recall F1-
TechniquePar- Selecte d Rate Score
ticles FeaturesFeatures
d

PSO 30 32 [ 0, 2, 4, 0.98615 460.98692 780.98615 460.98622595


+RF 8, 9, 13, 1 6 1
14, 15,
16, 17,
19, 20,
21, 23,
27, 31,
35,39,
41, 42,
43, 44,
45,
46, 50, 53,
54, 55,
56, 58, 59,
61, 62,
70]
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

PSO + 50 31 [ 0, 3, 4, 5, 0.97856 430.98055 730.97856 430.97877378


RF 6, 9, 12, 5 1 5
13, 19, 21,
23, 24,
27, 29, 32,
33, 35,
37, 38, 40,
44, 45,
46, 47, 51,
52, 56,
57, 58, 60,
70]
PSO 30 31 [ 0, 2, 4, 0.98728 750.98725 380.98728 750.98728775
+RF 5, 7, 11, 7 5 7
19, 20,
21, 22,
23, 27,
29, 30,
31, 32,
35,36,
37, 41,
46, 47,
48,
49, 52, 53,
54, 62,
63, 70]
PSO 50 33 [ 0, 2, 4, 0.99535 520.99536 160.99535 520.99471716
+ Hill 8, 9, 12, 6 3 6
Climbing 17, 19,
+ RF 21, 25,
26, 28,
31, 32,
33, 36,
38,42,
43, 47,
48,50, 51,
53, 54, 55,
56, 57, 58,
59, 61, 62,
70]
Table 5.5: Precision, Recall, F1-Score, and Support for each class without feature selction.
Class Precision Recall F1-Score Support
0 0.99 1.00 1.00 298481
1 1.00 0.25 0.40 283
2 1.00 0.05 0.09 301
3 1.00 1.00 1.00 25567
4 1.00 0.87 0.93 1637
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

5 1.00 0.98 0.99 22686


6 0.94 0.94 0.94 955
7 0.99 0.89 0.94 515
8 1.00 1.00 1.00 1203
10 1.00 0.10 0.18 10
11 0.99 1.00 0.99 17137
12 1.00 0.91 0.95 647
13 0.00 0.00 0.00 4
14 1.00 0.01 0.02 125
Table 5.6: Precision, Recall, F1-Score, and Support for each class after feature seelction.
Class Precision Recall F1-Score Support
0 1.00 1.00 1.00 298481
1 1.00 0.43 0.61 283
2 1.00 0.05 0.09 301
3 1.00 1.00 1.00 25567
4 1.00 0.89 0.94 1637
5 1.00 0.98 0.99 22686
6 0.99 0.94 0.97 955
7 0.99 0.91 0.95 515
8 1.00 1.00 1.00 1203
10 1.00 0.10 0.18 10
11 0.99 0.99 0.99 17137
12 1.00 0.92 0.96 647
13 0.00 0.00 0.00 4
14 1.00 0.01 0.02 125
5.2.1 Optimization Techniques Integration
The Particle Swarm Optimization algorithm is initialized with a population of 30 particles,
each representing a candidate feature subset. These particles are randomly distributed within
the search space, exploring possible combinations of features. The initialization includes
setting parameters Cognitive parameter and Social parameter. The fitness function evaluated
the performance of the classification model using metrics such as accuracy, precision, recall,
and F1-score.
Each particle continually assessed its personal best position, leading to the highest fitness value.
Additionally, a global best position is tracked, represented the best solution found by any parTable 5.7:
Comparison of Metrics with and without Feature Selection.
Metric Without Feature Selection After Feature Selection
Accuracy 99.5072399 99.5356527
Precision 99.5081940 99.5363161
Recall Rate 99.5072399 99.5356527
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

F1-Score 99.4339597 99.4717165


ticle in the entire swarm. The algorithm iteratively updated particle velocities and positions, aiming
to balance exploration and exploitation. The termination criterion is set at 20 itera-
tions.
The final output of the PSO algorithm is the global best position, which served as the initial
solution for the subsequent Hill Climbing algorithm. In Hill Climbing, the objective function is
evaluated at the current solution, and iterative improvements are made by generating
neighboring solutions through small perturbations. The process repeated for 10 iterations. If
a newly selected solution failed to improve the objective function compared to the current
solution, the algorithm terminates.
These steps are repeated multiple times by varying the number of particles, the number of
iterations, and the cognitive and social parameter values. The goal is to achieve the best
solution that maximizes accuracy while adhering to promising constraint values, as
summarized in the statistical data table5.2.
Each particle evaluated its personal best position that led to the highest fitness value.
Additionally, the global best position is tracked, representing the best solution found by any
particle in the entire swarm. The particles adjusted their velocities and positions iteratively.
The velocity update is influenced by Cognitive Component the difference between the
particle’s personal best and its current position and Social Component Tte difference between
the swarm’s global best and the particle’s current position. Maximum velocity for particle
updates set to 0.2 and minimum velocity for particle updates set to -0.2 .The updated velocity
and position are calculated using these components, aiming to strike a balance between
exploration and exploitation. The algorithm iteratively updated the positions and velocities
until a termination criterion is met.The termination conditions is reaching a 10 number of
iterations. the objective function is evaluated for each neighboring solution and selected the
one that improves the objective function the most. By Moving to the selected neighboring
solution the process repeated until 10 iterations reaches. If the newly selected solution does
not improve the objective function compared to the current solution, the algorithm
terminates. Above steps are repeated several times by changing no. of particles and number of
iteration along with cognitive and social parameter values until best solution is achieved
which maximize accuracy with promising constraint

values.
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

5.2.2 Comparative Analysis of Feature Selection Strategies


Table 5.5 shows result obtained without feature selection and Table 5.6 shows result after
feature selection using same machine learning algorithm. It investigated and compared
manual feature selection, global-only approach, and a combined global+local optimization for
efficient feature space reduction without compromising detection accuracy. Integration of
Particle Swarm Optimization (PSO) as a global optimization method and Hill Climbing as a
local optimization technique within the IDS to enhanced detection accuracy.
A detailed report comparing manual feature selection, global-only feature selection, and a
combined global+local optimization approach, outlined PSO +Hill climbing as the most
efficient method for reducing feature space without compromising detection accuracy. The
comparison table displays the performance metrics of various particles derived from the
Particle Swarm Optimization (PSO) algorithm. Metrics such as accuracy, precision, recall rate,
and F1-score were evaluated for each particle to determine their effectiveness in feature
selection. Among the particles examined, the optimal subset of features was identified. This
optimal subset is represented by the 33 selected particles indexed as [ 0, 2, 4, 8, 9, 12, 17, 19,
21, 25, 26, 28, 31, 32, 33, 36, 38,42, 43, 47, 48,50, 51, 53, 54, 55, 56, 57, 58, 59, 61, 62, 70].
These particular indices yielded the most promising performance metrics compared to other
particles in the dataset. The study utilized various performance metrics, including Accuracy,
F1-score, Precision, Recall, and Combined multi-class metrics, to assess the effectiveness of a
multi-class classification model. The focus was on evaluating the impact of dimensionality
reduction from 71 to 33 features using a combination of Particle Swarm Optimization (PSO)
and Random Forest (RF) on the CICIDS2017 dataset. The results are summarized in Tables
5.1, with additional details presented in Tables 5.5, 5.6, and 5.7.
Unlike traditional feature selection methods that produce a subset of precisely identified
features, the PSO+hill climbing technique generated new feature patterns with reduced
dimensions. A detailed analysis summary of the proposed framework in terms of Precision ,
Recall, F1-Score, Support are tabulated in Tables 5.5 and 5.6. Table 5.5 depicts the results with
71 features (before applying feature selection), while Table 5.6 shows the results using 33
features (after applying feature selection).
The findings indicate that the proposed framework achieved a maximum precision value of
0.995363 and a recall rate of 0.995356 after reducing the feature dimensionality. These
results confirm the efficiency and effectiveness of the intrusion detection process. Notably,
specific challenges were observed in classifying instances related to Sql Injection:SQL, where
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

Recall and Precision values were both 0.000. This discrepancy may be attributed to the equal
number of instances (11) of HeartBleed and Sql Injection originally present in the CICIDS2017
dataset.
The classifier may have misclassified these instances due to this inherent imbalance.

5.2.3 Algorithmic Evaluation


Table 5.7 shows comparison between result obtained without feature selection and result
after feature selection. It is highlighted that the Random Forest classifier, chosen for its
superior detection performance, comes with a notable overhead in terms of the time required
to build and test the model. The Random Forest algorithm combines multiple decision trees
into a single model, and in this study, with a dataset containing over 17 million instances, the
computational demands are substantial. Despite this computational cost, the Random Forest
classifier has demonstrated effectiveness in intrusion detection based on the presented
results. Then random forest algorithm is evaluated with parameter values as per table.
Conducted a comprehensive comparative analysis among Random Forest (RF), Decision Tree
Classifier (DTC), Support Vector Machine (SVM), and Na¨ıve Bayes algorithms to determine
the most effective algorithm in terms of accuracy, precision, recall, confusion matrix metrics,
F1-score, and computational time.
These machine learning algorithms are rigorously evaluated across various metrics such as
accuracy, precision, recall rate, confusion matrix, F1-score, and computational time. A
comprehensive comparative analysis report evaluating the performance metrics (accuracy,
precision, recall, confusion matrix, F1-score, computational time) of Random Forest (RF),
Decision Tree Classifier (DTC), Support Vector Machine (SVM), and Na¨ıve Bayes algorithms,
established RF as the most proficient for intrusion detection.The analysis demonstrated the
superior performance of Random Forest (RF) in accurately detecting intrusions while
maintaining computational
efficiency.

5.2.4 Predictions on Training Data:

Utilized the Random Forest classifier (rf classifier) to predict labels for the training dataset
(X train).
Calculate Metrics:
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

Different metrices are calculated on training set as per below:

Accuracy: Measure the accuracy of the model’s predictions on the training data using accuracy score.

Precision (Micro): Compute precision for the multiclass classification problem by taking a weighted
average of precision scores (precision score).

Recall: Compute the recall metric for the multiclass classification problem using a weighted average
of recall scores (recall score).

F1 Score: Compute the F1 score for the multiclass classification problem using a weighted average of
the F1 scores (f1 score).

Confusion Matrix: Generate a confusion matrix (confusion matrix) displaying the distribution of true
and predicted classes.

5.2.5 Testing Data Evaluation:

Predictions on Testing Data:


Utilized the Random Forest classifier (rf classifier) to predict labels for the testing dataset (X
test). Different metrices are calculated on testing set as per above definitions.

5.2.6 Visualization of Confusion Matrix:

Confusion Matrix Visualization:


The confusion matrix is visualized by importing necessary libraries seaborn ,
matplotlib.pyplot , confusion matrix. Calculated the confusion matrix based on the model’s
predictions on the testing dataset. Visualized the confusion matrix using a
heatmap(sns.heatmap) to illustrate the true and predicted class distribution in the form of a
graphical representation. This heatmap provided insights into the model’s performance in
classifying different classes as per Figure 5.1. The proposed solution encompassed a
comprehensive methodology integrating optimization techniques, feature selection strategies,
attack identification, and algorithmic comparisons. This approach aims to develop an
advanced IDS capable of effectively identifying and mitigating diverse cyber threats within
network traffic while optimizing detection accuracy and computational efficiency.
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

Figure 5.1: Confusion Matrix

5.2.7 Attack Identification and Classification


The IDS is trained to classify 14 different types of cyber-attacks present in the CICIDS dataset.
This categorization enables the system to distinguish between normal network behaviors and
various intrusion attempts accurately.
The project’s scope is focused on designing an adaptive IDS that optimally detects intrusions,
addresses feature reduction challenges, and identifies the most proficient machine learning
algorithm for robust intrusion detection within network traffic. This scope sets the
groundwork for enhancing network security measures and contributes valuable insights to
the field of cy-
bersecurity.
Chapter 6

Conclusion & Future Work


In conclusion, the evaluation of the model’s performance, utilizing selected features and
rigorous testing metrics, underscores its impressive capabilities on previously unseen data.
The accuracy of approximately 99.54% attests to the model’s precision, demonstrating
accurate predictions across the majority of instances. Precision, gauged through the ’Micro’
average, reveals an exceptional level of accuracy in identifying relevant instances across
various classes, with a precision rate of about 99.54%. The model’s recall score of
approximately 99.54% further emphasizes its ability to effectively capture the majority of
instances for each class, showcasing a high level of sensitivity. The F1 Score, a balanced
measure of precision and recall, reinforces the model’s robust performance, yielding an
impressive score of around 99.47%. This analysis, rooted in comprehensive testing metrics,
affirms the model’s consistent and accurate predictions, making it a highly effective tool
across diverse classes and scenarios.
However, the classification report revealed that there were some classes with limited samples
where the model’s performance was impacted. Classes HeartBleed and Sql Injection had very
few predicted samples, resulting in precision, recall, and F1-score of 0 for these classes. These
classes might require further data or specific strategies to improve predictions, as the model
couldn’t effectively predict them due to insufficient instances in the testing set. The selected
subset of features from the dataset, derived from the best-performing 33 particles,
contributed to a high-performing model with remarkable accuracy, precision, recall, and F1-
score across most classes. However, there’s a need for further investigation and potentially
more data for classes with limited samples to enhance the model’s predictive capability for
those specific
classes.

36
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

Future Work:
In the future, an in-depth examination of classes with insufficient samples can serve as a basis
for exploring potential strategies related to data augmentation or collection, aiming to
enhance the overall performance of the model. Future research may involve experimentation
with various classification techniques or models, specifically addressing class imbalance to
improve predictions for minority classes. Additionally, the application of hyperparameter
tuning or ensemble methods could be explored as potential avenues to further optimize the
model’s performance and enhance its generalization capabilities across all classes present in
the dataset.
Chapter 7

References

1. M. Bateni, A. Gerber, M. Hajiaghayi, and S. Sen, “Multi-VPN Optimization for Scalable


Routing via Relaying,” IEEE/ACM Transactions on Networking, vol. 18, no. 5, pp. 1544–
1556, Oct. 2010, doi: https://doi.org/10.1109/tnet.2010.2043743.

2. I. Chlamtac, A. Farago, and N. T. Zhang, “Optimizing the system of virtual paths,”


IEEE/ACM Transactions on Networking, vol. 2, no. 6, pp. 581–587, Jan. 1994, doi:
https://doi.org/10.1109/90.365415.

3. N. M. Luo, N. W. Ye, N. S. Huang, N. S. Feng, and N. Z. Li, “An efficient optimal algorithm
for virtual path bandwidth allocation,” vol. 22, pp. 487–490, Dec. 2003, doi:
https://doi.org/10.1109/aina.2003.1192926.

4. Z. Wang and D. W. Browning, “An optimal distributed routing algorithm,” IEEE


Transactions on Communications, vol. 39, no. 9, pp. 1379–1388, 1991, doi:
https://doi.org/10.1109/26.99144.

5. A. R. Curtis, R. M. McConnell, and D. Massey, “Efficient Algorithms For Optimizing


Policy-Constrained Routing,” International Workshop on Quality of Service, Jun. 2007,
doi: https://doi.org/10.1109/iwqos.2007.376556.

6. Zhu Yanqin, Qian Peide, and Hu Yuemei, “Design and Optimization of VPN Security
Gateway,” pp. 1–4, Oct. 2006, doi: https://doi.org/10.1109/chinacom.2006.344676.

7. K. H. Cheung and J. Mišić, “On virtual private networks security design issues,”
Computer Networks, vol. 38, no. 2, pp. 165–179, Feb. 2002, doi:
https://doi.org/10.1016/s1389-1286(01)00256-0.

8. Y. Zhang et al., “Multi-Path Routing Algorithm Based on Deep Reinforcement Learning


for SDN,” Applied Sciences, vol. 13, no. 22, pp. 12520–12520, Nov. 2023, doi:
https://doi.org/10.3390/app132212520.
Network Intrusion Detection System using Machine Learning and Optimization Technique for
Feature Selection

9. “US9942199B2 - Optimizing connections over virtual private networks - Google


Patents”,Google.com,Dec.31,2013.https://patents.google.com/patent/US9942199B2/
en (accessed Dec. 07, 2024).

10. A. K. Singh, S. G. Samaddar, and A. K. Misra, “Enhancing VPN security through security
policy management,” IEEE Xplore, Mar. 01, 2012.
https://ieeexplore.ieee.org/abstract/document/6194494.

11. Y. Bai et al., “A Deep Reinforcement Learning-Based Geographic Packet Routing


Optimization,” IEEE Access, vol. 10, pp. 108785–108796, 2022, doi:
https://doi.org/10.1109/access.2022.3213649.

12. M. Iqbal, “Analysis of Security Virtual Private Network (VPN) Using OpenVPN,”
International Journal of Cyber-Security and Digital Forensics, vol. 8, no. 1, pp. 58–65,
2019, doi: https://doi.org/10.17781/p002557.

13. “FUTURE AFTER OPENVPN AND IPSEC.” Available:


https://trepo.tuni.fi/bitstream/handle/10024/116808/korhonen.pdf?sequence=2

14. D. Xue et al., “OpenVPN is Open to VPN Fingerprinting,” Communications of the ACM,
Jun. 2024, doi: https://doi.org/10.1145/3618117.
15. Data set : https://zenodo.org/records/7301756

You might also like