0% found this document useful (0 votes)
541 views63 pages

22mdt1038 Capstone Final

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
541 views63 pages

22mdt1038 Capstone Final

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Enhancing Intrusion Detection in Real-Time

IoT Devices

MASTERS’ THESIS

Submitted by

Nalankilli R
Reg. No. 22MDT1038

in partial fulfillment for the award of the degree of

M.Sc. Data Science

Department of Mathematics
School of Advanced Sciences
Vellore Institute of Technology Chennai
Vandalur - Kelambakkam Road, Chennai - 600 127

May - 2024
DECLARATION

I hereby declare that the thesis entitled Enhancing Intrusion Detection in Real-Time
IoT Devices submitted by me to the Division of Mathematics, School of Advanced Sciences,
Vellore Institute of Technology, Chennai Campus, 600 127 in partial fulfillment of the require-
ments of the award of the degree of Master of Science in Data Science is a bona-fide record of
the work carried out by me under the supervision of Dr. Saroj Kumar Dash. I further declare
that the work reported in this thesis, has not been submitted and will not be submitted, either
in part or in full, for the award of any other degree or diploma of this institute or of any other
institute or University.

Place : Chennai Nalankilli R


Date : Reg. No: 22MDT1038

i
CERTIFICATE

This is to certify that the thesis entitled Enhancing Intrusion Detection in Real-Time
IoT Devices is prepared and submitted by Nalankilli R (Reg. No. 22MDT1038) to Vellore
Institute of Technology, Chennai Campus, in partial fulfillment of the requirement for the award
of the degree of Master of Science in Data Science is a bona-fide record carried out under my
guidance. The thesis fulfills the requirements as per the regulations of this University and in
my opinion meets the necessary standards for submission. The contents of this report have not
been submitted and will not be submitted either in part or in full, for the award of any other
degree or diploma and the same is certified.

Place : Chennai Signature of Guide


Date : Dr. Saroj Kumar Dash

Signature of HOD
Dr. K. Muthunagai

ii
Acknowledgement

With immense pleasure and a deep sense of gratitude, I wish to express my sincere thanks to
my supervisor Dr. Saroj Kumar Dash, Designation, School or Advanced Sciences, Vellore
Institute of Technology (VIT), Chennai for his/her motivation and continuous encouragement,
this project would not have been successfully completed.
I am grateful to the Chancellor of VIT, Dr. G.Viswanathan, the Vice Presidents, the Vice
Chancellor and the Pro Vice Chancellor for motivating me to carry out the project at Vellore
Institute of Technology, Chennai.
I express my sincere thanks to Dr. S. Mahalakshmi, Dean, School of Advanced Sciences,
VIT, Chennai and Dr. K. Muthunagai, HOD, Mathematics and Computing, School of Ad-
vanced Sciences, VIT, Chennai for their support and encouragement.

———– If you want to acknowledge someone write here ————-

Place : Chennai Nalankilli R


Date : Reg. No: 22MDT1038

iii
Abstract

The Real-Time Internet of Things 2022 (RT-IoT) dataset is a comprehensive resource that is
carefully crafted to detect intrusions in real-time IoT devices by utilising cutting-edge ma-
chine learning algorithms. It is derived from a real-time IoT infrastructure. This dataset offers
a comprehensive collection of different IoT devices and advanced network attack techniques
representation of actual IoT security scenarios. A combination of benign and malevolent net-
work behaviours can be found in its contents. Notable Internet of Things devices such as
ThingSpeak-LED, Wipro-Bulb, and MQTT-Temp are highlighted, and simulated attack sce-
narios such Brute-Force SSH, DDoS utilising Hping and Slowloris, and the use of Nmap pat-
terns. Using the Flowmeter plugin in conjunction with the Zeek network monitoring tool, the
bidirectional aspects are carefully documented, allowing for a comprehensive and in-depth in-
vestigation of network traffic patterns. This abstract proposes to use the RT-IoT dataset to
drive improvements in Intrusion Detection Systems (IDS) designed for real-time IoT networks,
promoting the development of robust and flexible security solutions by implementing machine
learning classification projects. This study demonstrates how well boosting and forest-based
machine learning algorithms perform in classifying and differentiating between benign and ma-
licious network traffic in the ever-changing context of IoT security. This is achieved through
a thorough analysis of the RT-IoT dataset. This research endeavor’s main goal is to make sub-
stantial progress in strengthening security protocols in real-time IoT environments, protecting
the integrity and guaranteeing the security of both IoT devices and networks.
Keywords: Machine Learning, classification algorithms, Real-time IoT, Intrusions, Network
traffic analysis

iv
Contents

Declaration i

Certificate ii

Acknowledgement iii

Abstract iv

1 Introduction 1

2 Literature Review 5
2.1 Background Study about IOT based Intrusions . . . . . . . . . . . . . . . . . 5
2.2 SMOTE and Class Imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Machine Learning Aproaches and Models . . . . . . . . . . . . . . . . . . . . 8

3 Methodology 10
3.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Analysis of Protocol Commands and Intrusions in Data . . . . . . . . . . . . . 12
3.3.1 General Commands for Input . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 General Commands for Output . . . . . . . . . . . . . . . . . . . . . . 13
3.3.3 Entering inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.4 Outgoing outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

v
Contents

3.3.5 Classes of the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 14


3.4 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4.1 Null Values and Outliers . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4.2 Winsorizing for handling Outliers . . . . . . . . . . . . . . . . . . . . 18
3.4.3 Undersampling and SMOTE for Class Imbalance . . . . . . . . . . . . 19
3.4.4 SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Recursive Feature Elimination (RFE) for Feature selection . . . . . . . . . . . 24
3.6 ML Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6.1 LightGBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6.2 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.4 KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6.5 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6.6 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Results and Discussions 36


4.1 Class imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.1 Undersampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Oversampling SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Summary and Future Work 40


5.1 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A Code Used 42
A.1 Data Loading and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 42
A.2 SMOTE AND RFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.3 ML model fiting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

vi
Contents

References 52

vii
List of Tables

3.1 Attack Patterns Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14


3.2 Normal Patterns Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Class Distribution in the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Class imbalance After Undersampling . . . . . . . . . . . . . . . . . . . . . . 37


4.2 Class Imbalance afte SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Model Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

viii
List of Figures

1.1 IOT Devices Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3.1 Proposed workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10


3.2 Data collection process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Distribution of Attack Type classes . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Anomaly Detection in Network Traffic: Analysis of DOS_SYN_HPING and
Thing_speak Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Synthetic Minority Over-sampling Technique . . . . . . . . . . . . . . . . . . 23

ix
Chapter 1

Introduction

Rapid IoT adoption has transformed a number of industries, including manufacturing, health-
care, and agriculture. As a result, network connectivity and data exchange capabilities have
significantly increased.Strong cybersecurity measures are vital in IoT ecosystems, as evidenced
by notable breaches like the Verkada incident in 2021, which exposed the live feeds of 150 mil-
lion surveillance cameras, and the concerning case in Florida where an attacker tampered with
the chemical composition of the water treatment organisation.
Our research focuses on creating a novel technique for detecting unusual assaults in IoT
network data in order to address these urgent issues. We use the machine learninh technique
for anomaly detection by utilising the extensive Real-Time Internet of Things 2022 (RT-IoT)
dataset, which includes both attack and regular traffic data from various devices like Amazon-
Alexa, MQTT-Temp, ThingSpeak-LED, and Wipro-Bulb. We predict significant reconstruction
error (RE) for detecting aberrant traffic patterns because the model was trained on benign net-
work data.

• MQTT-Temp is an IoT device which can communicate with one another using the lightweight
messaging protocol MQTT (Message Queuing Telemetry Transport). It is perfect for re-
mote monitoring and control systems since it allows for effective data transmission over
networks with constrained bandwidth.

1
Chapter 1. Introduction

Figure 1.1: IOT Devices Picture

• ThingSpeak is an Internet of Things platform that lets users gather, examine, and present
data from different IoT gadgets. It offers data integration APIs and is frequently utilised
for home automation and environmental monitoring applications.

• The Wipro Bulb is a smart lightbulb that can be operated from a distance with the help of
a smartphone or other linked devices. With capabilities like colour change, scheduling,
and dimming, it can be used with smart lighting solutions.

• Amazon created Alexa, a virtual assistant that can communicate via voice and manage
smart home appliances. It can do things like play music, make alarms, give weather
reports, and use voice commands to operate IoT devices.

Among the many risks that Internet of Things (IoT) devices encounter, Distributed Denial
of Service (DDoS) assaults are particularly dangerous. Webcams and printers are among the
many non-legacy IoT devices that are easily targeted by DDoS attacks, which can result in
the creation of malicious botnets. According to a recent Kaspersky analysis, there will be
a significant increase in highly skilled DDoS attacks in 2022, with attack durations reaching
a concerning 3000 minutes. The Secure Shell (SSH) brute-force attack is another common
kind of cyberattack that targets devices that have default passwords that are available via the

2
Chapter 1. Introduction

SSH protocol. The sophistication of these attacks keeps increasing as new IoT botnets like
RapperBot actively increase their capabilities.

• DOS_SYN_Hping: Denial of Service SYN Flood with Hping

• ARP_poisoning: Address Resolution Protocol poisoning

• NMAP_UDP_SCAN: Nmap User Datagram Protocol scan

• NMAP_XMAS_TREE_SCAN: Nmap Xmas Tree scan

• NMAP_OS_DETECTION: Nmap Operating System detection

• NMAP_TCP_scan: Nmap Transmission Control Protocol scan

• DDOS_Slowloris: Distributed Denial of Service Slowloris attack

• Metasploit_Brute_Force_SSH: Metasploit Brute Force Secure Shell

• NMAP_FIN_SCAN: Nmap Finish scan

In addition, we explore the domain of supervised learning classification methods in IDS


(intrusion detection systems) designed for Internet of Things devices. One popular supervised
learning technique is anomaly detection, which helps identify unusual occurrences or depar-
tures from the norm. However, it can be difficult to spot abnormalities in IoT infrastructures
due to the lack of IoT trace data in datasets. In order to address this, we have painstakingly
created network traces using IoT devices that operate in real-time, successfully capturing both
typical and malicious patterns.
Because of the limited physical resources of real-time IoT devices, deploying AI-driven
IDS frameworks poses computational challenges. In order to mitigate this, optimisation tech-
niques that minimise memory and energy consumption—such as network pruning and integer
quantization—come to the fore. Through the removal of unnecessary outliers and imbalancd
data and conversion of numerical and non numerical categorical values , we greatly reduce

3
Chapter 1. Introduction

computational complexity, which in turn improves performance and renders ML algorithms


more appropriate for integration in Internet of Things contexts.
In conclusion, the overall goal of our research is to strengthen the security of Internet of
Things devices by developing effective detection and mitigation techniques for assaults. We
attempt to overcome the particular difficulties inherent in real-time IoT ecosystems by carefully
integrating machine learning and optimisation approaches, thereby preserving the integrity and
security of IoT networks.

4
Chapter 2

Literature Review

2.1 Background Study about IOT based Intrusions


The Internet of Things, or IoT, is a new paradigm that seeks to connect all intelligent phys-
ical objects so that consumers can access intelligent services from them. Smart grids, smart
cities, smart homes, smart retail, and more are examples of IoT applications. It is difficult to
link IoT systems to the software/application level in order to extract information from massive
amounts of data because of the heterogeneous hardware and networking technologies utilised
in these systems. In order to characterise IoT systems, this paper surveys their varied designs
and protocols and suggests appropriate taxonomies. In-depth treatment of current research ac-
tivities is provided for each technical difficulty, including security and privacy, interoperability,
scalability, and energy efficiency.The goal of the survey is to assist future researchers in iden-
tifying issues unique to the Internet of Things and in selecting the right technology based on
application requirements.C. C. Sobin (2020)[8]
A safe methodology for identifying and stopping data integrity threats in wireless sen-
sor networks inside microgrids is presented in the proposed study. In order to differentiate
between malicious attacks of different degrees of intensity during regular operations, an in-
telligent anomaly detection technique based on prediction intervals (PIs) is introduced. This
technique establishes ideal PIs for smart metre readings for electric users using a lower and

5
Chapter 2. Literature Review

upper bound estimation strategy. In order to solve neural network instability, it also contains a
combinatorial idea of PIs. Neural network parameters are adjusted using a modified optimisa-
tion technique based on symbiotic organisms search to address the complicated and oscillatory
nature of data from electric users.Practical data from a home microgrid is used to assess the
accuracy and performance of the proposed model, proving its usefulness in identifying and
averting data integrity assaults in wireless sensor networks.Tao(2021)[4]
One of the technologies with the fastest global growth is IoT. It completely transforms how
people, machines, and other technologies communicate. IoT security is a serious issue that
arises with this expansion, though. The necessity for strong network security is highlighted
by the growing number of connected devices. The identification of malicious packets requires
the use of an intrusion detection system, or IDS. These systems can develop intelligent models
for identifying harmful data in Internet of Things (IoT) devices by utilising Machine Learning
(ML) algorithms. In order to detect and forecast abnormalities in a dataset of environmental
attributes gathered from sensors in an IoT environment in Bangladesh, this study contrasts
different machine learning techniques. We assess these methods’ performance and suggest a
machine learning model that has a runtime of less than 0.2 seconds and an accuracy of 96.5
percent . Hasan(2022)[5]

2.2 SMOTE and Class Imbalance


Many strategies effectively address imbalanced datasets at the preprocessing stage in order to
improve classification performance by rebalancing the data. SMOTE, a well-liked oversam-
pling method, alters the training set by adding synthetic minority samples. SMOTE might,
however, produce instances outside of safe zones in overlapping and noisy locations. We sug-
gest SMOTE-BFT, which uses the belief function theory to remove created minority instances
that are outside of safe zones, in order to overcome this problem. Every minority instance that
is formed once SMOTE is applied is represented by an evidentiary membership structure that
offers comprehensive details regarding class memberships. Then, created instances in noisy

6
Chapter 2. Literature Review

and overlapped regions are identified and eliminated using rules based on the belief function
theory.Tests carried performed on synthetic noisy datasets show that our suggestion performs
noticeably better than other widely used oversampling techniques.Eric(2020)[1]
Educational data mining can lead to the development of useful data-driven applications, like
predictive model-based academic achievement prediction or early warning systems in schools.
However, the accuracy of these predictive models may be compromised by the problem of class
imbalance in educational statistics. This is thus because many models are constructed under
the presumption of a forecast class that is balanced. Although earlier research has suggested
a number of approaches to deal with this imbalance, the majority of them have concentrated
on the technical aspects of each approach, with very few addressing real-world applications,
particularly for datasets that vary in their degree of imbalance. Using the High School Lon-
gitudinal Study of 2009 dataset, we analyse various sampling strategies to handle moderate
and extreme degrees of class imbalance in this study. Random undersampling (RUS), random
oversampling (ROS), and a hybrid resampling method that combines RUS and the synthetic
minority oversampling technique for nominal and continuous features (SMOTE-NC) are all
included in our comparison studies. Surina(2023)[9]
The class imbalance issue typically arises when conventional classification methods are un-
able to correctly identify infrequent occurrences or outliers that are present in a collection. The
usefulness of such algorithms in precisely detecting and categorising smaller or underrepre-
sented classes is limited because they are usually optimised to perform better with bigger or
more equally distributed classes. Researchers have proactively presented a range of novel con-
cepts, approaches, and modifications to the current categorization systems in order to address
this inherent issue. This paper’s scope includes a critical assessment of the related difficul-
ties, limitations, and gaps that continue to exist in the existing literature, as well as an in-depth
investigation of the dominant research trends that attempt to address the problem of class im-
balance.This study provides insight into the variety of approaches that have been developed
to improve classification model performance in the face of unequal class distributions. It also
highlights the significance of ongoing improvements and modifications to algorithmic design

7
Chapter 2. Literature Review

in order to lessen the negative effects of class imbalances on predictive accuracy and model
performance. Gosain (2022)[3]

2.3 Machine Learning Aproaches and Models


The amount of data being generated online is increasing at an exponential rate, making cyber-
security more and more necessary every day. The field of cybersecurity has grown significantly
in prominence in recent years and will do so in the future. The number of hackers and other
bad actors is rising, and they are employing a variety of strategies to obtain sensitive user data.
"Phishing" is among the most prevalent yet distinct security threats. It is distinct because it is a
social engineering assault that preys on human weaknesses rather than system flaws. By click-
ing on fraudulent emails or websites, users unintentionally divulge important information about
themselves, including passwords, credit card numbers, and bank account information. The pur-
pose of this research is to create a tool that can identify and distinguish between secure and
phishing websites, protecting users’ personal information and discouraging them from open-
ing dangerous URLs. The main approaches for classification are MultinomialNB and Linear
Regression, with other strategies including Support Vector Machine, Random Forest, and Ar-
tificial Neural Networks. However, the majority of widely used machine learning algorithms
necessitate extensive data training, which might cause a delay in real-time operations. Conse-
quently, the research’s goal is to develop a real-time operating model. With the use of logistic
regression, the pipelined model was constructed and produced an accuracy of almost 98 percent
Raja(2023)[2]
Malicious websites infecting consumers’ devices are becoming more and more widespread.
Users frequently fail to read the URLs’ details and unintentionally visit these websites, which
can lead to the theft of personal information. Attackers sometimes imitate authentic URLs,
making it difficult to tell them apart from authentic websites. It is vital to recognise these fraud-
ulent websites as a result. This study assesses the efficacy of tree-based classifiers, XGBoost
and Catboost, in identifying phishing websites. High accuracy is shown by both classifiers,

8
Chapter 2. Literature Review

with XGBoost surpassing Catboost by a small margin. Two datasets are used for the evalu-
ation, which is supported by train-test and k-fold validation. We also contrast XGBoost and
Catboost’s performance with those of other traditional classifiers.sadaf(2023)[7]
The landscape of digital communication is seriously threatened by malware, which can
cause hostile attacks that disrupt network infrastructure and delete important files. Malware
authors have improved their strategies over the last ten years, making it difficult for conven-
tional detection techniques—like signature-based approaches—to keep up with the changing
world of malware. Because of this, conventional techniques have been unable to successfully
identify novel, sophisticated malware strains. As a result, the demand for a reliable and effi-
cient malware detection system that can accurately identify and detect encrypted and hidden
malware is pressing. A possible method for finding hidden and masked malware is machine
learning.This paper presents a system that employs the Random Forest classifier to detect mal-
ware and focuses on analysing malware detection using different machine learning techniques.
At its highest, the research was around 98.5 percent accurate. Manzoor(2023)[6]

9
Chapter 3

Methodology

3.1 Workflow

Figure 3.1: Proposed workflow

10
Chapter 3. Methodology

3.2 Data Collection


The unique RT-IoT2022 dataset was created from a real-time IoT infrastructure and is intended
to be a comprehensive resource for IoT security researchers and practitioners. The dataset is a
useful resource for researching and comprehending both benign and hostile network behaviours
in IoT contexts since it incorporates a wide variety of IoT devices and advanced network attack
techniques.
The RT-IoT2022 dataset’s ability to incorporate data from a variety of IoT devices, such
as ThingSpeak-LED, Wipro-Bulb, and MQTT-Temp, is one of its main advantages. This vari-
ety makes it possible for researchers to investigate various IoT device kinds and their unique
properties, which is crucial for creating efficient security solutions.

Figure 3.2: Data collection process

The RT-IoT2022 dataset contains real-world IoT device data as well as simulated attack
scenarios. These scenarios include popular attack techniques including Nmap patterns, DDoS
attacks with Hping and Slowloris, and Brute-Force SSH attacks. The inclusion of these sim-
ulated attacks in the dataset offers researchers a thorough understanding of the possible risks
that Internet of Things devices may encounter in practical situations.

11
Chapter 3. Methodology

The Zeek network monitoring tool and the Flowmeter plugin are used by the RT-IoT2022
dataset to record the bidirectional properties of network traffic. With the use of these tech-
nologies, researchers can perform in-depth analysis of network traffic, identifying trends and
abnormalities that might point to a security risk.
All things considered, the RT-IoT2022 dataset provides an extensive and thorough view-
point on the intricate nature of network traffic in IoT contexts. With the use of this information,
researchers can improve the functionality of intrusion detection systems (IDS) and create reli-
able, flexible security solutions for real-time Internet of Things networks.

3.3 Analysis of Protocol Commands and Intrusions in Data


Depending on the particular device and its intended use, input (IP) and output (OP) commands
are commonly used in IoT device communication. Typical input and output command types
seen in Internet of Things devices are as follows:

3.3.1 General Commands for Input

• Sensor Readings: Instructs to retrieve information from sensors, including light, motion,
temperature, and humidity.

• Configuration Settings: These are commands used to set up parameters on the device,
like thresholds, sampling rates, and network characteristics.

• Control signals: Orders to operate actuators or other devices, such doors and lights, to
open and close, etc.

• User inputs: include button presses, voice commands, and other user-inputted com-
mands.

12
Chapter 3. Methodology

3.3.2 General Commands for Output

• Actuator Control: Instructions to move actuators in response to inputs from the user or
sensor data.

• Data transmission: refers to the process of sending commands to other network devices
or a central server.

• Status Updates: Orders to transmit alarms or status updates, such low battery life or high
temperature, among other things.

• Feedback: Instructions to give users feedback in the form of error alerts, confirmation
messages, etc. These orders are usually sent via a network utilising protocols for com-
munication like MQTT, CoAP, HTTP, etc., based on the needs and network connectivity
of the device.

3.3.3 Entering inputs

• Sensor Readings: Information on a range of environmental parameters, including tem-


perature, humidity, and light intensity, obtained from Internet of Things devices like
ThingSpeak-LED, Wipro-Bulb, and MQTT-Temp.

• Network traffic data: Information pertaining to fictitious attack scenarios, such as DDoS
attacks employing Hping and Slowloris, Nmap patterns, and Brute-Force SSH attacks, to
mimic hostile network activities.

• Configuration Settings: Details about device settings, network setups, and other aspects
that affect the security and functionality of Internet of Things devices.

13
Chapter 3. Methodology

3.3.4 Outgoing outputs

• Zeek network: monitoring tool and the Flowmeter plugin were used to record the bidi-
rectional properties of the network traffic. The results of the analysis included trends,
abnormalities, and other insights. Alerts from Intrusion Detection Systems

• (IDS): IDS systems use the dataset to generate alerts and notifications that highlight
possible security risks and assaults in real-time Internet of Things networks.

• Security Solutions: The creation and assessment of algorithms and security solutions
with the goal of reducing network attacks and improving the security of IoT settings in
real-time.

These input and output formats are essential for researching and comprehending the in-
tricate structure of network traffic in Internet of Things environments and for creating
security solutions that effectively guard networks and IoT devices from cyberattacks.

3.3.5 Classes of the Data

Attack Pattern Count


DOS_SYN_Hping 94659
ARP_poisioning 7750
NMAP_UDP_SCAN 2590
NMAP_XMAS_TREE_SCAN 2010
NMAP_OS_DETECTION 2000
NMAP_TCP_scan 1002
DDOS_Slowloris 534
Metasploit_Brute_Force_SSH 37
NMAP_FIN_SCAN 28

Table 3.1: Attack Patterns Details

14
Chapter 3. Methodology

Normal Pattern Count


MQTT 8108
Thing_speak 4146
Wipro_bulb_Dataset 253
Amazon-Alexa 86842

Table 3.2: Normal Patterns Details

1. DOS_SYN_Hping: A denial-of-service (DoS) attack in which a target is overloaded


with SYN packets via the Hping tool, rendering it unreachable.

2. ARP_poisioning: An attacker can intercept communication intended for a genuine de-


vice by sending fictitious ARP messages over a local network that link their MAC address
to the IP address of the legitimate device.

3. NMAP_UDP_SCAN: To find out whether services or applications are listening on a


certain port, use the Network Mapper (Nmap) scan function, which sends UDP packets
to the target port.

4. NMAP_XMAS_TREE_SCAN: This Nmap scan, sometimes referred to as a "Christmas


Tree" scan, sets the FIN, URG, and PUSH flags in a packet to check target ports for
vulnerabilities.

5. NMAP_OS_DETECTION: An Nmap scan that precisely ascertains the operating sys-


tems of target devices by examining their answers.

6. NMAP_TCP_scan: This Nmap scan finds open ports and services that are using those
ports by sending TCP packets to target ports.

7. DDOS_Slowloris: A Distributed Denial-of-Service (DDoS) attack that uses the Slowloris


tool to keep numerous connections open at once in an attempt to deplete a server’s re-
sources.

15
Chapter 3. Methodology

8. Metasploit_Brute_Force_SSH: Brute-force attack employing the Metasploit framework


to attempt a variety of username and password combinations in an attempt to guess an
SSH server’s credentials.

9. NMAP_FIN_SCAN: Nmap scan that checks if target ports are filtered or closed by a
firewall by sending FIN packets to them.

10. MQTT: The Message Queuing Telemetry Transport protocol is a lightweight messaging
protocol that helps devices communicate with one another by allowing publishers and
subscribers to publish and subscribe to messages.

11. Thing_speak: An IoT platform that makes it possible to monitor and manage IoT appli-
cations by allowing users to gather, examine, and visualise data from IoT devices.

12. Wipro_bulb_Dataset: Wipro-provided dataset valuable for IoT application develop-


ment and analysis, comprising information on the functionality and behaviour of Wipro-
branded IoT bulbs.

13. Amazon-Alexa: A virtual assistant that the company created. It can do a number of
things, like controlling smart home appliances, giving information, and playing music or
audiobooks, by using voice recognition and natural language processing.

Figure 3.3: Distribution of Attack Type classes

16
Chapter 3. Methodology

According to the graphic below, there appears to be an abnormality in the fwd_pkts_tot


(total forwards packets) values, which are larger during the DOS_SYN_HPING assault and
Thing_speak activity. This is followed by a spike in bwd_pkts_tot (total backward packets).
In the instance of DOS_SYN_HPING, the attack probably consists of flooding a target
system with a large number of SYN packets (fwd_pkts_tot). Subsequent to this deluge of
packets, there is a notable surge in response packets (bwd_pkts_tot), indicating anomalous
behaviour and possible intrusion.
Comparably, Thing_speak displays greater fwd_pkts_tot numbers as well, which would
point to regular data transfer or operation. The subsequent rise in bwd_pkts_tot, however,
might point to a particular pattern or behaviour peculiar to Thing_speak’s functioning.
Overall, the finding that bwd_pkts_tot is impacted in a similar way as fwd_pkts_tot in-
creases points to a link between these two measurements that may be important for identifying
anomalies and comprehending how network traffic behaves in these situations.

Figure 3.4: Anomaly Detection in Network Traffic: Analysis of DOS_SYN_HPING and


Thing_speak Activities

17
Chapter 3. Methodology

3.4 Data Preprocessing

3.4.1 Null Values and Outliers

The dataset has zero null values, which means that all of the data are present and there are no
missing values. Furthermore, as will be explained below, the dataset has outliers in a number
of columns.
The dataset consists of 83 columns in total, all of which have had their outliers examined.
The results are summarised as follows:
There are outliers in 76 of the columns.There are no outliers found in 7 columns. The
columns with no outliers detected are as follows:

• fwd_pkts_per_sec

• bwd_pkts_per_sec

• flow_pkts_per_sec

• bwd_URG_flag_count

• flow_CWR_flag_count

• flow_ECE_flag_count

• payload_bytes_per_second

3.4.2 Winsorizing for handling Outliers

A statistical method for handling extreme values in a dataset is winsorizing outliers. Winsoriz-
ing replaces outliers with the closest non-outlier value as opposed to eliminating them. This
method preserves important data points while lessening the impact of outliers on statistical
analysis. Typically, winsorizing involves setting the outlier values to a predetermined dataset
percentile, like the 95th or 99th percentile. In doing so, the extreme values are pushed closer

18
Chapter 3. Methodology

to the average, strengthening the dataset’s resistance to outliers while maintaining the original
data’s general distribution and properties.
In Winsorizing, the values below the p-th percentile are set to the value at the p-th percentile,
and the values above the (100−p)-th percentile are set to the value at the (100−p)-th percentile.
This can be mathematically represented as:





 percentilep if value < percentilep


Winsorized value = value if percentilep ≤ value ≤ percentile100−p




percentile100−p if value > percentile100−p

where percentilep is the p-th percentile and percentile100−p is the (100 − p)-th percentile of
the dataset.

3.4.3 Undersampling and SMOTE for Class Imbalance

The class imbalance issue in the dataset, where there is a notable difference in the number
of occurrences across distinct classes, is a typical machine learning obstacle. The model may
perform worse as a result of this problem if it predicts the overrepresented class more accurately
while ignoring the underrepresented ones. In particular, the class DOS_SYN_Hping stands out
in the dataset presented with a significantly higher number of instances than other classes,
which may bias the model’s predictions in favour of this class and, as a result, jeopardise the
precision and dependability of predictions for the minority classes.
Metasploit_Brute_Force_SSH and NMAP_FIN_SCAN are two of these minority classes,
with only 37 and 28 instances, respectively. It is difficult for the model to correctly learn the
distinctive patterns of these classes due to their low representation, which affects the model’s
ability to effectively categorise cases that belong to these classes. It is crucial to resolve this
imbalance in the class distribution in order to improve the model’s overall performance, since
it may result in reduced classification accuracy, precision, and recall for these minority classes.

19
Chapter 3. Methodology

If the class disparity is not addressed, it can negatively impact the model’s learning process
and result in predictions that are skewed towards the majority class and ignore the minority
classes. As a result, the model may perform unevenly across classes, performing well in pre-
dicting the majority class but having trouble with the minority ones. Various solutions, such as
resampling approaches, alternative evaluation metrics, and the usage of algorithms intended to
manage class imbalance well, can be used to alleviate the negative impacts of class imbalance.
The intention is to rebalance the class distribution by putting these tactics into practice,
which will allow the model to learn from every class equally and produce more reliable predic-
tions across the board. In order to increase the model’s usefulness and practical applicability in
real-world scenarios, it is imperative to address the issue of class imbalance. This will also en-
sure that the predictions produced by the model are impartial and trustworthy across all classes.

Class Instances
DOS_SYN_Hping 94659
Thing_Speak 8108
ARP_poisoning 7750
MQTT_Publish 4146
NMAP_UDP_SCAN 2590
NMAP_XMAS_TREE_SCAN 2010
NMAP_OS_DETECTION 2000
NMAP_TCP_scan 1002
DDOS_Slowloris 534
Wipro_bulb 253
Metasploit_Brute_Force_SSH 37
NMAP_FIN_SCAN 28

Table 3.3: Class Distribution in the Dataset

Reducing the number of instances in the overrepresented class to equal the number of in-
stances in the underrepresented class is a technique known as undersampling, which is used to

20
Chapter 3. Methodology

address class imbalance. Undersampling is used in the above code snippet to eliminate occur-
rences of the class "DOS_SYN_Hping" from the dataset.

1. Determine the row indices for the class "DOS_SYN_Hping."

2. Select a random subset of these indices to eliminate, and set the subset’s size to 84659 (the
required total number of removals).
Through random selection and removal of instances, the "DOS_SYN_Hping" class size
was reduced from its initial count to 10,000 instances. This adjustment made sure that the
"DOS_SYN_Hping" class was more evenly distributed compared to the other classes in the
dataset, which had the same size.

3.4.4 SMOTE

Synthetic Minority Over-sampling Technique, or SMOTE for short, is a popular technique used
to address class imbalances in machine learning datasets. By producing false data points for the
underrepresented minority classes, this strategy is essential to achieving a more fair allocation
of classes. This effectively addresses the problem of model predictions that are biassed due to
differences in class instances.
It is crucial to understand the basic dataset properties before delving into the details of how
SMOTE functions within the given code. The dataset in question is divided into several classes,
some of which have significantly less samples than others. Class imbalances of this kind have
the potential to distort model results, giving majority classes accurate forecasts but minority
classes poor performance.
The SMOTE module from the imblearn library—a Python tool created especially for man-
aging unbalanced datasets—is included in the code extract. By creating fake data points within
minority classes, SMOTE, an oversampling approach, aims to increase the representation of
those groups.
In order to achieve reproducible results, it is imperative that an instance of the SMOTE class
be initialised and configured with a particular random state value, like 42. Synthetic samples

21
Chapter 3. Methodology

aimed at correcting the class distribution are generated by applying the ‘fit_resample()‘ method
on the feature matrix x and the target variable y.
After SMOTE is applied, the target variable y and the feature matrix x have different forms.
As a result, the dataset becomes more balanced as the quantity of instances belonging to mi-
nority classes rises to match that of majority classes.
By comparing the counts of each class in the target variable y before and after the SMOTE
intervention, the effect of SMOTE on the class distribution is closely examined. Interestingly,
there is a large bias in counts prior to SMOTE, substantially favouring the majority classes. All
class counts are equalised after SMOTE, though, suggesting that the dataset has been success-
fully balanced.
SMOTE is important for reasons that go beyond just correcting for class disparities. SMOTE
dramatically improves the predicted accuracy and generalizability of machine learning models
trained on skewed datasets by reducing the disproportionate representation of classes. SMOTE
effectively combats biases that may occur during model training by promoting a more fair
dataset distribution, enabling a more robust and unbiased model performance across different
class instances.
SMOTE is essentially a game-changing technique for managing class imbalances in ma-
chine learning datasets. SMOTE boosts machine learning model performance and reliability
by generating synthetic samples for underrepresented classes, especially in situations when
complicated class imbalances are prevalent.
The algorithm for SMOTE can be described as follows:

1. For each sample in the minority class, find its k nearest neighbors.

2. Randomly select one of the k nearest neighbors and calculate the difference between the
sample and the selected neighbor.

3. Multiply this difference by a random number between 0 and 1, and add it to the sample to
create a new synthetic sample.

4. Repeat this process for each sample in the minority class to generate the desired number of
synthetic samples.

22
Chapter 3. Methodology

Mathematically, the SMOTE algorithm can be represented as follows:


Let xi be a sample from the minority class, and xzi be a randomly selected nearest neighbor
of xi .
The difference between xi and xzi is calculated as dif f = xzi − xi .
A new synthetic sample is then generated as xnew = xi + λ × dif f , where λ is a random
number between 0 and 1.
This process is repeated for each sample in the minority class to generate synthetic samples
and balance the class distribution.

Figure 3.5: Synthetic Minority Over-sampling Technique

23
Chapter 3. Methodology

3.5 Recursive Feature Elimination (RFE) for Feature selec-


tion
The goal of the feature selection process, known as recursive feature elimination (RFE), is to
extract the most pertinent characteristics from a given dataset. Recursively fitting a model and
removing the least significant features according to their relevance ranking is how it operates.
The first step in the procedure is to train a model using all of the features and give each
feature a relevance value. Then, the features that are now present are reduced to the least
significant features. Recursively repeating this technique continues until the target number of
features is attained.
The model is trained using the smaller set of features at each iteration, and the significance
scores are recalculated. This enables RFE to take into account the interdependencies between
features and choose a subset of characteristics that give the model the best overall prediction
power.
When working with high-dimensional datasets—those with more characteristics than sam-
ples—RFE is very helpful. RFE can enhance the model’s performance by lowering overfitting
and enhancing generalisation to unobserved data by choosing a subset of features.
RFE’s computational cost is one of its drawbacks, particularly when dealing with huge
datasets and intricate models. Furthermore, the model and feature count that are chosen will
determine how effective RFE is; these decisions may need to be adjusted depending on the
particular dataset and issue at hand.
In conclusion, by identifying the most pertinent features from a given dataset, Recursive
Feature Elimination is a potent feature selection strategy that can assist enhance the perfor-
mance and interpretability of machine learning models.
Recursive Feature Elimination (RFE) is a feature selection technique commonly used in
classification problems. The mathematical formula for RFE in the context of a classification
problem involves recursively fitting a classification model and eliminating the least important
features based on their ranking. Here’s the general mathematical formulation:

24
Chapter 3. Methodology

Let X represent the feature matrix of shape (n, m), where n is the number of samples and m
is the number of features. Let y represent the target variable of shape (n, ), containing the class
labels for each sample. Let model denote the classification model used for feature ranking.
Let RFE(X, y, k) represent the Recursive Feature Elimination function, where k is the desired
number of features to select.
1. Initialize the set of selected features S as an empty set. 2. Repeat until |S| = k: a. Train
the classification model model on the features XS , where XS represents the subset of features
in X selected in the previous iteration. b. Compute the importance scores for each feature in
X based on the model’s performance. c. Identify the least important feature f based on the
importance scores. d. Remove feature f from X and update S by adding f to the set of selected
features. 3. Return the set of selected features S.
The importance scores are typically derived from the model’s coefficients or feature impor-
tances, depending on the type of classification model used (e.g., logistic regression, decision
trees, etc.). The process continues recursively until the desired number of features (k) is se-
lected.
Overall, RFE aims to identify the subset of features that maximizes the classification per-
formance of the model while reducing the dimensionality of the feature space.
The formula for Recursive Feature Elimination (RFE) in a classification problem can be
expressed as follows:
Given a feature matrix X of shape (n, m) and a target variable y of shape (n, ), where n is
the number of samples and m is the number of features, RFE selects a subset of features of size
k that maximizes the classification performance of a given classification model.
Let model denote the classification model used for feature ranking, and let RFE(X, y, k)
represent the Recursive Feature Elimination function.
The RFE algorithm can be summarized by the following formula:

RFE(X, y, k) = arg max Performance(model(XS ), y)


S⊆{1,2,...,m},|S|=k

25
Chapter 3. Methodology

where: - S is the set of selected features. - XS represents the subset of features selected in
X. - Performance(model(XS ), y) is a metric measuring the classification performance of the
model using the selected features XS and the target variable y.
The goal of RFE is to find the subset of features S that maximizes the classification perfor-
mance of the model.

3.6 ML Algorithms

3.6.1 LightGBM

A gradient boosting system called LightGBM makes use of tree-based learning methods. Due
to its efficient and distributed design, it can be used for issues involving high-dimensional
features and massive amounts of data. LightGBM introduces various improvements to the
conventional gradient boosting framework, including:

• Gradient-based One-Side Sampling (GOSS): GOSS is a technique used by LightGBM


to decrease the quantity of training data examples while preserving gradient information.
It randomly samples the cases with small gradients and retains the examples with large
gradients, which are more instructive.

• unique Feature Bundling (EFB): To cut down on features and boost productivity, Light-
GBM groups unique features together. This is very helpful when working with data that
has many dimensions.

• Histogram-based Splitting: LightGBM determines the optimal split points for every fea-
ture using an algorithm based on a histogram. It initially classifies the data, then utilises
the histogram to choose the ideal split points rather than utilising all of the data points.

• Leaf-wise Tree development: Compared to level-wise tree development, LightGBM


grows trees in a leaf-wise manner, which more efficiently minimises loss and results
in a reduced loss.

26
Chapter 3. Methodology

• Gradient-based One-Side Sampling (GOSS): GOSS is a technique used by LightGBM


to decrease the quantity of training data examples while preserving gradient information.
It randomly samples the cases with small gradients and retains the examples with large
gradients, which are more instructive.

• Regularisation: To avoid overfitting, LightGBM offers a number of regularisation param-


eters, including L1 and L2 regularisation.

LightGBM’s classification functionality:

LightGBM computes the target variable’s overall mean after initialising the model using
a single leaf.

• Tree Growth: Trees are grown iteratively by LightGBM. It divides the leaf into two
kid leaves at each iteration based on which leaf minimises the loss the greatest. This
operation is carried out again until either the maximum number of leaves is taken or the
loss reduction falls below a predetermined threshold.

• Leaf-wise Growth: Unlike standard algorithms, which split all leaves at the same level,
LightGBM grows trees leaf-wise, that is, by splitting the leaf that minimises the loss the
greatest.

• LightGBM uses the feature values of the input instance to navigate the tree from the root
to a leaf in order to produce a prediction. The anticipated value is then derived from the
leaf’s mean value.

• Regularisation: L1 and L2 regularisation are two regularisation strategies that LightGBM


uses to stop overfitting.

Because of its efficient tree-growing technique and capacity to handle sparse data, Light-
GBM is an all-around efficient and effective algorithm for classification tasks, particu-
larly for large-scale datasets with high-dimensional features.

The LightGBM algorithm optimizes the following objective function in each iteration:

27
Chapter 3. Methodology

n
X T
X
obj(θ) = ℓ(yi , ŷi ) + Ω(fi )
i=1 i=1

where: - θ is the set of parameters to be optimized, - n is the number of data points, - ℓ is


the loss function (e.g., cross-entropy for binary classification), - yi is the true label of the i-th
data point, - ŷi is the predicted label of the i-th data point, - T is the number of leaves in the
tree, - Ω is the regularization term to prevent overfitting, - fi is the i-th tree in the model.
The objective function is optimized using a gradient boosting framework, where each new
tree is added to minimize the objective function. The prediction of the model for a new data
point x is given by:

T
X
ŷ(x) = fi (x)
i=1

where fi (x) is the prediction of the i-th tree for the data point x.
The algorithm iteratively adds new trees to the model, with each tree minimizing the ob-
jective function by fitting the negative gradient of the loss function. This process continues
until a stopping criterion is met, such as reaching the maximum number of trees or achieving a
minimum improvement in the objective function.

3.6.2 Gradient Boosting

One well-liked machine learning method for classification applications is gradient boosting.
It functions by successively combining several weak learners (often decision trees), with each
new learner concentrating on the errors committed by the older ones. An objective function
that gauges the effectiveness of the model and seeks to reduce errors serves as the process’s
direction.
An initial weak learner is used by the algorithm to create predictions based on the input
features. A loss function, such as binary cross-entropy for binary classification or categorical
cross-entropy for multi-class classification, is used to determine the errors when the predictions

28
Chapter 3. Methodology

and actual labels are compared. The objective is to minimise this loss function by varying the
weak learner’s parameters.
To fix the mistakes produced by the previous weak learner, a new one is added to the ensem-
ble in the following step. The residual errors—the discrepancy between expected and actual
values—of the prior learner are used to train this one. Iteratively, the procedure is repeated,
with each new student concentrating on the mistakes that remain.
Utilising gradients, or derivatives of the loss function with regard to the model’s predictions,
is the fundamental principle of gradient boosting. To minimise the loss, these gradients show
which way the model’s predictions should be changed. The model eventually gets better at
what it does by adding more learners that are trained to minimise these gradients iteratively.
Because gradient boosting can manage non-linearities and capture complicated relation-
ships in the data, it is a useful technique. To avoid overfitting and attain optimal performance,
it is crucial to adjust the algorithm’s hyperparameters, such as the learning rate and the number
of learners.
The mathematical formula for gradient boosting can be represented as follows in LaTeX
format:

Fm (x) = Fm−1 (x) + γ · hm (x)

Where: - Fm (x) is the ensemble of weak learners up to iteration m, - Fm−1 (x) is the
ensemble of weak learners up to iteration m − 1, - γ is the learning rate, - hm (x) is the weak
learner at iteration m.

3.6.3 Random Forest

A potent ensemble learning technique for both regression and classification applications is
called Random Forest. It is predicated on the idea of decision trees, in which a number of
trees are constructed to generate predictions, with the aggregate of the forecasts from each tree
serving as the ultimate prediction.

29
Chapter 3. Methodology

Several decision trees in a Random Forest are trained via a method known as bagging
(bootstrap aggregating). In bagging, a decision tree is trained on each of the bootstrapped
samples—random samples with replacement—that are created from the original dataset. This
lessens overfitting and enhances the model’s overall performance by introducing diversity among
the trees.
The Random Forest increases the diversity of the trees by training each one using a different
subset of data. This method is referred to as random feature selection or feature bagging. The
trees grow more robust and less correlated at each split by taking into account only a portion of
the attributes, which results in a more accurate and stable model.
The Random Forest combines the forecasts from each individual tree to create predictions.
The average of each tree’s predictions is the final prediction for regression tasks. When it comes
to classification jobs, the ultimate prediction is decided by a majority vote among all the trees’
predictions.
The capacity of Random Forest to handle big datasets with lots of characteristics and high
dimensionality is one of its main advantages. It is a well-liked option for many machine learn-
ing problems since it is less likely to overfit than individual decision trees.
All things considered, Random Forest is a popular option for many machine learning appli-
cations since it is a flexible and strong algorithm that blends the advantages of decision trees
with ensemble learning.
The mathematical formula for the Random Forest algorithm can be expressed as follows:
Let X be the input feature matrix with m samples and n features, and Y be the correspond-
ing target variable. Random Forest consists of N decision trees, where each tree is built using
a bootstrap sample of the training data:
1. For i = 1 to N : - Sample a bootstrap sample Xi of size m from X with replacement. -
Train a decision tree hi (Xi ) using Xi and Y .
2. To make a prediction for a new sample x: - Aggregate the predictions of all trees:
Ŷ (x) = N1 N N
P
i=1 hi (x) (for regression) or use majority voting: Ŷ (x) = mode{hi (x)}i=1 (for

classification).

30
Chapter 3. Methodology

The final prediction is the average (for regression) or majority vote (for classification) of
the predictions of all trees.

3.6.4 KNN

A straightforward yet powerful approach for machine learning classification problems is K-


Nearest Neighbours (KNN). The fundamental principle of KNN is to categorise a new data
point in the feature space according to the majority class of its K nearest neighbours. This is
how KNN functions:

• Initialization: A training dataset with labelled samples is used to begin the algorithm. A
class identifier and a collection of features, or attributes, make up each example.

• Distance Calculation: KNN determines the distance between each new data point to be
classified and every other point in the training dataset. While Manhattan distance is one
of the available metrics, Euclidean distance is the most widely used one.

• Finding Neighbours: Using the computed distances, KNN then finds the K data points
(neighbours) that are closest to the new point. The first group of points utilised for
classification is made up of these neighbours.

• The Majority Voting method is used by KNN to classify new data points by obtaining
a majority vote from the K neighbours. The new point is given to the class that has the
highest frequency among its neighbours.

• Selecting K: K is a hyperparameter that must be properly selected. While a big value of


K may result in underfitting, a little value of K may cause overfitting. Typically, cross-
validation methods are used to select K.

• Decision Boundary: KNN saves all training instances and their labels instead of learning
a model explicitly. The distribution of the training data establishes the non-linear decision
boundary in a KNN.

31
Chapter 3. Methodology

KNN is a flexible technique that works well with situations involving binary and multi-
class classification. Because it is easy to use and comprehend, it is a well-liked option for
those new to machine learning. It can, however, be computationally expensive, particu-
larly for large datasets, since all training instances must have their distances calculated
and stored.

The k-nearest neighbors (KNN) algorithm can be summarized in a mathematical formula as


follows:
Given a training dataset D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} where xi is the feature vector
and yi is the class label of the ith instance, and a test instance xq for which we want to predict
the class label:
1. Calculate the distance between xq and each xi in the training dataset using a distance
metric, such as Euclidean distance:
v
u m
uX
d(xq , xi ) = t (xqj − xij )2
j=1

2. Select the k instances in D with the smallest distances to xq .


3. Assign the class label to xq by majority voting among the k nearest neighbors:
k
X
yˆq = arg max I(yi = y)
y
i=1

where yˆq is the predicted class label for xq , yi is the class label of the ith nearest neighbor,
and I(yi = y) is an indicator function that is 1 if yi = y and 0 otherwise.

3.6.5 Logistic regression

A basic and popular statistical technique for issues involving binary classification is logistic
regression. Logistic regression forecasts the chance that an instance belongs to a specific class,
as opposed to linear regression, which predicts continuous values. When the dependent variable
is binary (e.g., 0 or 1, true or false), it is especially helpful.

32
Chapter 3. Methodology

The logistic function, sometimes referred to as the sigmoid function, is used by the logistic
regression model to translate the linear combination of the input data into a probability score
that ranges from 0 to 1. The definition of the logistic function is:

1
σ(z) =
1 + e−z
where z = β0 + β1 x1 + β2 x2 + . . . + βn xn is the linear combination of input features and
coefficients, and β0 , β1 , . . . , βn are the model parameters to be learned from the training data.
Finding the ideal values for the coefficients that maximise the likelihood of the observed
data is the goal of logistic regression during the training phase. Usually, optimisation tech-
niques like gradient descent are used for this. After being trained, the model can use the fol-
lowing formula to forecast the likelihood that a new instance would belong to the positive class
(such as class 1):

P (y = 1|x) = σ(β0 + β1 x1 + β2 x2 + . . . + βn xn )

where P (y = 1|x) is the probability of the positive class given the input features x.
Logistic regression employs a decision threshold (often 0.5) to generate a binary prediction.
The occurrence is classed as belonging to the positive class if the anticipated probability is
higher than the threshold; if not, it is classified as belonging to the negative class.
Because of its simplicity, interpretability, and efficacy in binary classification problems,
logistic regression is widely utilised in a variety of industries, including healthcare (e.g., fore-
casting illness risk), marketing (e.g., customer churn prediction), and finance (e.g., credit risk
assessment).

3.6.6 AdaBoost

Adaptive Boosting, or AdaBoost, is a well-liked ensemble learning technique for classification


applications. In order to produce a strong classifier, it combines several weak classifiers. A
model, such a decision tree with a single split, that outperforms random guessing just a little
bit is called a weak classifier.

33
Chapter 3. Methodology

AdaBoost’s main concept is to train a sequence of weak classifiers on the same dataset in
a sequential fashion, with each new classifier placing greater emphasis on the cases that the
earlier classifiers misclassified. This is accomplished by changing the training instance weights
so that the incorrectly identified examples receive larger weights in the ensuing training cycles.
Put differently, AdaBoost is very good at managing imbalanced datasets because it concentrates
more on the challenging cases.
More accurate classifiers are given larger weights throughout the training process, which is
based on the accuracy of each weak classifier. In the final classification, a weighted sum—where
the weights are the classifiers’ given weights—is used to aggregate the predictions of each weak
classifier. The final prediction is based on this weighted sum, and a majority vote among the
classifiers determines the class label.
AdaBoost has gained recognition for its ease of use and potency in enhancing classification
results, particularly in contrast to standalone weak classifiers. It is, however, susceptible to
outliers and noisy data, as these can adversely affect the performance of the weak classifiers
and, in turn, the AdaBoost method as a whole.
Sure, the mathematical formula for the AdaBoost algorithm can be represented as follows
in LaTeX format:

T
X
F (x) = αt ft (x)
t=1

Where: - F (x) is the final classification function, - T is the total number of weak classifiers,
- αt is the weight assigned to the weak classifier ft (x), - ft (x) is a weak classifier that is trained
to classify the data.
This formula represents the linear combination of weak classifiers weighted by αt to form a
strong classifier. Each weak classifier is trained sequentially, with the weights αt being adjusted
to focus on incorrectly classified samples in subsequent iterations.

34
Chapter 3. Methodology

3.7 Evaluation metrics


1. Precision: Precision is the number of true positive results divided by the number of all
positive results, including those not identified correctly. -

True Positives
Precision =
True Positives + False Positives

2. Recall (Sensitivity or True Positive Rate): Recall is the number of true positive results
divided by the number of positive results that should have been returned. -

True Positives
Recall =
True Positives + False Negatives

3. F1 Score: The F1 score is the harmonic mean of precision and recall, giving a single
score that balances both measures. -

Precision × Recall
F1 Score = 2 ×
Precision + Recall

4. Accuracy: : Accuracy is the ratio of correctly predicted observations to the total obser-
vations. -
True Positives + True Negatives
Accuracy =
Total Population

35
Chapter 4

Results and Discussions

4.1 Class imbalance

4.1.1 Undersampling

Undersampling involves reducing the number of instances in the overrepresented class to match
the number of instances in the underrepresented class. In this case, undersampling was used to
reduce the instances of the "DOS_SYN_Hping" class from 94659 to 10000. This process helps
balance the class distribution, which can improve the performance of machine learning models,
particularly in handling imbalanced datasets.

4.2 Oversampling SMOTE


This table summarizes the shape and label counts of the dataset before applying any oversam-
pling techniques.
Here is the table in LaTeX format:
“‘latex “‘
This table summarizes the shape and label counts of the dataset after applying oversampling
techniques.

36
Chapter 4. Results and Discussions

Description Value
Before OverSampling, the shape of X (38458, 83)
Before OverSampling, the shape of y (38458,)
Before OverSampling, counts of label ’0’ 7750
Before OverSampling, counts of label ’1’ 534
Before OverSampling, counts of label ’2’ 10000
Before OverSampling, counts of label ’3’ 4146
Before OverSampling, counts of label ’4’ 37
Before OverSampling, counts of label ’5’ 28
Before OverSampling, counts of label ’6’ 2000
Before OverSampling, counts of label ’7’ 1002
Before OverSampling, counts of label ’8’ 2590
Before OverSampling, counts of label ’9’ 2010
Before OverSampling, counts of label ’11’ 8108
Before OverSampling, counts of label ’12’ 253

Table 4.1: Class imbalance After Undersampling

4.3 Model Performance


The performance of the models—LightGBM, Gradient Boosting, Random Forest, KNN, Logis-
tic Regression, and AdaBoost—was evaluated on a classification project. LightGBM achieved
the highest accuracy of 97.48%, outperforming all other models. This superior performance
can be attributed to LightGBM’s ability to handle large datasets efficiently and its capability to
deal with high-dimensional data effectively.
Gradient Boosting, with an accuracy of 96.90%, also performed well, showcasing its strength
in ensemble learning and sequential improvement of weak learners. Random Forest achieved
an accuracy of 96.35%, demonstrating its robustness against overfitting and its effectiveness in
handling noisy data.

37
Chapter 4. Results and Discussions

Description Value
After OverSampling, the shape of X (120000, 83)
After OverSampling, the shape of y (120000,)
After OverSampling, counts of label ’0’ 10000
After OverSampling, counts of label ’1’ 10000
After OverSampling, counts of label ’2’ 10000
After OverSampling, counts of label ’3’ 10000
After OverSampling, counts of label ’4’ 10000
After OverSampling, counts of label ’5’ 10000
After OverSampling, counts of label ’6’ 10000
After OverSampling, counts of label ’7’ 10000
After OverSampling, counts of label ’8’ 10000
After OverSampling, counts of label ’9’ 10000
After OverSampling, counts of label ’11’ 10000
After OverSampling, counts of label ’12’ 10000

Table 4.2: Class Imbalance afte SMOTE

KNN, with an accuracy of 86.18%, showed decent performance but lagged behind the tree-
based models. This can be attributed to KNN’s sensitivity to irrelevant features and its reliance
on distance metrics, which may not always be suitable for high-dimensional data.
Both Logistic Regression and AdaBoost exhibited lower accuracies of 54.78% and 55.63%,
respectively. These models might have struggled due to the complexity of the dataset, as they
are less adept at capturing non-linear relationships compared to ensemble methods like Light-
GBM and Gradient Boosting.
In conclusion, LightGBM emerged as the top-performing model due to its efficient handling
of high-dimensional data and its ability to capture complex patterns in the dataset. Gradient
Boosting and Random Forest also showed strong performances, highlighting the effectiveness
of ensemble learning techniques in classification tasks. Conversely, KNN, Logistic Regression,

38
Chapter 4. Results and Discussions

and AdaBoost exhibited lower accuracies, suggesting that they may not be as suitable for this
particular classification problem.

Algorithm Accuracy
LightGBM 0.9748
Gradient Boosting 0.9690
Random Forest 0.9635
KNN 0.8618
Logistic Regression 0.5478
AdaBoost 0.5563

Table 4.3: Model Performances

39
Chapter 5

Summary and Future Work

5.1 Findings
This project focused on classifying IoT network traffic using boosting and forest-based machine
learning algorithms on the Real-Time Internet of Things 2022 dataset. The goal was to enhance
IoT security by accurately classifying normal patterns and attack types.
Several machine learning algorithms were employed and evaluated for this classification
task. The algorithms included LightGBM, Gradient Boosting, Random Forest, KNN, Logistic
Regression, and AdaBoost. These algorithms were chosen for their effectiveness in handling
classification problems and their suitability for this particular dataset.
The evaluation metrics used to assess the performance of the models included accuracy,
precision, recall, F1 score, and area under the ROC curve. These metrics provide a comprehen-
sive view of how well the models are performing in terms of both overall accuracy and their
ability to correctly classify different classes of network traffic.
The results of the project showed that LightGBM performed the best among the algorithms,
achieving an accuracy of 97.48%. This superior performance can be attributed to LightGBM’s
ability to handle large datasets efficiently and its capability to deal with high-dimensional data
effectively. Gradient Boosting and Random Forest also showed strong performances, with
accuracies of 96.90% and 96.35%, respectively.

40
Chapter 5. Summary and Future Work

On the other hand, KNN, Logistic Regression, and AdaBoost exhibited lower accuracies,
suggesting that they may not be as suitable for this particular classification problem. These
algorithms struggled due to the complexity of the dataset and their limitations in capturing non-
linear relationships compared to ensemble methods like LightGBM and Gradient Boosting.
In conclusion, the project successfully developed and evaluated machine learning models
for classifying network traffic in IoT environments. The results highlight the effectiveness of
ensemble learning techniques, particularly LightGBM, in handling complex classification tasks
in IoT environments. These findings can be valuable for enhancing the security of IoT devices
and networks by accurately detecting and classifying different types of network traffic.

5.2 Future Scope


In future work, the project can explore the integration of deep learning techniques, such as
convolutional neural networks (CNNs) and recurrent neural networks (RNNs), to further en-
hance the classification of network traffic in IoT environments. Deep learning models can
automatically extract features from raw data, potentially improving the accuracy and efficiency
of classification algorithms.
Additionally, developing an IoT network analysis system based on the developed machine
learning projects can provide real-time monitoring and detection of network anomalies. This
system can leverage the trained models to continuously analyze network traffic and identify
suspicious patterns or behaviors, helping to enhance the security and performance of IoT net-
works.
The benefits of integrating deep learning techniques into the project include the ability to
handle complex and high-dimensional data more effectively, as well as the potential for im-
proved accuracy and generalization. By leveraging deep learning, the project can stay at the
forefront of advancements in machine learning and IoT security, contributing to the develop-
ment of more robust and intelligent IoT network analysis systems.

41
Appendix A

Code Used

A.1 Data Loading and Preprocessing

import p a n d a s a s pd
import numpy a s np
import m a t p l o t l i b . p y p l o t a s p l t
import s e a b o r n a s s n s

d f = pd . r e a d _ c s v ( r ’C : \ U s e r s \ n a l a n \ Documents \ r i o t 2 2 . c s v ’ )
print ( df . shape )
#C : \ U s e r s \ n a l a n \ Documents \ r i o t 2 2 . c s v
d f . d r o p ( c o l u m n s = [ ’ Unnamed : 0 ’ ] , i n p l a c e = True , a x i s = 1)
d f . d u p l i c a t e d ( ) . sum ( )
duplicate_rows = df [ df . d u p l i c a t e d ( ) ]

d d f = pd . DataFrame ( d u p l i c a t e _ r o w s )
ddf . head ( 2 5 )

pd . s e t _ o p t i o n ( ’ d i s p l a y . max_rows ’ , 5 0 0 )

42
Appendix A. Code Used

pd . s e t _ o p t i o n ( ’ d i s p l a y . max_columns ’ , 5 0 0 )
pd . s e t _ o p t i o n ( ’ d i s p l a y . w i d t h ’ , 1 0 0 0 )

d u p l i c a t e _ c o u n t s = d f . d u p l i c a t e d ( ) . sum ( )
d d f = pd . DataFrame ( { ’ Column ’ : d f . columns , ’ D u p l i c a t e Count ’ : d u p l i c a t e _ c o
ddf . head ( ) # c a t add e n c o d

for i in df . columns :
i f d f [ i ] . d t y p e s == ’ o b j e c t ’ :
print ( i )
print ()
print ( ’ the values are : ’ )
print ( df [ i ] . value_counts ( ) )
print ()
print ()

u n i q u e _ a t t a c k s = df [ ’ Attack_type ’ ] . unique ( )
p r i n t ( u n i q u e _ a t t a c k s ) ) # 9 malwares

u n i q u e _ a t t a c k s = df [ ’ Attack_type ’ ] . value_counts ( )
print ( unique_attacks )

b a d _ a t t a c k s = [ ’ A R P _ p o i s i o n i n g ’ , ’ DDOS_Slowloris ’ , ’ DOS_SYN_Hping ’ , ’ M e t a

# F i l t e r t h e DataFrame t o i n c l u d e o n l y rows where t h e A t t a c k _ t y p e i s i n b


bad_attacks_df = df [ df [ ’ Attack_type ’ ] . i s i n ( bad_attacks ) ]

# Get t h e v a l u e c o u n t s f o r A t t a c k _ t y p e i n t h e f i l t e r e d DataFrame

43
Appendix A. Code Used

bad_attacks_counts = bad_attacks_df [ ’ Attack_type ’ ] . value_counts ()

print ( bad_attacks_counts )

f o r column i n d f . c o l u m n s :
n u l l _ c o u n t = d f [ column ] . i s n u l l ( ) . sum ( )
p r i n t ( f " Column ’{ column } ’ h a s { n u l l _ c o u n t } n u l l v a l u e s . " )

import m a t p l o t l i b . p y p l o t a s p l t

# F i l t e r numerical columns
n u m e r i c a l _ c o l u m n s = d f . s e l e c t _ d t y p e s ( i n c l u d e = [ ’ number ’ ] ) . c o l u m n s

# C r e a t e box p l o t s f o r e a c h n u m e r i c a l column
f o r column i n n u m e r i c a l _ c o l u m n s :
p l t . f i g u r e ( f i g s i z e =(8 , 6))
d f . b o x p l o t ( column = [ column ] )
p l t . t i t l e ( f ’ Box p l o t f o r { column } ’ )
p l t . y l a b e l ( ’ Value ’ )
p l t . show ( )

def d e t e c t _ o u t l i e r s _ i q r ( data ) :
# C a l c u l a t e Q1 ( 2 5 t h p e r c e n t i l e ) o f t h e g i v e n d a t a
Q1 = d a t a . q u a n t i l e ( 0 . 2 5 )

# C a l c u l a t e Q3 ( 7 5 t h p e r c e n t i l e ) o f t h e g i v e n d a t a
Q3 = d a t a . q u a n t i l e ( 0 . 7 5 )

44
Appendix A. Code Used

# C a l c u l a t e IQR ( I n t e r q u a r t i l e Range )
IQR = Q3 − Q1

# C a l c u l a t e l o w e r bound
l o w e r _ b o u n d = Q1 − 1 . 5 * IQR

# C a l c u l a t e u p p e r bound
u p p e r _ b o u n d = Q3 + 1 . 5 * IQR

# Find o u t l i e r s
o u t l i e r s = ( d a t a < lower_bound ) | ( d a t a > upper_bound )

return o u t l i e r s

# I t e r a t e o v e r e a c h n u m e r i c a l column and d e t e c t o u t l i e r s
f o r column i n n u m e r i c a l _ c o l u m n s :
o u t l i e r s = d e t e c t _ o u t l i e r s _ i q r ( d f [ column ] )
p r i n t ( f " Column ’{ column } ’ : { o u t l i e r s . sum ( ) } o u t l i e r s d e t e c t e d . " )

# o u t l i e r removal
numeric_columns = df . s e l e c t _ d t y p e s ( i n c l u d e =[ ’ f l o a t 6 4 ’ , ’ i n t 6 4 ’ ] ) . columns

f o r column i n n u m e r i c _ c o l u m n s :
Q1 = d f [ column ] . q u a n t i l e ( 0 . 2 5 )
Q3 = d f [ column ] . q u a n t i l e ( 0 . 7 5 )
IQR = Q3 − Q1

45
Appendix A. Code Used

p r i n t ( f " Column ’{ column } ’ : IQR = {IQR} " )

import p a n d a s a s pd

outliers_columns = [
’ id . resp_p ’ , ’ flow_duration ’ , ’ fwd_pkts_tot ’ , ’ bwd_pkts_tot ’ , ’ fwd_da
’ down_up_ratio ’ , ’ fwd_header_size_tot ’ , ’ fwd_header_size_min ’ , ’ fwd_h
’ bwd_header_size_min ’ , ’ bwd_header_size_max ’ ,
’ fwd_PSH_flag_count ’ , ’ bwd_PSH_flag_count ’ , ’ flow_ACK_flag_count ’ ,
’ f w d _ p k t s _ p a y l o a d . min ’ , ’ f w d _ p k t s _ p a y l o a d . max ’ , ’ f w d _ p k t s _ p a y l o a d . t o
’ f w d _ p k t s _ p a y l o a d . avg ’ , ’ f w d _ p k t s _ p a y l o a d . s t d ’ , ’ b w d _ p k t s _ p a y l o a d . min
’ b w d _ p k t s _ p a y l o a d . t o t ’ , ’ b w d _ p k t s _ p a y l o a d . avg ’ , ’ b w d _ p k t s _ p a y l o a d . s t d
’ f l o w _ p k t s _ p a y l o a d . max ’ , ’ f l o w _ p k t s _ p a y l o a d . t o t ’ , ’ f l o w _ p k t s _ p a y l o a d .
’ f w d _ i a t . min ’ , ’ f w d _ i a t . max ’ , ’ f w d _ i a t . t o t ’ , ’ f w d _ i a t . avg ’ , ’ f w d _ i a t .
’ b w d _ i a t . max ’ , ’ b w d _ i a t . t o t ’ , ’ b w d _ i a t . avg ’ , ’ b w d _ i a t . s t d ’ , ’ f l o w _ i a t
’ f l o w _ i a t . t o t ’ , ’ f l o w _ i a t . avg ’ , ’ f l o w _ i a t . s t d ’ , ’ p a y l o a d _ b y t e s _ p e r _ s e c
’ bwd_subflow_pkts ’ , ’ fwd_subflow_bytes ’ , ’ bwd_subflow_bytes ’ , ’ fwd_bu
’ fwd_bulk_packets ’ , ’ bwd_bulk_packets ’ , ’ fwd_bulk_rate ’ , ’ bwd_bulk_ra
’ a c t i v e . t o t ’ , ’ a c t i v e . avg ’ , ’ a c t i v e . s t d ’ , ’ i d l e . min ’ , ’ i d l e . max ’ , ’ i d
’ fwd_init_window_size ’ , ’ bwd_init_window_size ’ , ’ fwd_last_window_size
]

d f o u t = d f [ o u t l i e r s _ c o l u m n s ] . copy ( )
dfout

def d e t e c t _ o u t l i e r s _ i q r ( data ) :
# C a l c u l a t e Q1 ( 2 5 t h p e r c e n t i l e ) o f t h e g i v e n d a t a
Q1 = d a t a . q u a n t i l e ( 0 . 2 5 )

46
Appendix A. Code Used

# C a l c u l a t e Q3 ( 7 5 t h p e r c e n t i l e ) o f t h e g i v e n d a t a
Q3 = d a t a . q u a n t i l e ( 0 . 7 5 )

# C a l c u l a t e IQR ( I n t e r q u a r t i l e Range )
IQR = Q3 − Q1

# C a l c u l a t e l o w e r bound
l o w e r _ b o u n d = Q1 − 1 . 5 * IQR

# C a l c u l a t e u p p e r bound
u p p e r _ b o u n d = Q3 + 1 . 5 * IQR

# Find o u t l i e r s
o u t l i e r s = ( d a t a < lower_bound ) | ( d a t a > upper_bound )

return o u t l i e r s

for c o l in d f o u t . columns :
Q1 = d f o u t [ c o l ] . q u a n t i l e ( 0 . 2 5 )
Q3 = d f o u t [ c o l ] . q u a n t i l e ( 0 . 7 5 )
IQR = Q3 − Q1
l o w e r _ b o u n d = Q1 − 1 . 5 * IQR
u p p e r _ b o u n d = Q3 + 1 . 5 * IQR
d f o u t [ c o l ] = d f o u t [ c o l ] . a p p l y ( lambda x : l o w e r _ b o u n d i f x < l o w e r _ b o u n
dfout

47
Appendix A. Code Used

# W i n s o r i z i n g o u t l i e r s ( r e p l a c i n g o u t l i e r s w i t h t h e n e a r e s t non− o u t l i e r v a

import p a n d a s a s pd
import numpy a s np

# A s s u m i n g ’ A t t a c k _ t y p e ’ i s t h e column c o n t a i n i n g t h e c l a s s l a b e l s
# A s s u m i n g ’ DOS_SYN_Hping ’ i s t h e c l a s s l a b e l t o remove

# F i n d i n d i c e s o f rows w i t h ’ DOS_SYN_Hping ’
i n d i c e s _ t o _ r e m o v e = d f [ d f [ ’ A t t a c k _ t y p e ’ ] == ’ DOS_SYN_Hping ’ ] . i n d e x

# Choose random i n d i c e s t o remove


np . random . s e e d ( 4 2 ) # Set seed for r e p r o d u c i b i l i t y
i n d i c e s _ t o _ r e m o v e = np . random . c h o i c e ( i n d i c e s _ t o _ r e m o v e , s i z e =84659 , r e p l a

# Remove rows f r o m t h e DataFrame


df = df . drop ( indices_to_remove )

# S a v e t h e f i l t e r e d DataFrame f o r f u r t h e r p r o c e s s i n g
df . to_csv ( ’ f i l t e r e d _ d a t a . csv ’ , index= False )

from s k l e a r n . p r e p r o c e s s i n g import L a b e l E n c o d e r

# A s s u m i n g ’ d f ’ i s y o u r DataFrame and n o n _ n u m e r i c a l _ c o l u m n s i s a l i s t o f
non_numerical_columns = df . s e l e c t _ d t y p e s ( e x c l u d e =[ ’ f l o a t 6 4 ’ , ’ i n t 6 4 ’ ] ) . co

l a b e l _ e n c o d e r s = {}
for f e a t u r e in non_numerical_columns :

48
Appendix A. Code Used

l a b e l _ e n c o d e r s [ f e a t u r e ] = LabelEncoder ( )
df [ f e a t u r e ] = l a b e l _ e n c o d e r s [ f e a t u r e ] . f i t _ t r a n s f o r m ( df [ f e a t u r e ] )

x= d f . d r o p ( [ ’ A t t a c k _ t y p e ’ ] , a x i s = 1)
y= d f [ ’ A t t a c k _ t y p e ’ ]

A.2 SMOTE AND RFE

from i m b l e a r n . o v e r _ s a m p l i n g i m p o r t SMOTE
sm = SMOTE( r a n d o m _ s t a t e =4 2)
x _ r e s , y _ r e s = sm . f i t _ r e s a m p l e ( x , y . r a v e l ( ) )

x _ r e s =pd . DataFrame ( x _ r e s )
# Renaming column name o f T a r g e t v a r i a b l e
y _ r e s =pd . DataFrame ( y _ r e s )
y _ r e s . columns = [ ’ At t a ck _ ty p e ’ ]
d f = pd . c o n c a t ( [ x _ r e s , y _ r e s ] , a x i s =1 )

# aftre feature extraction

from s k l e a r n . f e a t u r e _ s e l e c t i o n i m p o r t RFE
from s k l e a r n . l i n e a r _ m o d e l i m p o r t L o g i s t i c R e g r e s s i o n

# Assuming ’X’ i s y o u r f e a t u r e m a t r i x and ’ y ’ i s y o u r t a r g e t v a r i a b l e


model = L o g i s t i c R e g r e s s i o n ( )
r f e = RFE ( model , n _ f e a t u r e s _ t o _ s e l e c t = 1 6)
f i t = rfe . f i t (x , y)

49
Appendix A. Code Used

s e l e c t e d _ f e a t u r e s = x . columns [ r f e . s u p p o r t _ ]
print ( selected_features )

A.3 ML model fiting

from s k l e a r n . f e a t u r e _ s e l e c t i o n i m p o r t RFE
from s k l e a r n . l i n e a r _ m o d e l i m p o r t L o g i s t i c R e g r e s s i o n
from x g b o o s t i m p o r t X G B C l a s s i f i e r
from s k l e a r n . e n s e m b l e i m p o r t R a n d o m F o r e s t C l a s s i f i e r
from s k l e a r n . t r e e i m p o r t D e c i s i o n T r e e C l a s s i f i e r
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t

# S p l i t t h e r e s a m p l e d d a t a i n t o t r a i n i n g and t e s t s e t s
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X_selected , y , t e s t _ s

# Logistic Regression
l r _ m o d e l = L o g i s t i c R e g r e s s i o n ( r a n d o m _ s t a t e =4 2)
lr_model . f i t ( X_train , y _ t r a i n )

# XGBoost
xgb_model = X G B C l a s s i f i e r ( r a n d o m _ s t a t e =4 2 )
xgb_model . f i t ( X _ t r a i n , y _ t r a i n )

# Random F o r e s t

50
Appendix A. Code Used

r f _ m o d e l = R a n d o m F o r e s t C l a s s i f i e r ( r a n d o m _ s t a t e =4 2 )
rf_model . f i t ( X_train , y _ t r a i n )

# Decision Tree
d t _ m o d e l = D e c i s i o n T r e e C l a s s i f i e r ( r a n d o m _ s t a t e = 42 )
dt_model . f i t ( X_train , y _ t r a i n )

from s k l e a r n . m e t r i c s i m p o r t a c c u r a c y _ s c o r e

# P r e d i c t u s i n g e a c h model
lr_pred = lr_model . p r e d i c t ( X_test )
x g b _ p r e d = xgb_model . p r e d i c t ( X _ t e s t )
rf_pred = rf_model . p r e d i c t ( X_test )
dt_pred = dt_model . p r e d i c t ( X_test )

# C a l c u l a t e a c c u r a c y f o r e a c h model
lr_accuracy = accuracy_score ( y_test , lr_pred )
xgb_accuracy = a c c u r a c y _ s c o r e ( y _ t e s t , xgb_pred )
rf_accuracy = accuracy_score ( y_test , rf_pred )
dt_accuracy = accuracy_score ( y_test , dt_pred )

p r i n t ( f " L o g i s t i c R e g r e s s i o n Accuracy : { l r _ a c c u r a c y } " )


p r i n t ( f " XGBoost A c c u r a c y : { x g b _ a c c u r a c y } " )
p r i n t ( f " Random F o r e s t A c c u r a c y : { r f _ a c c u r a c y } " )
p r i n t ( f " Decision Tree Accuracy : { d t _ a c c u r a c y } " )

51
References

[1] Fares Grina, Zied Elouedi, and Eric Lefevre. A preprocessing approach for class-
imbalanced data using smote and belief function theory. In Cesar Analide, Paulo No-
vais, David Camacho, and Hujun Yin, editors, Intelligent Data Engineering and Automated
Learning – IDEAL 2020, pages 3–11, Cham, 2020. Springer International Publishing.

[2] Ashish Kumar Jha, Raja Muthalagu, and Pranav M. Pawar. Intelligent phishing website de-
tection using machine learning. Multimedia Tools and Applications, 82(19):29431–29456,
2023.

[3] Prabhjot Kaur and Anjana Gosain. Issues and challenges of class imbalance problem in
classification. International Journal of Information Technology, 14(1):539–545, 2022.

[4] Abdollah Kavousi-Fard, Wencong Su, and Tao Jin. A machine-learning-based cyber at-
tack detection model for wireless sensor networks in microgrids. IEEE Transactions on
Industrial Informatics, 17(1):650–658, 2021.

[5] Saadat Hasan Khan, Aritro Roy Arko, and Amitabha Chakrabarty. Anomaly Detection in
IoT Using Machine Learning, pages 237–254. Springer International Publishing, Cham,
2022.

[6] Mohsin Manzoor and Bhavna Arora. Framework for detection of malware using random
forest classifier. In Yashwant Singh, Chaman Verma, Illés Zoltán, Jitender Kumar Chhabra,
and Pradeep Kumar Singh, editors, Proceedings of International Conference on Recent
Innovations in Computing, pages 727–740, Singapore, 2023. Springer Nature Singapore.

52
References

[7] Kishwar Sadaf. Phishing website detection using xgboost and catboost classifiers. In 2023
International Conference on Smart Computing and Application (ICSCA), pages 1–6, 2023.

[8] C. C. Sobin. A survey on architecture, protocols and challenges in iot. Wireless Personal
Communications, 112(3):1383–1429, 2020.

[9] Tarid Wongvorachan, Surina He, and Okan Bulut. A comparison of undersampling, over-
sampling, and smote methods for dealing with imbalanced classification in educational
data mining. Information, 14(1), 2023.

53

You might also like