22mdt1038 Capstone Final
22mdt1038 Capstone Final
IoT Devices
MASTERS’ THESIS
Submitted by
Nalankilli R
Reg. No. 22MDT1038
Department of Mathematics
School of Advanced Sciences
Vellore Institute of Technology Chennai
Vandalur - Kelambakkam Road, Chennai - 600 127
May - 2024
DECLARATION
I hereby declare that the thesis entitled Enhancing Intrusion Detection in Real-Time
IoT Devices submitted by me to the Division of Mathematics, School of Advanced Sciences,
Vellore Institute of Technology, Chennai Campus, 600 127 in partial fulfillment of the require-
ments of the award of the degree of Master of Science in Data Science is a bona-fide record of
the work carried out by me under the supervision of Dr. Saroj Kumar Dash. I further declare
that the work reported in this thesis, has not been submitted and will not be submitted, either
in part or in full, for the award of any other degree or diploma of this institute or of any other
institute or University.
i
CERTIFICATE
This is to certify that the thesis entitled Enhancing Intrusion Detection in Real-Time
IoT Devices is prepared and submitted by Nalankilli R (Reg. No. 22MDT1038) to Vellore
Institute of Technology, Chennai Campus, in partial fulfillment of the requirement for the award
of the degree of Master of Science in Data Science is a bona-fide record carried out under my
guidance. The thesis fulfills the requirements as per the regulations of this University and in
my opinion meets the necessary standards for submission. The contents of this report have not
been submitted and will not be submitted either in part or in full, for the award of any other
degree or diploma and the same is certified.
Signature of HOD
Dr. K. Muthunagai
ii
Acknowledgement
With immense pleasure and a deep sense of gratitude, I wish to express my sincere thanks to
my supervisor Dr. Saroj Kumar Dash, Designation, School or Advanced Sciences, Vellore
Institute of Technology (VIT), Chennai for his/her motivation and continuous encouragement,
this project would not have been successfully completed.
I am grateful to the Chancellor of VIT, Dr. G.Viswanathan, the Vice Presidents, the Vice
Chancellor and the Pro Vice Chancellor for motivating me to carry out the project at Vellore
Institute of Technology, Chennai.
I express my sincere thanks to Dr. S. Mahalakshmi, Dean, School of Advanced Sciences,
VIT, Chennai and Dr. K. Muthunagai, HOD, Mathematics and Computing, School of Ad-
vanced Sciences, VIT, Chennai for their support and encouragement.
iii
Abstract
The Real-Time Internet of Things 2022 (RT-IoT) dataset is a comprehensive resource that is
carefully crafted to detect intrusions in real-time IoT devices by utilising cutting-edge ma-
chine learning algorithms. It is derived from a real-time IoT infrastructure. This dataset offers
a comprehensive collection of different IoT devices and advanced network attack techniques
representation of actual IoT security scenarios. A combination of benign and malevolent net-
work behaviours can be found in its contents. Notable Internet of Things devices such as
ThingSpeak-LED, Wipro-Bulb, and MQTT-Temp are highlighted, and simulated attack sce-
narios such Brute-Force SSH, DDoS utilising Hping and Slowloris, and the use of Nmap pat-
terns. Using the Flowmeter plugin in conjunction with the Zeek network monitoring tool, the
bidirectional aspects are carefully documented, allowing for a comprehensive and in-depth in-
vestigation of network traffic patterns. This abstract proposes to use the RT-IoT dataset to
drive improvements in Intrusion Detection Systems (IDS) designed for real-time IoT networks,
promoting the development of robust and flexible security solutions by implementing machine
learning classification projects. This study demonstrates how well boosting and forest-based
machine learning algorithms perform in classifying and differentiating between benign and ma-
licious network traffic in the ever-changing context of IoT security. This is achieved through
a thorough analysis of the RT-IoT dataset. This research endeavor’s main goal is to make sub-
stantial progress in strengthening security protocols in real-time IoT environments, protecting
the integrity and guaranteeing the security of both IoT devices and networks.
Keywords: Machine Learning, classification algorithms, Real-time IoT, Intrusions, Network
traffic analysis
iv
Contents
Declaration i
Certificate ii
Acknowledgement iii
Abstract iv
1 Introduction 1
2 Literature Review 5
2.1 Background Study about IOT based Intrusions . . . . . . . . . . . . . . . . . 5
2.2 SMOTE and Class Imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Machine Learning Aproaches and Models . . . . . . . . . . . . . . . . . . . . 8
3 Methodology 10
3.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Analysis of Protocol Commands and Intrusions in Data . . . . . . . . . . . . . 12
3.3.1 General Commands for Input . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 General Commands for Output . . . . . . . . . . . . . . . . . . . . . . 13
3.3.3 Entering inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.4 Outgoing outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
v
Contents
A Code Used 42
A.1 Data Loading and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 42
A.2 SMOTE AND RFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.3 ML model fiting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
vi
Contents
References 52
vii
List of Tables
viii
List of Figures
ix
Chapter 1
Introduction
Rapid IoT adoption has transformed a number of industries, including manufacturing, health-
care, and agriculture. As a result, network connectivity and data exchange capabilities have
significantly increased.Strong cybersecurity measures are vital in IoT ecosystems, as evidenced
by notable breaches like the Verkada incident in 2021, which exposed the live feeds of 150 mil-
lion surveillance cameras, and the concerning case in Florida where an attacker tampered with
the chemical composition of the water treatment organisation.
Our research focuses on creating a novel technique for detecting unusual assaults in IoT
network data in order to address these urgent issues. We use the machine learninh technique
for anomaly detection by utilising the extensive Real-Time Internet of Things 2022 (RT-IoT)
dataset, which includes both attack and regular traffic data from various devices like Amazon-
Alexa, MQTT-Temp, ThingSpeak-LED, and Wipro-Bulb. We predict significant reconstruction
error (RE) for detecting aberrant traffic patterns because the model was trained on benign net-
work data.
• MQTT-Temp is an IoT device which can communicate with one another using the lightweight
messaging protocol MQTT (Message Queuing Telemetry Transport). It is perfect for re-
mote monitoring and control systems since it allows for effective data transmission over
networks with constrained bandwidth.
1
Chapter 1. Introduction
• ThingSpeak is an Internet of Things platform that lets users gather, examine, and present
data from different IoT gadgets. It offers data integration APIs and is frequently utilised
for home automation and environmental monitoring applications.
• The Wipro Bulb is a smart lightbulb that can be operated from a distance with the help of
a smartphone or other linked devices. With capabilities like colour change, scheduling,
and dimming, it can be used with smart lighting solutions.
• Amazon created Alexa, a virtual assistant that can communicate via voice and manage
smart home appliances. It can do things like play music, make alarms, give weather
reports, and use voice commands to operate IoT devices.
Among the many risks that Internet of Things (IoT) devices encounter, Distributed Denial
of Service (DDoS) assaults are particularly dangerous. Webcams and printers are among the
many non-legacy IoT devices that are easily targeted by DDoS attacks, which can result in
the creation of malicious botnets. According to a recent Kaspersky analysis, there will be
a significant increase in highly skilled DDoS attacks in 2022, with attack durations reaching
a concerning 3000 minutes. The Secure Shell (SSH) brute-force attack is another common
kind of cyberattack that targets devices that have default passwords that are available via the
2
Chapter 1. Introduction
SSH protocol. The sophistication of these attacks keeps increasing as new IoT botnets like
RapperBot actively increase their capabilities.
3
Chapter 1. Introduction
4
Chapter 2
Literature Review
5
Chapter 2. Literature Review
upper bound estimation strategy. In order to solve neural network instability, it also contains a
combinatorial idea of PIs. Neural network parameters are adjusted using a modified optimisa-
tion technique based on symbiotic organisms search to address the complicated and oscillatory
nature of data from electric users.Practical data from a home microgrid is used to assess the
accuracy and performance of the proposed model, proving its usefulness in identifying and
averting data integrity assaults in wireless sensor networks.Tao(2021)[4]
One of the technologies with the fastest global growth is IoT. It completely transforms how
people, machines, and other technologies communicate. IoT security is a serious issue that
arises with this expansion, though. The necessity for strong network security is highlighted
by the growing number of connected devices. The identification of malicious packets requires
the use of an intrusion detection system, or IDS. These systems can develop intelligent models
for identifying harmful data in Internet of Things (IoT) devices by utilising Machine Learning
(ML) algorithms. In order to detect and forecast abnormalities in a dataset of environmental
attributes gathered from sensors in an IoT environment in Bangladesh, this study contrasts
different machine learning techniques. We assess these methods’ performance and suggest a
machine learning model that has a runtime of less than 0.2 seconds and an accuracy of 96.5
percent . Hasan(2022)[5]
6
Chapter 2. Literature Review
and overlapped regions are identified and eliminated using rules based on the belief function
theory.Tests carried performed on synthetic noisy datasets show that our suggestion performs
noticeably better than other widely used oversampling techniques.Eric(2020)[1]
Educational data mining can lead to the development of useful data-driven applications, like
predictive model-based academic achievement prediction or early warning systems in schools.
However, the accuracy of these predictive models may be compromised by the problem of class
imbalance in educational statistics. This is thus because many models are constructed under
the presumption of a forecast class that is balanced. Although earlier research has suggested
a number of approaches to deal with this imbalance, the majority of them have concentrated
on the technical aspects of each approach, with very few addressing real-world applications,
particularly for datasets that vary in their degree of imbalance. Using the High School Lon-
gitudinal Study of 2009 dataset, we analyse various sampling strategies to handle moderate
and extreme degrees of class imbalance in this study. Random undersampling (RUS), random
oversampling (ROS), and a hybrid resampling method that combines RUS and the synthetic
minority oversampling technique for nominal and continuous features (SMOTE-NC) are all
included in our comparison studies. Surina(2023)[9]
The class imbalance issue typically arises when conventional classification methods are un-
able to correctly identify infrequent occurrences or outliers that are present in a collection. The
usefulness of such algorithms in precisely detecting and categorising smaller or underrepre-
sented classes is limited because they are usually optimised to perform better with bigger or
more equally distributed classes. Researchers have proactively presented a range of novel con-
cepts, approaches, and modifications to the current categorization systems in order to address
this inherent issue. This paper’s scope includes a critical assessment of the related difficul-
ties, limitations, and gaps that continue to exist in the existing literature, as well as an in-depth
investigation of the dominant research trends that attempt to address the problem of class im-
balance.This study provides insight into the variety of approaches that have been developed
to improve classification model performance in the face of unequal class distributions. It also
highlights the significance of ongoing improvements and modifications to algorithmic design
7
Chapter 2. Literature Review
in order to lessen the negative effects of class imbalances on predictive accuracy and model
performance. Gosain (2022)[3]
8
Chapter 2. Literature Review
with XGBoost surpassing Catboost by a small margin. Two datasets are used for the evalu-
ation, which is supported by train-test and k-fold validation. We also contrast XGBoost and
Catboost’s performance with those of other traditional classifiers.sadaf(2023)[7]
The landscape of digital communication is seriously threatened by malware, which can
cause hostile attacks that disrupt network infrastructure and delete important files. Malware
authors have improved their strategies over the last ten years, making it difficult for conven-
tional detection techniques—like signature-based approaches—to keep up with the changing
world of malware. Because of this, conventional techniques have been unable to successfully
identify novel, sophisticated malware strains. As a result, the demand for a reliable and effi-
cient malware detection system that can accurately identify and detect encrypted and hidden
malware is pressing. A possible method for finding hidden and masked malware is machine
learning.This paper presents a system that employs the Random Forest classifier to detect mal-
ware and focuses on analysing malware detection using different machine learning techniques.
At its highest, the research was around 98.5 percent accurate. Manzoor(2023)[6]
9
Chapter 3
Methodology
3.1 Workflow
10
Chapter 3. Methodology
The RT-IoT2022 dataset contains real-world IoT device data as well as simulated attack
scenarios. These scenarios include popular attack techniques including Nmap patterns, DDoS
attacks with Hping and Slowloris, and Brute-Force SSH attacks. The inclusion of these sim-
ulated attacks in the dataset offers researchers a thorough understanding of the possible risks
that Internet of Things devices may encounter in practical situations.
11
Chapter 3. Methodology
The Zeek network monitoring tool and the Flowmeter plugin are used by the RT-IoT2022
dataset to record the bidirectional properties of network traffic. With the use of these tech-
nologies, researchers can perform in-depth analysis of network traffic, identifying trends and
abnormalities that might point to a security risk.
All things considered, the RT-IoT2022 dataset provides an extensive and thorough view-
point on the intricate nature of network traffic in IoT contexts. With the use of this information,
researchers can improve the functionality of intrusion detection systems (IDS) and create reli-
able, flexible security solutions for real-time Internet of Things networks.
• Sensor Readings: Instructs to retrieve information from sensors, including light, motion,
temperature, and humidity.
• Configuration Settings: These are commands used to set up parameters on the device,
like thresholds, sampling rates, and network characteristics.
• Control signals: Orders to operate actuators or other devices, such doors and lights, to
open and close, etc.
• User inputs: include button presses, voice commands, and other user-inputted com-
mands.
12
Chapter 3. Methodology
• Actuator Control: Instructions to move actuators in response to inputs from the user or
sensor data.
• Data transmission: refers to the process of sending commands to other network devices
or a central server.
• Status Updates: Orders to transmit alarms or status updates, such low battery life or high
temperature, among other things.
• Feedback: Instructions to give users feedback in the form of error alerts, confirmation
messages, etc. These orders are usually sent via a network utilising protocols for com-
munication like MQTT, CoAP, HTTP, etc., based on the needs and network connectivity
of the device.
• Network traffic data: Information pertaining to fictitious attack scenarios, such as DDoS
attacks employing Hping and Slowloris, Nmap patterns, and Brute-Force SSH attacks, to
mimic hostile network activities.
• Configuration Settings: Details about device settings, network setups, and other aspects
that affect the security and functionality of Internet of Things devices.
13
Chapter 3. Methodology
• Zeek network: monitoring tool and the Flowmeter plugin were used to record the bidi-
rectional properties of the network traffic. The results of the analysis included trends,
abnormalities, and other insights. Alerts from Intrusion Detection Systems
• (IDS): IDS systems use the dataset to generate alerts and notifications that highlight
possible security risks and assaults in real-time Internet of Things networks.
• Security Solutions: The creation and assessment of algorithms and security solutions
with the goal of reducing network attacks and improving the security of IoT settings in
real-time.
These input and output formats are essential for researching and comprehending the in-
tricate structure of network traffic in Internet of Things environments and for creating
security solutions that effectively guard networks and IoT devices from cyberattacks.
14
Chapter 3. Methodology
6. NMAP_TCP_scan: This Nmap scan finds open ports and services that are using those
ports by sending TCP packets to target ports.
15
Chapter 3. Methodology
9. NMAP_FIN_SCAN: Nmap scan that checks if target ports are filtered or closed by a
firewall by sending FIN packets to them.
10. MQTT: The Message Queuing Telemetry Transport protocol is a lightweight messaging
protocol that helps devices communicate with one another by allowing publishers and
subscribers to publish and subscribe to messages.
11. Thing_speak: An IoT platform that makes it possible to monitor and manage IoT appli-
cations by allowing users to gather, examine, and visualise data from IoT devices.
13. Amazon-Alexa: A virtual assistant that the company created. It can do a number of
things, like controlling smart home appliances, giving information, and playing music or
audiobooks, by using voice recognition and natural language processing.
16
Chapter 3. Methodology
17
Chapter 3. Methodology
The dataset has zero null values, which means that all of the data are present and there are no
missing values. Furthermore, as will be explained below, the dataset has outliers in a number
of columns.
The dataset consists of 83 columns in total, all of which have had their outliers examined.
The results are summarised as follows:
There are outliers in 76 of the columns.There are no outliers found in 7 columns. The
columns with no outliers detected are as follows:
• fwd_pkts_per_sec
• bwd_pkts_per_sec
• flow_pkts_per_sec
• bwd_URG_flag_count
• flow_CWR_flag_count
• flow_ECE_flag_count
• payload_bytes_per_second
A statistical method for handling extreme values in a dataset is winsorizing outliers. Winsoriz-
ing replaces outliers with the closest non-outlier value as opposed to eliminating them. This
method preserves important data points while lessening the impact of outliers on statistical
analysis. Typically, winsorizing involves setting the outlier values to a predetermined dataset
percentile, like the 95th or 99th percentile. In doing so, the extreme values are pushed closer
18
Chapter 3. Methodology
to the average, strengthening the dataset’s resistance to outliers while maintaining the original
data’s general distribution and properties.
In Winsorizing, the values below the p-th percentile are set to the value at the p-th percentile,
and the values above the (100−p)-th percentile are set to the value at the (100−p)-th percentile.
This can be mathematically represented as:
percentilep if value < percentilep
Winsorized value = value if percentilep ≤ value ≤ percentile100−p
percentile100−p if value > percentile100−p
where percentilep is the p-th percentile and percentile100−p is the (100 − p)-th percentile of
the dataset.
The class imbalance issue in the dataset, where there is a notable difference in the number
of occurrences across distinct classes, is a typical machine learning obstacle. The model may
perform worse as a result of this problem if it predicts the overrepresented class more accurately
while ignoring the underrepresented ones. In particular, the class DOS_SYN_Hping stands out
in the dataset presented with a significantly higher number of instances than other classes,
which may bias the model’s predictions in favour of this class and, as a result, jeopardise the
precision and dependability of predictions for the minority classes.
Metasploit_Brute_Force_SSH and NMAP_FIN_SCAN are two of these minority classes,
with only 37 and 28 instances, respectively. It is difficult for the model to correctly learn the
distinctive patterns of these classes due to their low representation, which affects the model’s
ability to effectively categorise cases that belong to these classes. It is crucial to resolve this
imbalance in the class distribution in order to improve the model’s overall performance, since
it may result in reduced classification accuracy, precision, and recall for these minority classes.
19
Chapter 3. Methodology
If the class disparity is not addressed, it can negatively impact the model’s learning process
and result in predictions that are skewed towards the majority class and ignore the minority
classes. As a result, the model may perform unevenly across classes, performing well in pre-
dicting the majority class but having trouble with the minority ones. Various solutions, such as
resampling approaches, alternative evaluation metrics, and the usage of algorithms intended to
manage class imbalance well, can be used to alleviate the negative impacts of class imbalance.
The intention is to rebalance the class distribution by putting these tactics into practice,
which will allow the model to learn from every class equally and produce more reliable predic-
tions across the board. In order to increase the model’s usefulness and practical applicability in
real-world scenarios, it is imperative to address the issue of class imbalance. This will also en-
sure that the predictions produced by the model are impartial and trustworthy across all classes.
Class Instances
DOS_SYN_Hping 94659
Thing_Speak 8108
ARP_poisoning 7750
MQTT_Publish 4146
NMAP_UDP_SCAN 2590
NMAP_XMAS_TREE_SCAN 2010
NMAP_OS_DETECTION 2000
NMAP_TCP_scan 1002
DDOS_Slowloris 534
Wipro_bulb 253
Metasploit_Brute_Force_SSH 37
NMAP_FIN_SCAN 28
Reducing the number of instances in the overrepresented class to equal the number of in-
stances in the underrepresented class is a technique known as undersampling, which is used to
20
Chapter 3. Methodology
address class imbalance. Undersampling is used in the above code snippet to eliminate occur-
rences of the class "DOS_SYN_Hping" from the dataset.
2. Select a random subset of these indices to eliminate, and set the subset’s size to 84659 (the
required total number of removals).
Through random selection and removal of instances, the "DOS_SYN_Hping" class size
was reduced from its initial count to 10,000 instances. This adjustment made sure that the
"DOS_SYN_Hping" class was more evenly distributed compared to the other classes in the
dataset, which had the same size.
3.4.4 SMOTE
Synthetic Minority Over-sampling Technique, or SMOTE for short, is a popular technique used
to address class imbalances in machine learning datasets. By producing false data points for the
underrepresented minority classes, this strategy is essential to achieving a more fair allocation
of classes. This effectively addresses the problem of model predictions that are biassed due to
differences in class instances.
It is crucial to understand the basic dataset properties before delving into the details of how
SMOTE functions within the given code. The dataset in question is divided into several classes,
some of which have significantly less samples than others. Class imbalances of this kind have
the potential to distort model results, giving majority classes accurate forecasts but minority
classes poor performance.
The SMOTE module from the imblearn library—a Python tool created especially for man-
aging unbalanced datasets—is included in the code extract. By creating fake data points within
minority classes, SMOTE, an oversampling approach, aims to increase the representation of
those groups.
In order to achieve reproducible results, it is imperative that an instance of the SMOTE class
be initialised and configured with a particular random state value, like 42. Synthetic samples
21
Chapter 3. Methodology
aimed at correcting the class distribution are generated by applying the ‘fit_resample()‘ method
on the feature matrix x and the target variable y.
After SMOTE is applied, the target variable y and the feature matrix x have different forms.
As a result, the dataset becomes more balanced as the quantity of instances belonging to mi-
nority classes rises to match that of majority classes.
By comparing the counts of each class in the target variable y before and after the SMOTE
intervention, the effect of SMOTE on the class distribution is closely examined. Interestingly,
there is a large bias in counts prior to SMOTE, substantially favouring the majority classes. All
class counts are equalised after SMOTE, though, suggesting that the dataset has been success-
fully balanced.
SMOTE is important for reasons that go beyond just correcting for class disparities. SMOTE
dramatically improves the predicted accuracy and generalizability of machine learning models
trained on skewed datasets by reducing the disproportionate representation of classes. SMOTE
effectively combats biases that may occur during model training by promoting a more fair
dataset distribution, enabling a more robust and unbiased model performance across different
class instances.
SMOTE is essentially a game-changing technique for managing class imbalances in ma-
chine learning datasets. SMOTE boosts machine learning model performance and reliability
by generating synthetic samples for underrepresented classes, especially in situations when
complicated class imbalances are prevalent.
The algorithm for SMOTE can be described as follows:
1. For each sample in the minority class, find its k nearest neighbors.
2. Randomly select one of the k nearest neighbors and calculate the difference between the
sample and the selected neighbor.
3. Multiply this difference by a random number between 0 and 1, and add it to the sample to
create a new synthetic sample.
4. Repeat this process for each sample in the minority class to generate the desired number of
synthetic samples.
22
Chapter 3. Methodology
23
Chapter 3. Methodology
24
Chapter 3. Methodology
Let X represent the feature matrix of shape (n, m), where n is the number of samples and m
is the number of features. Let y represent the target variable of shape (n, ), containing the class
labels for each sample. Let model denote the classification model used for feature ranking.
Let RFE(X, y, k) represent the Recursive Feature Elimination function, where k is the desired
number of features to select.
1. Initialize the set of selected features S as an empty set. 2. Repeat until |S| = k: a. Train
the classification model model on the features XS , where XS represents the subset of features
in X selected in the previous iteration. b. Compute the importance scores for each feature in
X based on the model’s performance. c. Identify the least important feature f based on the
importance scores. d. Remove feature f from X and update S by adding f to the set of selected
features. 3. Return the set of selected features S.
The importance scores are typically derived from the model’s coefficients or feature impor-
tances, depending on the type of classification model used (e.g., logistic regression, decision
trees, etc.). The process continues recursively until the desired number of features (k) is se-
lected.
Overall, RFE aims to identify the subset of features that maximizes the classification per-
formance of the model while reducing the dimensionality of the feature space.
The formula for Recursive Feature Elimination (RFE) in a classification problem can be
expressed as follows:
Given a feature matrix X of shape (n, m) and a target variable y of shape (n, ), where n is
the number of samples and m is the number of features, RFE selects a subset of features of size
k that maximizes the classification performance of a given classification model.
Let model denote the classification model used for feature ranking, and let RFE(X, y, k)
represent the Recursive Feature Elimination function.
The RFE algorithm can be summarized by the following formula:
25
Chapter 3. Methodology
where: - S is the set of selected features. - XS represents the subset of features selected in
X. - Performance(model(XS ), y) is a metric measuring the classification performance of the
model using the selected features XS and the target variable y.
The goal of RFE is to find the subset of features S that maximizes the classification perfor-
mance of the model.
3.6 ML Algorithms
3.6.1 LightGBM
A gradient boosting system called LightGBM makes use of tree-based learning methods. Due
to its efficient and distributed design, it can be used for issues involving high-dimensional
features and massive amounts of data. LightGBM introduces various improvements to the
conventional gradient boosting framework, including:
• unique Feature Bundling (EFB): To cut down on features and boost productivity, Light-
GBM groups unique features together. This is very helpful when working with data that
has many dimensions.
• Histogram-based Splitting: LightGBM determines the optimal split points for every fea-
ture using an algorithm based on a histogram. It initially classifies the data, then utilises
the histogram to choose the ideal split points rather than utilising all of the data points.
26
Chapter 3. Methodology
LightGBM computes the target variable’s overall mean after initialising the model using
a single leaf.
• Tree Growth: Trees are grown iteratively by LightGBM. It divides the leaf into two
kid leaves at each iteration based on which leaf minimises the loss the greatest. This
operation is carried out again until either the maximum number of leaves is taken or the
loss reduction falls below a predetermined threshold.
• Leaf-wise Growth: Unlike standard algorithms, which split all leaves at the same level,
LightGBM grows trees leaf-wise, that is, by splitting the leaf that minimises the loss the
greatest.
• LightGBM uses the feature values of the input instance to navigate the tree from the root
to a leaf in order to produce a prediction. The anticipated value is then derived from the
leaf’s mean value.
Because of its efficient tree-growing technique and capacity to handle sparse data, Light-
GBM is an all-around efficient and effective algorithm for classification tasks, particu-
larly for large-scale datasets with high-dimensional features.
The LightGBM algorithm optimizes the following objective function in each iteration:
27
Chapter 3. Methodology
n
X T
X
obj(θ) = ℓ(yi , ŷi ) + Ω(fi )
i=1 i=1
T
X
ŷ(x) = fi (x)
i=1
where fi (x) is the prediction of the i-th tree for the data point x.
The algorithm iteratively adds new trees to the model, with each tree minimizing the ob-
jective function by fitting the negative gradient of the loss function. This process continues
until a stopping criterion is met, such as reaching the maximum number of trees or achieving a
minimum improvement in the objective function.
One well-liked machine learning method for classification applications is gradient boosting.
It functions by successively combining several weak learners (often decision trees), with each
new learner concentrating on the errors committed by the older ones. An objective function
that gauges the effectiveness of the model and seeks to reduce errors serves as the process’s
direction.
An initial weak learner is used by the algorithm to create predictions based on the input
features. A loss function, such as binary cross-entropy for binary classification or categorical
cross-entropy for multi-class classification, is used to determine the errors when the predictions
28
Chapter 3. Methodology
and actual labels are compared. The objective is to minimise this loss function by varying the
weak learner’s parameters.
To fix the mistakes produced by the previous weak learner, a new one is added to the ensem-
ble in the following step. The residual errors—the discrepancy between expected and actual
values—of the prior learner are used to train this one. Iteratively, the procedure is repeated,
with each new student concentrating on the mistakes that remain.
Utilising gradients, or derivatives of the loss function with regard to the model’s predictions,
is the fundamental principle of gradient boosting. To minimise the loss, these gradients show
which way the model’s predictions should be changed. The model eventually gets better at
what it does by adding more learners that are trained to minimise these gradients iteratively.
Because gradient boosting can manage non-linearities and capture complicated relation-
ships in the data, it is a useful technique. To avoid overfitting and attain optimal performance,
it is crucial to adjust the algorithm’s hyperparameters, such as the learning rate and the number
of learners.
The mathematical formula for gradient boosting can be represented as follows in LaTeX
format:
Where: - Fm (x) is the ensemble of weak learners up to iteration m, - Fm−1 (x) is the
ensemble of weak learners up to iteration m − 1, - γ is the learning rate, - hm (x) is the weak
learner at iteration m.
A potent ensemble learning technique for both regression and classification applications is
called Random Forest. It is predicated on the idea of decision trees, in which a number of
trees are constructed to generate predictions, with the aggregate of the forecasts from each tree
serving as the ultimate prediction.
29
Chapter 3. Methodology
Several decision trees in a Random Forest are trained via a method known as bagging
(bootstrap aggregating). In bagging, a decision tree is trained on each of the bootstrapped
samples—random samples with replacement—that are created from the original dataset. This
lessens overfitting and enhances the model’s overall performance by introducing diversity among
the trees.
The Random Forest increases the diversity of the trees by training each one using a different
subset of data. This method is referred to as random feature selection or feature bagging. The
trees grow more robust and less correlated at each split by taking into account only a portion of
the attributes, which results in a more accurate and stable model.
The Random Forest combines the forecasts from each individual tree to create predictions.
The average of each tree’s predictions is the final prediction for regression tasks. When it comes
to classification jobs, the ultimate prediction is decided by a majority vote among all the trees’
predictions.
The capacity of Random Forest to handle big datasets with lots of characteristics and high
dimensionality is one of its main advantages. It is a well-liked option for many machine learn-
ing problems since it is less likely to overfit than individual decision trees.
All things considered, Random Forest is a popular option for many machine learning appli-
cations since it is a flexible and strong algorithm that blends the advantages of decision trees
with ensemble learning.
The mathematical formula for the Random Forest algorithm can be expressed as follows:
Let X be the input feature matrix with m samples and n features, and Y be the correspond-
ing target variable. Random Forest consists of N decision trees, where each tree is built using
a bootstrap sample of the training data:
1. For i = 1 to N : - Sample a bootstrap sample Xi of size m from X with replacement. -
Train a decision tree hi (Xi ) using Xi and Y .
2. To make a prediction for a new sample x: - Aggregate the predictions of all trees:
Ŷ (x) = N1 N N
P
i=1 hi (x) (for regression) or use majority voting: Ŷ (x) = mode{hi (x)}i=1 (for
classification).
30
Chapter 3. Methodology
The final prediction is the average (for regression) or majority vote (for classification) of
the predictions of all trees.
3.6.4 KNN
• Initialization: A training dataset with labelled samples is used to begin the algorithm. A
class identifier and a collection of features, or attributes, make up each example.
• Distance Calculation: KNN determines the distance between each new data point to be
classified and every other point in the training dataset. While Manhattan distance is one
of the available metrics, Euclidean distance is the most widely used one.
• Finding Neighbours: Using the computed distances, KNN then finds the K data points
(neighbours) that are closest to the new point. The first group of points utilised for
classification is made up of these neighbours.
• The Majority Voting method is used by KNN to classify new data points by obtaining
a majority vote from the K neighbours. The new point is given to the class that has the
highest frequency among its neighbours.
• Decision Boundary: KNN saves all training instances and their labels instead of learning
a model explicitly. The distribution of the training data establishes the non-linear decision
boundary in a KNN.
31
Chapter 3. Methodology
KNN is a flexible technique that works well with situations involving binary and multi-
class classification. Because it is easy to use and comprehend, it is a well-liked option for
those new to machine learning. It can, however, be computationally expensive, particu-
larly for large datasets, since all training instances must have their distances calculated
and stored.
where yˆq is the predicted class label for xq , yi is the class label of the ith nearest neighbor,
and I(yi = y) is an indicator function that is 1 if yi = y and 0 otherwise.
A basic and popular statistical technique for issues involving binary classification is logistic
regression. Logistic regression forecasts the chance that an instance belongs to a specific class,
as opposed to linear regression, which predicts continuous values. When the dependent variable
is binary (e.g., 0 or 1, true or false), it is especially helpful.
32
Chapter 3. Methodology
The logistic function, sometimes referred to as the sigmoid function, is used by the logistic
regression model to translate the linear combination of the input data into a probability score
that ranges from 0 to 1. The definition of the logistic function is:
1
σ(z) =
1 + e−z
where z = β0 + β1 x1 + β2 x2 + . . . + βn xn is the linear combination of input features and
coefficients, and β0 , β1 , . . . , βn are the model parameters to be learned from the training data.
Finding the ideal values for the coefficients that maximise the likelihood of the observed
data is the goal of logistic regression during the training phase. Usually, optimisation tech-
niques like gradient descent are used for this. After being trained, the model can use the fol-
lowing formula to forecast the likelihood that a new instance would belong to the positive class
(such as class 1):
P (y = 1|x) = σ(β0 + β1 x1 + β2 x2 + . . . + βn xn )
where P (y = 1|x) is the probability of the positive class given the input features x.
Logistic regression employs a decision threshold (often 0.5) to generate a binary prediction.
The occurrence is classed as belonging to the positive class if the anticipated probability is
higher than the threshold; if not, it is classified as belonging to the negative class.
Because of its simplicity, interpretability, and efficacy in binary classification problems,
logistic regression is widely utilised in a variety of industries, including healthcare (e.g., fore-
casting illness risk), marketing (e.g., customer churn prediction), and finance (e.g., credit risk
assessment).
3.6.6 AdaBoost
33
Chapter 3. Methodology
AdaBoost’s main concept is to train a sequence of weak classifiers on the same dataset in
a sequential fashion, with each new classifier placing greater emphasis on the cases that the
earlier classifiers misclassified. This is accomplished by changing the training instance weights
so that the incorrectly identified examples receive larger weights in the ensuing training cycles.
Put differently, AdaBoost is very good at managing imbalanced datasets because it concentrates
more on the challenging cases.
More accurate classifiers are given larger weights throughout the training process, which is
based on the accuracy of each weak classifier. In the final classification, a weighted sum—where
the weights are the classifiers’ given weights—is used to aggregate the predictions of each weak
classifier. The final prediction is based on this weighted sum, and a majority vote among the
classifiers determines the class label.
AdaBoost has gained recognition for its ease of use and potency in enhancing classification
results, particularly in contrast to standalone weak classifiers. It is, however, susceptible to
outliers and noisy data, as these can adversely affect the performance of the weak classifiers
and, in turn, the AdaBoost method as a whole.
Sure, the mathematical formula for the AdaBoost algorithm can be represented as follows
in LaTeX format:
T
X
F (x) = αt ft (x)
t=1
Where: - F (x) is the final classification function, - T is the total number of weak classifiers,
- αt is the weight assigned to the weak classifier ft (x), - ft (x) is a weak classifier that is trained
to classify the data.
This formula represents the linear combination of weak classifiers weighted by αt to form a
strong classifier. Each weak classifier is trained sequentially, with the weights αt being adjusted
to focus on incorrectly classified samples in subsequent iterations.
34
Chapter 3. Methodology
True Positives
Precision =
True Positives + False Positives
2. Recall (Sensitivity or True Positive Rate): Recall is the number of true positive results
divided by the number of positive results that should have been returned. -
True Positives
Recall =
True Positives + False Negatives
3. F1 Score: The F1 score is the harmonic mean of precision and recall, giving a single
score that balances both measures. -
Precision × Recall
F1 Score = 2 ×
Precision + Recall
4. Accuracy: : Accuracy is the ratio of correctly predicted observations to the total obser-
vations. -
True Positives + True Negatives
Accuracy =
Total Population
35
Chapter 4
4.1.1 Undersampling
Undersampling involves reducing the number of instances in the overrepresented class to match
the number of instances in the underrepresented class. In this case, undersampling was used to
reduce the instances of the "DOS_SYN_Hping" class from 94659 to 10000. This process helps
balance the class distribution, which can improve the performance of machine learning models,
particularly in handling imbalanced datasets.
36
Chapter 4. Results and Discussions
Description Value
Before OverSampling, the shape of X (38458, 83)
Before OverSampling, the shape of y (38458,)
Before OverSampling, counts of label ’0’ 7750
Before OverSampling, counts of label ’1’ 534
Before OverSampling, counts of label ’2’ 10000
Before OverSampling, counts of label ’3’ 4146
Before OverSampling, counts of label ’4’ 37
Before OverSampling, counts of label ’5’ 28
Before OverSampling, counts of label ’6’ 2000
Before OverSampling, counts of label ’7’ 1002
Before OverSampling, counts of label ’8’ 2590
Before OverSampling, counts of label ’9’ 2010
Before OverSampling, counts of label ’11’ 8108
Before OverSampling, counts of label ’12’ 253
37
Chapter 4. Results and Discussions
Description Value
After OverSampling, the shape of X (120000, 83)
After OverSampling, the shape of y (120000,)
After OverSampling, counts of label ’0’ 10000
After OverSampling, counts of label ’1’ 10000
After OverSampling, counts of label ’2’ 10000
After OverSampling, counts of label ’3’ 10000
After OverSampling, counts of label ’4’ 10000
After OverSampling, counts of label ’5’ 10000
After OverSampling, counts of label ’6’ 10000
After OverSampling, counts of label ’7’ 10000
After OverSampling, counts of label ’8’ 10000
After OverSampling, counts of label ’9’ 10000
After OverSampling, counts of label ’11’ 10000
After OverSampling, counts of label ’12’ 10000
KNN, with an accuracy of 86.18%, showed decent performance but lagged behind the tree-
based models. This can be attributed to KNN’s sensitivity to irrelevant features and its reliance
on distance metrics, which may not always be suitable for high-dimensional data.
Both Logistic Regression and AdaBoost exhibited lower accuracies of 54.78% and 55.63%,
respectively. These models might have struggled due to the complexity of the dataset, as they
are less adept at capturing non-linear relationships compared to ensemble methods like Light-
GBM and Gradient Boosting.
In conclusion, LightGBM emerged as the top-performing model due to its efficient handling
of high-dimensional data and its ability to capture complex patterns in the dataset. Gradient
Boosting and Random Forest also showed strong performances, highlighting the effectiveness
of ensemble learning techniques in classification tasks. Conversely, KNN, Logistic Regression,
38
Chapter 4. Results and Discussions
and AdaBoost exhibited lower accuracies, suggesting that they may not be as suitable for this
particular classification problem.
Algorithm Accuracy
LightGBM 0.9748
Gradient Boosting 0.9690
Random Forest 0.9635
KNN 0.8618
Logistic Regression 0.5478
AdaBoost 0.5563
39
Chapter 5
5.1 Findings
This project focused on classifying IoT network traffic using boosting and forest-based machine
learning algorithms on the Real-Time Internet of Things 2022 dataset. The goal was to enhance
IoT security by accurately classifying normal patterns and attack types.
Several machine learning algorithms were employed and evaluated for this classification
task. The algorithms included LightGBM, Gradient Boosting, Random Forest, KNN, Logistic
Regression, and AdaBoost. These algorithms were chosen for their effectiveness in handling
classification problems and their suitability for this particular dataset.
The evaluation metrics used to assess the performance of the models included accuracy,
precision, recall, F1 score, and area under the ROC curve. These metrics provide a comprehen-
sive view of how well the models are performing in terms of both overall accuracy and their
ability to correctly classify different classes of network traffic.
The results of the project showed that LightGBM performed the best among the algorithms,
achieving an accuracy of 97.48%. This superior performance can be attributed to LightGBM’s
ability to handle large datasets efficiently and its capability to deal with high-dimensional data
effectively. Gradient Boosting and Random Forest also showed strong performances, with
accuracies of 96.90% and 96.35%, respectively.
40
Chapter 5. Summary and Future Work
On the other hand, KNN, Logistic Regression, and AdaBoost exhibited lower accuracies,
suggesting that they may not be as suitable for this particular classification problem. These
algorithms struggled due to the complexity of the dataset and their limitations in capturing non-
linear relationships compared to ensemble methods like LightGBM and Gradient Boosting.
In conclusion, the project successfully developed and evaluated machine learning models
for classifying network traffic in IoT environments. The results highlight the effectiveness of
ensemble learning techniques, particularly LightGBM, in handling complex classification tasks
in IoT environments. These findings can be valuable for enhancing the security of IoT devices
and networks by accurately detecting and classifying different types of network traffic.
41
Appendix A
Code Used
import p a n d a s a s pd
import numpy a s np
import m a t p l o t l i b . p y p l o t a s p l t
import s e a b o r n a s s n s
d f = pd . r e a d _ c s v ( r ’C : \ U s e r s \ n a l a n \ Documents \ r i o t 2 2 . c s v ’ )
print ( df . shape )
#C : \ U s e r s \ n a l a n \ Documents \ r i o t 2 2 . c s v
d f . d r o p ( c o l u m n s = [ ’ Unnamed : 0 ’ ] , i n p l a c e = True , a x i s = 1)
d f . d u p l i c a t e d ( ) . sum ( )
duplicate_rows = df [ df . d u p l i c a t e d ( ) ]
d d f = pd . DataFrame ( d u p l i c a t e _ r o w s )
ddf . head ( 2 5 )
pd . s e t _ o p t i o n ( ’ d i s p l a y . max_rows ’ , 5 0 0 )
42
Appendix A. Code Used
pd . s e t _ o p t i o n ( ’ d i s p l a y . max_columns ’ , 5 0 0 )
pd . s e t _ o p t i o n ( ’ d i s p l a y . w i d t h ’ , 1 0 0 0 )
d u p l i c a t e _ c o u n t s = d f . d u p l i c a t e d ( ) . sum ( )
d d f = pd . DataFrame ( { ’ Column ’ : d f . columns , ’ D u p l i c a t e Count ’ : d u p l i c a t e _ c o
ddf . head ( ) # c a t add e n c o d
for i in df . columns :
i f d f [ i ] . d t y p e s == ’ o b j e c t ’ :
print ( i )
print ()
print ( ’ the values are : ’ )
print ( df [ i ] . value_counts ( ) )
print ()
print ()
u n i q u e _ a t t a c k s = df [ ’ Attack_type ’ ] . unique ( )
p r i n t ( u n i q u e _ a t t a c k s ) ) # 9 malwares
u n i q u e _ a t t a c k s = df [ ’ Attack_type ’ ] . value_counts ( )
print ( unique_attacks )
b a d _ a t t a c k s = [ ’ A R P _ p o i s i o n i n g ’ , ’ DDOS_Slowloris ’ , ’ DOS_SYN_Hping ’ , ’ M e t a
# Get t h e v a l u e c o u n t s f o r A t t a c k _ t y p e i n t h e f i l t e r e d DataFrame
43
Appendix A. Code Used
print ( bad_attacks_counts )
f o r column i n d f . c o l u m n s :
n u l l _ c o u n t = d f [ column ] . i s n u l l ( ) . sum ( )
p r i n t ( f " Column ’{ column } ’ h a s { n u l l _ c o u n t } n u l l v a l u e s . " )
import m a t p l o t l i b . p y p l o t a s p l t
# F i l t e r numerical columns
n u m e r i c a l _ c o l u m n s = d f . s e l e c t _ d t y p e s ( i n c l u d e = [ ’ number ’ ] ) . c o l u m n s
# C r e a t e box p l o t s f o r e a c h n u m e r i c a l column
f o r column i n n u m e r i c a l _ c o l u m n s :
p l t . f i g u r e ( f i g s i z e =(8 , 6))
d f . b o x p l o t ( column = [ column ] )
p l t . t i t l e ( f ’ Box p l o t f o r { column } ’ )
p l t . y l a b e l ( ’ Value ’ )
p l t . show ( )
def d e t e c t _ o u t l i e r s _ i q r ( data ) :
# C a l c u l a t e Q1 ( 2 5 t h p e r c e n t i l e ) o f t h e g i v e n d a t a
Q1 = d a t a . q u a n t i l e ( 0 . 2 5 )
# C a l c u l a t e Q3 ( 7 5 t h p e r c e n t i l e ) o f t h e g i v e n d a t a
Q3 = d a t a . q u a n t i l e ( 0 . 7 5 )
44
Appendix A. Code Used
# C a l c u l a t e IQR ( I n t e r q u a r t i l e Range )
IQR = Q3 − Q1
# C a l c u l a t e l o w e r bound
l o w e r _ b o u n d = Q1 − 1 . 5 * IQR
# C a l c u l a t e u p p e r bound
u p p e r _ b o u n d = Q3 + 1 . 5 * IQR
# Find o u t l i e r s
o u t l i e r s = ( d a t a < lower_bound ) | ( d a t a > upper_bound )
return o u t l i e r s
# I t e r a t e o v e r e a c h n u m e r i c a l column and d e t e c t o u t l i e r s
f o r column i n n u m e r i c a l _ c o l u m n s :
o u t l i e r s = d e t e c t _ o u t l i e r s _ i q r ( d f [ column ] )
p r i n t ( f " Column ’{ column } ’ : { o u t l i e r s . sum ( ) } o u t l i e r s d e t e c t e d . " )
# o u t l i e r removal
numeric_columns = df . s e l e c t _ d t y p e s ( i n c l u d e =[ ’ f l o a t 6 4 ’ , ’ i n t 6 4 ’ ] ) . columns
f o r column i n n u m e r i c _ c o l u m n s :
Q1 = d f [ column ] . q u a n t i l e ( 0 . 2 5 )
Q3 = d f [ column ] . q u a n t i l e ( 0 . 7 5 )
IQR = Q3 − Q1
45
Appendix A. Code Used
import p a n d a s a s pd
outliers_columns = [
’ id . resp_p ’ , ’ flow_duration ’ , ’ fwd_pkts_tot ’ , ’ bwd_pkts_tot ’ , ’ fwd_da
’ down_up_ratio ’ , ’ fwd_header_size_tot ’ , ’ fwd_header_size_min ’ , ’ fwd_h
’ bwd_header_size_min ’ , ’ bwd_header_size_max ’ ,
’ fwd_PSH_flag_count ’ , ’ bwd_PSH_flag_count ’ , ’ flow_ACK_flag_count ’ ,
’ f w d _ p k t s _ p a y l o a d . min ’ , ’ f w d _ p k t s _ p a y l o a d . max ’ , ’ f w d _ p k t s _ p a y l o a d . t o
’ f w d _ p k t s _ p a y l o a d . avg ’ , ’ f w d _ p k t s _ p a y l o a d . s t d ’ , ’ b w d _ p k t s _ p a y l o a d . min
’ b w d _ p k t s _ p a y l o a d . t o t ’ , ’ b w d _ p k t s _ p a y l o a d . avg ’ , ’ b w d _ p k t s _ p a y l o a d . s t d
’ f l o w _ p k t s _ p a y l o a d . max ’ , ’ f l o w _ p k t s _ p a y l o a d . t o t ’ , ’ f l o w _ p k t s _ p a y l o a d .
’ f w d _ i a t . min ’ , ’ f w d _ i a t . max ’ , ’ f w d _ i a t . t o t ’ , ’ f w d _ i a t . avg ’ , ’ f w d _ i a t .
’ b w d _ i a t . max ’ , ’ b w d _ i a t . t o t ’ , ’ b w d _ i a t . avg ’ , ’ b w d _ i a t . s t d ’ , ’ f l o w _ i a t
’ f l o w _ i a t . t o t ’ , ’ f l o w _ i a t . avg ’ , ’ f l o w _ i a t . s t d ’ , ’ p a y l o a d _ b y t e s _ p e r _ s e c
’ bwd_subflow_pkts ’ , ’ fwd_subflow_bytes ’ , ’ bwd_subflow_bytes ’ , ’ fwd_bu
’ fwd_bulk_packets ’ , ’ bwd_bulk_packets ’ , ’ fwd_bulk_rate ’ , ’ bwd_bulk_ra
’ a c t i v e . t o t ’ , ’ a c t i v e . avg ’ , ’ a c t i v e . s t d ’ , ’ i d l e . min ’ , ’ i d l e . max ’ , ’ i d
’ fwd_init_window_size ’ , ’ bwd_init_window_size ’ , ’ fwd_last_window_size
]
d f o u t = d f [ o u t l i e r s _ c o l u m n s ] . copy ( )
dfout
def d e t e c t _ o u t l i e r s _ i q r ( data ) :
# C a l c u l a t e Q1 ( 2 5 t h p e r c e n t i l e ) o f t h e g i v e n d a t a
Q1 = d a t a . q u a n t i l e ( 0 . 2 5 )
46
Appendix A. Code Used
# C a l c u l a t e Q3 ( 7 5 t h p e r c e n t i l e ) o f t h e g i v e n d a t a
Q3 = d a t a . q u a n t i l e ( 0 . 7 5 )
# C a l c u l a t e IQR ( I n t e r q u a r t i l e Range )
IQR = Q3 − Q1
# C a l c u l a t e l o w e r bound
l o w e r _ b o u n d = Q1 − 1 . 5 * IQR
# C a l c u l a t e u p p e r bound
u p p e r _ b o u n d = Q3 + 1 . 5 * IQR
# Find o u t l i e r s
o u t l i e r s = ( d a t a < lower_bound ) | ( d a t a > upper_bound )
return o u t l i e r s
for c o l in d f o u t . columns :
Q1 = d f o u t [ c o l ] . q u a n t i l e ( 0 . 2 5 )
Q3 = d f o u t [ c o l ] . q u a n t i l e ( 0 . 7 5 )
IQR = Q3 − Q1
l o w e r _ b o u n d = Q1 − 1 . 5 * IQR
u p p e r _ b o u n d = Q3 + 1 . 5 * IQR
d f o u t [ c o l ] = d f o u t [ c o l ] . a p p l y ( lambda x : l o w e r _ b o u n d i f x < l o w e r _ b o u n
dfout
47
Appendix A. Code Used
# W i n s o r i z i n g o u t l i e r s ( r e p l a c i n g o u t l i e r s w i t h t h e n e a r e s t non− o u t l i e r v a
import p a n d a s a s pd
import numpy a s np
# A s s u m i n g ’ A t t a c k _ t y p e ’ i s t h e column c o n t a i n i n g t h e c l a s s l a b e l s
# A s s u m i n g ’ DOS_SYN_Hping ’ i s t h e c l a s s l a b e l t o remove
# F i n d i n d i c e s o f rows w i t h ’ DOS_SYN_Hping ’
i n d i c e s _ t o _ r e m o v e = d f [ d f [ ’ A t t a c k _ t y p e ’ ] == ’ DOS_SYN_Hping ’ ] . i n d e x
# S a v e t h e f i l t e r e d DataFrame f o r f u r t h e r p r o c e s s i n g
df . to_csv ( ’ f i l t e r e d _ d a t a . csv ’ , index= False )
from s k l e a r n . p r e p r o c e s s i n g import L a b e l E n c o d e r
# A s s u m i n g ’ d f ’ i s y o u r DataFrame and n o n _ n u m e r i c a l _ c o l u m n s i s a l i s t o f
non_numerical_columns = df . s e l e c t _ d t y p e s ( e x c l u d e =[ ’ f l o a t 6 4 ’ , ’ i n t 6 4 ’ ] ) . co
l a b e l _ e n c o d e r s = {}
for f e a t u r e in non_numerical_columns :
48
Appendix A. Code Used
l a b e l _ e n c o d e r s [ f e a t u r e ] = LabelEncoder ( )
df [ f e a t u r e ] = l a b e l _ e n c o d e r s [ f e a t u r e ] . f i t _ t r a n s f o r m ( df [ f e a t u r e ] )
x= d f . d r o p ( [ ’ A t t a c k _ t y p e ’ ] , a x i s = 1)
y= d f [ ’ A t t a c k _ t y p e ’ ]
from i m b l e a r n . o v e r _ s a m p l i n g i m p o r t SMOTE
sm = SMOTE( r a n d o m _ s t a t e =4 2)
x _ r e s , y _ r e s = sm . f i t _ r e s a m p l e ( x , y . r a v e l ( ) )
x _ r e s =pd . DataFrame ( x _ r e s )
# Renaming column name o f T a r g e t v a r i a b l e
y _ r e s =pd . DataFrame ( y _ r e s )
y _ r e s . columns = [ ’ At t a ck _ ty p e ’ ]
d f = pd . c o n c a t ( [ x _ r e s , y _ r e s ] , a x i s =1 )
from s k l e a r n . f e a t u r e _ s e l e c t i o n i m p o r t RFE
from s k l e a r n . l i n e a r _ m o d e l i m p o r t L o g i s t i c R e g r e s s i o n
49
Appendix A. Code Used
s e l e c t e d _ f e a t u r e s = x . columns [ r f e . s u p p o r t _ ]
print ( selected_features )
from s k l e a r n . f e a t u r e _ s e l e c t i o n i m p o r t RFE
from s k l e a r n . l i n e a r _ m o d e l i m p o r t L o g i s t i c R e g r e s s i o n
from x g b o o s t i m p o r t X G B C l a s s i f i e r
from s k l e a r n . e n s e m b l e i m p o r t R a n d o m F o r e s t C l a s s i f i e r
from s k l e a r n . t r e e i m p o r t D e c i s i o n T r e e C l a s s i f i e r
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
# S p l i t t h e r e s a m p l e d d a t a i n t o t r a i n i n g and t e s t s e t s
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X_selected , y , t e s t _ s
# Logistic Regression
l r _ m o d e l = L o g i s t i c R e g r e s s i o n ( r a n d o m _ s t a t e =4 2)
lr_model . f i t ( X_train , y _ t r a i n )
# XGBoost
xgb_model = X G B C l a s s i f i e r ( r a n d o m _ s t a t e =4 2 )
xgb_model . f i t ( X _ t r a i n , y _ t r a i n )
# Random F o r e s t
50
Appendix A. Code Used
r f _ m o d e l = R a n d o m F o r e s t C l a s s i f i e r ( r a n d o m _ s t a t e =4 2 )
rf_model . f i t ( X_train , y _ t r a i n )
# Decision Tree
d t _ m o d e l = D e c i s i o n T r e e C l a s s i f i e r ( r a n d o m _ s t a t e = 42 )
dt_model . f i t ( X_train , y _ t r a i n )
from s k l e a r n . m e t r i c s i m p o r t a c c u r a c y _ s c o r e
# P r e d i c t u s i n g e a c h model
lr_pred = lr_model . p r e d i c t ( X_test )
x g b _ p r e d = xgb_model . p r e d i c t ( X _ t e s t )
rf_pred = rf_model . p r e d i c t ( X_test )
dt_pred = dt_model . p r e d i c t ( X_test )
# C a l c u l a t e a c c u r a c y f o r e a c h model
lr_accuracy = accuracy_score ( y_test , lr_pred )
xgb_accuracy = a c c u r a c y _ s c o r e ( y _ t e s t , xgb_pred )
rf_accuracy = accuracy_score ( y_test , rf_pred )
dt_accuracy = accuracy_score ( y_test , dt_pred )
51
References
[1] Fares Grina, Zied Elouedi, and Eric Lefevre. A preprocessing approach for class-
imbalanced data using smote and belief function theory. In Cesar Analide, Paulo No-
vais, David Camacho, and Hujun Yin, editors, Intelligent Data Engineering and Automated
Learning – IDEAL 2020, pages 3–11, Cham, 2020. Springer International Publishing.
[2] Ashish Kumar Jha, Raja Muthalagu, and Pranav M. Pawar. Intelligent phishing website de-
tection using machine learning. Multimedia Tools and Applications, 82(19):29431–29456,
2023.
[3] Prabhjot Kaur and Anjana Gosain. Issues and challenges of class imbalance problem in
classification. International Journal of Information Technology, 14(1):539–545, 2022.
[4] Abdollah Kavousi-Fard, Wencong Su, and Tao Jin. A machine-learning-based cyber at-
tack detection model for wireless sensor networks in microgrids. IEEE Transactions on
Industrial Informatics, 17(1):650–658, 2021.
[5] Saadat Hasan Khan, Aritro Roy Arko, and Amitabha Chakrabarty. Anomaly Detection in
IoT Using Machine Learning, pages 237–254. Springer International Publishing, Cham,
2022.
[6] Mohsin Manzoor and Bhavna Arora. Framework for detection of malware using random
forest classifier. In Yashwant Singh, Chaman Verma, Illés Zoltán, Jitender Kumar Chhabra,
and Pradeep Kumar Singh, editors, Proceedings of International Conference on Recent
Innovations in Computing, pages 727–740, Singapore, 2023. Springer Nature Singapore.
52
References
[7] Kishwar Sadaf. Phishing website detection using xgboost and catboost classifiers. In 2023
International Conference on Smart Computing and Application (ICSCA), pages 1–6, 2023.
[8] C. C. Sobin. A survey on architecture, protocols and challenges in iot. Wireless Personal
Communications, 112(3):1383–1429, 2020.
[9] Tarid Wongvorachan, Surina He, and Okan Bulut. A comparison of undersampling, over-
sampling, and smote methods for dealing with imbalanced classification in educational
data mining. Information, 14(1), 2023.
53