Table of Content
Table of Content
INTRODUCTION
The advancement of fifth-generation (5G) mobile communication technology
has led to the diversification of access environments and the establishment of
distributed networks, enabling the transmission of various types of data through
network systems. These data, originating from sensors, computers, and the
Internet of Things (IoT), are now processed more efficiently due to the
expanded capacity of network systems. However, the increased diversity of
access points has also expanded the attack surface, making network systems
more susceptible to potential threats. Furthermore, cyber-attack techniques
have evolved to become more intricate and frequent, underscoring the critical
importance of cybersecurity. Consequently, numerous studies are actively
being conducted to mitigate potential network threats. A key challenge in
cybersecurity lies in the identification of network threats, with various findings
emerging in the realm of network intrusion detection systems (NIDSs). Recent
studies have predominantly focused on integrating artificial intelligence (AI)
technology into NIDS, resulting in significant advancements in performance.
Initially, research efforts concentrated on applying traditional machine learning
models like decision trees (DTs) and support vector machines (SVMs) to
existing intrusion detection systems, with current emphasis shifting towards
deep learning methodologies such as convolutional neural networks (CNNs),
long short-term memory (LSTM), and auto encoders. While these approaches
have shown promising results in anomaly detection, challenges persist in their
practical deployment in real-world systems.
The majority of network flow data is typically normal traffic, with rare
occurrences of malicious behavior that can lead to service failure. Furthermore,
within the realm of malicious behavior, most of the data consists of well-
known attacks, while specific types of attacks are exceptionally uncommon.
This data imbalance poses a challenge for AI models deployed in Network
Intrusion Detection Systems (NIDS), as they struggle to adequately learn the
1
characteristics of specific network threats. Consequently, this can leave
network systems vulnerable to attacks due to poor detection performance.
2
1.1 PROBLEM STATEMENT
Most of the techniques used in modern IDs they cannot manage the
flexible and complex environment of Internet attacks on computer networks.
Methods of Deep learning provide appropriate accounting and communication
costs.
This study describes the behavior of the Deep learning to identify intruders.
This is very helpful in preventing interference with some kind of related attack.
The model can also reach real time identification entry based on size reduction
and simple separator. This study aims to increase focus on a number of points:
• Selecting the appropriate algorithm for the appropriate tasks depending on the
data types, size and network behavior and requirements.
• Data analysis, acquisition, modeling, and engineering key features, are used
several processing techniques by putting them together in a smart order for best
accuracy with low data representation size and size.
3
1.3 SCOPE OF THE PROJECT
4
CHAPTER 2
LITERATURE SURVEY
ABSTRACT:
Networks play important roles in modern life, and cyber security has
become a vital research area. An intrusion detection system (IDS) which is an
important cyber security technique, monitors the state of software and
hardware running in the network. Despite decades of development, existing
IDSs still face challenges in improving the detection accuracy, reducing the
false alarm rate and detecting unknown attacks. To solve the above problems,
many researchers have focused on developing IDSs that capitalize on machine
learning methods. Machine learning methods can automatically discover the
essential differences between normal data and abnormal data with high
accuracy. In addition, machine learning methods have strong generalizability,
so they are also able to detect unknown attacks. Deep learning is a branch of
machine learning, whose performance is remarkable and has become a research
hotspot. This survey proposes a taxonomy of IDS that takes data objects as the
main dimension to classify and summarize machine learning-based and deep
learning-based IDS literature. We believe that this type of taxonomy
framework is fit for cyber security researchers. The survey first clarifies the
concept and taxonomy of IDSs. Then, the machine learning algorithms
frequently used in IDSs, metrics, and benchmark datasets are introduced. Next,
combined with the representative literature, we take the proposed taxonomic
system as a baseline and explain how to solve key IDS issues with machine
learning and deep learning techniques. Finally, challenges and future
developments are discussed by reviewing recent representative studies.
5
Merits:
Demerits:
ABSTRACT:
6
Merits:
Demerits:
ABSTRACT:
After the digital revolution, large quantities of data have been generated
with time through various networks. The networks have made the process of
data analysis very difficult by detecting attacks using suitable techniques.
While Intrusion Detection Systems (IDSs) secure resources against threats,
they still face challenges in improving detection accuracy, reducing false alarm
rates, and detecting the unknown ones. This paper presents a framework to
integrate data mining classification algorithms and association rules to
implement network intrusion detection. Several experiments have been
performed and evaluated to assess various machine learning classifiers based
on the KDD99 intrusion dataset. Our study focuses on several data mining
algorithms such as; naïve Bayes, decision trees, support vector machines,
decision tables, k-nearest neighbor algorithms, and artificial neural networks.
Moreover, this paper is concerned with the association process in creating
attack rules to identify those in the network audit data, by utilizing a KDD99
dataset anomaly detection. The focus is on false negative and false positive
performance metrics to enhance the detection rate of the intrusion detection
system. The implemented experiments compare the results of each algorithm
7
and demonstrate that the decision tree is the most powerful algorithm as it has
the highest accuracy (0.992) and the lowest false positive rate (0.009).
Merits:
Demerits:
ABSTRACT:
8
Merits:
Demerits:
ABSTRACT:
Merits:
9
Demerits:
ABSTRACT:
10
Merits: Introduction of a novel AI-based NIDS utilizing generative adversarial
networks to resolve data imbalance and enhance detection performance.
ABSTRACT:
11
Merits:
Demerits:
ABSTRACT:
12
Merits: Utilization of deep learning architectures for adaptive network
intrusion detection, enhancing detection capabilities against evolving threats.
ABSTRACT:
Merits:
13
Demerits:
ABSTRACT:
Cyber attacks are a very common issue in the modern world, and
since there is a growing array of challenges in accurately detecting intrusion,
this results in damage to security services, i.e. confidentiality, integrity, and
availability of data. The attackers found new types of attacks day by day, first
of all the type of attack should be analyzed properly with the help of IDS for
the prevention of these types of attacks to offer the correct answers. SIDS and
AIDS intrusion detection systems are separate proposed methods of intrusion
detection to manage security threats. This paper has reviewed numerous deep
learning algorithms that have been proposed to detect intrusion, i.e.,
Convolutional Neural Network, Recurrent Neural Network, Restricted
Boltzmann Machine, Deep Brief Network and Auto encoder. It is designed to
use IDS approach depending on a deep learning (DL) algorithm by using
literature work comparisons and by providing the expertise either in intrusion
detection or deep learning algorithms.
Merits:
Demerits:
14
Potential challenges in selecting the most suitable deep learning
algorithm and fine-tuning parameters for optimal performance in different
network environments.
15
CHAPTER 3
SYSTEM ANALYSIS
16
network layer, it lacks the capability to detect application layer intrusions due
to its lack of a state concept, making it vulnerable to HTTP Post Flooding
attacks. To address these limitations, a machine learning approach is
recommended.
3.4 PROPOSED SYSTEM
The proposed framework introduces a cutting-edge network intrusion
detection system that utilizes Deep learning techniques. It combines various
machine learning models such as ANN, CNN, and LSTM to improve the
identification of threats. The main objective of this system is to harness the
strengths of multiple algorithms and address the weaknesses of individual
models. By leveraging Deep learning, it aims to enhance detection accuracy,
adaptability to evolving threats, and resilience against adversarial attacks. The
system prioritizes ensemble diversity and consensus decision-making to
minimize false positives and effectively handle complex network behaviors.
Ultimately, the goal is to develop a robust, versatile, and collaborative system
that can proactively identify and counter emerging cyber threats in intricate
network environments.
3.5 PROPOSED SYSTEM ADVANTAGES
Improved Precision: Deep learning integrates multiple models to enhance
detection accuracy.
Enhanced Resilience: The combination of diverse models helps to reduce
individual vulnerabilities, thus enhancing the overall system's robustness.
Minimized Overfitting: Ensemble methods are frequently used to prevent
overfitting, thereby improving the system's ability to generalize.
Superior Adaptability: These models are adept at adjusting to emerging
threats by utilizing a variety of perspectives for comprehensive threat
detection.
17
CHAPTER 4
SOFTWARE SPECIFICATION
These are the requirements for doing the project. Without using these
tools and software’s we can’t do the project. So we have two requirements to
do the project. They are
1. Hardware Requirements.
2. Software Requirements.
SYSTEM REQUIREMENTS
The hardware requirements may serve as the basis for a contract for the
implementation of the system and should therefore be a complete and
consistent specification of the whole system. They are used by software
engineers as the starting point for the system design. It shows what the system
does and not how it should be implemented.
PROCESSOR : Intel I5
RAM : 4GB
HARD DISK : 500 GB
4.2 SOFTWARE REQUIREMENTS
18
PROGRAMMING LANGUAGE : Python
ANACONDA
The big difference between Conda and the pip package manager is in how
package dependencies are managed, which is a significant challenge for Python
data science and the reason Conda exists. Pip installs all Python package
dependencies required, whether or not those conflict with other packages you
installed previously.
So your working installation of, for example, Google Tensorflow, can suddenly
stop working when you pip install a different package that needs a different
version of the Numpy library. More insidiously, everything might still appear
to work but now you get different results from your data science, or you are
unable to reproduce the same results elsewhere because you didn't pip install in
the same order.
Conda analyzes your current environment, everything you have installed, any
version limitations you specify (e.g. you only want tensorflow >= 2.0) and
figures out how to install compatible dependencies. Or it will tell you that what
you want can't be done. Pip, by contrast, will just install the thing you wanted
and any dependencies, even if that breaks other things.Open source packages
can be individually installed from the Anaconda repository, Anaconda Cloud
19
(anaconda.org), or your own private repository or mirror, using the conda
install command. Anaconda Inc compiles and builds all the packages in the
Anaconda repository itself, and provides binaries for Windows 32/64 bit, Linux
64 bit and MacOS 64-bit. You can also install anything on PyPI into a Conda
environment using pip, and Conda knows what it has installed and what pip has
installed. Custom packages can be made using the conda build command, and
can be shared with others by uploading them to Anaconda Cloud, PyPI or other
repositories.The default installation of Anaconda2 includes Python 2.7 and
Anaconda3 includes Python 3.7. However, you can create new environments
that include any version of Python packaged with Conda. Anaconda Navigator
is a desktop Graphical User Interface (GUI) included in Anaconda distribution
that allows users to launch applications and manage conda packages,
environments and channels without using command-line commands. Navigator
can search for packages on Anaconda Cloud or in a local Anaconda Repository,
install them in an environment, run the packages and update them. It is
available for Windows, macOS and Linux.
JupyterLab
Jupyter Notebook
QtConsole
Spyder
Glueviz
Orange
Rstudio
Visual Studio Code
Microsoft .NET is a set of Microsoft software technologies for rapidly building
and integrating XML Web services, Microsoft Windows-based applications,
and Web solutions. The .NET Framework is a language-neutral platform for
writing programs that can easily and securely interoperate. There’s no language
barrier with .NET: there are numerous languages available to the developer
20
including Managed C++, C#, Visual Basic and Java Script. The .NET
framework provides the foundation for components to interact seamlessly,
whether locally or remotely on different platforms. It standardizes common
data types and communications protocols so that components created in
different languages can easily interoperate.
“.NET” is also the collective name given to various software components built
upon the .NET platform. These will be both products (Visual Studio.NET and
Windows.NET Server, for instance) and services (like Passport, .NET My
Services, and so on).
Easy to code
Free and Open Source
Object-Oriented Language
GUI Programming Support
High-Level Language
Extensible feature
Python is Portable language
Python is Integrated language
Interpreted
Large Standard Library
Dynamically Typed Language
21
PYTHON
1.Easy to code:
Python is high level programming language. Python is very easy to learn
language as compared to other language like c, c#, java script, java etc. It is
very easy to code in python language and anybody can learn python basic in
few hours or days. It is also developer-friendly language.
3.Object-Oriented Language:
One of the key features of python is Object-Oriented programming. Python
supports object oriented language and concepts of classes, objects
encapsulation etc.
5. High-Level Language:
Python is a high-level language. When we write programs in python, we do not
need to remember the system architecture, nor do we need to manage the
memory.
22
6.Extensible feature:
Python is a Extensible language. we can write our some python code into c or
c++ language and also we can compile that code in c/c++ language.
9. Interpreted Language:
Python is an Interpreted Language. because python code is executed line by
line at a time. like other language c, c++, java etc there is no need to compile
python code this makes it easier to debug our code. The source code of python
is converted into an immediate form called bytecode.
23
APPLICATIONS OF PYTHON :
WEB APPLICATIONS
You can create scalable Web Apps using frameworks and CMS
(Content Management System) that are built on Python. Some of the
popular platforms for creating Web Apps are: Django, Flask, Pyramid,
Plone, Django CMS.
Sites like Mozilla, Reddit, Instagram and PBS are written in Python.
SCIENTIFIC AND NUMERIC COMPUTING:
24
CHAPTER 5
MODULE DESCRIPTION
Data loading is the process of copying and loading data or data sets from
a source file, folder or application to a database or similar application. It is
usually implemented by copying digital data from a source and pasting or
loading the data to a data storage or processing utility. Data loading is used in
database-based extraction and loading techniques. Typically, such data is
loaded into the destination application as a different format than the original
source location.
Missing values were imputed to guarantee that all the algorithms would
be able to handle them. Nevertheless, some algorithms could deal with missing
values automatically without imputation, such as XGBoost. To restrict the
comparison complexity, the missing values were imputed based on their data
type. For numerical data types, the missing entries are replaced by the median
value of the complete entries. For categorical data, the missing entries were
replaced by the mode value of the complete entries.
In this module the data is cleaned. After cleaning of the data, the data is
grouped as per requirement. This grouping of data is known as data clustering.
25
Then check if there is any missing value in the data set or not. It there is some
missing value then change it by any default value. After that if any data need to
change its format, it is done. That total process before the prediction is known
is data pre-processing. After that the data is used for the prediction and
forecasting step.
For each experiment, we split the entire dataset into 70% training set
and 30% test set. We used the training set for resampling, hyper parameter
tuning, and training the model and we used test set to test the performance of
the trained model. While splitting the data, we specified a random seed (any
random number), which ensured the same data split every time the program
executed.
Now, even if you’ve stored a vast amount of well-structured data, it might not
be labeled in a way that actually works for training your model. For example,
autonomous vehicles don’t just need pictures of the road, they need labeled
images where each car, pedestrian, street sign and more are annotated;
sentiment analysis projects require labels that help an algorithm understand
when someone’s using slang or sarcasm; chatbots need entity extraction and
careful syntactic analysis, not just raw language.
In other words, the data you want to use for training usually needs to be
enriched or labeled. Or you might just need to collect more of it to power your
26
algorithms. But chances are, the data you’ve stored isn’t quite ready to be used
to train your classifiers.
Because if you’re trying to make a great model, you need great training data.
And we know a thing or two about that. After all, we’ve labeled over 5 billion
rows of data for some of the most innovative companies in the world. Whether
it’s images, text, audio, or, really, any other kind of data, we can help create the
training set that makes your models successful.
5.8 ALGORITHMS:
LSTM
ANN
27
FIG 5.2 LSTM MODEL
LSTM's are a special subset of RNN’s that can capture context-specific
temporal dependencies for long periods of time. Each LSTM neuron is a
memory cell that can store other information i.e., it maintains its own cell state.
While neurons in normal RNN’s merely take in their previous hidden state and
the current input to output a new hidden state, an LSTM neuron also takes in its
old cell state and outputs its new cell state.
1. Forget gate:
The forget gateway determines when certain parts of the cell will be
inserted with information that is more recent. It subtracts almost 1 in parts of
the cell state to be kept, and zero in values to be ignored.
2. Input gate:
Based on the input (e.g., previous output o (t-1), input x (t), and the
previous state of cell c (t-1)), this network category reads the conditions under
which any information should be stored (or updated) in the state cell.
3. Output gate:
Depending on the input mode and the cell, this component determines
which information is forwarded in the next location in the network.
Thus, LSTM networks are ideal for exploring how variation in one
stock's price can affect the prices of several other stocks over a long period of
28
time. They can also decide (in a dynamic fashion) for how long information
about specific past trends in stock price movement needs to be retained in order
to more accurately predict future trends in the variation of stock prices.
Advantages of LSTM:
The main advantage of LSTM is its ability to read intermediate context.
Each unit remembers details for a long or short period without explicitly
utilizing the activation function within the recurring components. An important
fact is that any cell state is repeated only with the release of the forget gate,
which varies between 0 and 1. That is to say, the gateway for forgetting in the
LSTM cell is responsible for both the hardware and the function of the cell
state activation. Thus, the data from the previous cell can pass through the
unchanged cell instead of explicitly increasing or decreasing in each step or
layer, and the instruments can convert to their appropriate values over a limited
time. This allows LSTM to solve a perishable gradient problem - because the
amount stored in the memory cell is not converted in a recurring manner, the
gradient does not end when trained to distribute backwards.
ARTIFICIAL NEURAL NETWORK (ANN)
Deep learning ANNs are crucial in machine learning (ML) and support broader
artificial intelligence (AI) technologies. An artificial neural network typically
comprises three or more interconnected layers.
The initial layer contains input neurons, which transmit data to deeper layers,
culminating in the final output layer. The intermediate layers, termed hidden
layers, process information adaptively through transformations. Each layer acts
as both input and output, enabling the ANN to comprehend complex objects.
Units within the hidden layers learn by weighting information based on internal
guidelines, producing transformed outputs for subsequent layers.
29
Backpropagation, a learning process, enables the ANN to adjust its outputs by
considering errors. During supervised training, errors are propagated backward,
and weights are updated accordingly to minimize discrepancies between
desired and actual outcomes. Training ANNs involves selecting appropriate
models and associated algorithms. One of the main advantages of ANNs is
their ability to learn from data observations, serving as effective tools for
function approximation and cost-effective solution estimation.
ANNs analyze data samples rather than entire sets, saving time and resources.
They find applications in various domains, including predictive analytics, spam
detection, natural language processing, and more.
30
Keras additionally requires either Theano or TensorFlow to be installed. In the
examples in this chapter we are using Theano as a backend, however the code
will work identically for either backend. You can install Theano using pip, but
it has a number of dependencies that must be installed first. Refer to the
Theano and TensorFlow documentation for more information [12].
Keras is a modular API. It allows you to create neural networks by building a
stack of modules, from the input of the neural network, to the output of the
neural network, piece by piece until you have a complete network. Also, Keras
can be configured to use your Graphics Processing Unit, or GPU. This makes
training neural networks far faster than if we were to use a CPU. We begin by
importing Keras:
We may want to view the network’s accuracy on the test (or its loss on the
training set) over time (measured at each epoch), to get a better idea how well it
is learning. An epoch is one complete cycle through the training data.
Fortunately, this is quite easy to plot as Keras’ fit function returns a history
object which we can use to do exactly this:
This will result in a plot similar to that shown. Often you will also want to plot
the loss on the test set and training set, and the accuracy on the test set and
training set.
Plotting the loss and accuracy can be used to see if you are over fitting (you
experience tiny loss on the training set, but large loss on the test set) and to see
when your training has plateaued.
31
CHAPTER 6
SYSTEM DESIGN
Data flow diagrams are used to graphically represent the flow of data in
a business information system. DFD describes the processes that are involved
in a system to transfer data from the input to the file storage and reports
generation. Data flow diagrams can be divided into logical and physical. The
logical data flow diagram describes flow of data through a system to perform
certain functionality of a business. The physical data flow diagram describes
the implementation of the logical data flow.
32
FIG 6.2 DATA FLOW MODEL
Use case diagrams are a way to capture the system's functionality and
requirements in UML diagrams. It captures the dynamic behavior of a live
system. A use case diagram consists of a use case and an actor. Here, data
owner and user having separate registration and login then data owners will
uploading the text document using the symmetric key for encrypting the cloud
data.
33
FIG 6.3 USE CASE DIAGRAM
34
FIG 6.4 CLASS DIAGRAM
35
6.6 ACTIVITY DIAGRAM
36
CHAPTER 7
SOFTWARE TESTING
Testing is done for each module. After testing all the modules, the modules are
integrated and testing of the final system is done with the test data, specially
designed to show that the system will operate successfully in all its aspects
conditions. Thus, the system testing is a confirmation that all is correct and an
opportunity to show the user that the system works. Inadequate testing or non-
testing leads to errors that may appear few months later. This will create two
problems, Time delay between the cause and appearance of the problem. The
effect of the system errors on files and records within the system. The purpose
of the system testing is to consider all the likely variations to which it will be
suggested and push the system to its limits. The testing process focuses on
logical intervals of the software ensuring that all the statements have been
tested and on the function intervals (i.e.,) conducting tests to uncover errors and
ensure that defined inputs will produce actual results that agree with the
37
required results. Testing has to be done using the two common steps Unit
testing and Integration testing. In the project system testing is made as follows:
The procedure level testing is made first. By giving improper inputs, the errors
occurred are noted and eliminated. This is the final step in system life cycle.
Here we implement the tested error-free system into real-life environment and
make necessary changes, which runs in an online fashion. Here system
maintenance is done every month or year based on company policies, and is
checked for errors like runtime errors, long run errors and other maintenances
like table verification and reports. Integration Testing is a level of software
testing where individual units are combined and tested as a group. The purpose
of this level is to expose faults in the interaction between integrated units. Test
drivers and test stubs are used to assist in Integration testing. Any of Black Box
Testing, White Box Testing, and Gray Box Testing methods can be used.
Normally, the method depends on your definition of ‘unit.’
TASKS:
38
7.1 UNIT TESTING
Unit testing verification efforts on the smallest unit of software design,
module. This is known as “Module Testing.” The modules are tested separately.
This testing is carried out during programming stage itself. In these testing
steps, each module is found to be working satisfactorily as regard to the
expected output from the module.
7.2 BLACK BOX TESTING
Black box testing, also known as Behavioral Testing, is a software
testing method in which the internal structure/ design/ implementation of the
item being tested is not known to the tester. These tests can be functional or
non-functional, though usually functional.
7.3 WHITE-BOX TESTING
White-box testing (also known as clear box testing, glass box testing,
transparent box testing, and structural testing) is a method of testing software
that tests internal structures or workings of an application, as opposed to its
functionality (i.e. black-box testing).
7.4 GREY BOX TESTING
Grey box testing is a technique to test the application with having a
limited knowledge of the internal workings of an application. To test the Web
Services application usually the Grey box testing is used. Grey box testing is
performed by end-users and also by testers and developers.
7.5 INTEGRATION TESTING
Integration testing is a systematic technique for constructing tests to
uncover error associated within the interface. In the project, all the modules are
combined and then the entire programmer is tested as a whole. In the
integration-testing step, all the error uncovered is corrected for the next testing
steps. Software integration testing is the incremental integration testing of two
or more integrated software components on a single platform to produce
failures caused by interface defects. The task of the integration test is to check
that components or software applications.
39
7.6 ACCEPTANCE TESTING
User Acceptance Testing is a critical phase of any project and requires
significant participation by the end user. It also ensures that the system meets
the functional requirements.
ACCEPTANCE TESTING FOR DATA SYNCHRONIZATION
The Acknowledgements will be received by the Sender Node after the
Packets are received by the Destination Node. The Route add operation is done
only when there is a Route request in need. The Status of Nodes information is
done automatically in the Cache Updating process.
BUILD THE TEST PLAN
Any project can be divided into units that can be further performed for
detailed processing. Then a testing strategy for each of this unit is carried out.
Unit testing helps to identity the possible bugs in the individual component, so
the component that has bugs can be identified and can be rectified from errors.
40
CHAPTER 8
CONCLUSION AND FUTURE WORK
This project presented the As number of devices used to access internet
increases day by day the danger of Intrusion detection also increases at an
alarming rate. Most of the current systems such as IPS and IDS, which are used
to detect and prevent Intrusion detection, are not able to detect and prevent
attacks that have new signatures or attacks which haven’t been identified. Thus,
therefore, the use of machine learning and pattern recognition comes into place
to give the systems like IDS or IPS to analyze new forms of Intrusion detection
and prevent it without being intervened by a user. Algorithms such as, ANN
and LSTM helps to classify and cluster the packets inbound to the network.
This project in depth focuses on identifying Intrusion detection based on UDP
Flooding, but classifying other types of Intrusion detection such as TCP Flood,
ICMP Flood, Smurf attack and HTTP Flood can be researched later as future
works.Based on the findings from the analyses conducted, recommendations
can be proposed to enhance the effectiveness of the Intrusion Detection
Systems. One suggestion is to enhance performance by implementing ongoing
real-time model training instead of relying on training the model with static
data. Additionally, a combination of Machine Learning (ML) and Deep
Learning (DL) can be utilized to further boost performance, where features are
extracted from the hidden layers of DL models and inputted into other ML or
DL models for further refinement. It is also advisable to subject the hybrid
model to various attacks, including zero-day attacks, in a real-world or
simulated network environment. This will allow for the identification of
vulnerabilities and enable the models to be retrained to detect such attacks.
Furthermore, integrating ANDAE with other anomaly detection techniques can
improve the detection of abnormal traffic patterns. Lastly, it is recommended to
explore the use of tools like Spark to improve the training and detection
capabilities of the model.
41
APPENDIX 1 SAMPLE CODINGS
import pandas as pd
#list of useful imports that i will use
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import random
import pickle
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
data = pd.read_csv(r"c:\users\deepi\music\project\dataset\data\kddtrain+.txt")
data
data.columns
data.head()
data.normal.value_counts()
data.normal.value_counts()
#renaming columns
data=data.rename(
columns={
'0':'duration',
'tcp':'protocol_type',
'ftp_data':'service',
'sf':'flag',
'491':'src_bytes',
'0.1':'dst_bytes',
'0.2':'land',
'0.3':'wrong_fragment',
'0.4':'urgent',
'0.5':'hot',
'0.6':'num_failed_logins',
42
'0.7':'logged_in',
'0.8':'num_compromised',
'0.9':'root_shell',
'0.10':'su_attempted',
'0.11':'num_root',
'0.12':'num_file_creations',
'0.13':'num_shells',
'0.14':'num_access_files' ,
'0.15':'num_outbound_cmds',
'0.16':'is_host_login',
'0.17':'is_guest_login',
'2':'count',
'2.1':'srv_count',
'0.00':'serror_rate',
'0.00.1':'srv_serror_rate',
'0.00.2':'rerror_rate',
'0.00.3':'srv_rerror_rate',
'1.00':'same_srv_rate',
'0.00.4':'diff_srv_rate',
'0.00.5':'srv_diff_host_rate',
'150':'dst_host_count',
'25':'dst_host_srv_count',
'0.17.1':'dst_host_same_srv_rate',
'0.03':'dst_host_diff_srv_rate',
'0.17.2':'dst_host_same_src_port_rate',
'0.00.6':'dst_host_srv_diff_host_rate',
'0.00.7':'dst_host_serror_rate',
'0.00.8':'dst_host_srv_serror_rate',
'0.05':'dst_host_rerror_rate',
'0.00.9':'dst_host_srv_rerror_rate',
'normal':'class',
43
'20':'num'})
data.head()
data.info()
data.describe()
data.isnull().sum()
data.isnull().any()
from sklearn.preprocessing import labelencoder
from sklearn.preprocessing import labelencoder
columns = data.columns
label_encoder = labelencoder()
44
df_c1_upsampled = resample(df_c1,
replace=true,n_samples=500,random_state=100)
df_c2_upsampled = resample(df_c2,
replace=true,n_samples=500,random_state=100)
df_c3_upsampled = resample(df_c3,
replace=true,n_samples=500,random_state=100)
df_c4_upsampled = resample(df_c4,
replace=true,n_samples=500,random_state=100)
# df_majority_downsampled = resample(df_majority,
replace=false,n_samples=2500,random_state=100)
45
from sklearn.preprocessing import labelencoder
enc=labelencoder()
y = enc.fit_transform(y)
from keras.utils import to_categorical
y1 = to_categorical(y)
enc.classes_
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y1, test_size=0.3,
random_state=42)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
x_test.to_csv(r"c:\users\deepi\music\project\test.csv")
from keras.utils import to_categorical#convert to one-hot-encoding
from keras.models import sequential
from keras.layers import
dense,dropout,flatten,conv1d,maxpool1d,globalavgpool1d,globalmaxpooling1d
from tensorflow.keras.optimizers import rmsprop
from tensorflow.keras.optimizers import adam
from sklearn.model_selection import train_test_split
from keras.layers import dense, lstm, repeatvector, timedistributed
#reshapind data
x_test1 = x_test.values.reshape((len(x_test),41,1))
x_train1 = x_train.values.reshape((len(x_train),41,1))
x_train1.shape
y_train.shape
import tensorflow as tf
model = sequential()
model.add(lstm(100, input_shape=(41,1)))
model.add(dropout(0.5))
model.add(dense(100, activation='relu'))
model.add(dense(5, activation='softmax'))
46
#model.compile(loss='categorical_crossentropy',optimizer='adam',
metrics=['accuracy'])
model.compile(loss='categorical_crossentropy',
# optimizer='sgd', # almost same optimizer=tf.keras.optimizers.adam(1e-4),
metrics=['accuracy'])
history1 = model.fit(x_train1,y_train, batch_size= 128,
epochs = 30, validation_data = (x_test1,y_test))
plt.plot(history1.history['accuracy'], 'r')
plt.plot(history1.history['val_accuracy'], 'b')
plt.legend({'train accuracy': 'r', 'test accuracy':'b'})
plt.show()
score = model.evaluate(x_test1, y_test, verbose=0)
print('test accuracy:', score[1])
score = model.evaluate(x_train1, y_train, verbose=0)
print('train accuracy:', score[1])
# save the model
tf.keras.models.save_model(model,file_name)
#plot confusion matrix
from sklearn.metrics import confusion_matrix
class_names = enc.classes_
df_heatmap =
pd.dataframe(confusion_matrix(np.argmax((model.predict(x_test1)),axis =
1),np.argmax(y_test,axis=1)),columns = class_names, index = class_names)
df_heatmap
#heatmap = sns.heatmap(df_heatmap, fmt="d")
enc.classes_
i=3
y_pred = model.predict(x_test1[i-1:i])
classes_x=np.argmax(y_pred,axis=1)
act = np.argmax(y_test[i-1])
print("predicted class: {}".format(enc.classes_[classes_x]))
47
print("actual class: {}".format(enc.classes_[act]))
i=6
y_pred = model.predict(x_test1[i-1:i])
classes_x=np.argmax(y_pred,axis=1)
act = np.argmax(y_test[i-1])
print("predicted class: {}".format(enc.classes_[classes_x]))
print("actual class: {}".format(enc.classes_[act]))
i=8
y_pred = model.predict(x_test1[i-1:i])
classes_x=np.argmax(y_pred,axis=1)
act = np.argmax(y_test[i-1])
print("predicted class: {}".format(enc.classes_[classes_x]))
print("actual class: {}".format(enc.classes_[act]))
all_model_result =
pd.dataframe(columns=['model','test_accuracy','train_accuracy'])
new = ['lstm',91, 91]
all_model_result.loc[0] = new
model = sequential()
model.add(conv1d(filters = 32, kernel_size = 3,activation ='relu', input_shape =
(41,1)))
model.add(conv1d(filters = 32, kernel_size = 3, activation ='relu'))
model.add(dropout(0.4))
model.add(conv1d(filters = 32, kernel_size = 3,activation ='relu'))
model.add(conv1d(filters = 32, kernel_size = 3, activation ='relu'))
model.add(dropout(0.4))
model.add(conv1d(filters = 64, kernel_size = 3, activation ='relu'))
model.add(conv1d(filters = 64, kernel_size = 3, activation ='relu'))
model.add(dropout(0.4))
model.add(flatten())
model.add(dense(256, activation = "relu"))
model.add(dropout(0.5))
48
model.add(dense(5, activation = "softmax"))
model.summary()
model.compile(optimizer = 'rmsprop' , loss = "categorical_crossentropy",
metrics=["accuracy"])
history = model.fit(x_train1,y_train, batch_size= 128,
epochs = 30, validation_data = (x_test1,y_test))
plt.plot(history.history['accuracy'], 'r')
plt.plot(history.history['val_accuracy'], 'b')
plt.legend({'train accuracy': 'r', 'test accuracy':'b'})
plt.show()
score1 = model.evaluate(x_test1, y_test, verbose=0)
print('test accuracy:', score1[1])
score = model.evaluate(x_train1, y_train, verbose=0)
print('train accuracy:', score[1])
# save the model
tf.keras.models.save_model(model,file_name)
new = ['cnn 1d',score[1], score1[1]]
all_model_result.loc[1] = new
all_model_result
# initialising the ann
classifier = sequential()
#adding the input layer and hidden layer
classifier.add(dense(input_dim=41, units=45, kernel_initializer='uniform',
activation='relu'))
#adding the second hidden layer
classifier.add(dense(units=20, kernel_initializer='uniform', activation='relu'))
#adding the output layer
classifier.add(dense(units=5, kernel_initializer='uniform', activation='sigmoid'))
#compiling the ann(applying stochastic gradient)
classifier.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
49
history=classifier.fit(x_train1, y_train,batch_size=128,
epochs=200,validation_data = (x_test1,y_test))
plt.plot(history.history['accuracy'], 'r')
plt.plot(history.history['val_accuracy'], 'b')
plt.legend({'train accuracy': 'r', 'test accuracy':'b'})
plt.show()
score1 = classifier.evaluate(x_train1, y_train, verbose=0)
print('train accuracy:', score1[1])
score = classifier.evaluate(x_test1, y_test, verbose=0)
print('test accuracy:', score[1])
# save the model
tf.keras.models.save_model(classifier,file_name)
#plot confusion matrix
from sklearn.metrics import confusion_matrix
class_names = enc.classes_
df_heatmap =
pd.dataframe(confusion_matrix(np.argmax((classifier.predict(x_test1)),axis =
1),np.argmax(y_test,axis=1)),columns = class_names, index = class_names)
df_heatmap
# heatmap = sns.heatmap(df_heatmap, annot=true, fmt="d")
enc.classes_
i=8
y_pred = classifier.predict(x_test1[i-1:i])
classes_x=np.argmax(y_pred,axis=1)
act = np.argmax(y_test[i-1])
print("predicted class: {}".format(enc.classes_[classes_x]))
print("actual class: {}".format(enc.classes_[act]))
new = ['ann',score[1], score1[1]]
all_model_result.loc[2] = new
all_model_result
50
APPENDIX 2: SNAP SHOTS
51
52
REFERENCES
[1] Wani, Abdul Raoof, Q. P. Rana, and Nitin Pandey. "Cloud security
architecture based on user authentication and symmetric key cryptographic
techniques." Reliability, Infocom Technologies and Optimization (Trends and
Future Directions)(ICRITO), 2017 6th International Conference on.IEEE, 2017.
[2] Wani, Abdul Raoof, Q. P. Rana, and Nitin Pandey. "Analysis and
Countermeasures for Security and Privacy Issues in Cloud Computing."
System Performance and Management Analytics. Springer, Singapore, 2019.
47-54.
[4] Kaspersky Labs, Global IT security risks survey 2014 - distributed denial of
service (DDoS) attacks, 2014, (http://media.kaspersky.com/en/ B2B-
International- 2014- Survey- DDoS- Summary- Report.pdf ).
[7] Salmen, Fadir, et al. "Using Firefly and Genetic Metaheuristics for
Anomaly Detection based on Network Flows." AICT: The Eleventh Advanced
International Conference onTelecommunications. 2015. 875
53
[8] Vijayalakshmi, M., S. Mercy Shalinie, and A. Arun Pragash. "IP traceback
system for network and application layer attacks." Recent Trends In
Information Technology (ICRTIT), 2012. International Conference on. IEEE,
2012.
[9] Dantas, Yuri Gil, Vivek Nigam, and Iguatemi E. Fonseca. "A selective
defense for application layer ddos attacks."Intelligence and Security
Informatics Conference (JISIC), 2014IEEE Joint. IEEE, 2014.
54
55