Privilege Escalation Attack Detection and Mitigation in Cloud Using Machine Learning
Privilege Escalation Attack Detection and Mitigation in Cloud Using Machine Learning
Cloud computing faces significant security threats from privilege escalation attacks, compro
mising sensitive data and resources. This research proposes a machine learning-based solutio
n to detect and mitigate such attacks. A customized dataset from multiple files of the CERT d
ataset is used. Utilizing Random Forest, AdaBoost, XGBoost, and LightGBM algorithms, our
model achieves 97% detection accuracy and a 0.78% false positive rate. By identifying intern
al attackers and distinguishing between legitimate and malicious activity, our approach enhan
ces cloud security and minimizes false alerts.
Keywords : — Privilege escalation, insider attack, machine learning, random forest, adaboo
st, XGBoost, LightGBM, classification.
CHAPTER-01
INTRODUCTION
Cloud computing is a new way of thinking about how to facilitate and provide services throu
gh the Internet. The current financial crisis, as well as the expanding computing demands, ha
ve necessitated significant changes to the current Cloud Model in terms of data storage, proce
ssing, and display . Cloud computing prevents people from spending a lot on equipment main
tenance and purchases by utilizing cloud infrastructure. Cloud storage providers adopt funda
mental security measures for their systems and the data they handle, including encryption, ac
cess control, and authentication. Depending on the accessibility, speed, and frequency of data
access, the cloud has an almost infinite capacity for storing any type of data in different cloud
data storage structures. Sensitive data breaches might occur due to the volume of data that mo
ves between businesses and cloud service providers, both inadvertent and malicious. The char
acteristics that make online services easy to use for workers and IT systems also make it hard
er for businesses to prevent unwanted access. Authentication and open Interfaces are new sec
urity vulnerabilities that Cloud services subject enterprises face. Hackers with advanced skills
utilize their knowledge to access Cloud systems Machine learning employs a variety of appro
aches and algorithms to address the security challenge and better manage data. Many datasets
are private and cannot be released owing to privacy concerns, or they may be missing crucial
statistical properties . The fast rise of the Cloud industry creates privacy and security risks go
verned by regulations. Employee access privileges may not necessarily change when they cha
nge roles or positions within the Cloud Company. As a result, old privileges are used inconve
niently to steal and harm valuable data. Each account that communicates with a computer has
some level of authority. Server databases, confidential files, and other services are often restri
cted to approved users. A malicious attacker can access a sensitive system by gaining control
of a higher user account and exploiting or expanding privileges. Based on their objectives, att
ackers can move horizontally to obtain control of more systems or vertically to obtain admin
and root access till they have complete control of the whole environment . When a user gets t
he access permissions of another user with the same access level, this is known as horizontal
privilege escalation. An attacker can use horizontal privilege escalation to access data that do
es not necessarily relate to him. An attacker may be able to uncover holes in a Web applicatio
n that provides him entry to certain other people’s information in badly designed apps . Becau
se the attacker has completed a horizontal elevation of privileges exploit, they can see, alter, a
nd copy sensitive information. Figure 1 illustrates the scenario of how the horizontal privilege
escalation attack happened among the entities of the organizations. This form of assault usual
ly necessitates a thorough knowledge of the weaknesses that impact specific operating system
s and the usage of malicious programs. It’s also called privilege elevation assault, which entai
ls giving a user, software, or other assets more rights or privileged access than they already ha
ve. Moving from a low degree of privileged access to a greater level of special access is the k
ey objective of the attacker. To achieve vertical access control, the attacker may need to take
various actions to overcome or override security restrictions. Vertical privileges controls are f
iner-grained versions of security models that implement business objectives like separation of
roles and least privilege, as shown in Figure 2. An attacker, for example, takes control of an o
rdinary registered user on a network and tries to acquire administrative or root access. Anoma
ly activity on organizational systems or user accounts can be detected using behavioral analyt
ics, which might signal intrusion or privilege escalation.
To get better security protection systems, we need intelligent algorithms, such as ML algorith
ms, to classify and predict insider attacks . In addition, knowing the performance of ML algor
ithms on classifying insider attacks allows you to choose the most appropriate algorithm for e
ach case, and the ones (ML algorithms) need to be improved. So you can provide a higher lev
el of security protection. This research aims to apply effective and efficient ML algorithms to
insider attack scenarios to gain better and faster results. ML algorithms have been applied and
evaluated in this regard: Random Forest, AdaBoost, XGBoost, and LightGBM. The principle
behind the boosting strategy is to take a weak classifier and train it to become a very good on
e by raising the prediction of the classification algorithm. Random Forest, AdaBoost, and XG
Boost worked accurately and quickly to classify insider threats. These are the contributions th
at this research intends to make:
1) In order to generate findings that represent real-world situations, this work assumes a realis
tic context for ML model training. After this, the work emphasizes the differences from traini
ng under conventional ML conditions.
2) Create and analyze a user-centered insider attack detection process, including data collecti
on, pre-processing, and ML model-based data analysis.
3) To better understand insider attack situations offer a detailed result reporting procedure wh
ere instance and user-based results are presented and malicious incidents are evaluated.
To the best of our knowledge, this is the first paper that deals with measuring the performanc
e of the four machine learning algorithms (Random Forest, AdaBoost, XGBoost, and LightG
BM) on classifying insider attacks and using this (algorithm performance) to quickly identify
appropriate defense tools that improve the level of security protection. Recent insider threat d
etection and classification studies used different models and ensemble techniques. Those stud
ies individually implemented the models on different datasets and then gave the classification
results. This paper implemented the four ensemble models on a single customized dataset to b
etter detect and classify insider threats. Our study presented the best results of applied ensem
ble algorithms.
1.1 PURPOSE
The scope of using machine learning (ML) for detecting and mitigating privilege escalation
attacks in cloud environments can be summarized in three key areas:
1. Anomaly Detection and Behavior Analysis: The goal is to detect deviations from normal
usage patterns that could indicate privilege escalation attempts.
2. Automated Threat Mitigation:This includes automated actions like restricting
access,notifying administrators ,reducing response time and potential damage.
3. Continuous Learning and Adaptions:Implement continuous learning mechanisms to adapt
to new attack vectors and evolving threat landscapes.By retraining models with new data,the
system can become more resilient to advanced,unknown attacks while reducing false
positives over time.
1.3 ABBREVIATIONS AND DEFINITIONS
PEA - Privilege Escalation Attack: An attack that exploits a vulnerability to gain elevated
access to resources that are normally protected from user access.
IDS - Intrusion Detection System: A system designed to monitor network or system activities
for malicious activities or policy violations.
IPS - Intrusion Prevention System: A network security technology that examines network
traffic for malicious activity and takes action to prevent it.
ACL - Access Control List: A list of permissions attached to an object that specifies which
users or system processes can access that object and what operations they can perform.
CHAPTER- 02
LITERATURE SURVEY
Le et al. [9] discussed that insider threats are among the most expensive and difficult-to-detec
t forms of assault since insiders have access to a company’s networked systems and are famili
ar with its structure and security processes. A unique set of challenges face insider malware d
etection, such as extremely unbalanced data, limited ground truth, and behavioral drifts and s
hifts. Machine learning is used to analyze data at several levels of detail under realistic situati
ons to identify harmful behaviors, especially malicious insider attacks. Random Forest beats t
he other ML methods, achieving good detection performance and F1-score with low false pos
itive rates in most situations. The proposed work achieved an accuracy of 85% and a false po
sitive rate of only 0.78%. Janjua et al. [10] discussed that preventing malicious insiders from
acting maliciously in an organization’s system is a significant cybersecurity challenge. The p
aper’s main goal is to use several Machine Learning approaches to classify email from the T
WOS dataset. The following supervised learning techniques that have been used on the datase
t are Adaboost, Naïve Bayes (NB), Logistic Regression (LR), KNN, Linear Regression (LR),
and Support Vector Machine (SVM). Experiments reveal that AdaBoost has the best classific
ation accuracy for harmful and non-malicious emails, with a 98% accuracy rate. Although the
model was trained on the original dataset, the data is limited. The model’s results may be imp
roved if the dataset is bigger.
Kumar et al. [11] discussed that due to the large number of diverse apps operating on shared r
esources, implementing security and resilience on a Cloud platform is necessary but difficult.
Inside the Cloud infrastructure. Based on the idea of clustering, a novel malware detection tec
hnique was suggested: trend micro locality sensitive hashing (TLSH). They utilized Cuckoo s
andbox, which generates dynamic file analysis results by running them in a separate environ
ment. Principal component analysis (PCA), random forest, and Chisquare feature selection ap
proaches are also used to choose the essential features. Experimental outcomes for clustering
and non-clustering methods are obtained for three proposed classifiers. According to the outc
omes of the experiments, the Random Forest obtains the best accuracy among the other classi
fiers. Cloud security has long been a serious concern. Attackers target data sources because th
ey want the most valuable and sensitive information. If data is lost, every Cloud user’s privac
y and security is seriously threatened. Internal attackers get access to a system by compromisi
ng a susceptible user node. Internally linked to the cloud network, they conduct assaults while
posing as trusted users. The use of Improvised LSTM to identify internal attackers in a cloud
network is offered as a security technique. Not only does the proposed ILSTM identify intern
al attackers, but it also minimizes false alert rates by distinguishing broken and new user node
s from malfunctioning nodes.
Le and Zincir-Heywood discussed that insider threat actions could be taken intentionally or a
ccidentally, like information system sabotage or irresponsible working with cloud resources.
One of the difficulties in researching insider threats is that a malicious insider has access to th
e organization’s network systems and is familiar with its security processes. To assist cyberse
curity experts in detecting harmful insider activity on unseen data, ANN, RF, and LR machin
e learning techniques are taught on finite ground truth. User-session data looks to be the great
est choice for data granularity since it enables a system with a significant malicious insider de
tection rate and quick response times. They used machine learning techniques such as RF and
ANN, which performed well in this work. Because RF provides excellent precision, it can be
used when manpower for examining alerts is restricted. Tripathy et al. discussed that convent
ional web-based and cloud apps are vulnerable to the most popular online threats. One of the
greatest threats to a SaaS application is the SQL injection attack. They construct and test the c
lassification for SQL attack detection using machine learning methods. They explore the abili
ty of machine learning models to identify SQL injection attacks, including the AdaBoost Clas
sifier, Random Forest, and Deep Learning utilizing ANN, TensorFlow’s Linear Classifier, an
d Boosted Trees Classifier. More important than malicious reading activities are malicious wr
iting operations. The random forest classifier surpasses all others on the dataset and obtains b
etter accuracy.
Sun et al. discussed that the network is becoming increasingly integral to businesses and orga
nizations. So there is an increase in network security threats. Data leakage incidents from 15
nations and 17 industry groups were examined for Ponemon’s 2018 Cost of a Data Breach St
udy, with 48% being malicious operations. While insiders’ faulty actions were the cause of 2
7% of the incidents. They used the tree structure technique to study user behavior and create t
he feature sequence in this article. To distinguish between the feature patterns and detect unus
ual users, the COPOD approach is adopted. Additionally, the detection effect outperforms the
standard unsupervised learning approach. Processing vast amounts of complicated and divers
e data using this way provides benefits. Kim et al. discussed that the authorized user’s malici
ous acts, such as stealing intellectual property or sensitive information, fraud, and sabotage, a
re examples of insider risks. Although insider threats are far less common than external netw
ork assaults, they can still do significant harm. There are three widely used research methodol
ogies for detecting insider threats. Making a rule-based detection system is the first method. T
he second technique is to create a network graph and monitor modifications in the graph’s str
ucture to spot suspicious people or bad behavior. The third technique uses historical data to cr
eate a statistical or machine-learning model that can predict potentially dangerous activity. Th
ey utilized the ‘‘CERT Insider Threat Tools’’ dataset since obtaining genuine business syste
m logs is extremely challenging. Employee computer actions logs are included in the CERT d
ataset and certain organizational data such as employee’s departments and responsibilities. Th
ey built insider-threat detection models to emulate realworld companies using machine learni
ng-based methods. Experiments indicate that the suggested system can detect harmful insider
activities relatively effectively.
Liu et al. discussed that information communication technology systems are increasingly vul
nerable to cyber security attacks, most of which come from within the organization. Detectin
g and mitigating insider threats is a complicated challenge because insiders are hidden behind
enterprise-level security defense measures and frequently have privileged network access. By
gathering and reassembling information from the literature, they present the many types of in
siders and the threats they bring. Insider threats are of three types: Masquerader, Traitor, and
Unintentional perpetrator. Prevention may be viewed as a set of defensive procedures that can
help prevent or enhance the identification of various internal threats. They examine the sugge
sted efforts from a data analytics viewpoint, presenting them in terms of host, network, and c
ontextual data analytics. Meanwhile, relevant studies are analyzed and compared, with a brief
overview to show the benefits and drawbacks. Wang et al. discussed that the insiders are an o
rganization’s trusted partners who have access to the organization’s assets, information, and n
etwork. Over 60% of all security breaches or assaults documented worldwide in 2015 were c
ommitted entirely by insiders. As a result, preventing insider threats is a severe problem. The
major goal of this work is to create a reliable insider threat detection system that can distingui
sh between malicious and non-malicious insider activity. The examination of human behavior
al activities will be the main emphasis of this paper. They examine three scenarios related to t
he behavioral activity of the insider user. These scenarios are as follows:
• A user performs activities after working hours using a removable device to access and steal
data.
• Before leaving the current job, the user’s frequency of using the thumb drive increases and t
hen is used to steal important company data.
• Users download some spyware software to get the passwords of the employees of the organi
zation, and after getting the passwords, they try to steal the supervisor’s credentials. After tha
t, they generate fake alarming emails to create panic in the organization.
Tariq et al. discussed that Deep learning, also known as multilevel and deep-structured learni
ng, is a subset of machine learning methods that can be supervised or unsupervised. The DL’s
encrypted data comes from learning and interface modules and is its main issue. Due to the w
idespread adoption of DL models in numerous applications, security and privacy concerns are
of utmost importance. Due to numerous Deep Neural Network properties, which rely on a sig
nificant quantity of input training data, privacy issues constantly exist. The Industries and res
earchers have focused on many Deep Learning security threats and associated defenses.
Berman et al. discussed that the set of procedures, methods, tools and technologies collective
ly called ‘‘cyber security’’ are used to safeguard computing resources’ availability, confidenti
ality, and integrity. There is evidence of compromise throughout an attack’s life cycle, and th
ere may even be important warning signs of an upcoming attack. Finding these indications, w
hich could be scattered across the environment, is difficult. Many data are produced by machi
ne-to-machine and human-to-machine exchanges from apps, websites, electronic objects, and
other cyber-enabled resources. Malware threats are becoming more prevalent and diverse, ma
king it more challenging to protect against them using conventional techniques. DL offers the
chance to create generalized models for malware detection and classification. Network behavi
or-based approaches are required to identify complex malware since they focus on the synchr
onized command and control traffic from the malware. Pang et al. discussed that insider threa
t, which may lead to data theft or system sabotage, is one of today’s main cyber security issue
s. Although insider threats can substantially harm, their objectives and activities might differ
greatly. The use of anomaly-based intrusion detection methods is a useful way of identifying
both known and undiscovered/unknown threats. The type of anomaly is a key element in ano
maly identification. There are three subcategories of anomalies: point anomalies, contextual a
nomalies, and collective anomalies. In anomaly-based intrusion detection systems, a model is
created by training the system using ‘‘normal’’ network data. When the system’s model is rea
dy, it is then utilized to determine whether or not new events, objects, or traffic are abnormal.
Deep learning is a subset of machine learning algorithms that uses several levels of informati
on processing steps in hierarchical structures to learn features unsupervised and evaluate or cl
assify patterns. On the KDDCup99 dataset, the system is developed using two techniques: R
BM and Autoencoder. Based on the data currently available from the KDD99 dataset, a statist
ical analysis is done on the values of each characteristic. Tests using connections to the KDD
Cup99 network traffic have demonstrated that Deep Learning algorithms efficiently detect int
rusions with minimal error rates. Coppolino et al. discussed that security is a major issue sinc
e cloud services handle sensitive data that may be accessible from anywhere over the Internet.
Malevolent insiders frequently wreak significantly more harm than is anticipated. Such attac
kers inject insecure code into the cloud and use their equipment as a channel. When correctly
injected, this code performs maliciously, and the user running it has power over it. The signifi
cance of this code’s capacity to give the malicious user access to information depends on the
strength of the developed code and the degree of security measures implemented by the cloud.
Cloud Ecosystem that aims to offer security controls across all Clouds. The system aims to g
uarantee data privacy and security from the user authentication procedure through cloud stora
ge. One Time Password (OTP) was made possible by the system’s design’s consideration of
verified authentication. The CloudSim simulator is used to model both the proposed system a
nd algorithm.
Abdelsalam et al. discussed a Deep Learning-based malware detection technique (DL). Empl
oying raw, process behavior (performance metrics) data, the study demonstrated the usefulne
ss of using a 2D Convolutional Neural Network (CNN) for malware detection. The study illu
strates the effectiveness of the proposed method by first developing a standard 2D CNN mod
el which does not include the time window, and then making comparisons it to a newly devel
oped 3D CNN model that greatly enhances detection accuracy, because of the use of a time w
indow as the third dimension, thereby minimizing the problem of mislabeling. Results reveale
d a reasonable accuracy of 79% on the testing dataset by using 2D CNN.
Jaafar et al. illustrated that information systems are created to provide services and functions t
o a large number of people. As a result, it is common to have multiple levels of privilege for
different users on the same information system. By identifying irregularities and flaws in info
rmation systems, several studies have been published to find security issues or attacks associa
ted with privilege escalation. In the article, the study first introduces a new distance-based out
lier detection technique for detecting unexpected situations of privilege escalation assaults wi
thout making any assumptions about the dataset or distribution. Second, based on known priv
ilege escalation scenarios, the study identifies four kinds of privilege escalation assaults and t
he justification for their specifications.
Alhebaishi et al. discussed that the growing use of cloud computing brings with it plenty of n
ew security and privacy issues. In order to execute their assigned maintenance responsibilities,
remote administrators must be given the proper privileges, which may include direct access t
o the underlying cloud infrastructure. A dishonest remote administrator, or an attacker who h
as stolen an administrator’s credentials, might pose serious internal risks to the cloud. The stu
dy starts by modeling the maintenance jobs and their associated rights. The study was next us
es the current k-zero day safety metric to represent the insider threats caused by remote admi
nistrators allocated to maintenance tasks.
Yuan et al. demonstrated that the approach for detecting insider threats is to model a user’s us
ual behaviour and look for anomalies. To determine if user behaviour is normal or abnormal,
it is to present the unique insider threat detection approach. Specifically, using the LSTM to c
ategorize the user action sequence directly is inefficient, because each sequence is represente
d by a single bit of information in the LSTM output. The proposed model works in two stages:
In the first stage, the LSTM is used to extract the temporal features about the behaviour of th
e user, then these features are converted to fixed-size feature matrices. In the second stage C
NN is used to classify fixed-size matrices as normal or anomaly.
Mohammed demonstrated that Cloud computing has become more vital for today’s business
es to satisfy their demands. The present popularity of cloud web services is a result of their af
fordability and accessibility. Several adaptable service models, including IaaS, SaaS, PaaS, a
nd multitenancy, are used to achieve this. Security and privacy risks associated with these clo
ud services are serious. By limiting illegal access, the combination of verification and attribut
ebased access control enhances the performance of the cloud web application. Relying on ide
ntification and access control systems to prevent unauthorized use of the systems, which is a
mong the most typical circumstances, is one of the biggest risks. Despite the risks and challen
ges, there are more benefits to having an IAM system than problems. Future cost savings and
use cases will emerge, but they will probably only be available to businesses with strong clou
d identity standards. The majority of firms looking to establish themselves for long-term succ
ess have found that identity as a service provides the best route forward.
CHAPTER-03
SYSTEM ANALYSIS
DISADVANTAGES
Imbalanced Data
Complexity and interpretability
Privacy concerns
Resource Intensive
The existing work achieved an accuracy of 97% and a falsevpositive rate of only 0.78%.
The proposed model consists of GRU algorithm that is applied to CERT dataset.
This algorithm provides 98.3% of accuracy and takes less time stamp.
ADVANTAGES
Early Detection
Adaptability
Performance: The application should have better accuracy and should provide prediction in l
ess time.
Scalability: The system must have the potential to be enlarged to accommodate the growth.
Capability: The capability of the storage should be high so the large amount of data can be st
ored in order to train the model.
⮚ ECONOMICAL FEASIBILITY
⮚ TECHNICAL FEASIBILITY
⮚ SOCIAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical r
equirements of the system. Any system developed must not have a high demand on the availa
ble technical resources. This will lead to high demands on the available technical resources. T
his will lead to high demands being placed on the client. The developed system must have a
modest requirement, as only minimal or null changes are required for implementing this syste
m.
CHAPTER-4
SYSTEM REQUIREMENTS
➢ RAM - 4 GB (min)
➢ Hard Disk - 20 GB
➢ Key Board - Standard Windows Keyboard
➢ Mouse - Two or Three Button Mouse
➢ Monitor - SVGA
❖ Front-End : Python.
❖ Back-End : django
CHAPTER-05
SOFTWARE DESIGN
A crucial step in the cyber-attack chain is privilege escalation, which often includes the execu
tion of a privilege escalation vulnerability caused by a flaw in the system, a configuration err
or, or insufficient access controls. The following are the countermeasures against the privileg
e escalation attacks:
A. SECURITY POLICY
An effective security policy should, at the very least, outline the mitigation of security threats.
Including measures in your security strategy to avoid and identify misuse is one of the greate
st strategies to stop insider threats. Rules for handling
False Alarm Rate. insider misuse investigations should also be included in the policy.
B. MULTIFACTOR AUTHENTICATION
C. SECURE DESKTOPS
Certain services can lock down PCs throughout Corporation. Because businesses can’t rely o
n their staff to handle all their setups with the appropriate level of responsibility, these service
s are quite helpful. To further assist companies in preventing dangers, these services also let c
ompanies lock off certain areas of a user’s computer programs.
Utilize software that scans the company’s policies and notifies authorities when employees br
eak them on the network. To be sure that company staff is not revealing business secrets, ther
e is software available that will examine the content of outgoing emails.
Because most businesses are too focused on looking for external dangers, it frequently happe
ns that an employee abuses a company’s trust without expecting to be held accountable. As a
result, it is best to look into any suspicious behavior that occurs on the LAN of your business.
Consider that there are rules governing monitoring, so inform yourself before breaking any of
them.
F. BEHAVIORAL BIOMETRICS
To strengthen the defense against insider attacks, many biometrics have been deployed. Some
studied methods included behavioral biometrics (e.g., typing patterns, eye and head motions).
Keystroke dynamics is a type of biometrics where insiders are continuously verified dependin
g on their typing style. Insider keystroke patterns’ variances among presses and releases were
computed. The tasks being executed will be promptly prevented as soon as an abnormal typin
g pattern is discovered, which is regarded as a masquerader attack.
G. PHYSIOLOGICAL BIOMETRICS
Access control models’ primary objective is to control access to digital assets using various a
uthentication techniques, such as passwords, tokens, fingerprints, etc., so access may only be
allowed to people who have the appropriate permissions and are approved. One of the main is
sues with access control models is generally that if a user is trusted for the duration of a sessi
on, they can abuse the capabilities they have been given without being noticed. The Intent-Ba
sed Access Control Model (IBAC) was developed to solve this issue. IBAC confirms the inte
grity of insiders’ purpose rather than their identity, in contrast to conventional access control
schemes. IBAC is based on the theory that physiological traits, including brain signals, may b
e used to assess the sincerity of intents and prevent insider threats.
1) MITIGATION STRATEGIES
The threat of sensitive information being accessed will be reduced by using the best practices
to prevent insider attacks. These practices are as follows:
• All connections, including mobile ones, are monitored and under remote access control.
1. The DFD is also called as bubble chart. It is a simple graphical formalism that can be used
to represent a system in terms of input data to the system, various processing carried out on th
is data, and the output data is generated by this system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It is used to mo
del the system components. These components are the system process, the data used by the pr
ocess, an external entity that interacts with the system and the information flows in the syste
m.
3. DFD shows how the information moves through the system and how it is modified by a ser
ies of transformations. It is a graphical technique that depicts information flow and the transf
ormations that are applied as data moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at any level
of abstraction. DFD may be partitioned into levels that represent increasing information flow
and functional detail.
UML is a method for describing the system architecture in detail using the blue print. UML r
epresents a collection of best engineering practice that has proven successful in the modeling
of large and complex systems. The UML is very important parts of developing object-oriente
d software and the software development process. The UML uses mostly graphical notations t
o express the design of software projects. Using the helps UML helps project teams communi
cate explore potential designs and validate the architectural design of the software.
A use case diagram in the Unified Modeling Language (UML) is a type of behavioral diagra
m defined by and created from a Use-case analysis. Its purpose is to present a graphical overv
iew of the functionality provided by a system in terms of actors, their goals (represented as us
e cases), and any dependencies between those use cases. The main purpose of a use case diag
ram is to show what system functions are performed for which actor. Roles of the actors in th
e system can be depicted.
5.3.2 ACTIVITY DIAGRAM
Activity diagrams are graphical representations of work flows of stepwise activities and actio
ns with support for choice, iteration and concurrency. In the Unified Modeling Language, acti
vity diagrams can be used to describe the business and operational step-by-step work flows of
components in a system. An activity diagram shows the overall flow of control.
5.3.3 SEQUENCE DIAGRAM
In software engineering, a class diagram in the Unified Modeling Language (UML) is a type
of static structure diagram that describes the structure of a system by showing the system's cla
sses, their attributes, operations (or methods), and the relationships among the classes. It expl
ains which class contains information.
CHAPTER -06
SYSTEM IMPLEMENTATION
The malicious insider becomes a crucial threat to the organization since they have more acces
s and opportunity to produce significant damage. Unlike outsiders, insiders possess privileged
and proper access to information and resources. Furthermore, insiders are well-versed in the o
rganization’s vital assets. As a result, identifying and understanding insider attackers and thei
r objectives requires good internal threat classification. Insider risks may be defined and addr
essed using criteria including insider indications, detection approaches, and insider kinds. The
re are two sorts of analysis intervals: real-time, which may identify malicious activity in real-t
ime, and offline anomaly detection, which gathers log data and looks for certain patterns. Bot
h purposeful and accidental cyber-attacks on the information and the use of unauthorized acti
vities to affect the information’s availability, integrity, or secrecy are examples of authorized
misuse actions. The threat approach determines the method for detecting malicious agents. At
tackers might easily introduce random data into the distributed algorithm to prevent it from c
onvergence . Figure 3 shows the attacker’s approaches toward the user’s system of an organiz
ation for stealing sensitive information or performing some serious damage to the data. Attac
kers can also attack through email by sending malicious code or URLs to the desired user acc
ounts. He got control or credentials of that user and later use these details for further attacks.
The limitations of traditional machine learning for attack detection include their inability to a
utomatically design features, poor detection rate, and inability to identify tiny mutants of kno
wn attacks and insider attacks. In the majority of circumstances, the ensemble of models will
perform FIGURE 3. Privilege Escalation Attack Process. better than the individual models on
insider threat detection and classification. The combined output of several models is almost al
ways lesser noisy than the sum of the individual models. Model consistency and robustness re
sult from this approach. Both linear and non-linear connections in the data are captured by en
semble models. It is achieved by combining two separate models into one ensemble. The pro
posed methodology consists of well-known supervised machine learning algorithms, i.e., Ran
dom Forest, AdaBoost, XGBoost, and LightGBM. These models utilized datasets for detectio
n and classification. Implementing these models contains a series of steps, including dataset p
reprocessing, model training, model testing, detection, and classification . Several challenges
were faced when implementing the proposed ML algorithms. A major concern was the datase
t collection for insider attacks. After evaluating the datasets from multiple sources like dataset
repositories and websites, the final dataset has been selected for implementation. Before traini
ng the models, the dataset was analyzed completely for its quality. Some features contain mis
sing values and outliers. The ‘size’ feature of the dataset contained some outliers, that were re
moved by averaging the neighboring values. The ‘File Copy’ feature had some values missin
g. The dataset pattern of that feature then fills those missing values. High-quality data is very
important for better results. Irrelevant features in the dataset also impacted the training of mo
dels. So in this context, irrelevant features are removed i.e., ‘employee’, and ‘file tree’, and tr
ain the models on specifically selected features. The initial results were not promising during
the evaluation of the proposed models. We tuned these parameters, i.e., learning rate, maximu
m depth, and K-fold, to get efficient results.
Among some of the machine learning techniques used in supervised learning is Random Fore
st, which is widely recognized. It may be used in machine learning to address various regressi
on and classification issues. It’s an ensemble learning method that combines many classifiers
to tackle complex problems and enhance the model’s efficiency . The advantages of Random
Forest are that classification and regression issues can be handled with Random Forest. Large
datasets with several dimensions may be handled by it. It increases model accuracy and solve
s the overfitting issue. Obtain the relative feature significance, which helps choose the classifi
er’s most beneficial features. Implementation Steps are as follows:
• Preprocessing of Dataset
• Random Forest algorithm Training
• Random Forest algorithm Testing
• Model Accuracy
• Visualize Results
The relevance of each feature in the random forest is determined using Gini importance or me
an decrease in impurity (MDI). The total decline in node impurity is another name for the Gin
i importance. This is the reduction to fit the model or accuracy caused by removing a variable.
The significance of the variable increases with the size of the decline. Here, the mean decline
is key in determining which variables to use. The entire explanatory power of the variables m
ay be expressed using the Gini index. Random forest works based on the decision tree. Multi
ple decision trees were created based on randomly selected features from the dataset. For the i
nsider attack dataset, the random forest classifier selects the features and makes a decision tre
e on that dataset. The classification yields 0 for no attack and 1 for an attack. All the generate
d decision trees yield 0 or 1. The combination of bootstrapping and aggregation works in a ra
ndom forest. Then the outcomes of each decision are checked, and then the outcome of rando
m forest is the majority of that decision tree’s outcomes. For example, if the majority outcom
es are 1 from decision trees, the final prediction will be 1 and vice versa. Figure 4 demonstrat
e the working of random forest to classify insider attack. Starting from the main dataset, the r
andom subsets of the datasets for generating decision trees. Decision trees yield in 0 or 1, and
then random forest algorithm outcomes 0 or 1 basis on the majority existence. Some importan
t feature of random forest are as follow:
• Diversity
• Immune to the curse of dimensionality
• Parallelization
• Train-Test Split
• Stability
Figure 4 shows the Random Forest’s stepwise working for classifying insider threats. The dat
aset is subdivided into sets of the dataset and then a random classifier builds a decision tree o
n each subset of the dataset. Each decision tree predicted an outcome. All the outcomes of de
cision trees were then evaluated on the basis of majority voting. The final prediction is the m
ost frequent outcome of decision trees.
6.2 AdaBoost
ensemble method. It builds a model that gives each piece of data equal weight. The improperl
y classified points are subsequently given more weight. All points with higher weights are pri
oritized more in the next model. Models will keep being trained up till a low error is detected.
The weight-assigning approach used after each iteration distinguishes the AdaBoost algorith
m from all those other boosting algorithms, which is its strongest attribute. Unlike other algor
ithms, it is easy to use and requires less changing of parameters. Overfitting is not a problem
with AdaBoost. AdaBoost can help poor classifiers increase their accuracy.
AdaBoost is an iterative ensemble algorithm. AdaBoost classifier combines several weak clas
sifiers to create a powerful classifier that has a high degree of accuracy. The fundamental idea
underlying Adaboost is to train the data sample and adjust the classifier weights in each iterat
ion to provide accurate predictions of uncommon observations. It strives to minimize training
errors to offer the best fit possible for these instances in each iteration. The AdaBoost algorith
m selects the random training subset for the insider attack dataset. The subset of the dataset is
utilized for the training of the Adaboost algorithm. AdaBoost gives incorrectly classified obse
rvations a higher weight so that they will have a higher chance of being correctly classified in
the next iteration. Additionally, based on the trained classifier’s accuracy, weight is assigned t
o it in each iteration. The more precise classifier will be given more weight. This method itera
tes until the entire training set fits perfectly or until the largest number of estimators has been
reached. When choosing a base learner, Gini and Entropy are considered. The base learner wi
ll be the stump with the lowest Gini or Entropy. The output it may create while traveling thro
ugh the first stump is 1. Once through the second stump, the output may once more be create
d as 1. It may produce 0 when going through the third stump. Similar to random trees, the ma
jority of votes in the AdaBoost method also occur between the stumps, and then the final pred
iction is achieved by voting among all stumps. The mathematical approach of the random
forest classifier is given below: Eq 1 is used to assign the sample weights to the target class of
the dataset. The Gini Impurity is then calculated using Eq 2. The Alpha value will be calculat
ed by using Eq 3 to measure the correctness of sample classification. After 1st iteration, the w
eights for the next iteration will be calculated for both correctly and incorrectly classified by
using Eq 4 and Eq 5.
Step 5: Randomly selected a new sample of a dataset based on the new sample weight.
Step 6: Process repetition by N numbers of times.
In figure 5, the implementation of the Adaboost classifier is demonstrated. The subset of the
dataset is utilized for the training of the Adaboost algorithm. AdaBoost gives incorrectly class
ified observations a higher weight so that they will have a higher chance of being correctly cl
assified in the next iteration. All the models on a subset of the dataset with higher weightage s
cenarios were then analyzed on the basis of majority voting. The final prediction is based on t
he majority outcome of the models.
6.3 XGBoost
XGBoost is a flexible and extremely accurate gradientboosting system that pushes the bounda
ries of computing capabilities for boosted tree methods. It has the advantages of improving th
e algorithm and modifying the model and
FIGURE 6. XGBoost Classifier for Insider threat classification.
can also be used in computing infrastructure . Regression and classification problems are amo
ng those it is used to address. This method involves building decision trees one at a time. Wei
ghts are highly important in XGBoost. Weights are assigned to each independent variable, wh
ich is then put into the decision tree and predicts outcomes. It is less timeconsuming than Gra
dient Boosting and is designed to deal with incomplete data with the help of its built-in abiliti
es. The user may perform cross-validation after each loop. It is good for small to larger datase
ts. Two equations are used to calculate the similarity scores and the new residuals. Eq 6 is use
d for calculating the similarity score and then Eq 7 is used to calculate the new residuals for t
he next iteration of the algorithm.
It’s a boosting strategy that applies techniques for tree-based learning, which are regarded as
a very effective processing method. It is thought to be an efficient processing method. Unlike
other algorithms, which construct their trees horizontally, the LightGBM method grows verti
cally, which means it grows leaf-wise while other methods grow level-wise. With its processi
ng speed and speedy delivery of results, LightGBM is termed ‘‘Light.’’
In contrast to previous boosting algorithms that develop trees level-by-level, LightGBM divid
es the tree leaf-wise. It selects the leaf with the greatest delta loss for growth. Figure 7 shows
the implementation architecture of the LightGBM. The leaf-wise growth technique is more ef
fective since it only divides the leaves with the greatest information gain across the same laye
r. The learning rate (Lr), number of leaves, and maximum depth are a few key parameters tha
t we altered during the construction of Lightgbm. Lr is a super parameter that regulates how q
uickly the model’s internal parameters are updated. LightGBM is a type of robust machine le
arning model built on the decision tree that is fast, stable, and has good accuracy and predicti
ve power.
The main goal of ensemble learning is to enhance a model’s performance. Ensemble learning
includes bagging, boosting, and stacking. Bagging and boosting techniques are used in our st
udy to detect and classify insider threats. Data aggregation during the data pre-processing pha
se enables us to extract data that offers insightful information about how well the models wor
k. Data normalization is a very helpful method of transforming characteristics to be on a com
parable scale during the data preparation stage. The model performs better and maintains trai
ning stability as a result. The feature extraction process became essential in lowering the volu
me of redundant data in the data collection. In the end, the data reduction speeds up the learni
ng and generalizations phases of the machine learning process while also enabling the model
to be built with less machine effort. In order to reduce training mistakes, boosting is an ensem
ble learning technique that combines a number of base learners into strong learners. So in this
perspective, the boosting technique is utilized, which lies in ensemble learning.
The best algorithm among these four is LightGBM which shows the highest accuracy.
Table 1 demonstrated the factors which play their roles in the high performance of these algo
rithms.
One of the most often used computer languages is Python, which has displaced many other la
nguages in the field primarily due to its enormous library set. We have implemented the follo
wing best Python libraries as shown in table 2 in the proposed study.
Figure 9 explains the flow of the proposed technique, in which data is gathered from the datas
et and is preprocessed. Machine learning models are trained using the data, and testing is perf
ormed based on the ratio of the dataset.
The experiments are done on a Linux operating system, Ubuntu 18.04 on an ACER Aspire 53
49 machine with a 3rd generation Intel Core i5 processor and 6 Gb of RAM for the multichai
n network. In this work, the CERT dataset is utilized. This dataset includes many features for
the detection and classification of insider threats. This dataset includes multiple files for diffe
rent activities performed by the user or attacker. From this newly updated dataset, this paper s
elects multiple features from multiple files and merges them to make a single dataset file for e
ffective experimental setup
FIGURE 8. Overview of Privilege Escalation Attack Proposed Models.
and results evaluation. All the files in that dataset are CSV format, so it is easily analyzed, pre
processed, and applied by proposed models for better results generation.
Figure 10 represents the features and the value range of the given features. These features are
specifically gathered from multiple dataset files. These relevant features show the most impor
tant ways the attacker performs insider attacks.
Figure 11 shows the distribution of user actions along with the pc number they contain. It is s
hown that most of the action
6.7 MODULES
Load Data
Data collection
Data pre-processing
Feature Selection
Feature Extraction
6.7.1 LOAD DATA:
Pandas allows you to import data from a wide range of data sources directly into a dataframe.
These can be static files, such as CSV, TSV, fixed width files, Microsoft Excel, JSON, SAS a
nd SPSS files, as well as a range of popular databases, such as MySQL, PostgreSQL and Goo
gle BigQuery. You can even scrape data directly from web pages into Pandas dataframes.
Data collection means pooling data by scraping, capturing, and loading it from multiple sourc
es, including offline and online sources. High volumes of data collection or data creation can
be the hardest part of a machine learning project, especially at scale.Data collection allows yo
u to capture a record of past events so that we can use data analysis to find recurring patterns.
From those patterns, you build predictive models using machine learning algorithms that look
for trends and predict future changes.Predictive models are only as good as the data from whi
ch they are built, so good data collection practices are crucial to developing high-performing
models. The data needs to be error-free and contain relevant information for the task at hand.
For example, a loan default model would not benefit from tiger population sizes but could be
nefit from gas prices over time.
Data preprocessing is a process of preparing the raw data and making it suitable for a machin
e learning model. It is the first and crucial step while creating a machine learning model. Whe
n creating a machine learning project, it is not always a case that we come across the clean an
d formatted data. And while doing any operation with data, it is mandatory to clean it and put
in a formatted way. So for this, we use data preprocessing task.
A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required ta
sks for cleaning the data and making it suitable for a machine learning model which also incr
eases the accuracy and efficiency of a machine learning model.
○ Importing libraries
○ Importing datasets
○ Feature scaling
6.7.4 FEATURE SELECTION:
The goal of feature selection techniques in Deep Learning is to find the best set of features th
at allows one to build optimized models of studied phenomena. The techniques for feature sel
ection in Deep Learning can be broadly classified into the following categories: Supervised T
echniques: These techniques can be used for labeled data and to identify the relevant features
for increasing the efficiency of supervised models like classification and regression. For Exa
mple- linear regression, decision tree, SVM, etc. Unsupervised Techniques: These techniques
can be used for unlabeled data. For Example- K-Means Clustering, Principal Component Ana
lysis, Hierarchical Clustering, etc. From a taxonomic point of view, these techniques are class
ified into filter, wrapper, embedded, and hybrid methods
Feature extraction is a part of the dimensionality reduction process, in which, an initial set of
the raw data is divided and reduced to more manageable groups. So when you want to proces
s it will be easier. The most important characteristic of these large data sets is that they have a
large number of variables. These variables require a lot of computing resources to process. So
Feature extraction helps to get the best feature from those big data sets by selecting and comb
ining variables into features, thus, effectively reducing the amount of data. These features are
easy to process, but still able to describe the actual data set with accuracy and originality. Col
or features are obtained by extracting statistical features from image histograms. They are use
d to provide a general description of color statistics in the image.
Deep learning is a subset of machine learning that focuses on the use of artificial neural
networks to model and solve complex problems. It's inspired by the structure and function of
the human brain, specifically the interconnected layers of neurons. Deep learning has gained
significant popularity and success in various fields, thanks to its ability to automatically learn
hierarchical representations from data.
Here are some key points about deep learning:
1. Neural Networks:At the core of deep learning are neural networks, which are composed of
layers of interconnected nodes or artificial neurons. The "deep" in deep learning refers to the
depth of these networks, meaning they have multiple layers (deep architectures).
2. Deep Neural Networks (DNNs): Deep learning often involves training deep neural
networks, which can range from a few layers (shallow networks) to many layers (deep
networks). The depth allows the network to learn intricate features and representations from
the input data.
4. Training with Backpropagation: Deep neural networks are trained using a process called
backpropagation. It involves iteratively adjusting the weights of connections between neurons
to minimize the difference between the predicted output and the actual target. This is
typically done using optimization algorithms like stochastic gradient descent.
ResNet (Residual Network) and DenseNet (Densely Connected Convolutional Network) are
two popular architectures in the field of deep learning, specifically designed to address
challenges related to training very deep neural networks. Let's take a brief look at each:
ResNet (Residual Network):
1.Introduction: ResNet was introduced by Kaiming He et al. in 2015. The key innovation is
the use of residual blocks, which contain shortcut connections that skip one or more layers.
This helps in addressing the vanishing gradient problem and enables the training of very deep
networks.
4. Benefits: ResNet has been successful in training extremely deep networks, reaching
hundreds of layers. It has been widely used in image classification, object detection, and
other computer vision tasks.
1. Introduction: DenseNet, introduced by Gao Huang et al. in 2017, takes a different approach
by promoting dense connectivity between layers. In a DenseNet, each layer receives input
from all preceding layers and passes its output to all subsequent layers.
2. Dense Blocks: The fundamental building block in DenseNet is the dense block. It consists
of multiple layers, and each layer receives the feature maps from all preceding layers as
input. This dense connectivity enhances feature reuse and promotes gradient flow.
3. Transition Blocks: To manage the growth of parameters and computation in dense blocks,
transition blocks are used to reduce the number of feature maps before passing them to the
next dense block. This helps in maintaining computational efficiency.
4. Benefits: DenseNet often requires fewer parameters compared to traditional architectures,
leading to more efficient models. The dense connectivity also helps in mitigating the
vanishing gradient problem and encourages feature reuse.
5. Applications: DenseNet has been successfully applied to tasks such as image classification
and segmentation, achieving competitive performance with fewer parameters.
Training Dataset
The training data is the biggest (in -size) subset of the original dataset, which is used to train
or fit the machine learning model. Firstly, the training data is fed to the ML algorithms,
which lets them learn how to make predictions for the given task.
Test Dataset
Once we train the model with the training dataset, it's time to test the model with the test
dataset. This dataset evaluates the performance of the model and ensures that the model can
generalize well with the new or unseen dataset. The test dataset is another subset of
original data, which is independent of the training dataset. However, it has some similar
types of features and class probability distribution and uses it as a benchmark for model
evaluation once the model training is completed. Test data is a well-organized dataset that
contains data for each type of scenario for a given problem that the model would be facing
when used in the real world. Usually, the test dataset is approximately 20-25% of the total
original data for an ML project.
Convolutional Neural Networks (CNNs) are a type of deep learning algorithm commonly
used for image and video recognition. Here's a brief overview of the working process of a
CNN:
1. Input Layer:
Takes in the raw input data, which is usually an image in the case of computer vision tasks.
2. Convolutional Layers:
- Apply a set of filters (kernels) to the input data, performing convolution operations.
- Filters detect patterns and features in the input, such as edges, textures, or shapes.
- Convolution helps preserve spatial relationships within the data.
3. Activation Function:
- Introduces non-linearity to the system, typically using functions like ReLU (Rectified
Linear Unit).
- Helps the network learn complex patterns and relationships.
7. Output Layer:
- Produces the final predictions based on the learned features and patterns.
- The activation function in the output layer depends on the task (e.g., softmax for
classification).
8. Loss Function:
- Measures the difference between the predicted output and the actual target.
- The goal is to minimize this difference during training.
9. Backpropagation:
- Calculates the gradient of the loss function with respect to the weights.
- Adjusts the weights using optimization algorithms (e.g., stochastic gradient descent) to
minimize the loss.
10. Training:
- Iteratively updates the network's parameters using backpropagation and optimization.
- Continues until the model performs well on the training data.
11. Testing/Prediction:
- The trained model is used to make predictions on new, unseen data.
In summary, CNNs are designed to automatically and adaptively learn spatial hierarchies of
features from input data, making them well-suited for tasks like image recognition and object
detection.
Update Gate(z): It determines how much of the past knowledge needs to be passed along
into the future. It is analogous to the Output Gate in an LSTM recurrent unit.
Reset Gate(r): It determines how much of the past knowledge to forget. It is analogous to the
combination of the Input Gate and the Forget Gate in an LSTM recurrent unit.
The basic work-flow of a Gated Recurrent Unit Network is similar to that of a basic
Recurrent Neural Network when illustrated, the main difference between the two is in the
internal working within each recurrent unit as Gated Recurrent Unit networks consist of gates
which modulate the current input and the previous hidden state.
6.7.7 MODEL SELECTION IN Deep Learning:
Model selection in Deep Learning is the process of selecting the best algorithm and model
architecture for a specific job or dataset. It entails assessing and contrasting various models to
identify the one that best fits the data & produces the best results. Model complexity, data
handling capabilities, and generalizability to new examples are all taken into account while
choosing a model. Models are evaluated and contrasted using methods like cross-validation,
and grid search, as well as indicators like accuracy and mean squared error. Finding a model
that balances complexity and performance to produce reliable predictions and strong
generalization abilities is the aim of model selection.
6.8 TECHNOLOGIES
PYTHON
Django
6.8.1 Python
Python is a highly interpreted programming language Python provides man GUI devel
opment possibilities (Graphical User Interface). flask is, the most frequently used technique o
f all GUI methods. It's a standard Python interface to the Python Tk GUI toolkit.
Python is the quickest and simplest method for creating GUI apps using Flask outputs. It is a
simple job to create a GUI using flask. Python is a common, flexible and popular language of
programming.
It is excellent as a first language since it is succinet and simple to understand and also good t
o use in any programmer's pile because it can be utilized from development of the web to soft
ware. It's basic, easy-to-use grammar, making it the ideal language to first learn computer pro
gramming.
Most implementations of Python (including C and Python), include a read- eval-print (REPL)
loop that enables the user to act as a command-line interpreter that results in sequence and ins
tantaneous intake of instructions. Other shells like as IDLE and Python provide extra features
such as auto-completion, session retention and highlighting of syntax.
Invoking the interpreter without passing a script file as a parameter brings up the following p
rompt
− $ python
Python 2.4.3 (#1, Nov 11 2010, 13:34:43)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information
Type the following text at the Python prompt and press the Enter –
>>> print "Hello, Python!" If you are running new version of Python, then you would need t
o use print statement with parenthesis as in print ("Hello, Python!");. However in Python vers
ion 2.4.3, this produces the following result
− Hello,
Invoking the interpreter with a script parameter begins execution of the script and continues u
ntil the script is finished. When the script is finished, the interpreter is no longer active.
Let us write a simple Python program in a script. Python files have extension .py. Type the f
ollowing source code in a test.py file –
Live Demo print "Hello, Python!" We assume that you have Python interpreter set in PATH
variable. Now, try to run this program as follows –
Hello, Python! Let us try another way to execute a Python script. Here is the modified test.py
file –
Live Demo
#!/usr/bin/python print "Hello, Python!"
We assume that you have Python interpreter available in /usr/bin directory. Now, try to run th
is program as follows –
$ chmod +x test.py # This is to make file executable
$./test.py
This produces the following result –
Hello, Python!
6.8.2 DJango
Django is a Python framework that makes it easier to create web sites using Python.
Django takes care of the difficult stuff so that you can concentrate on building your web appli
cations.
Django emphasizes reusability of components, also referred to as DRY (Don't Repeat Yourse
lf), and comes with ready-to-use features like login system, database connection and CRUD o
perations (Create Read Update Delete).
● Model - The data you want to present, usually data from a database.
● View - A request handler that returns the relevant template and content - based on the
request from the user.
● Template - A text file (like an HTML file) containing the layout of the web page, with
logic on how to display the data.
Model
The most common way to extract data from a database is SQL. One problem with SQL is that
you have to have a pretty good understanding of the database structure to be able to work wit
h it.
Django, with ORM, makes it easier to communicate with the database, without having to writ
e complex SQL statements.
View
A view is a function or method that takes http requests as arguments, imports the relevant mo
del(s), and finds out what data to send to the template, and returns the final result.
Template
A template is a file where you describe how the result should be represented.
Templates are often .html files, with HTML code describing the layout of a web page, but it c
an also be in other file formats to present other results, but we will concentrate on .html files.
Django uses standard HTML to describe the layout, but uses Django tags to add logic:
<h1>My Homepage</h1>
Django also provides a way to navigate around the different pages in a website.
When a user requests a URL, Django decides which view it will send it to.
CHAPTER -09
TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality
of components, sub assemblies, assemblies and/or a finished product It is the process of exerc
ising software with the intent of ensuring that the Software system meets its requirements and
user expectations and does not fail in an unacceptable manner. There are various types of test.
Each test type addresses a specific testing requirement.
TYPES OF TESTS
Unit testing
Unit testing involves the design of test cases that validate that the internal program log
ic is functioning properly, and that program inputs produce valid outputs. All decision branch
es and internal code flow should be validated. It is the testing of individual software units of t
he application .it is done after the completion of an individual unit before integration. This is
a structural testing, that relies on knowledge of its construction and is invasive. Unit tests perf
orm basic tests at component level and test a specific business process, application, and/or sys
tem configuration. Unit tests ensure that each unique path of a business process performs acc
urately to the documented specifications and contains clearly defined inputs and expected res
ults.
Integration testing
Integration tests are designed to test integrated software components to determin
e if they actually run as one program. Testing is event driven and is more concerned with the
basic outcome of screens or fields. Integration tests demonstrate that although the component
s were individually satisfaction, as shown by successfully unit testing, the combination of co
mponents is correct and consistent. Integration testing is specifically aimed at exposing the pr
oblems that arise from the combination of components.
Functional test
Functional tests provide systematic demonstrations that functions tested are availa
ble as specified by the business and technical requirements, system documentation, and user
manuals.
White Box Testing is a testing in which in which the software tester has knowledg
e of the inner workings, structure and language of the software, or at least its purpose. It is pu
rpose. Itis used to test areas that cannot be reached from a black box level.
Unit Testing
Unit testing is usually conducted as part of a combined code and unit test phase of the softwar
e lifecycle, although it is not uncommon for coding and unit testing to be conducted as two di
stinct phases.
Field testing will be performed manually and functional tests will be written in detail.
Test objectives
Features to be tested
Integration Testing
Software integration testing is the incremental integration testing of two or more integrated so
ftware components on a single platform to produce failures caused by interface defects.
The task of the integration test is to check that components or software applications, e.
g. components in a software system or – one step up – software applications at the company l
evel – interact without error.
Test Results:
All the test cases mentioned above passed successfully. No defects encountered.
Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant participatio
n by the end user. It also ensures that the system meets the functional requirements.
Test Results:
All the test cases mentioned above passed successfully. No defects encountered
TEST CASES :
CHAPTER-10
EXPERIMENTAL RESULTS
10.1 EXPERIMENTAL PARAMETERS
Table 3 shows the experimental parameters considered during the classification using machin
e learning algorithms. The core parameters that helped in increasing the accuracy of LightGB
M are learning_rate, num_leavesa, and bagging
frequency. For increasing the performance of the XGBoost classifier the core parameters are
max_depth, learning_rate, min_child_weight, and gamma. Table 4 shows the comparative an
alysis of the performance of the classification algorithms used in this paper. Lightgbm perfor
mance is higher with the highest results produced in terms of accuracy. The other applied alg
orithms which are XGBoost, AdaBoost, and Random Forest also performed with better accur
acy than the others in the previous studies.
10.2 EXPERIMENTAL RESULTS
One technique to enhance the precision of a decision tree is to boost it. Each of the training d
atasets is given weight at first. Following the learning of the classifiers, the weights are modif
ied such that the next classifier pays greater attention to the previously overlooked datasets. It
has been seen from the results that the proposed models achieve the best accuracy on the give
n dataset as RF has 86%, Adaboost has 88%, XGBoost has 88.27%, and the best of all those
LightGBM gives the best higher accuracy of 97%. The measurement utilized for the assessme
nt of the proposed models is given below.
Figure 12 illustrates the confusion matrix built on the random forest algorithm classification.
It demonstrated the predicted values against the actual values. Random Forest classifier predi
cted most samples of the dataset correctly and hence it helps in improving the accuracy of the
classifier.
Figure 13 illustrates the confusion matrix built on the classification of the AdaBoost algorith
m. It demonstrated the predicted values against the actual values. AdaBoost classifier predicte
d most samples of the dataset correctly and hence it helps in improving the accuracy of the cl
assifier. The prediction results of Adaboost are better than Random Forest on the same datase
t.
Figure 14 illustrates the confusion matrix built on the classification of the XGBoost algorithm.
It demonstrated the predicted values against the actual values. XGBoost classifier predicted
most samples of the dataset correctly and hence it helps in improving the accuracy of the clas
sifier. The prediction results XGBoost is better than AdaBoost on the same dataset.
Figure 15 illustrates the confusion matrix built on the classification of the LightGBM algorith
m. It demonstrated the predicted values against the actual values. LightGBM classifier predict
ed most samples of the dataset correctly and hence it helps in improving the accuracy of the c
lassifier. The prediction results in LightGBM are better than XGBoost on the same dataset.
10.3 DISCUSSION
This work applies four machine learning algorithms to classify insider attacks. Figure 17 is th
e graphical representation of the applied algorithms for classifying the insider attack. The best
algorithm among these is LightGBM which shows the highest accuracy. These algorithms we
re applied to the same dataset, and their comparative classification results are shown in Figur
e 17.
Machine learning algorithms are vast, and all algorithms have their benefits and limitations.
Different machine learning algorithms have been applied to various datasets to perform classi
fication. In this work, one bagging and three boosting algorithms have been utilized on the sa
me dataset to perform classification. The results show that the boosting algorithms get higher
accuracy than random forest. All other algorithms are also used for the classification of differ
ent
datasets. Figure 18 demonstrates multiple algorithms applied to the CERT customized dataset
for classification. SVM, Naïve Bayes, and Gradient Boosting algorithms were analyzed on th
e same set of datasets. Natives Bayes algorithm did not perform well while the other two algo
rithms give better results in terms of classifying threats.
Figure 19 is the demonstration of recall of the proposed techniques. LightGBM has the highe
st value among other
FIGURE 19. Recall score of proposed algorithms.
techniques. The LightGBM returns the most relevant results and gets the better recall value. T
he F-measure measures the accuracy of a test. It is determined using the test’s accuracy and r
ecall. Figure 20 shows the F-1 measure of the proposed techniques. It can be seen in the figur
e that the LightGBM algorithm got the highest value of 0.97.
Figure 21 shows the comparative analysis of different algorithms applied to the customized d
ataset. The results are compared on the basis of Recall, Precision, and F1 Score. Ensemble lea
rning approaches came up with great results because of their best learning and classification a
pproach. Figure 18 shows the highest accuracy value 97%, of the LightGBM algorithm.
With various hyperparameters that can be tweaked for optimal efficiencies, such as the numb
er of leaves per tree, the learning rate, and the regularization parameters, LightGBM enables
greater control over the training process. Due to its ability to efficiently understand complicat
ed associations between features and targets in huge, high-dimensional datasets, LightGBM p
erforms better than other algorithms at classification tasks due to all these aspects. Due to its
histogram-based approach, LightGBM provides several benefits for classification problems, i
ncluding a faster training speed and more efficiency. Moreover, it employs two
The proportion of false alarms or false positives produced by a model is measured by the fals
e alarm rate, making it a crucial parameter in machine learning. False alarms can lead to unne
eded interruptions or treatments, which can be dangerous in some situations. The accuracy an
d efficiency of machine learning models must thus be increased while simultaneously lowerin
g the false alarm rate. Figure 22 is the graphical representation of the False Alarm Rate.
Table 5 compares literature work algorithms with the proposed methodology algorithms. The
LightGBM achieved the highest accuracy among all the applied algorithms in this work and a
lso the previous algorithms in recent studies.
CHAPTER-11
CONCLUSION
The proposed project, "Privilege Escalation Attack Detection and Mitigation in the Cloud
using Machine Learning", presents a highly efficient and accurate model for detecting pri
vilege escalation attacks. Key outcomes of this project include:
Improved Accuracy: The proposed GRU-based model achieves an accuracy of 98.3%, su
rpassing the existing system's 97% accuracy.
Reduced False Positives: The model maintains a minimal false positive rate, enhancing it
s reliability.
Time Efficiency: The GRU algorithm processes time-series data more efficiently, ensurin
g faster detection of malicious activities.
FUTURE ENCHANCEMENT
The scope of using machine learning (ML) for detecting and mitigating privilege escalation
attacks in cloud environments can be summarized in three key areas:
1. Anomaly Detection and Behavior Analysis: The goal is to detect deviations from normal
usage patterns that could indicate privilege escalation attempts.
2. Automated Threat Mitigation:This includes automated actions like restricting
access,notifying administrators ,reducing response time and potential damage.
3. Continuous Learning and Adaptions:Implement continuous learning mechanisms to adapt
to new attack vectors and evolving threat landscapes.
CHAPTER-12
REFERENCE
[1] U. A. Butt, R. Amin, H. Aldabbas, S. Mohan, B. Alouffi, and A. Ahmadian, ‘‘Cloud-base
d email phishing attack using machine and deep learning algorithm,’’ Complex Intell. Syst., p
p. 1–28, Jun. 2022.
[2] D. C. Le and A. N. Zincir-Heywood, ‘‘Machine learning based insider threat modelling a
nd detection,’’ in Proc. IFIP/IEEE Symp. Integr. Netw. Service Manag. (IM), Apr. 2019, pp.
1–6.
[3] P. Oberoi, ‘‘Survey of various security attacks in clouds based environments,’’ Int. J. Adv
Res. Comput. Sci., vol. 8, no. 9, pp. 405–410, Sep. 2017.
[4] A. Ajmal, S. Ibrar, and R. Amin, ‘‘Cloud computing platform: Performance analysis of pr
ominent cryptographic algorithms,’’ Concurrency Comput., Pract. Exper., vol. 34, no. 15, p. e
6938, Jul. 2022.
[5] U. A. Butt, R. Amin, M. Mehmood, H. Aldabbas, M. T. Alharbi, and N. Albaqami, ‘‘Clou
d security threats and solutions: A survey,’’ Wireless Pers. Commun., vol. 128, no. 1, pp. 387
–413, Jan. 2023.
[6] H. Touqeer, S. Zaman, R. Amin, M. Hussain, F. Al-Turjman, and M. Bilal, ‘‘Smart home
security: Challenges, issues and solutions at different IoT layers,’’ J. Supercomput., vol. 77, n
o. 12, pp. 14053–14089, Dec. 2021.
[7] S. Zou, H. Sun, G. Xu, and R. Quan, ‘‘Ensemble strategy for insider threat detection from
user activity logs,’’ Comput., Mater. Continua, vol. 65, no. 2, pp. 1321–1334, 2020.
[8] G. Apruzzese, M. Colajanni, L. Ferretti, A. Guido, and M. Marchetti, ‘‘On the effectivene
ss of machine and deep learning for cyber security,’’ in Proc. 10th Int. Conf. Cyber Conflict
(CyCon), May 2018, pp. 371–390.
[9] D. C. Le, N. Zincir-Heywood, and M. I. Heywood, ‘‘Analyzing data granularity levels for
insider threat detection using machine learning,’’ IEEE Trans. Netw. Service Manag., vol. 17,
no. 1, pp. 30–44, Mar. 2020.
[10] F. Janjua, A. Masood, H. Abbas, and I. Rashid, ‘‘Handling insider threat through supervi
sed machine learning techniques,’’ Proc. Comput. Sci., vol. 177, pp. 64–71, Jan. 2020.
[11] R. Kumar, K. Sethi, N. Prajapati, R. R. Rout, and P. Bera, ‘‘Machine learning based mal
ware detection in cloud environment using clustering approach,’’ in Proc. 11th Int. Conf. Co
mput., Commun. Netw. Technol. (ICCCNT), Jul. 2020, pp. 1–7.
[12] D. Tripathy, R. Gohil, and T. Halabi, ‘‘Detecting SQL injection attacks in cloud SaaS us
ing machine learning,’’ in Proc. IEEE 6th Int. Conf. Big Data Secur. Cloud (BigDataSecurit
y), Int. Conf. High Perform. Smart Comput., (HPSC), IEEE Int. Conf. Intell. Data Secur. (ID
S), May 2020, pp. 145–150.
[13] X. Sun, Y. Wang, and Z. Shi, ‘‘Insider threat detection using an unsupervised learning m
ethod: COPOD,’’ in Proc. Int. Conf. Commun., Inf. Syst. Comput. Eng. (CISCE), May 2021,
pp. 749–754.
[14] J. Kim, M. Park, H. Kim, S. Cho, and P. Kang, ‘‘Insider threat detection based on user b
ehavior modeling and anomaly detection algorithms,’’ Appl. Sci., vol. 9, no. 19, p. 4018, Sep.
2019.
[15] L. Liu, O. de Vel, Q.-L. Han, J. Zhang, and Y. Xiang, ‘‘Detecting and preventing cyber i
nsider threats: A survey,’’ IEEE Commun. Surveys Tuts., vol. 20, no. 2, pp. 1397–1417, 2nd
Quart., 2018.
[16] P. Chattopadhyay, L. Wang, and Y.-P. Tan, ‘‘Scenario-based insider threat detection fro
m cyber activities,’’ IEEE Trans. Computat. Social Syst., vol. 5, no. 3, pp. 660–675, Sep. 201
8.
[17] G. Ravikumar and M. Govindarasu, ‘‘Anomaly detection and mitigation for wide-area d
amping control using machine learning,’’ IEEE Trans. Smart Grid, early access, May 18, 202
0, doi: 10.1109/TSG.2020.2995313.
[18] M. I. Tariq, N. A. Memon, S. Ahmed, S. Tayyaba, M. T. Mushtaq, N. A. Mian, M. Imra
n, and M. W. Ashraf, ‘‘A review of deep learning security and privacy defensive technique
s,’’ Mobile Inf. Syst., vol. 2020, pp. 1–18, Apr. 2020.
[19] D. S. Berman, A. L. Buczak, J. S. Chavis, and C. L. Corbett, ‘‘A survey of deep learning
methods for cyber security,’’ Information, vol. 10, no. 4, p. 122, 2019.
[20] N. T. Van and T. N. Thinh, ‘‘An anomaly-based network intrusion detection system usin
g deep learning,’’ in Proc. Int. Conf. Syst. Sci. Eng. (ICSSE), 2017, pp. 210–214.
[21] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, ‘‘Deep learning for anomaly detection:
A review,’’ ACM Comput. Surv., vol. 54, no. 2, pp. 1–38, Mar. 2021.
[22] A. Arora, A. Khanna, A. Rastogi, and A. Agarwal, ‘‘Cloud security ecosystem for data s
ecurity and privacy,’’ in Proc. 7th Int. Conf. Cloud Comput., Data Sci. Eng., Jan. 2017, pp. 2
88–292.
[23] L. Coppolino, S. D’Antonio, G. Mazzeo, and L. Romano, ‘‘Cloud security: Emerging thr
eats and current solutions,’’ Comput. Electr. Eng., vol. 59, pp. 126–140, Apr. 2017.
[24] M. Abdelsalam, R. Krishnan, Y. Huang, and R. Sandhu, ‘‘Malware detection in cloud in
frastructures using convolutional neural networks,’’ in Proc. IEEE 11th Int. Conf. Cloud Com
put. (CLOUD), Jul. 2018, pp. 162–169.
[25] F. Jaafar, G. Nicolescu, and C. Richard, ‘‘A systematic approach for privilege escalation
prevention,’’ in Proc. IEEE Int. Conf. Softw. Quality, Rel. Secur. Companion (QRS-C), Aug.
2016, pp. 101–108.
[26] N. Alhebaishi, L. Wang, S. Jajodia, and A. Singhal, ‘‘Modeling and mitigating the inside
r threat of remote administrators in clouds,’’ in Proc. IFIP Annu. Conf. Data Appl. Secur. Pri
vacy. Bergamo, Italy: Springer, 2018, pp. 3–20.
[27] F. Yuan, Y. Cao, Y. Shang, Y. Liu, J. Tan, and B. Fang, ‘‘Insider threat detection with d
eep neural network,’’ in Proc. Int. Conf. Comput. Sci. Wuxi, China: Springer, 2018, pp. 43–5
4.
[28] I. A. Mohammed, ‘‘Cloud identity and access management—A model proposal,’’ Int. J.
Innov. Eng. Res. Technol., vol. 6, no. 10, pp. 1–8, 2019.
[29] F. M. Okikiola, A. M. Mustapha, A. F. Akinsola, and M. A. Sokunbi, ‘‘A new framewor
k for detecting insider attacks in cloud-based e-health care system,’’ in Proc. Int. Conf. Math.,
Comput. Eng. Comput. Sci. (ICMCECS), Mar. 2020, pp. 1–6.
[30] G. Li, S. X. Wu, S. Zhang, and Q. Li, ‘‘Neural networks-aided insider attack detection fo
r the average consensus algorithm,’’ IEEE Access, vol. 8, pp. 51871–51883, 2020.
[31] A. R. Wani, Q. P. Rana, U. Saxena, and N. Pandey, ‘‘Analysis and detection of DDoS at
tacks on cloud computing environment using machine learning techniques,’’ in Proc. Amity I
nt. Conf. Artif. Intell. (AICAI), Feb. 2019, pp. 870–875.
[32] N. M. Sheykhkanloo and A. Hall, ‘‘Insider threat detection using supervised machine lea
rning algorithms on an extremely imbalanced dataset,’’ Int. J. Cyber Warfare Terrorism, vol.
10, no. 2, pp. 1–26, Apr. 2020.
[33] M. Idhammad, K. Afdel, and M. Belouch, ‘‘Distributed intrusion detection system for cl
oud environments based on data mining techniques,’’ Proc. Comput. Sci., vol. 127, pp. 35–41,
Jan. 2018.
[34] P. Kaur, R. Kumar, and M. Kumar, ‘‘A healthcare monitoring system using random fore
st and Internet of Things (IoT),’’ Multimedia Tools Appl., vol. 78, no. 14, pp. 19905–19916,
2019.
[35] J. L. Leevy, J. Hancock, R. Zuech, and T. M. Khoshgoftaar, ‘‘Detecting cybersecurity att
acks using different network features with LightGBM and XGBoost learners,’’ in Proc. IEEE
2nd Int. Conf. Cognit. Mach. Intell. (CogMI), Oct. 2020, pp. 190–197.
[36] R. A. Alsowail and T. Al-Shehari, ‘‘Techniques and countermeasures for preventing insi
der threats,’’ PeerJ Comput. Sci., vol. 8, p. e938, Apr. 2022.
[37] B. Alouffi, M. Hasnain, A. Alharbi, W. Alosaimi, H. Alyami, and M. Ayaz, ‘‘A systema
tic literature review on cloud computing security: Threats and mitigation strategies,’’ IEEE A
ccess, vol. 9, pp. 57792–57807, 2021.