COMPARATIVE STUDY OF FILELESS MALWARE
DETECTION USING MACHINE LEARNING
A Project report submitted in partial fulfillment of the requirements for the award of the degree of
BACHELOR OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Submitted by
JAISON.V.R
20BAM027
Under the Guidance of
MR. [Link]
Assistant Professor and Head, Department of AI & ML
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND MACHINE
LEARNING
SREE SARASWATHI THYAGARAJA COLLEGE
(Autonomous)
An Autonomous, NAAC Re-Accredited with A Grade, ISO 21001:2018 Certified
Institution,Affiliated to Bharathiar University, Coimbatore
Approved by AICTE for MBA/MCA and by UGC for 2(f) & 12(B) status
Pollachi-642 107
CERTIFICATE
This is to certify that the project report entitled COMPARATIVE STUDY OF
FILELESS MALWARE DETECTION USING MACHINE LEARNING submitted
to Sree Saraswathi Thyagaraja College (Autonomous), Pollachi,affiliated to
Bharathiar University, Coimbatore in partial fulfillment of the requirements for the
award of the degree of BACHELOR OF ARTIFICIAL INTELLIGENCE AND
MACHINE LEARNING is a record of original work done by JAISON.V.R under
my supervision mand guidance and the report has not previously formed the basis for
the award of any Degree / Diploma / Associate ship / Fellowship or other similar title
to any candidate of any University.
Date: 10-11-2022 Guide
Place: Pollachi (Mr. [Link])
Counter Signed by
PC PRINCIPAL
Viva-voce Examination held on -------------------
INTERNAL EXAMINER EXTERNAL EXAMINER
DECLARATION
I, JAISON.V.R hereby declare that the project report entitled COMPARATIVE
STUDY OF FILELESS MALWARE DETECTION USING MACHINE
LEARNING submitted to Sree Saraswathi Thyagaraja College
(Autonomous), Pollachi, affiliated to Bharathiar University, Coimbatore in
partial fulfillment of the requirements for the award of the degree of
BACHELOR OF ARTIFICIAL INTELLIGENCE AND MACHINE
LEARNING is a record of original work done by me under the guidance of
Mr. [Link], Assistant Professor and Head, Department of
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING and it has
not previously formed the basis for the award of any Degree/Diploma
/Associateship /Fellowship or other similar title to any candidate of any
University.
Place: Pollachi
Date:10-11-2022 Signature of the Candidate
ACKNOWLEDGEMENT
I take this opportunity to express our gratitude and sincere thanks to everyone who
helped me in my project.
I wish to express my heartfelt thanks to the Management of Sree Saraswathi
Thyagaraja College for providing me with excellent infrastructure during the course
of study and project.
I wish to express my deep sense of gratitude to Dr. A. SOMU, Principal, Sree
Saraswathi Thyagaraja College for providing me excellent facilities and
encouragement during the course of study and project.
I express my deep sense of gratitude and sincere thanks to my Head of the
Department MRS. GEETHA & my beloved staff members MR. VIVIN JOSE,
MR. [Link] & MRS. [Link] allowed me to carry out this
project and gave me complete freedom to utilize the resources of the department.
It's my prime duty to solemnly express my deep sense of gratitude and sincere thanks
to the guide Mr. [Link], Assistant Professor and Head, UG
Department of Artificial Intelligence and Machine Learning, for his valuable
advice and excellent guidance to complete the project successfully.
I also convey my heartfelt thanks to my parents, friends and all the staff members of
the Department of ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING for
their valuable support which energized me to complete this project.
PROJECT CONTENT
[Link]
[Link]
1.2MalwareDetection
[Link]
[Link]
[Link]
2.1.1DrawbacksofExistingSystem
[Link]
2.2.1AdvantagesofProposedSystem [Link]
[Link]
[Link]
3.3XGBOOST
3.4RandomForest
[Link]
[Link]
[Link]
[Link]
[Link]
[Link]
INTRODUCTION:
Idealistic hackers attacked computers in the early days because they were eager
to prove themselves. Cracking machines, however, is an industry in today's
world. Despite recent improvements in software and computer hardware
security, both in frequency and sophistication, attacks on computer systems have
increased. Regrettably, there are major drawbacks to current methods for
detecting and analyzing unknown code samples. The Internet is a critical part of
our everyday lives today. On the internet, there are many services and they are
rising daily as well. Numerous reports indicate that malware's effect is
worsening at an alarming pace. Although malware diversity is growing, anti-
virus scanners are unable to fulfill security needs, resulting in attacks on
millions of hosts. Around 65,63,145 different hosts were targeted, according to
Kaspersky Labs, and in 2015, 40,00,000 unique malware artifacts were found.
Juniper Research (2016), in particular, projected that by 2019 the cost of data
breaches will rise to $2.1 trillion globally. Current studies show that
script-kiddies are generating more and more attacks or are automated. To date,
attacks on commercial and government organizations, such as ransomware and
malware, continue to pose a significant threat and challenge. Such attacks can
come in various ways and sizes. An enormous challenge is the ability of the
global security community to develop and provide expertise in cybersecurity.
There is widespread awareness of the global scarcity of cybersecurity and talent.
Cybercrimes, such as financial fraud, child exploitation online and payment
fraud, are so common that they demand international 24-hour response and
collaboration between multinational law enforcement agencies. For single users
and organizations, malware defense of computer systems is therefore one of the
most critical cybersecurity activities, as even a single attack may result in
compromised data and sufficient losses.
Malware attacks have been one of the most serious cyber risks faced by different
countries. The number of vulnerabilities reporting and malware is also
increasing rapidly. Researchers have received tremendous attention in the study
of malware behaviors. There are several factors that lead to the development of
malware attacks. The malware authors create and deploy malware that can
mutate and which has different forms such as ransomware and fileless malwares.
This is done in order to avoid the detection of malware. It is difficult to detect
the malware and cyber attacks using the traditional cyber security procedures.
Solutions for the new generation cyber attacks rely on various Machine learning
techniques.
EVOLUTION OF MALWARE
In order to protect networks and computer systems from attacks, the diversity,
sophistication and availability of malicious software present enormous
challenges. Malware is continually changing and challenges security researchers
and scientists to strengthen their cyber defenses to keep pace. Owing to the use
of polymorphic and metamorphic methods used to avoid detection and conceal
its true intent, the prevalence of malware has increased. To mutate the code
while keeping the original functionality intact, polymorphic malware uses a
polymorphic engine. The two most common ways to conceal code are packaging
and encryption . Through one or more layers of compression, packers cover a
program's real code. Then the unpacking routines restore the original code and
execute it in memory at runtime. To make it harder for researchers to analyze the
software, crypters encrypt and manipulate malware or part of its code. A crypter
includes a stub that is used for malicious code encryption and decryption.
Whenever it's propagated, metamorphic malware rewrites the code to an
equivalent. Multiple transformation techniques, including but not limited to,
register renaming, code permutation, code expansion, code shrinking and
insertion of garbage code, can be used by malware authors. The combination of
the above techniques resulted in increasingly increasing quantities of malware,
making time-consuming, expensive and more complicated forensic
investigations of malware cases. There are some issues with conventional
antivirus solutions that rely on signature-based and heuristic/behavioral
methods. A signature is a unique feature or collection of features that like a
fingerprint, uniquely differentiates an executable. Signature-based approaches
are unable to identify unknown types of malware, however. Security researchers
suggested behavior-based detection to overcome these problems, which analyzes
the features and behavior of the file to decide whether it is indeed malware,
although it may take some time to search and evaluate. Researchers have begun
implementing machine learning to supplement their solutions in order to solve
the previous drawbacks of conventional antivirus engines and keep pace with
new attacks and variants, as machine learning is well suited for processing large
quantities of data.
1. MALWARE DETECTION
In such a way, hackers present malware aimed at persuading people to
install it. As it seems legal, users also do not know what the
programme is. Usually, we install it thinking that it is secure, but on the
contrary, it's a major threat. That's how the malware gets into your
system. When on the screen, it disperses and hides in numerous files,
making it very difficult to identify. In order to access and record
personal or useful information, it may connect directly to the operating
system and start encrypting it Detection of malware is defined as the
search process for malware files and directories. There are several tools
and methods available to detect malware that make it efficient and
reliable. Some of the general strategies for malware detection are:
○ Signature-based
○ Heuristic Analysis
○ Anti-malware Software
○ Sandbox
Several classifiers have been implemented,
such as linear classifiers (logistic regression, naive
Bayes classifier), support for vector machinery, neural
networks, random forests, etc. Through both static and
dynamic analysis, malware can be identified by:
○ Without Executing the code
○ Behavioural Analysis
2. NEED FOR MACHINE LEARNING IN MALWARE
DETECTION
Machine learning has created a drastic change in many industries,
including cybersecurity, over the last decade. Among cybersecurity
experts, there is a general belief that AI-powered anti-malware tools
can help detect modern malware attacks and boost scanning engines.
Proof of this belief is the number of studies on malware detection
strategies that exploit machine learning reported in the last few years.
The number of research papers released in 2018 is 7720, a 95 percent
rise over 2015 and a 476 percent increase over 2010, according to
Google Scholar,1. This rise in the number of studies is the product of
several factors, including but not limited to the increase in publicly
labeled malware feeds, the increase in computing capacity at the same
time as its price decrease, and the evolution of the field of machine
learning, which has achieved ground-breaking success in a wide range
of tasks such as computer vision and speech recognition. Depending
on the type of analysis, conventional machine learning methods can be
categorized into two main categories, static and dynamic approaches.
The primary difference between them is that static methods extract
features from the static malware analysis, while dynamic methods
extract features from the dynamic analysis. A third category may be
considered, known as hybrid approaches. Hybrid methods incorporate
elements of both static and dynamic analysis. In addition, learning
features from raw inputs in diverse fields have outshone neural
networks. The performance of neural networks in the malware domain
is mirrored by recent developments in machine learning for
cybersecurity.
Brief:
Malware, short for malicious software, consists of programming (code, scripts,
active content, and other software) designed to disrupt or deny operation, gather
information that leads to loss of privacy or exploitation, gain unauthorized
access to system resources, and other abusive behavior. It is a general term used
to define a variety of forms of hostile, intrusive, or annoying software or
program code. Software is considered to be malware based on the perceived
intent of the creator rather than any particular features. Malware includes
computer viruses, worms, Trojan horses, spyware, dishonest adware,
crime-ware, most rootkits, and other malicious and unwanted software or
programs .
In 2008, Symantec published a report that "the release rate of malicious code
and other unwanted programs may be exceeding that of legitimate software
applications.” According to F-Secure, "As much malware was produced in 2007
as in the previous 20 years altogether.”.
Since the rise of widespread Internet access, malicious software has been
designed for a profit, for example forced advertising. For instance, since 2003,
the majority of widespread viruses and worms have been designed to take
control of users' computers for black-market exploitation. Another category of
malware, spyware, - programs designed to monitor users' web browsing and
steal private information. Spyware programs do not spread like viruses, instead
are installed by exploiting security holes or are packaged with user-installed
software, such as peer-to-peer applications.
Clearly, there is a very urgent need to find, not just a suitable method to detect
infected files, but to build a smart engine that can detect new viruses by
studying the structure of system calls made by malware.
2. Current Antivirus Software
Antivirus software is used to prevent, detect, and remove malware, including
but not limited to computer viruses, computer worm, Trojan horses, spyware
and adware. A variety of strategies are typically employed by the antivirus
engines. Signature-based detection involves searching for known patterns of
data within executable code. However, it is possible for a computer to be
infected with a new virus for which no signatures exist. To counter such
“zero-day” threats, heuristics can be used to identify new viruses or variants of
existing viruses by looking for known malicious code. Some antivirus can also
make predictions by executing files in a sandbox and analyzing results.
Often, antivirus software can impair a computer's performance. Any incorrect
decision may lead to a security breach, since it runs at the highly trusted kernel
level of the operating system. If the antivirus software employs heuristic
detection, success depends on achieving the right balance between false
positives and false negatives. Today, malware may no longer be executable
files. Powerful macros in Microsoft Word could also present a security risk.
Traditionally, antivirus software heavily relied upon signatures to identify
malware. However, because of newer kinds of malware, signature-based
approaches are no longer effective.
Although standard antivirus can effectively contain virus outbreaks, for large
enterprises, any breach could be potentially fatal. Virus makers are employing
"oligomorphic", "polymorphic" and, "metamorphic" viruses, which encrypt
parts of themselves or modify themselves as a method of disguise, so as to not
match virus signatures in the dictionary.
Studies in 2007 showed that the effectiveness of antivirus software had
decreased drastically, particularly against unknown or zero day attacks.
Detection rates have dropped from 40-50% in 2006 to 20-30% in 2007. The
problem is magnified by the changing intent of virus makers. Independent
testing on all the major virus scanners consistently shows that none provide
100% virus detection.
Work has been described as
● Describing the details: The dataset is imported and
the different columns are discussed in the dataset. ●
Data cleaning: The required steps are taken after
examining the dataset so that the dataset can be
cleaned and all the null values and columns of not
much significance are removed so that they will not
be of any concern in the training part.
● Data Training and Testing: When the information is
transparent and ready for training, we spilled the
information as a training dataset and testing
dataset in an 80:20 ratio so that the data was
spilled in an 80:20 ratio.
In this paper, as we try to achieve the highest
accuracy, we use two algorithms to see which
will give us better precision.
● Applying Different Algorithms[ML Algorithms]
MAIN ALGORITHMS APPLIED:
1. DECISION TREE
2. RANDOM FOREST
3. SVM
4. XGBOOST
DECISION TREE:
The decision tree Algorithm belongs to the family of supervised
machine learning algorithms. It can be used for both a
classification problem as well as for a regression problem.
The goal of this algorithm is to create a model that predicts the
value of a target variable, for which the decision tree uses the
tree representation to solve the problem in which the leaf node
corresponds to a class label and attributes are represented on
the internal node of the tree.
Let’s take a sample data set to move further ….
Suppose we have a sample of 14 patient data set and we have
to predict which drug to suggest to the patient A or B.
Let’s say we pick cholesterol as the first attribute to split data
It will split our data into two branches High and Normal based
on cholesterol, as you can see in the above figure.
Let’s suppose our new patient has high cholesterol by the
above split of our data we cannot say whether Drug B or Drug
A will be suitable for the patient.
Also, If the patient cholesterol is normal we still do not have an
idea or information to determine that either Drug A or Drug B
is Suitable for the patient.
Let us take Another Attribute Age, as we can see age has three
categories in it Young, middle age and senior let’s try to split.
From the above figure, Now we can say that we can easily
predict which Drug to give to a patient based on his or
her reports.
Assumptions that we make while using the Decision tree:
– In the beginning, we consider the whole training set as the
root.
-Feature values are preferred to be categorical, if the values
continue then they are converted to discrete before building the
model.
-Based on attribute values records are distributed recursively.
-We use a statistical method for ordering attributes as a root
node or the internal node.
Mathematics behind Decision tree algorithm: Before going to
the Information Gain first we have to understand entropy
Entropy: Entropy is the measures of impurity, disorder,
or uncertainty in a bunch of examples.
Purpose of Entropy:
Entropy controls how a Decision Tree decides to split the
data. It affects how a Decision Tree draws its boundaries.
“Entropy values range from 0 to 1”, Less the value of entropy
more it is trusting able.
Suppose we have F1, F2, F3 features we selected the F1
feature as our root node
F1 contains 9 yes label and 5 no label in it, after splitting the F1
we get F2 which have 6 yes/2 No and F3 which have 3 yes/3
no.
Now if we try to calculate the Entropy of both F2 by using the
Entropy formula…
Putting the values in the formula:
Here, 6 is the number of yes taken as positive as we are
calculating probability divided by 8 is the total rows present in
the F2.
Similarly, if we perform Entropy for F3 we will get 1 bit which is
a case of an attribute as in it there is 50%, yes and 50% no.
This splitting will be going on unless and until we get a pure
subset.
What is a Puresubset?
The pure subset is a situation where we will get either all yes or
all no in this case.
We have performed this concerning one node what if after
splitting F2 we may also require some other attribute to reach
the leaf node and we also have to take the entropy of those
values and add it up to do the submission of all those entropy
values for that we have the concept of information gain.
Information Gain: Information gain is used to decide which
feature to split on at each step in building the tree. Simplicity is
best, so we want to keep our tree small. To do so, at each step
we should choose the split that results in the purest daughter
nodes. A commonly used measure of purity is called
information.
For each node of the tree, the information value measures how
much information a feature gives us about the class. The split
with the highest information gain will be taken as the first split
and the process will continue until all children nodes are pure,
or until the information gain is 0.
The algorithm calculates the information gain for each split and
the split which is giving the highest value of information gain is
selected.
We can say that in Information gain we are going to compute
the average of all the entropy-based on the specific split.
Sv = Total sample after the split as in F2 there are 6 yes
S = Total Sample as in F1=9+5=14
Now calculating the Information Gain:
Like this, the algorithm will perform this for n number of splits,
and the information gain for whichever split is higher it is going
to take it in order to construct the decision tree.
The higher the value of information gain of the split the higher
the chance of it getting selected for the particular split.
Gini Impurity:
Gini Impurity is a measurement used to build Decision Trees to
determine how the features of a data set should split nodes to
form the tree. More precisely, the Gini Impurity of a data set is
a number between 0-0.5, which indicates the likelihood of
new,
random data being miss classified if it were given a random
class label according to the class distribution in the data
set.
Entropy vs Gini Impurity
The maximum value for entropy is 1 whereas the maximum
value for Gini impurity is 0.5.
As the Gini Impurity does not contain any logarithmic function
to calculate it takes less computational time as compared to
entropy.
2. SVM Algorithm
“Support Vector Machine” (SVM) is a supervised machine
learning algorithm that can be used for both classification or
regression challenges. However, it is mostly used in
classification problems. In the SVM algorithm, we plot each
data item as a point in n-dimensional space (where n is a
number of features you have) with the value of each feature
being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiates the
two classes very well (look at the below snapshot).
Support Vectors are simply the coordinates of individual
observation. The SVM classifier is a frontier that best
segregates the two classes (hyper-plane/ line).
You can look at support vector machines and a few examples of
their working here.
How does it work?
Above, we got accustomed to the process of segregating the
two classes with a hyper-plane. Now the burning question is
“How can we identify the right hyper-plane?”. Don’t worry, it’s
not as hard as you think!
Let’s understand:
● Identify the right hyper-plane (Scenario-1): Here, we have
three hyper-planes (A, B, and C). Now, identify the right
hyper-plane to classify stars and circles.
You need to
remember a thumb rule to identify the right hyper-plane:
“Select the hyper-plane which segregates the two classes
better”. In this scenario, hyper-plane “B” has excellently
performed this job.
● Identify the right hyper-plane (Scenario-2): Here, we have
three hyper-planes (A, B, and C) and all are segregating
the classes well. Now, How can we identify the right
hyper-plane?
Here, maximizing the
distances between nearest data point (either class) and
hyper-plane will help us to decide the right hyper-plane.
This distance is called Margin. Let’s look at the below
snapshot: Above,
you can see that the margin for hyper-plane C is high as
compared to both A and B. Hence, we name the right
hyper-plane as C. Another lightning reason for selecting
the hyper-plane with higher margin is robustness. If we
select a hyper-plane having low margin then there is high
chance of miss-classification.
● Identify the right hyper-plane (Scenario-3):Hint: Use the
rules as discussed in previous section to identify the right
hyper-plane
Some of you may have
selected the hyper-plane B as it has higher margin compared
to
A. But, here is the catch, SVM selects the hyper-plane which
classifies the classes accurately prior to maximizing margin.
Here, hyper-plane B has a classification error and A has
classified all correctly. Therefore, the right hyper-plane is A.
● Can we classify two classes (Scenario-4)?: Below, I am
unable to segregate the two classes using a straight line,
as one of the stars lies in the territory of other(circle) class
as an outlier.
As I have
already mentioned, one star at other end is like an outlier
for star class. The SVM algorithm has a feature to ignore
outliers and find the hyper-plane that has the maximum
margin. Hence, we can say, SVM classification is robust to
outliers.
● Find the hyper-plane to segregate to classes (Scenario-5):
In the scenario below, we can’t have linear hyper-plane
between the two classes, so how does SVM classify these
two classes? Till now, we have only looked at the linear
hyper-plane.
SVM can solve this
problem. Easily! It solves this problem by introducing
additional feature. Here, we will add a new feature
z=x^2+y^2. Now, let’s plot the data points on axis x and
z:
In above plot, points to consider are:
○ All values for z would be positive always because z is
the squared sum of both x and y
○ In the original plot, red circles appear close to the
origin of x and y axes, leading to lower value of z and
star relatively away from the origin result to higher
value of z.
3. XGBOOST
Ever since its introduction in 2014, XGBoost has been lauded as
the holy grail of machine learning hackathons and
competitions. From predicting ad click-through rates to
classifying high energy physics events, XGBoost has proved its
mettle in terms of performance – and speed.
I always turn to XGBoost as my first algorithm of choice in any
ML hackathon. The accuracy it consistently gives, and the time
it saves, demonstrates how useful it is. But how does it actually
work? What kind of mathematics power XGBoost? We’ll figure
out the answers to these questions soon.
Tianqi Chen, one of the co-creators of XGBoost, announced (in
2016) that the innovative system features and algorithmic
optimizations in XGBoost have rendered it 10 times faster than
most sought after machine learning solutions. A truly amazing
technique!
In this article, we will first look at the power of XGBoost, and
then deep dive into the inner workings of this popular and
powerful technique. It’s good to be able to implement it in
Python or R, but understanding the nitty-gritties of the
algorithm will help you become a better data scientist.
Table of Contents
● The Power of XGBoost
● Why Ensemble Learning?
○ Bagging
○ Boosting
● Demonstrating the Potential of Boosting
● Using gradient descent for optimizing the loss function
● Unique Features of XGBoost
The Power of XGBoost
The beauty of this powerful algorithm lies in its scalability,
which drives fast learning through parallel and distributed
computing and offers efficient memory usage.
It’s no wonder then that CERN recognized it as the best
approach to classify signals from the Large Hadron Collider. This
particular challenge posed by CERN required a solution that
would be scalable to process data being generated at the rate
of 3 petabytes per year and effectively distinguish an extremely
rare signal from background noises in a complex physical
process. XGBoost emerged as the most useful, straightforward
and robust solution.
Now, let’s deep dive into the inner workings of XGBoost.
Why ensemble learning?
XGBoost is an ensemble learning method. Sometimes, it may
not be sufficient to rely upon the results of just one machine
learning model. Ensemble learning offers a systematic solution
to combine the predictive power of multiple learners. The
resultant is a single model which gives the aggregated output
from several models.
The models that form the ensemble, also known as base
learners, could be either from the same learning algorithm or
different learning algorithms. Bagging and boosting are two
widely used ensemble learners. Though these two techniques
can be used with several statistical models, the most
predominant usage has been with decision trees.
Let’s briefly discuss bagging before taking a more detailed look
at the concept of boosting.
Bagging
While decision trees are one of the most easily interpretable
models, they exhibit highly variable behavior. Consider a single
training dataset that we randomly split into two parts. Now,
let’s use each part to train a decision tree in order to obtain two
models.
When we fit both these models, they would yield different
results. Decision trees are said to be associated with high
variance due to this behavior. Bagging or boosting aggregation
helps to reduce the variance in any learner. Several decision
trees which are generated in parallel, form the base learners of
bagging technique. Data sampled with replacement is fed to
these learners for training. The final prediction is the averaged
output from all the learners.
Boosting
In boosting, the trees are built sequentially such that each
subsequent tree aims to reduce the errors of the previous tree.
Each tree learns from its predecessors and updates the residual
errors. Hence, the tree that grows next in the sequence will
learn from an updated version of the residuals.
The base learners in boosting are weak learners in which the
bias is high, and the predictive power is just a tad better than
random guessing. Each of these weak learners contributes
some vital information for prediction, enabling the boosting
technique to produce a strong learner by effectively combining
these weak learners. The final strong learner brings down both
the bias and the variance.
In contrast to bagging techniques like Random Forest, in which
trees are grown to their maximum extent, boosting makes use
of trees with fewer splits. Such small trees, which are not very
deep, are highly interpretable. Parameters like the number of
trees or iterations, the rate at which the gradient boosting
learns, and the depth of the tree, could be optimally selected
through validation techniques like k-fold cross validation.
Having a large number of trees might lead to overfitting. So, it
is necessary to carefully choose the stopping criteria for
boosting.
The boosting ensemble technique consists of three simple
steps:
● An initial model F0 is defined to predict the target variable
y. This model will be associated with a residual (y – F0) ● A
new model h1 is fit to the residuals from the previous step
● Now, F0 and h1 are combined to give F1, the boosted
version of F0. The mean squared error from F1will be
lower than that from F0:
To improve the performance of F1, we could model after the
residuals of F1 and create a new model F2:
This can be done for ‘m’ iterations, until residuals have been
minimized as much as possible:
Here, the additive learners do not disturb the functions created
in the previous steps. Instead, they impart information of their
own to bring down the errors.
Demonstrating the Potential of Boosting
Consider the following data where the years of experience is
predictor variable and salary (in thousand dollars) is the target.
Using regression trees as base learners, we can create an
ensemble model to predict the salary. For the sake of simplicity,
we can choose square loss as our loss function and our
objective would be to minimize the square error.
As the first step, the model should be initialized with a function
F0(x). F0(x) should be a function which minimizes the loss
function or MSE (mean squared error), in this case:
Taking the first differential of the above equation with respect
to γ, it is seen that the function minimizes at the mean
i=1nyin. So, the boosting model could be initiated with:
F0(x) gives the predictions from the first stage of our model.
Now, the residual error for each instance is (yi – F0(x)).
We can use the residuals from F0(x) to create h1(x). h1(x) will
be a regression tree which will try and reduce the residuals
from the previous step. The output of h1(x) won’t be a
prediction of y; instead, it will help in predicting the successive
function F1(x) which will bring down the residuals.
The additive model h1(x) computes the mean of the residuals
(y – F0) at each leaf of the tree. The boosted function F1(x) is
obtained by summing F0(x) and h1(x). This way h1(x) learns
from the residuals of F0(x) and suppresses it in F1(x).
This can be repeated for 2 more iterations to compute h2(x)
and h3(x). Each of these additive learners, hm(x), will make
use of the residuals from the preceding function, Fm-1(x).
The MSEs for F0(x), F1(x) and F2(x) are 875, 692 and 540. It’s
amazing how these simple weak learners can bring about a
huge reduction in error!
Note that each learner, hm(x), is trained on the residuals. All
the additive learners in boosting are modeled after the residual
errors at each step. Intuitively, it could be observed that the
boosting learners make use of the patterns in residual errors.
At the stage where maximum accuracy is reached by boosting,
the residuals appear to be randomly distributed without any
pattern.
Plots of Fn and hn
Using gradient descent for optimizing the loss function
In the case discussed above, MSE was the loss function. The
mean minimized the error here. When MAE (mean absolute
error) is the loss function, the median would be used as F0(x)
to initialize the model. A unit change in y would cause a unit
change in MAE as well.
For MSE, the change observed would be roughly exponential.
Instead of fitting hm(x) on the residuals, fitting it on the
gradient of loss function, or the step along which loss occurs,
would make this process generic and applicable across all loss
functions.
Gradient descent helps us minimize any differentiable function.
Earlier, the regression tree for hm(x) predicted the mean
residual at each terminal node of the tree. In gradient boosting,
the average gradient component would be computed.
For each node, there is a factor γ with which hm(x) is
multiplied. This accounts for the difference in impact of each
branch of the split. Gradient boosting helps in predicting the
optimal gradient for the additive model, unlike classical
gradient descent techniques which reduce error in the output at
each iteration.
The following steps are involved in gradient boosting:
● F0(x) – with which we initialize the boosting algorithm – is
to be defined:
● The gradient of the loss function is computed iteratively:
● Each hm(x) is fit on the gradient obtained at each step
● The multiplicative factor γm for each terminal node is
derived and the boosted model Fm(x) is defined:
Unique features of XGBoost
XGBoost is a popular implementation of gradient boosting. Let’s
discuss some features of XGBoost that make it so interesting.
● Regularization: XGBoost has an option to penalize
complex models through both L1 and L2
regularization. Regularization helps in preventing
overfitting
● Handling sparse data: Missing values or data processing
steps like one-hot encoding make data sparse. XGBoost
incorporates a sparsity-aware split finding algorithm to
handle different types of sparsity patterns in the data
● Weighted quantile sketch: Most existing tree based
algorithms can find the split points when the data points
are of equal weights (using quantile sketch algorithm).
However, they are not equipped to handle weighted data.
XGBoost has a distributed weighted quantile sketch
algorithm to effectively handle weighted data
● Block structure for parallel learning: For faster computing,
XGBoost can make use of multiple cores on the CPU. This
is possible because of a block structure in its system
design. Data is sorted and stored in in-memory units
called blocks. Unlike other algorithms, this enables the
data layout to be reused by subsequent iterations, instead
of computing it again. This feature also serves useful for
steps like split finding and column sub-sampling
● Cache awareness: In XGBoost, non-continuous memory
access is required to get the gradient statistics by row
index. Hence, XGBoost has been designed to make
optimal use of hardware. This is done by allocating
internal buffers in each thread, where the gradient
statistics can be stored
● Out-of-core computing: This feature optimizes the
available disk space and maximizes its usage when
handling huge datasets that do not fit into memory
4. RANDOM FOREST
Random forest is a Supervised Machine Learning Algorithm
that is used widely in Classification and Regression problems.
It
builds decision trees on different samples and takes
their majority vote for classification and average in case
of regression.
One of the most important features of the Random Forest
Algorithm is that it can handle the data set containing
continuous variables as in the case of regression and
categorical variables as in the case of classification. It performs
better results for classification problems.
Real Life Analogy
Let’s dive into a real-life analogy to understand this concept
further. A student named X wants to choose a course after his
10+2, and he is confused about the choice of course based on
his skill set. So he decides to consult various people like his
cousins, teachers, parents, degree students, and working
people. He asks them varied questions like why he should
choose, job opportunities with that course, course fee, etc.
Finally, after consulting various people about the course he
decides to take the course suggested by most of the people.
Working of Random Forest Algorithm
Before understanding the working of the random forest we
must look into the ensemble technique. Ensemble simply
means combining multiple models. Thus a collection of models
is used to make predictions rather than an individual model.
Ensemble uses two types of methods:
1. Bagging– It creates a different training subset from sample
training data with replacement & the final output is based on
majority voting. For example, Random Forest.
2. Boosting– It combines weak learners into strong learners by
creating sequential models such that the final model has the
highest accuracy. For example, ADA BOOST, XG BOOST
As mentioned earlier, Random forest works on the Bagging
principle. Now let’s dive in and understand bagging in
detail.
Bagging
Bagging, also known as Bootstrap Aggregation is the ensemble
technique used by random forest. Bagging chooses a random
sample from the data set. Hence each model is generated from
the samples (Bootstrap Samples) provided by the Original Data
with replacement known as row sampling. This step of row
sampling with replacement is called bootstrap. Now each
model is trained independently which generates results. The
final output is based on majority voting after combining the
results of all models. This step which involves combining all the
results
and generating output based on majority voting is known as
aggregation.
Now let’s look at an example by breaking it down with the help
of the following figure. Here the bootstrap sample is taken from
actual data (Bootstrap sample 01, Bootstrap sample 02, and
Bootstrap sample 03) with a replacement which means there is
a high possibility that each sample won’t contain unique data.
Now the model (Model 01, Model 02, and Model 03) obtained
from this bootstrap sample is trained independently. Each
model generates results as shown. Now Happy emoji is having
a majority when compared to sad emoji. Thus based on
majority voting final output is obtained as Happy emoji.
Steps involved in random forest algorithm:
Step 1: In Random forest n number of random records are
taken from the data set having k number of records.
Step 2: Individual decision trees are constructed for each
sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or
Averaging for Classification and regression respectively.
For example: consider the fruit basket as the data as shown in
the figure below. Now n number of samples are taken from the
fruit basket and an individual decision tree is constructed for
each sample. Each decision tree will generate an output as
shown in the figure. The final output is considered based on
majority voting. In the below figure you can see that the
majority decision tree gives output as an apple when compared
to a banana, so the final output is taken as an apple.
Important Features of Random Forest
1. Diversity- Not all attributes/variables/features are
considered while making an individual tree, each tree is
different.
2. Immune to the curse of dimensionality- Since each tree does
not consider all the features, the feature space is reduced.
3. Parallelization-Each tree is created independently out of
different data and attributes. This means that we can make full
use of the CPU to build random forests.
4. Train-Test split- In a random forest we don’t have to
segregate the data for train and test as there will always be
30% of the data which is not seen by the decision tree.
5. Stability- Stability arises because the result is based on
majority voting/ averaging.
Difference Between Decision Tree & Random Forest Random
forest is a collection of decision trees; still, there are a
lot of differences in their behavior.
Decision trees Random Forest
1. Random forests are created
from subsets of data and the
1. Decision trees normally
final output is based on
average or majority ranking
suffer from the problem of
and hence the problem of
overfitting is taken care of.
overfitting if it’s allowed to
grow without any control.
2. A single decision tree is
2. It is comparatively slower.
faster in computation.
3. When a data set with 3. Random forest randomly
features is taken as input by a selects observations, builds a
decision tree it will formulate decision tree and the average
some set of rules to do result is taken. It doesn’t use
prediction. any set of formulas.
Thus random forests are much more successful than decision
trees only if the trees are diverse and acceptable.
Important Hyperparameters
Hyperparameters are used in random forests to either enhance
the performance and predictive power of models or to make
the model faster.
Following hyperparameters increases the predictive power:
1. n_estimators– number of trees the algorithm builds
before averaging the predictions.
2. max_features– maximum number of features random forest
considers splitting a node.
3. mini_sample_leaf– determines the minimum number of
leaves required to split an internal node.
Following hyperparameters increases the speed:
1. n_jobs– it tells the engine how many processors it is allowed
to use. If the value is 1, it can use only one processor but if the
value is -1 there is no limit.
2. random_state– controls randomness of the sample. The
model will always produce the same results if it has a definite
value of random state and if it has been given the same
hyperparameters and the same training data.
3. oob_score – OOB means out of the bag. It is a random
forest cross-validation method. In this one-third of the sample
is not used to train the data instead used to evaluate its
performance. These samples are called out of bag samples.
SOURCE CODE
In [ ]: import os
In [ ]: !pip install sklearn
Requirement already satisfied: sklearn in /usr/local/lib/python3.7/dist-packa
ges (0.0)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist
packages (from sklearn) (0.22.2.post1)
Requirement already satisfied: numpy>=1.11.0 in /usr/local/lib/python3.7/dist
-packages (from scikit-learn->sklearn) (1.19.5)
Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.7/dist
-packages (from scikit-learn->sklearn) (1.4.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist
packages (from scikit-learn->sklearn) (1.0.1)
In [ ]: import os
import pandas
import numpy
import [Link] as ek
from sklearn import tree, linear_model
from sklearn.feature_selection import SelectFromModel
from [Link] import joblib
from sklearn.naive_bayes import GaussianNB
from [Link] import confusion_matrix
from [Link] import make_pipeline
from sklearn import preprocessing
from sklearn import svm
from sklearn.linear_model import LinearRegression
In [ ]: dataset = pandas.read_csv('Malware_Detection_data.csv',sep='|',
low_memory=Fal se)
1/15
In [ ]: [Link]()
Out[ ]:
Name md5 Machine SizeOfOptionalHeader Characteristic 0 [Link]
631ea355665f28d4707448e442fbf5b8 332 224 25 1 [Link] 9d10f99a6712e28f8acd5641e3a7ea6b 332 224
333 2 [Link] 4d92f518527353c0db88a70fddcfd390 332 224 333 3 [Link]
a41e524f8d45f0074fd07805ff0c9b12 332 224 25 4 [Link] c87e561258f2f8650cef999bf643a731 332 224
25
In [ ]: [Link]()
Out[ ]:
Name md5 Machine
45080 VirusShare_b8a2fd495c3e170ee0e50cb5251539f8 b8a2fd495c3e170ee0e50cb5251539f8 332 45081
VirusShare_fd35430b011a41b265151939c02f1902 fd35430b011a41b265151939c02f1902 332 45082
VirusShare_0876461ffea8c11041a69baae76bc868 0876461ffea8c11041a69baae76bc868 332 45083
VirusShare_c2325fe7e5f0638eff3b5a1ba4ae1046 c2325fe7e5f0638eff3b5a1ba4ae1046 332 45084
VirusShare_e57735f42657563a27f01e6a5cee1757 e57735f42657563a27f01e6a5cee1757 332
In [ ]: [Link]()
Out[ ]:
Machine SizeOfOptionalHeader Characteristics MajorLinkerVersion MinorLinkerVersio count
45085.000000 45085.000000 45085.000000 45085.000000 45085.00000 mean 12307.278962 229.624576
6888.325984 8.720062 1.47248
std 16267.140560 7.639284 4093.550179 1.942843 5.31230 min 332.000000 224.000000 2.000000 0.000000
0.00000 25% 332.000000 224.000000 8226.000000 8.000000 0.00000 50% 332.000000 224.000000
8226.000000 9.000000 0.00000 75% 34404.000000 240.000000 8450.000000 9.000000 0.00000 max
34404.000000 240.000000 41358.000000 255.000000 255.00000
In [ ]: [Link](dataset['legitimate']).size()
Out[ ]: legitimate
0.0 3761
1.0 41323
dtype: int64
2/15
In [ ]: X = [Link](['Name','md5','legitimate'],axis=1).values
y = dataset['legitimate'].values
Part 1
In [ ]: import pandas as pd
import numpy as np
In [ ]: malware_csv = pd.read_csv('[Link]', sep='|')
legit = malware_csv[0:41323].drop(['legitimate'],axis=1)
malware = malware_csv[41323::].drop(['legitimate'],axis=1)
In [ ]: malware_csv
Out[ ]:
Name md5 Mach
0 [Link] 631ea355665f28d4707448e442fbf5b8
1 [Link] 9d10f99a6712e28f8acd5641e3a7ea6b
2 [Link] 4d92f518527353c0db88a70fddcfd390
3 [Link] a41e524f8d45f0074fd07805ff0c9b12
4 [Link] c87e561258f2f8650cef999bf643a731
... ... ...
138042 VirusShare_8e292b418568d6e7b87f2a32aee7074b 8e292b418568d6e7b87f2a32aee7074b 138043
VirusShare_260d9e2258aed4c8a3bbd703ec895822 260d9e2258aed4c8a3bbd703ec895822 138044
VirusShare_8d088a51b7d225c9f5d11d239791ec3f 8d088a51b7d225c9f5d11d239791ec3f 138045
VirusShare_4286dccf67ca220fe67635388229a9f3 4286dccf67ca220fe67635388229a9f3 138046
VirusShare_d7648eae45f09b3adb75127f43be6d11 d7648eae45f09b3adb75127f43be6d11
138047 rows × 57 columns
In [ ]: malware_csv.head()
Out[ ]:
Name md5 Machine SizeOfOptionalHeader Characteristic 0 [Link]
631ea355665f28d4707448e442fbf5b8 332 224 25 1 [Link] 9d10f99a6712e28f8acd5641e3a7ea6b 332 224
333 2 [Link] 4d92f518527353c0db88a70fddcfd390 332 224 333 3 [Link]
a41e524f8d45f0074fd07805ff0c9b12 332 224 25 4 [Link] c87e561258f2f8650cef999bf643a731 332 224
25
3/15
In [ ]: malware_csv.tail()
Out[ ]:
Name md5 Mach
138042 VirusShare_8e292b418568d6e7b87f2a32aee7074b 8e292b418568d6e7b87f2a32aee7074b 138043
VirusShare_260d9e2258aed4c8a3bbd703ec895822 260d9e2258aed4c8a3bbd703ec895822 138044
VirusShare_8d088a51b7d225c9f5d11d239791ec3f 8d088a51b7d225c9f5d11d239791ec3f 138045
VirusShare_4286dccf67ca220fe67635388229a9f3 4286dccf67ca220fe67635388229a9f3 138046
VirusShare_d7648eae45f09b3adb75127f43be6d11 d7648eae45f09b3adb75127f43be6d11
In [ ]: malware_csv.describe()
Out[ ]:
Machine SizeOfOptionalHeader Characteristics MajorLinkerVersion MinorLinkerVers count
138047.000000 138047.000000 138047.000000 138047.000000 138047.0000 mean 4259.069274 225.845632
4444.145994 8.619774 3.8192
std 10880.347245 5.121399 8186.782524 4.088757 11.8626 min 332.000000 224.000000 2.000000 0.000000
0.0000 25% 332.000000 224.000000 258.000000 8.000000 0.0000 50% 332.000000 224.000000 258.000000
9.000000 0.0000 75% 332.000000 224.000000 8226.000000 10.000000 0.0000 max 34404.000000 352.000000
49551.000000 255.000000 255.0000
4/15
<class '[Link]'>
RangeIndex: 138047 entries, 0 to 138046
Data columns (total 57 columns):
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name
138047 non-null object 1 md5 138047 non-null object 2 Machine 138047
non-null int64 3 SizeOfOptionalHeader 138047 non-null int64 4
Characteristics 138047 non-null int64 5 MajorLinkerVersion 138047 non-null
int64 6 MinorLinkerVersion 138047 non-null int64 7 SizeOfCode 138047
non-null int64 8 SizeOfInitializedData 138047 non-null int64 9
SizeOfUninitializedData 138047 non-null int64 10 AddressOfEntryPoint 138047
non-null int64 11 BaseOfCode 138047 non-null int64 12 BaseOfData 138047
non-null int64
13 ImageBase 138047 non-null float64 14 SectionAlignment 138047 non-null
int64 15 FileAlignment 138047 non-null int64 16
MajorOperatingSystemVersion 138047 non-null int64 17
MinorOperatingSystemVersion 138047 non-null int64 18 MajorImageVersion
138047 non-null int64 19 MinorImageVersion 138047 non-null int64 20
MajorSubsystemVersion 138047 non-null int64 21 MinorSubsystemVersion 138047
non-null int64 22 SizeOfImage 138047 non-null int64 23 SizeOfHeaders
138047 non-null int64 24 CheckSum 138047 non-null int64 25 Subsystem
138047 non-null int64 26 DllCharacteristics 138047 non-null int64 27
SizeOfStackReserve 138047 non-null int64 28 SizeOfStackCommit 138047
non-null int64 29 SizeOfHeapReserve 138047 non-null int64 30
SizeOfHeapCommit 138047 non-null int64 31 LoaderFlags 138047 non-null int64
32 NumberOfRvaAndSizes 138047 non-null int64 33 SectionsNb 138047 non-null
int64 34 SectionsMeanEntropy 138047 non-null float64 35 SectionsMinEntropy
138047 non-null float64
36 SectionsMaxEntropy 138047 non-null float64 37 SectionsMeanRawsize 138047
non-null float64 38 SectionsMinRawsize 138047 non-null int64 39
SectionMaxRawsize 138047 non-null int64 40 SectionsMeanVirtualsize 138047
non-null float64 41 SectionsMinVirtualsize 138047 non-null int64 42
SectionMaxVirtualsize 138047 non-null int64 43 ImportsNbDLL 138047 non-null
int64 44 ImportsNb 138047 non-null int64 45 ImportsNbOrdinal 138047
non-null int64 46 ExportNb 138047 non-null int64 47 ResourcesNb 138047
non-null int64 48 ResourcesMeanEntropy 138047 non-null float64 49
ResourcesMinEntropy 138047 non-null float64 50 ResourcesMaxEntropy 138047
non-null float64 51 ResourcesMeanSize 138047 non-null float64
6/15
52 ResourcesMinSize 138047 non-null int64 53 ResourcesMaxSize 138047
non-null int64 54 LoadConfigurationSize 138047 non-null int64 55
VersionInformationSize 138047 non-null int64 56 legitimate 138047 non-null
int64 dtypes: float64(10), int64(45), object(2)
memory usage: 60.0+ MB
In [ ]: import [Link] as plt
import seaborn as sns
7/15
In [ ]: malware_csv.plot()
Out[ ]: <[Link]._subplots.AxesSubplot at 0x7f922700bb50>
8/15
Out[ ]: array([[<[Link]._subplots.AxesSubplot object at
0x7f9226e99850>, <[Link]._subplots.AxesSubplot object at
0x7f9226db0690>, <[Link]._subplots.AxesSubplot object at
0x7f9226de3b10>, <[Link]._subplots.AxesSubplot object at
0x7f9226da60d0>, <[Link]._subplots.AxesSubplot object at
0x7f9226d5b650>, <[Link]._subplots.AxesSubplot object at
0x7f9226d11bd0>, <[Link]._subplots.AxesSubplot object at
0x7f9226cd5210>], [<[Link]._subplots.AxesSubplot object at
0x7f9226c896d0>, <[Link]._subplots.AxesSubplot object at
0x7f9226c89710>, <[Link]._subplots.AxesSubplot object at
0x7f9226c3ed90>, <[Link]._subplots.AxesSubplot object at
0x7f9226bb57d0>, <[Link]._subplots.AxesSubplot object at
0x7f9226b6cd50>, <[Link]._subplots.AxesSubplot object at
0x7f9226b2c310>, <[Link]._subplots.AxesSubplot object at
0x7f9226b65890>], [<[Link]._subplots.AxesSubplot object at
0x7f9226b19e10>, <[Link]._subplots.AxesSubplot object at
0x7f9226adc3d0>, <[Link]._subplots.AxesSubplot object at
0x7f9226a92950>, <[Link]._subplots.AxesSubplot object at
0x7f9226a47ed0>, <[Link]._subplots.AxesSubplot object at
0x7f9226a08490>, <[Link]._subplots.AxesSubplot object at
0x7f92269bfa10>, <[Link]._subplots.AxesSubplot object at
0x7f9226974f90>], [<[Link]._subplots.AxesSubplot object at
0x7f9226935550>, <[Link]._subplots.AxesSubplot object at
0x7f92268ebad0>, <[Link]._subplots.AxesSubplot object at
0x7f9226917b90>, <[Link]._subplots.AxesSubplot object at
0x7f92268e3610>, <[Link]._subplots.AxesSubplot object at
0x7f9226897b90>, <[Link]._subplots.AxesSubplot object at
0x7f922685a150>, <[Link]._subplots.AxesSubplot object at
0x7f92268106d0>], [<[Link]._subplots.AxesSubplot object at
0x7f92267c6c50>, <[Link]._subplots.AxesSubplot object at
0x7f922678a210>, <[Link]._subplots.AxesSubplot object at
0x7f9226740810>, <[Link]._subplots.AxesSubplot object at
0x7f92266f4d90>, <[Link]._subplots.AxesSubplot object at
0x7f922b35f690>, <[Link]._subplots.AxesSubplot object at
0x7f922a666cd0>, <[Link]._subplots.AxesSubplot object at
0x7f9226e8d710>], [<[Link]._subplots.AxesSubplot object at
0x7f9226f99110>, <[Link]._subplots.AxesSubplot object at
0x7f9226e0a110>, <[Link]._subplots.AxesSubplot object at
0x7f92266002d0>, <[Link]._subplots.AxesSubplot object at
0x7f92265b2950>, <[Link]._subplots.AxesSubplot object at
0x7f92265ddb10>, <[Link]._subplots.AxesSubplot object at
0x7f92265a8690>, <[Link]._subplots.AxesSubplot object at
0x7f922655ed10>], [<[Link]._subplots.AxesSubplot object at
0x7f92265203d0>, <[Link]._subplots.AxesSubplot object at
0x7f92264d5a50>, <[Link]._subplots.AxesSubplot object at
0x7f9226498110>, <[Link]._subplots.AxesSubplot object at
0x7f922644d790>, <[Link]._subplots.AxesSubplot object at
0x7f9226403e10>, <[Link]._subplots.AxesSubplot object at
0x7f92263c74d0>, <[Link]._subplots.AxesSubplot object at
0x7f922637ab50>], [<[Link]._subplots.AxesSubplot object at
0x7f9226341210>, <[Link]._subplots.AxesSubplot object at
0x7f92262f4890>, <[Link]._subplots.AxesSubplot object at
0x7f922632af10>, <[Link]._subplots.AxesSubplot object at
0x7f922626d5d0>, <[Link]._subplots.AxesSubplot object at
0x7f92262a2c50>, <[Link]._subplots.AxesSubplot object at
0x7f9226265310>, <[Link]._subplots.AxesSubplot object at
0x7f922621b990>]], dtype=object)
10/15
In [ ]: print("The no of samples are %s and no of features are %s for
legitimate part "%([Link][0],[Link][1]))
print("The no of samples are %s and no of features are %s for malware part "%(
[Link][0],[Link][1]))
The no of samples are 41323 and no of features are 56 for legitimate part The
no of samples are 96724 and no of features are 56 for malware part
In [ ]: pd.set_option("display.max_columns",None)
malware
Out[ ]:
Name md5 Mach
41323 VirusShare_4a400b747afe6547e09ce0b02dae7f1c 4a400b747afe6547e09ce0b02dae7f1c 41324
VirusShare_9bd57c8252948bd2fa651ad372bd4f13 9bd57c8252948bd2fa651ad372bd4f13 41325
VirusShare_d1456165e9358b8f61f93a5f2042f39c d1456165e9358b8f61f93a5f2042f39c 41326
VirusShare_e4214cc73afbba0f52bb72d5db8f8bb1 e4214cc73afbba0f52bb72d5db8f8bb1 41327
VirusShare_710890c07b3f93b90635f8bff6c34605 710890c07b3f93b90635f8bff6c34605
... ... ... 138042 VirusShare_8e292b418568d6e7b87f2a32aee7074b 8e292b418568d6e7b87f2a32aee7074b
138043 VirusShare_260d9e2258aed4c8a3bbd703ec895822 260d9e2258aed4c8a3bbd703ec895822 138044
VirusShare_8d088a51b7d225c9f5d11d239791ec3f 8d088a51b7d225c9f5d11d239791ec3f 138045
VirusShare_4286dccf67ca220fe67635388229a9f3 4286dccf67ca220fe67635388229a9f3 138046
VirusShare_d7648eae45f09b3adb75127f43be6d11 d7648eae45f09b3adb75127f43be6d11
96724 rows × 56 columns
11/15
In [ ]: from [Link] import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
In [ ]: malware_csv
Out[ ]:
Name md5 Mach
0 [Link] 631ea355665f28d4707448e442fbf5b8 1 [Link] 9d10f99a6712e28f8acd5641e3a7ea6b 2
[Link] 4d92f518527353c0db88a70fddcfd390 3 [Link] a41e524f8d45f0074fd07805ff0c9b12 4
[Link] c87e561258f2f8650cef999bf643a731
... ... ... 138042 VirusShare_8e292b418568d6e7b87f2a32aee7074b 8e292b418568d6e7b87f2a32aee7074b
138043 VirusShare_260d9e2258aed4c8a3bbd703ec895822 260d9e2258aed4c8a3bbd703ec895822 138044
VirusShare_8d088a51b7d225c9f5d11d239791ec3f 8d088a51b7d225c9f5d11d239791ec3f 138045
VirusShare_4286dccf67ca220fe67635388229a9f3 4286dccf67ca220fe67635388229a9f3 138046
VirusShare_d7648eae45f09b3adb75127f43be6d11 d7648eae45f09b3adb75127f43be6d11
138047 rows × 57 columns
In [ ]: data_input = malware_csv.drop(['Name','md5','legitimate'],axis =
1).values labels = malware_csv['legitimate'].values
extratrees = ExtraTreesClassifier().fit(data_input, labels)
select = SelectFromModel(extratrees, prefit = True)
data_input_new = [Link](data_input)
12/15
In [ ]: import numpy as np
features = data_input_new.shape[1]
importances = extratrees.feature_importances_
indices = [Link](importances)[::-1]
for i in range (features):
print("%d"%(i+1),malware_csv.columns[2+indices[i]],importances[indices[i ]])
1 DllCharacteristics 0.18192824351590617
2 Characteristics 0.10840711225864559
3 Machine 0.09972369581559354
4 Subsystem 0.06886261002211971
5 VersionInformationSize 0.05465157639605862
6 SectionsMaxEntropy 0.04926051040315489
7 ImageBase 0.04548174292036617
8 MajorSubsystemVersion 0.043129379250107805
9 SizeOfOptionalHeader 0.041849160410714396
10 ResourcesMinEntropy 0.03683297953662699
11 SizeOfStackReserve 0.03062319891509856
12 ResourcesMaxEntropy 0.029344981855075357
13 SectionsMeanEntropy 0.020449232460599844
In [ ]: from [Link] import RandomForestClassifier
legit_train,legit_test,mal_train,mal_test = train_test_split(data_input_new,la
bels,test_size=0.2)
classifier = RandomForestClassifier(n_estimators=50)
[Link](legit_train,mal_train)
Out[ ]: RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
class_weight=None, criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None, oob_score=False,
random_state=None, verbose=0, warm_start=False)
In [ ]: print("The score of algorithm is " +
str([Link](legit_test,mal_test) *100))
The score of algorithm is 99.39876856211518
Confusion Matrix
In [ ]: from [Link] import confusion_matrix
result = [Link](legit_test)
conf_matrix = confusion_matrix(mal_test,result)
In [ ]: conf_matrix
Out[ ]: array([[19244, 91],
[ 75, 8200]])
13/15
Gradiant Boost
In [ ]: print("False Positives:",conf_matrix[0][1]*100/sum(conf_matrix[0]))
print("False Negatives:",conf_matrix[1][0]*100/sum(conf_matrix[1]))
False Positives: 0.4706490819756918
False Negatives: 0.9063444108761329
In [ ]: from [Link] import GradientBoostingClassifier
grad_boost = GradientBoostingClassifier(n_estimators=50)
grad_boost.fit(legit_train,mal_train)
Out[ ]: GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse',
init=Non e,
learning_rate=0.1, loss='deviance', max_depth=3, max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=Non e,
min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=50, n_iter_no_change=None, presort='deprecated',
random_state=None, subsample=1.0, tol=0.0001, validation_fraction=0.1,
verbose=0,
warm_start=False)
In [ ]: print("Score:", grad_boost.score(legit_test,mal_test)*100) Score:
98.85910901847157
part2
In [ ]: import os
import pandas
import numpy
import [Link] as ek
from sklearn import tree, linear_model
from sklearn.feature_selection import SelectFromModel
from [Link] import joblib
from sklearn.naive_bayes import GaussianNB
from [Link] import confusion_matrix
from [Link] import make_pipeline
from sklearn import preprocessing
from sklearn import svm
from sklearn.linear_model import LinearRegression
14/15
In [ ]: model = { "DecisionTree":[Link](max_depth=10),
"RandomForest":[Link](n_estimators=50),
"Adaboost":[Link](n_estimators=50),
"LinearRegression":LinearRegression()
}
In [ ]: results = {}
for algo in model:
clf = model[algo]
[Link](legit_train,mal_train)
score = [Link](legit_test,mal_test)
print ("%s : %s " %(algo, score))
results[algo] = score
DecisionTree : 0.9909815284317276
RandomForest : 0.994313654473017
Adaboost : 0.9844983701557407
LinearRegression : 0.5834840523494268
In [ ]: #your_project_completes
In [ ]:
15/15
Conclusion:
We have proposed a malware detection module based on
advanced data mining and machine learning. While such a
method may not be suitable for home users, being very
processor heavy, this can be implemented at enterprise
gateway level to act as a central antivirus engine to supplement
antiviruses present on end user computers. This will not only
easily detect known viruses, but act as a knowledge that will
detect newer forms of harmful files. While a costly model
requires costly infrastructure, it can help in protecting
invaluable enterprise data from security threats, and prevent
immense financial damage.
REFERENCES:
[1][Link]
t [Link]
[2] "Defining Malware: FAQ".
[Link] Retrieved 2009-09-10.
[3] F-Secure Corporation (December 4, 2007). "F-Secure
Reports Amount of Malware Grew by 100% during 2007". Press
release. Retrieved 2007-12-11.
[4] History of Viruses.
[Link]
3_1_1.html [5] Landesman, Mary (2009). "What is a Virus
Signature?” Retrieved 2009-06-18.
[6] Christodorescu,M., Jha, S., 2003. Static analysis of
executables to detect malicious patterns. In: Proceedings of the
12th USENIX Security Symposium. Washington .pp. 105-120.
[7] Filiol, E.,2005. Computer Viruses: from Theory to
Applications. New York, Springer, ISBN 10: 2- 287-23939-1.
[8] Filiol, E., Jacob, G., Liard, M.L., 2007: Evaluation
methodology and theoretical model for antiviral
behavioral detection strategies. J. Comput. 3, pp 27–37.
[9] H. Witten and E. Frank. 2005. Data mining: Practical
machine learning tools with Java implementations. Morgan
Kaufmann, ISBN-10: 0120884070.
[10] J. Kolter and M. Maloof, 2004. Learning to detect malicious
executables in the wild. In Proceedings of KDD'04, pp 470-478.
[11] J. Wang, P. Deng, Y. Fan, L. Jaw, and Y. Liu, [Link]
detection using data mining techniques. In Proceedings of IEEE
International Conference on Data Mining.
66
International Journal of Network Security & Its Applications
(IJNSA), Vol.4, No.1, January 2012
[12] Kephart, J., Arnold, W., 1994. Automatic extraction of
computer virus signatures. In: Proceedings of 4th Virus Bulletin
International Conference, pp. 178–184.
[13] L. Adleman, 1990. An abstract theory of computer viruses
(invited talk). CRYPTO ’88: Proceedings on Advances in
Cryptology, New York, USA. Springer, pp: 354–374.
[14] Lee, T., Mody, J., [Link] classification. In:
Proceedings of European Institute for Computer Antivirus
Research (EICAR) Conference.
[15] Lo, R., Levitt, K., Olsson, R., 1995: Mcf: A malicious code
filter. Comput. Secur. 14, pp.541– 566.
[16] M. Schultz, E. Eskin, and E. Zadok, [Link] mining
methods for detection of new malicious executables. In
Security and Privacy Proceedings IEEE Symposium, pp 38-49.
[17] McGraw, G., Morrisett, G.,2002 : Attacking malicious
code, report to the infosec research council. IEEE Software. pp.
33–41.
[18] P. Szor, [Link] Art of Computer Virus Research and
Defense. New Jersey, Addison Wesley for Symantec Press.
ISBN-10: 0321304543.
[19] Rabek, J., Khazan, R., Lewandowski, S., Cunningham,
R., 2003. Detection of injected, dynamically generated, and
obfuscated malicious code. In: Proceedings of the 2003 ACM
Workshop on Rapid Malcode, pp. 76–82.
[20] S. Hashemi,Y. Yang, D. Zabihzadeh, and M. Kangavari,
[Link] intrusion transactions in databases using data
item dependencies and anomaly analysis. Expert Systems,
25,5,pp 460–473. DOI: 10.1111/j.1468-0394.2008.00467.x
[21] Sung, A., Xu, J., Chavez, P., Mukkamala, S., [Link]
analyzer of vicious executables (save). In: Proceedings of the
20th Annual Computer Security Applications Conference. IEEE
Computer Society Press,ISBN 0-7695-2252-1,pp.326-334.
[22] Virus dataset, Available from: [Link]
[23] Y. Ye, D. Wang, T. Li, and D, Ye. 2008. An intelligent
pe-malware detection system based on association mining.
In
Journal in Computer Virology, 4, 4, pp 323–334. DOI
10.1007/s11416-008- 0082-4.
[24] Zakorzhevsky, 2011. Monthly Malware Statistics. Available
from:
[Link]
lware_Statistics_June_2011.
[25] Dan Goodin (December 21, 2007). "Anti-virus protection
gets worse". Channel Register. Retrieved 2011-02-24.