0% found this document useful (0 votes)
14 views

Project JAISON

Uploaded by

22bfs030
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Project JAISON

Uploaded by

22bfs030
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

COMPARATIVE STUDY OF FILELESS MALWARE

DETECTION USING MACHINE LEARNING

A Project report submitted in partial fulfillment of the requirements for the award of the degree of
BACHELOR OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Submitted by

JAISON.V.R

20BAM027

Under the Guidance of

MR. G.MURUGESAN

Assistant Professor and Head, Department of AI & ML

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND MACHINE


LEARNING
SREE SARASWATHI THYAGARAJA COLLEGE
(Autonomous)

An Autonomous, NAAC Re-Accredited with A Grade, ISO 21001:2018 Certified


Institution,Affiliated to Bharathiar University, Coimbatore

Approved by AICTE for MBA/MCA and by UGC for 2(f) & 12(B) status

Pollachi-642 107
CERTIFICATE

This is to certify that the project report entitled COMPARATIVE STUDY OF


FILELESS MALWARE DETECTION USING MACHINE LEARNING submitted
to Sree Saraswathi Thyagaraja College (Autonomous), Pollachi,affiliated to
Bharathiar University, Coimbatore in partial fulfillment of the requirements for the
award of the degree of BACHELOR OF ARTIFICIAL INTELLIGENCE AND
MACHINE LEARNING is a record of original work done by JAISON.V.R under
my supervision mand guidance and the report has not previously formed the basis for
the award of any Degree / Diploma / Associate ship / Fellowship or other similar title
to any candidate of any University.

Date: 10-11-2022 Guide

Place: Pollachi (Mr. G.MURUGESAN)

Counter Signed by

PC PRINCIPAL

Viva-voce Examination held on -------------------

INTERNAL EXAMINER EXTERNAL EXAMINER


DECLARATION

I, JAISON.V.R hereby declare that the project report entitled COMPARATIVE


STUDY OF FILELESS MALWARE DETECTION USING MACHINE
LEARNING submitted to Sree Saraswathi Thyagaraja College
(Autonomous), Pollachi, affiliated to Bharathiar University, Coimbatore in
partial fulfillment of the requirements for the award of the degree of
BACHELOR OF ARTIFICIAL INTELLIGENCE AND MACHINE
LEARNING is a record of original work done by me under the guidance of
Mr. G.MURUGESAN, Assistant Professor and Head, Department of
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING and it has
not previously formed the basis for the award of any Degree/Diploma
/Associateship /Fellowship or other similar title to any candidate of any
University.

Place: Pollachi

Date:10-11-2022 Signature of the Candidate


ACKNOWLEDGEMENT

I take this opportunity to express our gratitude and sincere thanks to everyone who
helped me in my project.

I wish to express my heartfelt thanks to the Management of Sree Saraswathi


Thyagaraja College for providing me with excellent infrastructure during the course
of study and project.

I wish to express my deep sense of gratitude to Dr. A. SOMU, Principal, Sree


Saraswathi Thyagaraja College for providing me excellent facilities and
encouragement during the course of study and project.

I express my deep sense of gratitude and sincere thanks to my Head of the


Department MRS. GEETHA & my beloved staff members MR. VIVIN JOSE,
MR. S.GUNASEKARAN & MRS. M.LEELAVATHI allowed me to carry out this
project and gave me complete freedom to utilize the resources of the department.

It's my prime duty to solemnly express my deep sense of gratitude and sincere thanks
to the guide Mr. G.MURUGESAN, Assistant Professor and Head, UG
Department of Artificial Intelligence and Machine Learning, for his valuable
advice and excellent guidance to complete the project successfully.

I also convey my heartfelt thanks to my parents, friends and all the staff members of
the Department of ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING for
their valuable support which energized me to complete this project.
PROJECT CONTENT

1.Introduction

1.1.EvolutionofMalware

1.2MalwareDetection

1.3.NeedforMachinelearninginMalwareDetection

2.SystemStudy

2.1.ExistingSystem
2.1.1DrawbacksofExistingSystem
2.2.ProposedSystem
2.2.1AdvantagesofProposedSystem 3.AlgorithmsApplied
3.1.DecisionTree
3.2.SVM
3.3XGBOOST
3.4RandomForest
4.Testing
4.1.TestingMethodologies
5.ConclusionandFutureEnhancement
6.SourceCode

7.Bibliography

8.References
INTRODUCTION:

Idealistic hackers attacked computers in the early days because they were eager
to prove themselves. Cracking machines, however, is an industry in today's
world. Despite recent improvements in software and computer hardware
security, both in frequency and sophistication, attacks on computer systems have
increased. Regrettably, there are major drawbacks to current methods for
detecting and analyzing unknown code samples. The Internet is a critical part of
our everyday lives today. On the internet, there are many services and they are
rising daily as well. Numerous reports indicate that malware's effect is
worsening at an alarming pace. Although malware diversity is growing, anti-
virus scanners are unable to fulfill security needs, resulting in attacks on
millions of hosts. Around 65,63,145 different hosts were targeted, according to
Kaspersky Labs, and in 2015, 40,00,000 unique malware artifacts were found.
Juniper Research (2016), in particular, projected that by 2019 the cost of data
breaches will rise to $2.1 trillion globally. Current studies show that
script-kiddies are generating more and more attacks or are automated. To date,
attacks on commercial and government organizations, such as ransomware and
malware, continue to pose a significant threat and challenge. Such attacks can
come in various ways and sizes. An enormous challenge is the ability of the
global security community to develop and provide expertise in cybersecurity.
There is widespread awareness of the global scarcity of cybersecurity and talent.
Cybercrimes, such as financial fraud, child exploitation online and payment
fraud, are so common that they demand international 24-hour response and
collaboration between multinational law enforcement agencies. For single users
and organizations, malware defense of computer systems is therefore one of the
most critical cybersecurity activities, as even a single attack may result in
compromised data and sufficient losses.
Malware attacks have been one of the most serious cyber risks faced by different
countries. The number of vulnerabilities reporting and malware is also
increasing rapidly. Researchers have received tremendous attention in the study
of malware behaviors. There are several factors that lead to the development of
malware attacks. The malware authors create and deploy malware that can
mutate and which has different forms such as ransomware and fileless malwares.
This is done in order to avoid the detection of malware. It is difficult to detect
the malware and cyber attacks using the traditional cyber security procedures.
Solutions for the new generation cyber attacks rely on various Machine learning
techniques.

EVOLUTION OF MALWARE

In order to protect networks and computer systems from attacks, the diversity,
sophistication and availability of malicious software present enormous
challenges. Malware is continually changing and challenges security researchers
and scientists to strengthen their cyber defenses to keep pace. Owing to the use
of polymorphic and metamorphic methods used to avoid detection and conceal
its true intent, the prevalence of malware has increased. To mutate the code
while keeping the original functionality intact, polymorphic malware uses a
polymorphic engine. The two most common ways to conceal code are packaging
and encryption . Through one or more layers of compression, packers cover a
program's real code. Then the unpacking routines restore the original code and
execute it in memory at runtime. To make it harder for researchers to analyze the
software, crypters encrypt and manipulate malware or part of its code. A crypter
includes a stub that is used for malicious code encryption and decryption.
Whenever it's propagated, metamorphic malware rewrites the code to an
equivalent. Multiple transformation techniques, including but not limited to,
register renaming, code permutation, code expansion, code shrinking and
insertion of garbage code, can be used by malware authors. The combination of
the above techniques resulted in increasingly increasing quantities of malware,
making time-consuming, expensive and more complicated forensic
investigations of malware cases. There are some issues with conventional
antivirus solutions that rely on signature-based and heuristic/behavioral
methods. A signature is a unique feature or collection of features that like a
fingerprint, uniquely differentiates an executable. Signature-based approaches
are unable to identify unknown types of malware, however. Security researchers
suggested behavior-based detection to overcome these problems, which analyzes
the features and behavior of the file to decide whether it is indeed malware,
although it may take some time to search and evaluate. Researchers have begun
implementing machine learning to supplement their solutions in order to solve
the previous drawbacks of conventional antivirus engines and keep pace with
new attacks and variants, as machine learning is well suited for processing large
quantities of data.

1. MALWARE DETECTION
In such a way, hackers present malware aimed at persuading people to
install it. As it seems legal, users also do not know what the
programme is. Usually, we install it thinking that it is secure, but on the
contrary, it's a major threat. That's how the malware gets into your
system. When on the screen, it disperses and hides in numerous files,
making it very difficult to identify. In order to access and record
personal or useful information, it may connect directly to the operating
system and start encrypting it Detection of malware is defined as the
search process for malware files and directories. There are several tools
and methods available to detect malware that make it efficient and
reliable. Some of the general strategies for malware detection are:

○ Signature-based
○ Heuristic Analysis
○ Anti-malware Software
○ Sandbox
Several classifiers have been implemented,
such as linear classifiers (logistic regression, naive
Bayes classifier), support for vector machinery, neural
networks, random forests, etc. Through both static and
dynamic analysis, malware can be identified by:

○ Without Executing the code


○ Behavioural Analysis

2. NEED FOR MACHINE LEARNING IN MALWARE


DETECTION
Machine learning has created a drastic change in many industries,
including cybersecurity, over the last decade. Among cybersecurity
experts, there is a general belief that AI-powered anti-malware tools
can help detect modern malware attacks and boost scanning engines.
Proof of this belief is the number of studies on malware detection
strategies that exploit machine learning reported in the last few years.
The number of research papers released in 2018 is 7720, a 95 percent
rise over 2015 and a 476 percent increase over 2010, according to
Google Scholar,1. This rise in the number of studies is the product of
several factors, including but not limited to the increase in publicly
labeled malware feeds, the increase in computing capacity at the same
time as its price decrease, and the evolution of the field of machine
learning, which has achieved ground-breaking success in a wide range
of tasks such as computer vision and speech recognition. Depending
on the type of analysis, conventional machine learning methods can be
categorized into two main categories, static and dynamic approaches.
The primary difference between them is that static methods extract
features from the static malware analysis, while dynamic methods
extract features from the dynamic analysis. A third category may be
considered, known as hybrid approaches. Hybrid methods incorporate
elements of both static and dynamic analysis. In addition, learning
features from raw inputs in diverse fields have outshone neural
networks. The performance of neural networks in the malware domain
is mirrored by recent developments in machine learning for
cybersecurity.

Brief:

Malware, short for malicious software, consists of programming (code, scripts,


active content, and other software) designed to disrupt or deny operation, gather
information that leads to loss of privacy or exploitation, gain unauthorized
access to system resources, and other abusive behavior. It is a general term used
to define a variety of forms of hostile, intrusive, or annoying software or
program code. Software is considered to be malware based on the perceived
intent of the creator rather than any particular features. Malware includes
computer viruses, worms, Trojan horses, spyware, dishonest adware,
crime-ware, most rootkits, and other malicious and unwanted software or
programs .

In 2008, Symantec published a report that "the release rate of malicious code
and other unwanted programs may be exceeding that of legitimate software
applications.” According to F-Secure, "As much malware was produced in 2007
as in the previous 20 years altogether.”.

Since the rise of widespread Internet access, malicious software has been
designed for a profit, for example forced advertising. For instance, since 2003,
the majority of widespread viruses and worms have been designed to take
control of users' computers for black-market exploitation. Another category of
malware, spyware, - programs designed to monitor users' web browsing and
steal private information. Spyware programs do not spread like viruses, instead
are installed by exploiting security holes or are packaged with user-installed
software, such as peer-to-peer applications.

Clearly, there is a very urgent need to find, not just a suitable method to detect
infected files, but to build a smart engine that can detect new viruses by
studying the structure of system calls made by malware.

2. Current Antivirus Software

Antivirus software is used to prevent, detect, and remove malware, including


but not limited to computer viruses, computer worm, Trojan horses, spyware
and adware. A variety of strategies are typically employed by the antivirus
engines. Signature-based detection involves searching for known patterns of
data within executable code. However, it is possible for a computer to be
infected with a new virus for which no signatures exist. To counter such
“zero-day” threats, heuristics can be used to identify new viruses or variants of
existing viruses by looking for known malicious code. Some antivirus can also
make predictions by executing files in a sandbox and analyzing results.

Often, antivirus software can impair a computer's performance. Any incorrect


decision may lead to a security breach, since it runs at the highly trusted kernel
level of the operating system. If the antivirus software employs heuristic
detection, success depends on achieving the right balance between false
positives and false negatives. Today, malware may no longer be executable
files. Powerful macros in Microsoft Word could also present a security risk.
Traditionally, antivirus software heavily relied upon signatures to identify
malware. However, because of newer kinds of malware, signature-based
approaches are no longer effective.

Although standard antivirus can effectively contain virus outbreaks, for large
enterprises, any breach could be potentially fatal. Virus makers are employing
"oligomorphic", "polymorphic" and, "metamorphic" viruses, which encrypt
parts of themselves or modify themselves as a method of disguise, so as to not
match virus signatures in the dictionary.

Studies in 2007 showed that the effectiveness of antivirus software had


decreased drastically, particularly against unknown or zero day attacks.
Detection rates have dropped from 40-50% in 2006 to 20-30% in 2007. The
problem is magnified by the changing intent of virus makers. Independent
testing on all the major virus scanners consistently shows that none provide
100% virus detection.
Work has been described as

● Describing the details: The dataset is imported and


the different columns are discussed in the dataset. ●
Data cleaning: The required steps are taken after
examining the dataset so that the dataset can be
cleaned and all the null values and columns of not
much significance are removed so that they will not
be of any concern in the training part.
● Data Training and Testing: When the information is
transparent and ready for training, we spilled the
information as a training dataset and testing
dataset in an 80:20 ratio so that the data was
spilled in an 80:20 ratio.
In this paper, as we try to achieve the highest
accuracy, we use two algorithms to see which
will give us better precision.
● Applying Different Algorithms[ML Algorithms]

MAIN ALGORITHMS APPLIED:

1. DECISION TREE
2. RANDOM FOREST
3. SVM
4. XGBOOST
DECISION TREE:
The decision tree Algorithm belongs to the family of supervised

machine learning algorithms. It can be used for both a

classification problem as well as for a regression problem.

The goal of this algorithm is to create a model that predicts the

value of a target variable, for which the decision tree uses the

tree representation to solve the problem in which the leaf node

corresponds to a class label and attributes are represented on

the internal node of the tree.

Let’s take a sample data set to move further ….

Suppose we have a sample of 14 patient data set and we have


to predict which drug to suggest to the patient A or B.

Let’s say we pick cholesterol as the first attribute to split data


It will split our data into two branches High and Normal based

on cholesterol, as you can see in the above figure.

Let’s suppose our new patient has high cholesterol by the

above split of our data we cannot say whether Drug B or Drug

A will be suitable for the patient.

Also, If the patient cholesterol is normal we still do not have an

idea or information to determine that either Drug A or Drug B

is Suitable for the patient.

Let us take Another Attribute Age, as we can see age has three

categories in it Young, middle age and senior let’s try to split.

From the above figure, Now we can say that we can easily
predict which Drug to give to a patient based on his or

her reports.

Assumptions that we make while using the Decision tree:

– In the beginning, we consider the whole training set as the

root.

-Feature values are preferred to be categorical, if the values

continue then they are converted to discrete before building the

model.

-Based on attribute values records are distributed recursively.


-We use a statistical method for ordering attributes as a root

node or the internal node.

Mathematics behind Decision tree algorithm: Before going to

the Information Gain first we have to understand entropy

Entropy: Entropy is the measures of impurity, disorder,

or uncertainty in a bunch of examples.

Purpose of Entropy:

Entropy controls how a Decision Tree decides to split the

data. It affects how a Decision Tree draws its boundaries.

“Entropy values range from 0 to 1”, Less the value of entropy

more it is trusting able.


Suppose we have F1, F2, F3 features we selected the F1

feature as our root node

F1 contains 9 yes label and 5 no label in it, after splitting the F1

we get F2 which have 6 yes/2 No and F3 which have 3 yes/3

no.

Now if we try to calculate the Entropy of both F2 by using the

Entropy formula…

Putting the values in the formula:


Here, 6 is the number of yes taken as positive as we are

calculating probability divided by 8 is the total rows present in

the F2.

Similarly, if we perform Entropy for F3 we will get 1 bit which is

a case of an attribute as in it there is 50%, yes and 50% no.

This splitting will be going on unless and until we get a pure

subset.

What is a Puresubset?
The pure subset is a situation where we will get either all yes or

all no in this case.

We have performed this concerning one node what if after

splitting F2 we may also require some other attribute to reach

the leaf node and we also have to take the entropy of those

values and add it up to do the submission of all those entropy

values for that we have the concept of information gain.

Information Gain: Information gain is used to decide which

feature to split on at each step in building the tree. Simplicity is

best, so we want to keep our tree small. To do so, at each step

we should choose the split that results in the purest daughter

nodes. A commonly used measure of purity is called

information.

For each node of the tree, the information value measures how
much information a feature gives us about the class. The split

with the highest information gain will be taken as the first split

and the process will continue until all children nodes are pure,

or until the information gain is 0.

The algorithm calculates the information gain for each split and

the split which is giving the highest value of information gain is

selected.

We can say that in Information gain we are going to compute

the average of all the entropy-based on the specific split.

Sv = Total sample after the split as in F2 there are 6 yes

S = Total Sample as in F1=9+5=14

Now calculating the Information Gain:

Like this, the algorithm will perform this for n number of splits,
and the information gain for whichever split is higher it is going

to take it in order to construct the decision tree.

The higher the value of information gain of the split the higher

the chance of it getting selected for the particular split.

Gini Impurity:
Gini Impurity is a measurement used to build Decision Trees to

determine how the features of a data set should split nodes to

form the tree. More precisely, the Gini Impurity of a data set is

a number between 0-0.5, which indicates the likelihood of

new,

random data being miss classified if it were given a random

class label according to the class distribution in the data

set.

Entropy vs Gini Impurity


The maximum value for entropy is 1 whereas the maximum

value for Gini impurity is 0.5.

As the Gini Impurity does not contain any logarithmic function

to calculate it takes less computational time as compared to

entropy.

2. SVM Algorithm

“Support Vector Machine” (SVM) is a supervised machine


learning algorithm that can be used for both classification or

regression challenges. However, it is mostly used in

classification problems. In the SVM algorithm, we plot each

data item as a point in n-dimensional space (where n is a

number of features you have) with the value of each feature

being the value of a particular coordinate. Then, we perform

classification by finding the hyper-plane that differentiates the

two classes very well (look at the below snapshot).

Support Vectors are simply the coordinates of individual

observation. The SVM classifier is a frontier that best

segregates the two classes (hyper-plane/ line).

You can look at support vector machines and a few examples of

their working here.

How does it work?


Above, we got accustomed to the process of segregating the

two classes with a hyper-plane. Now the burning question is

“How can we identify the right hyper-plane?”. Don’t worry, it’s


not as hard as you think!

Let’s understand:

● Identify the right hyper-plane (Scenario-1): Here, we have


three hyper-planes (A, B, and C). Now, identify the right
hyper-plane to classify stars and circles.

You need to
remember a thumb rule to identify the right hyper-plane:
“Select the hyper-plane which segregates the two classes
better”. In this scenario, hyper-plane “B” has excellently
performed this job.
● Identify the right hyper-plane (Scenario-2): Here, we have
three hyper-planes (A, B, and C) and all are segregating
the classes well. Now, How can we identify the right
hyper-plane?

Here, maximizing the


distances between nearest data point (either class) and
hyper-plane will help us to decide the right hyper-plane.
This distance is called Margin. Let’s look at the below
snapshot: Above,
you can see that the margin for hyper-plane C is high as
compared to both A and B. Hence, we name the right
hyper-plane as C. Another lightning reason for selecting
the hyper-plane with higher margin is robustness. If we
select a hyper-plane having low margin then there is high
chance of miss-classification.
● Identify the right hyper-plane (Scenario-3):Hint: Use the
rules as discussed in previous section to identify the right
hyper-plane

Some of you may have


selected the hyper-plane B as it has higher margin compared
to
A. But, here is the catch, SVM selects the hyper-plane which

classifies the classes accurately prior to maximizing margin.

Here, hyper-plane B has a classification error and A has

classified all correctly. Therefore, the right hyper-plane is A.

● Can we classify two classes (Scenario-4)?: Below, I am


unable to segregate the two classes using a straight line,
as one of the stars lies in the territory of other(circle) class
as an outlier.

As I have
already mentioned, one star at other end is like an outlier
for star class. The SVM algorithm has a feature to ignore
outliers and find the hyper-plane that has the maximum
margin. Hence, we can say, SVM classification is robust to
outliers.

● Find the hyper-plane to segregate to classes (Scenario-5):


In the scenario below, we can’t have linear hyper-plane
between the two classes, so how does SVM classify these
two classes? Till now, we have only looked at the linear
hyper-plane.
SVM can solve this
problem. Easily! It solves this problem by introducing
additional feature. Here, we will add a new feature
z=x^2+y^2. Now, let’s plot the data points on axis x and
z:

In above plot, points to consider are:


○ All values for z would be positive always because z is
the squared sum of both x and y
○ In the original plot, red circles appear close to the
origin of x and y axes, leading to lower value of z and
star relatively away from the origin result to higher
value of z.

3. XGBOOST
Ever since its introduction in 2014, XGBoost has been lauded as
the holy grail of machine learning hackathons and

competitions. From predicting ad click-through rates to

classifying high energy physics events, XGBoost has proved its

mettle in terms of performance – and speed.

I always turn to XGBoost as my first algorithm of choice in any

ML hackathon. The accuracy it consistently gives, and the time

it saves, demonstrates how useful it is. But how does it actually

work? What kind of mathematics power XGBoost? We’ll figure

out the answers to these questions soon.

Tianqi Chen, one of the co-creators of XGBoost, announced (in

2016) that the innovative system features and algorithmic

optimizations in XGBoost have rendered it 10 times faster than

most sought after machine learning solutions. A truly amazing

technique!

In this article, we will first look at the power of XGBoost, and

then deep dive into the inner workings of this popular and

powerful technique. It’s good to be able to implement it in

Python or R, but understanding the nitty-gritties of the

algorithm will help you become a better data scientist.

Table of Contents
● The Power of XGBoost
● Why Ensemble Learning?
○ Bagging
○ Boosting
● Demonstrating the Potential of Boosting
● Using gradient descent for optimizing the loss function
● Unique Features of XGBoost
The Power of XGBoost
The beauty of this powerful algorithm lies in its scalability,

which drives fast learning through parallel and distributed

computing and offers efficient memory usage.

It’s no wonder then that CERN recognized it as the best

approach to classify signals from the Large Hadron Collider. This

particular challenge posed by CERN required a solution that

would be scalable to process data being generated at the rate

of 3 petabytes per year and effectively distinguish an extremely

rare signal from background noises in a complex physical

process. XGBoost emerged as the most useful, straightforward

and robust solution.

Now, let’s deep dive into the inner workings of XGBoost.

Why ensemble learning?


XGBoost is an ensemble learning method. Sometimes, it may

not be sufficient to rely upon the results of just one machine

learning model. Ensemble learning offers a systematic solution

to combine the predictive power of multiple learners. The

resultant is a single model which gives the aggregated output

from several models.

The models that form the ensemble, also known as base


learners, could be either from the same learning algorithm or

different learning algorithms. Bagging and boosting are two

widely used ensemble learners. Though these two techniques

can be used with several statistical models, the most

predominant usage has been with decision trees.

Let’s briefly discuss bagging before taking a more detailed look

at the concept of boosting.

Bagging
While decision trees are one of the most easily interpretable

models, they exhibit highly variable behavior. Consider a single

training dataset that we randomly split into two parts. Now,

let’s use each part to train a decision tree in order to obtain two

models.

When we fit both these models, they would yield different

results. Decision trees are said to be associated with high

variance due to this behavior. Bagging or boosting aggregation

helps to reduce the variance in any learner. Several decision

trees which are generated in parallel, form the base learners of

bagging technique. Data sampled with replacement is fed to

these learners for training. The final prediction is the averaged

output from all the learners.

Boosting
In boosting, the trees are built sequentially such that each

subsequent tree aims to reduce the errors of the previous tree.

Each tree learns from its predecessors and updates the residual

errors. Hence, the tree that grows next in the sequence will

learn from an updated version of the residuals.

The base learners in boosting are weak learners in which the

bias is high, and the predictive power is just a tad better than

random guessing. Each of these weak learners contributes

some vital information for prediction, enabling the boosting

technique to produce a strong learner by effectively combining

these weak learners. The final strong learner brings down both

the bias and the variance.

In contrast to bagging techniques like Random Forest, in which

trees are grown to their maximum extent, boosting makes use

of trees with fewer splits. Such small trees, which are not very

deep, are highly interpretable. Parameters like the number of

trees or iterations, the rate at which the gradient boosting

learns, and the depth of the tree, could be optimally selected

through validation techniques like k-fold cross validation.

Having a large number of trees might lead to overfitting. So, it

is necessary to carefully choose the stopping criteria for

boosting.

The boosting ensemble technique consists of three simple

steps:
● An initial model F0 is defined to predict the target variable
y. This model will be associated with a residual (y – F0) ● A
new model h1 is fit to the residuals from the previous step
● Now, F0 and h1 are combined to give F1, the boosted
version of F0. The mean squared error from F1will be
lower than that from F0:

To improve the performance of F1, we could model after the

residuals of F1 and create a new model F2:

This can be done for ‘m’ iterations, until residuals have been

minimized as much as possible:

Here, the additive learners do not disturb the functions created

in the previous steps. Instead, they impart information of their

own to bring down the errors.

Demonstrating the Potential of Boosting


Consider the following data where the years of experience is

predictor variable and salary (in thousand dollars) is the target.

Using regression trees as base learners, we can create an

ensemble model to predict the salary. For the sake of simplicity,

we can choose square loss as our loss function and our


objective would be to minimize the square error.

As the first step, the model should be initialized with a function

F0(x). F0(x) should be a function which minimizes the loss

function or MSE (mean squared error), in this case:

Taking the first differential of the above equation with respect

to γ, it is seen that the function minimizes at the mean

i=1nyin. So, the boosting model could be initiated with:

F0(x) gives the predictions from the first stage of our model.

Now, the residual error for each instance is (yi – F0(x)).


We can use the residuals from F0(x) to create h1(x). h1(x) will

be a regression tree which will try and reduce the residuals

from the previous step. The output of h1(x) won’t be a

prediction of y; instead, it will help in predicting the successive

function F1(x) which will bring down the residuals.

The additive model h1(x) computes the mean of the residuals

(y – F0) at each leaf of the tree. The boosted function F1(x) is

obtained by summing F0(x) and h1(x). This way h1(x) learns

from the residuals of F0(x) and suppresses it in F1(x).


This can be repeated for 2 more iterations to compute h2(x)

and h3(x). Each of these additive learners, hm(x), will make

use of the residuals from the preceding function, Fm-1(x).


The MSEs for F0(x), F1(x) and F2(x) are 875, 692 and 540. It’s

amazing how these simple weak learners can bring about a

huge reduction in error!

Note that each learner, hm(x), is trained on the residuals. All

the additive learners in boosting are modeled after the residual

errors at each step. Intuitively, it could be observed that the

boosting learners make use of the patterns in residual errors.

At the stage where maximum accuracy is reached by boosting,

the residuals appear to be randomly distributed without any

pattern.
Plots of Fn and hn

Using gradient descent for optimizing the loss function


In the case discussed above, MSE was the loss function. The

mean minimized the error here. When MAE (mean absolute

error) is the loss function, the median would be used as F0(x)

to initialize the model. A unit change in y would cause a unit

change in MAE as well.

For MSE, the change observed would be roughly exponential.

Instead of fitting hm(x) on the residuals, fitting it on the


gradient of loss function, or the step along which loss occurs,

would make this process generic and applicable across all loss

functions.

Gradient descent helps us minimize any differentiable function.

Earlier, the regression tree for hm(x) predicted the mean

residual at each terminal node of the tree. In gradient boosting,

the average gradient component would be computed.

For each node, there is a factor γ with which hm(x) is

multiplied. This accounts for the difference in impact of each

branch of the split. Gradient boosting helps in predicting the

optimal gradient for the additive model, unlike classical

gradient descent techniques which reduce error in the output at

each iteration.

The following steps are involved in gradient boosting:


● F0(x) – with which we initialize the boosting algorithm – is
to be defined:

● The gradient of the loss function is computed iteratively:

● Each hm(x) is fit on the gradient obtained at each step


● The multiplicative factor γm for each terminal node is
derived and the boosted model Fm(x) is defined:
Unique features of XGBoost
XGBoost is a popular implementation of gradient boosting. Let’s

discuss some features of XGBoost that make it so interesting.

● Regularization: XGBoost has an option to penalize


complex models through both L1 and L2
regularization. Regularization helps in preventing
overfitting
● Handling sparse data: Missing values or data processing
steps like one-hot encoding make data sparse. XGBoost
incorporates a sparsity-aware split finding algorithm to
handle different types of sparsity patterns in the data
● Weighted quantile sketch: Most existing tree based
algorithms can find the split points when the data points
are of equal weights (using quantile sketch algorithm).
However, they are not equipped to handle weighted data.
XGBoost has a distributed weighted quantile sketch
algorithm to effectively handle weighted data
● Block structure for parallel learning: For faster computing,
XGBoost can make use of multiple cores on the CPU. This
is possible because of a block structure in its system
design. Data is sorted and stored in in-memory units
called blocks. Unlike other algorithms, this enables the
data layout to be reused by subsequent iterations, instead
of computing it again. This feature also serves useful for
steps like split finding and column sub-sampling
● Cache awareness: In XGBoost, non-continuous memory
access is required to get the gradient statistics by row
index. Hence, XGBoost has been designed to make
optimal use of hardware. This is done by allocating
internal buffers in each thread, where the gradient
statistics can be stored
● Out-of-core computing: This feature optimizes the
available disk space and maximizes its usage when
handling huge datasets that do not fit into memory

4. RANDOM FOREST
Random forest is a Supervised Machine Learning Algorithm

that is used widely in Classification and Regression problems.

It

builds decision trees on different samples and takes

their majority vote for classification and average in case

of regression.

One of the most important features of the Random Forest

Algorithm is that it can handle the data set containing

continuous variables as in the case of regression and

categorical variables as in the case of classification. It performs

better results for classification problems.

Real Life Analogy


Let’s dive into a real-life analogy to understand this concept

further. A student named X wants to choose a course after his

10+2, and he is confused about the choice of course based on

his skill set. So he decides to consult various people like his

cousins, teachers, parents, degree students, and working

people. He asks them varied questions like why he should

choose, job opportunities with that course, course fee, etc.


Finally, after consulting various people about the course he

decides to take the course suggested by most of the people.

Working of Random Forest Algorithm


Before understanding the working of the random forest we

must look into the ensemble technique. Ensemble simply

means combining multiple models. Thus a collection of models

is used to make predictions rather than an individual model.

Ensemble uses two types of methods:

1. Bagging– It creates a different training subset from sample

training data with replacement & the final output is based on

majority voting. For example, Random Forest.

2. Boosting– It combines weak learners into strong learners by

creating sequential models such that the final model has the

highest accuracy. For example, ADA BOOST, XG BOOST


As mentioned earlier, Random forest works on the Bagging

principle. Now let’s dive in and understand bagging in

detail.

Bagging
Bagging, also known as Bootstrap Aggregation is the ensemble

technique used by random forest. Bagging chooses a random

sample from the data set. Hence each model is generated from

the samples (Bootstrap Samples) provided by the Original Data

with replacement known as row sampling. This step of row

sampling with replacement is called bootstrap. Now each

model is trained independently which generates results. The

final output is based on majority voting after combining the

results of all models. This step which involves combining all the

results

and generating output based on majority voting is known as

aggregation.
Now let’s look at an example by breaking it down with the help

of the following figure. Here the bootstrap sample is taken from

actual data (Bootstrap sample 01, Bootstrap sample 02, and

Bootstrap sample 03) with a replacement which means there is

a high possibility that each sample won’t contain unique data.

Now the model (Model 01, Model 02, and Model 03) obtained

from this bootstrap sample is trained independently. Each

model generates results as shown. Now Happy emoji is having

a majority when compared to sad emoji. Thus based on

majority voting final output is obtained as Happy emoji.


Steps involved in random forest algorithm:
Step 1: In Random forest n number of random records are

taken from the data set having k number of records.

Step 2: Individual decision trees are constructed for each

sample.

Step 3: Each decision tree will generate an output.

Step 4: Final output is considered based on Majority Voting or

Averaging for Classification and regression respectively.


For example: consider the fruit basket as the data as shown in

the figure below. Now n number of samples are taken from the

fruit basket and an individual decision tree is constructed for

each sample. Each decision tree will generate an output as

shown in the figure. The final output is considered based on

majority voting. In the below figure you can see that the

majority decision tree gives output as an apple when compared

to a banana, so the final output is taken as an apple.


Important Features of Random Forest
1. Diversity- Not all attributes/variables/features are

considered while making an individual tree, each tree is

different.

2. Immune to the curse of dimensionality- Since each tree does

not consider all the features, the feature space is reduced.

3. Parallelization-Each tree is created independently out of

different data and attributes. This means that we can make full

use of the CPU to build random forests.

4. Train-Test split- In a random forest we don’t have to

segregate the data for train and test as there will always be

30% of the data which is not seen by the decision tree.


5. Stability- Stability arises because the result is based on

majority voting/ averaging.

Difference Between Decision Tree & Random Forest Random


forest is a collection of decision trees; still, there are a

lot of differences in their behavior.


Decision trees Random Forest

1. Random forests are created


from subsets of data and the
1. Decision trees normally
final output is based on
average or majority ranking
suffer from the problem of
and hence the problem of
overfitting is taken care of.

overfitting if it’s allowed to


grow without any control.

2. A single decision tree is


2. It is comparatively slower.
faster in computation.

3. When a data set with 3. Random forest randomly


features is taken as input by a selects observations, builds a
decision tree it will formulate decision tree and the average
some set of rules to do result is taken. It doesn’t use
prediction. any set of formulas.
Thus random forests are much more successful than decision

trees only if the trees are diverse and acceptable.

Important Hyperparameters
Hyperparameters are used in random forests to either enhance

the performance and predictive power of models or to make

the model faster.

Following hyperparameters increases the predictive power:

1. n_estimators– number of trees the algorithm builds

before averaging the predictions.

2. max_features– maximum number of features random forest

considers splitting a node.

3. mini_sample_leaf– determines the minimum number of

leaves required to split an internal node.

Following hyperparameters increases the speed:

1. n_jobs– it tells the engine how many processors it is allowed

to use. If the value is 1, it can use only one processor but if the

value is -1 there is no limit.

2. random_state– controls randomness of the sample. The

model will always produce the same results if it has a definite

value of random state and if it has been given the same

hyperparameters and the same training data.

3. oob_score – OOB means out of the bag. It is a random

forest cross-validation method. In this one-third of the sample

is not used to train the data instead used to evaluate its

performance. These samples are called out of bag samples.

SOURCE CODE
In [ ]: import os

In [ ]: !pip install sklearn

Requirement already satisfied: sklearn in /usr/local/lib/python3.7/dist-packa


ges (0.0)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist
packages (from sklearn) (0.22.2.post1)
Requirement already satisfied: numpy>=1.11.0 in /usr/local/lib/python3.7/dist
-packages (from scikit-learn->sklearn) (1.19.5)
Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.7/dist
-packages (from scikit-learn->sklearn) (1.4.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist
packages (from scikit-learn->sklearn) (1.0.1)

In [ ]: import os
import pandas
import numpy

import sklearn.ensemble as ek
from sklearn import tree, linear_model
from sklearn.feature_selection import SelectFromModel
from sklearn.externals import joblib
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn import svm
from sklearn.linear_model import LinearRegression

In [ ]: dataset = pandas.read_csv('Malware_Detection_data.csv',sep='|',
low_memory=Fal se)

1/15
In [ ]: dataset.head()

Out[ ]:
Name md5 Machine SizeOfOptionalHeader Characteristic 0 memtest.exe

631ea355665f28d4707448e442fbf5b8 332 224 25 1 ose.exe 9d10f99a6712e28f8acd5641e3a7ea6b 332 224

333 2 setup.exe 4d92f518527353c0db88a70fddcfd390 332 224 333 3 DW20.EXE

a41e524f8d45f0074fd07805ff0c9b12 332 224 25 4 dwtrig20.exe c87e561258f2f8650cef999bf643a731 332 224

25
In [ ]: dataset.tail()

Out[ ]:
Name md5 Machine

45080 VirusShare_b8a2fd495c3e170ee0e50cb5251539f8 b8a2fd495c3e170ee0e50cb5251539f8 332 45081

VirusShare_fd35430b011a41b265151939c02f1902 fd35430b011a41b265151939c02f1902 332 45082

VirusShare_0876461ffea8c11041a69baae76bc868 0876461ffea8c11041a69baae76bc868 332 45083

VirusShare_c2325fe7e5f0638eff3b5a1ba4ae1046 c2325fe7e5f0638eff3b5a1ba4ae1046 332 45084

VirusShare_e57735f42657563a27f01e6a5cee1757 e57735f42657563a27f01e6a5cee1757 332

In [ ]: dataset.describe()

Out[ ]:
Machine SizeOfOptionalHeader Characteristics MajorLinkerVersion MinorLinkerVersio count

45085.000000 45085.000000 45085.000000 45085.000000 45085.00000 mean 12307.278962 229.624576

6888.325984 8.720062 1.47248

std 16267.140560 7.639284 4093.550179 1.942843 5.31230 min 332.000000 224.000000 2.000000 0.000000

0.00000 25% 332.000000 224.000000 8226.000000 8.000000 0.00000 50% 332.000000 224.000000

8226.000000 9.000000 0.00000 75% 34404.000000 240.000000 8450.000000 9.000000 0.00000 max

34404.000000 240.000000 41358.000000 255.000000 255.00000

In [ ]: dataset.groupby(dataset['legitimate']).size()

Out[ ]: legitimate
0.0 3761
1.0 41323
dtype: int64

2/15
In [ ]: X = dataset.drop(['Name','md5','legitimate'],axis=1).values
y = dataset['legitimate'].values

Part 1
In [ ]: import pandas as pd
import numpy as np

In [ ]: malware_csv = pd.read_csv('MalwareData.csv', sep='|')


legit = malware_csv[0:41323].drop(['legitimate'],axis=1)
malware = malware_csv[41323::].drop(['legitimate'],axis=1)

In [ ]: malware_csv

Out[ ]:
Name md5 Mach

0 memtest.exe 631ea355665f28d4707448e442fbf5b8
1 ose.exe 9d10f99a6712e28f8acd5641e3a7ea6b

2 setup.exe 4d92f518527353c0db88a70fddcfd390

3 DW20.EXE a41e524f8d45f0074fd07805ff0c9b12

4 dwtrig20.exe c87e561258f2f8650cef999bf643a731

... ... ...

138042 VirusShare_8e292b418568d6e7b87f2a32aee7074b 8e292b418568d6e7b87f2a32aee7074b 138043

VirusShare_260d9e2258aed4c8a3bbd703ec895822 260d9e2258aed4c8a3bbd703ec895822 138044

VirusShare_8d088a51b7d225c9f5d11d239791ec3f 8d088a51b7d225c9f5d11d239791ec3f 138045

VirusShare_4286dccf67ca220fe67635388229a9f3 4286dccf67ca220fe67635388229a9f3 138046

VirusShare_d7648eae45f09b3adb75127f43be6d11 d7648eae45f09b3adb75127f43be6d11

138047 rows × 57 columns

In [ ]: malware_csv.head()

Out[ ]:
Name md5 Machine SizeOfOptionalHeader Characteristic 0 memtest.exe

631ea355665f28d4707448e442fbf5b8 332 224 25 1 ose.exe 9d10f99a6712e28f8acd5641e3a7ea6b 332 224

333 2 setup.exe 4d92f518527353c0db88a70fddcfd390 332 224 333 3 DW20.EXE

a41e524f8d45f0074fd07805ff0c9b12 332 224 25 4 dwtrig20.exe c87e561258f2f8650cef999bf643a731 332 224

25

3/15
In [ ]: malware_csv.tail()

Out[ ]:
Name md5 Mach

138042 VirusShare_8e292b418568d6e7b87f2a32aee7074b 8e292b418568d6e7b87f2a32aee7074b 138043

VirusShare_260d9e2258aed4c8a3bbd703ec895822 260d9e2258aed4c8a3bbd703ec895822 138044

VirusShare_8d088a51b7d225c9f5d11d239791ec3f 8d088a51b7d225c9f5d11d239791ec3f 138045

VirusShare_4286dccf67ca220fe67635388229a9f3 4286dccf67ca220fe67635388229a9f3 138046

VirusShare_d7648eae45f09b3adb75127f43be6d11 d7648eae45f09b3adb75127f43be6d11

In [ ]: malware_csv.describe()

Out[ ]:
Machine SizeOfOptionalHeader Characteristics MajorLinkerVersion MinorLinkerVers count

138047.000000 138047.000000 138047.000000 138047.000000 138047.0000 mean 4259.069274 225.845632

4444.145994 8.619774 3.8192

std 10880.347245 5.121399 8186.782524 4.088757 11.8626 min 332.000000 224.000000 2.000000 0.000000

0.0000 25% 332.000000 224.000000 258.000000 8.000000 0.0000 50% 332.000000 224.000000 258.000000

9.000000 0.0000 75% 332.000000 224.000000 8226.000000 10.000000 0.0000 max 34404.000000 352.000000

49551.000000 255.000000 255.0000


4/15
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138047 entries, 0 to 138046
Data columns (total 57 columns):
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name
138047 non-null object 1 md5 138047 non-null object 2 Machine 138047
non-null int64 3 SizeOfOptionalHeader 138047 non-null int64 4
Characteristics 138047 non-null int64 5 MajorLinkerVersion 138047 non-null
int64 6 MinorLinkerVersion 138047 non-null int64 7 SizeOfCode 138047
non-null int64 8 SizeOfInitializedData 138047 non-null int64 9
SizeOfUninitializedData 138047 non-null int64 10 AddressOfEntryPoint 138047
non-null int64 11 BaseOfCode 138047 non-null int64 12 BaseOfData 138047
non-null int64
13 ImageBase 138047 non-null float64 14 SectionAlignment 138047 non-null
int64 15 FileAlignment 138047 non-null int64 16
MajorOperatingSystemVersion 138047 non-null int64 17
MinorOperatingSystemVersion 138047 non-null int64 18 MajorImageVersion
138047 non-null int64 19 MinorImageVersion 138047 non-null int64 20
MajorSubsystemVersion 138047 non-null int64 21 MinorSubsystemVersion 138047
non-null int64 22 SizeOfImage 138047 non-null int64 23 SizeOfHeaders
138047 non-null int64 24 CheckSum 138047 non-null int64 25 Subsystem
138047 non-null int64 26 DllCharacteristics 138047 non-null int64 27
SizeOfStackReserve 138047 non-null int64 28 SizeOfStackCommit 138047
non-null int64 29 SizeOfHeapReserve 138047 non-null int64 30
SizeOfHeapCommit 138047 non-null int64 31 LoaderFlags 138047 non-null int64
32 NumberOfRvaAndSizes 138047 non-null int64 33 SectionsNb 138047 non-null
int64 34 SectionsMeanEntropy 138047 non-null float64 35 SectionsMinEntropy
138047 non-null float64
36 SectionsMaxEntropy 138047 non-null float64 37 SectionsMeanRawsize 138047
non-null float64 38 SectionsMinRawsize 138047 non-null int64 39
SectionMaxRawsize 138047 non-null int64 40 SectionsMeanVirtualsize 138047
non-null float64 41 SectionsMinVirtualsize 138047 non-null int64 42
SectionMaxVirtualsize 138047 non-null int64 43 ImportsNbDLL 138047 non-null
int64 44 ImportsNb 138047 non-null int64 45 ImportsNbOrdinal 138047
non-null int64 46 ExportNb 138047 non-null int64 47 ResourcesNb 138047
non-null int64 48 ResourcesMeanEntropy 138047 non-null float64 49
ResourcesMinEntropy 138047 non-null float64 50 ResourcesMaxEntropy 138047
non-null float64 51 ResourcesMeanSize 138047 non-null float64
6/15
52 ResourcesMinSize 138047 non-null int64 53 ResourcesMaxSize 138047
non-null int64 54 LoadConfigurationSize 138047 non-null int64 55
VersionInformationSize 138047 non-null int64 56 legitimate 138047 non-null
int64 dtypes: float64(10), int64(45), object(2)
memory usage: 60.0+ MB

In [ ]: import matplotlib.pyplot as plt


import seaborn as sns

7/15
In [ ]: malware_csv.plot()

Out[ ]: <matplotlib.axes._subplots.AxesSubplot at 0x7f922700bb50>


8/15
Out[ ]: array([[<matplotlib.axes._subplots.AxesSubplot object at
0x7f9226e99850>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226db0690>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226de3b10>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226da60d0>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226d5b650>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226d11bd0>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226cd5210>], [<matplotlib.axes._subplots.AxesSubplot object at
0x7f9226c896d0>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226c89710>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226c3ed90>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226bb57d0>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226b6cd50>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226b2c310>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226b65890>], [<matplotlib.axes._subplots.AxesSubplot object at
0x7f9226b19e10>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226adc3d0>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226a92950>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226a47ed0>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226a08490>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f92269bfa10>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226974f90>], [<matplotlib.axes._subplots.AxesSubplot object at
0x7f9226935550>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f92268ebad0>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226917b90>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f92268e3610>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226897b90>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f922685a150>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f92268106d0>], [<matplotlib.axes._subplots.AxesSubplot object at
0x7f92267c6c50>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f922678a210>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226740810>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f92266f4d90>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f922b35f690>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f922a666cd0>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226e8d710>], [<matplotlib.axes._subplots.AxesSubplot object at
0x7f9226f99110>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226e0a110>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f92266002d0>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f92265b2950>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f92265ddb10>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f92265a8690>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f922655ed10>], [<matplotlib.axes._subplots.AxesSubplot object at
0x7f92265203d0>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f92264d5a50>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226498110>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f922644d790>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226403e10>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f92263c74d0>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f922637ab50>], [<matplotlib.axes._subplots.AxesSubplot object at
0x7f9226341210>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f92262f4890>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f922632af10>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f922626d5d0>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f92262a2c50>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f9226265310>, <matplotlib.axes._subplots.AxesSubplot object at
0x7f922621b990>]], dtype=object)
10/15

In [ ]: print("The no of samples are %s and no of features are %s for


legitimate part "%(legit.shape[0],legit.shape[1]))
print("The no of samples are %s and no of features are %s for malware part "%(
malware.shape[0],malware.shape[1]))

The no of samples are 41323 and no of features are 56 for legitimate part The
no of samples are 96724 and no of features are 56 for malware part

In [ ]: pd.set_option("display.max_columns",None)
malware

Out[ ]:
Name md5 Mach

41323 VirusShare_4a400b747afe6547e09ce0b02dae7f1c 4a400b747afe6547e09ce0b02dae7f1c 41324

VirusShare_9bd57c8252948bd2fa651ad372bd4f13 9bd57c8252948bd2fa651ad372bd4f13 41325

VirusShare_d1456165e9358b8f61f93a5f2042f39c d1456165e9358b8f61f93a5f2042f39c 41326

VirusShare_e4214cc73afbba0f52bb72d5db8f8bb1 e4214cc73afbba0f52bb72d5db8f8bb1 41327

VirusShare_710890c07b3f93b90635f8bff6c34605 710890c07b3f93b90635f8bff6c34605

... ... ... 138042 VirusShare_8e292b418568d6e7b87f2a32aee7074b 8e292b418568d6e7b87f2a32aee7074b

138043 VirusShare_260d9e2258aed4c8a3bbd703ec895822 260d9e2258aed4c8a3bbd703ec895822 138044

VirusShare_8d088a51b7d225c9f5d11d239791ec3f 8d088a51b7d225c9f5d11d239791ec3f 138045

VirusShare_4286dccf67ca220fe67635388229a9f3 4286dccf67ca220fe67635388229a9f3 138046

VirusShare_d7648eae45f09b3adb75127f43be6d11 d7648eae45f09b3adb75127f43be6d11

96724 rows × 56 columns

11/15
In [ ]: from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate

In [ ]: malware_csv

Out[ ]:
Name md5 Mach

0 memtest.exe 631ea355665f28d4707448e442fbf5b8 1 ose.exe 9d10f99a6712e28f8acd5641e3a7ea6b 2

setup.exe 4d92f518527353c0db88a70fddcfd390 3 DW20.EXE a41e524f8d45f0074fd07805ff0c9b12 4

dwtrig20.exe c87e561258f2f8650cef999bf643a731

... ... ... 138042 VirusShare_8e292b418568d6e7b87f2a32aee7074b 8e292b418568d6e7b87f2a32aee7074b

138043 VirusShare_260d9e2258aed4c8a3bbd703ec895822 260d9e2258aed4c8a3bbd703ec895822 138044

VirusShare_8d088a51b7d225c9f5d11d239791ec3f 8d088a51b7d225c9f5d11d239791ec3f 138045

VirusShare_4286dccf67ca220fe67635388229a9f3 4286dccf67ca220fe67635388229a9f3 138046

VirusShare_d7648eae45f09b3adb75127f43be6d11 d7648eae45f09b3adb75127f43be6d11

138047 rows × 57 columns

In [ ]: data_input = malware_csv.drop(['Name','md5','legitimate'],axis =
1).values labels = malware_csv['legitimate'].values
extratrees = ExtraTreesClassifier().fit(data_input, labels)
select = SelectFromModel(extratrees, prefit = True)
data_input_new = select.transform(data_input)

12/15
In [ ]: import numpy as np
features = data_input_new.shape[1]
importances = extratrees.feature_importances_
indices = np.argsort(importances)[::-1]
for i in range (features):
print("%d"%(i+1),malware_csv.columns[2+indices[i]],importances[indices[i ]])

1 DllCharacteristics 0.18192824351590617
2 Characteristics 0.10840711225864559
3 Machine 0.09972369581559354
4 Subsystem 0.06886261002211971
5 VersionInformationSize 0.05465157639605862
6 SectionsMaxEntropy 0.04926051040315489
7 ImageBase 0.04548174292036617
8 MajorSubsystemVersion 0.043129379250107805
9 SizeOfOptionalHeader 0.041849160410714396
10 ResourcesMinEntropy 0.03683297953662699
11 SizeOfStackReserve 0.03062319891509856
12 ResourcesMaxEntropy 0.029344981855075357
13 SectionsMeanEntropy 0.020449232460599844

In [ ]: from sklearn.ensemble import RandomForestClassifier


legit_train,legit_test,mal_train,mal_test = train_test_split(data_input_new,la
bels,test_size=0.2)

classifier = RandomForestClassifier(n_estimators=50)
classifier.fit(legit_train,mal_train)

Out[ ]: RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,


class_weight=None, criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None, oob_score=False,
random_state=None, verbose=0, warm_start=False)

In [ ]: print("The score of algorithm is " +


str(classifier.score(legit_test,mal_test) *100))

The score of algorithm is 99.39876856211518


Confusion Matrix

In [ ]: from sklearn.metrics import confusion_matrix


result = classifier.predict(legit_test)
conf_matrix = confusion_matrix(mal_test,result)

In [ ]: conf_matrix

Out[ ]: array([[19244, 91],


[ 75, 8200]])

13/15
Gradiant Boost

In [ ]: print("False Positives:",conf_matrix[0][1]*100/sum(conf_matrix[0]))
print("False Negatives:",conf_matrix[1][0]*100/sum(conf_matrix[1]))

False Positives: 0.4706490819756918


False Negatives: 0.9063444108761329

In [ ]: from sklearn.ensemble import GradientBoostingClassifier


grad_boost = GradientBoostingClassifier(n_estimators=50)
grad_boost.fit(legit_train,mal_train)

Out[ ]: GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse',


init=Non e,
learning_rate=0.1, loss='deviance', max_depth=3, max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=Non e,
min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=50, n_iter_no_change=None, presort='deprecated',
random_state=None, subsample=1.0, tol=0.0001, validation_fraction=0.1,
verbose=0,
warm_start=False)

In [ ]: print("Score:", grad_boost.score(legit_test,mal_test)*100) Score:

98.85910901847157

part2
In [ ]: import os
import pandas
import numpy

import sklearn.ensemble as ek
from sklearn import tree, linear_model
from sklearn.feature_selection import SelectFromModel
from sklearn.externals import joblib
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn import svm
from sklearn.linear_model import LinearRegression
14/15
In [ ]: model = { "DecisionTree":tree.DecisionTreeClassifier(max_depth=10),
"RandomForest":ek.RandomForestClassifier(n_estimators=50),
"Adaboost":ek.AdaBoostClassifier(n_estimators=50),
"LinearRegression":LinearRegression()
}

In [ ]: results = {}
for algo in model:
clf = model[algo]
clf.fit(legit_train,mal_train)
score = clf.score(legit_test,mal_test)
print ("%s : %s " %(algo, score))
results[algo] = score

DecisionTree : 0.9909815284317276
RandomForest : 0.994313654473017
Adaboost : 0.9844983701557407
LinearRegression : 0.5834840523494268

In [ ]: #your_project_completes

In [ ]:

15/15
Conclusion:

We have proposed a malware detection module based on


advanced data mining and machine learning. While such a
method may not be suitable for home users, being very
processor heavy, this can be implemented at enterprise
gateway level to act as a central antivirus engine to supplement
antiviruses present on end user computers. This will not only
easily detect known viruses, but act as a knowledge that will
detect newer forms of harmful files. While a costly model
requires costly infrastructure, it can help in protecting
invaluable enterprise data from security threats, and prevent
immense financial damage.
REFERENCES:

[1]http://www.us-cert.gov/control_systems/pdf/undirected_at
t ack0905.pdf

[2] "Defining Malware: FAQ".


http://technet.microsoft.com. Retrieved 2009-09-10.

[3] F-Secure Corporation (December 4, 2007). "F-Secure


Reports Amount of Malware Grew by 100% during 2007". Press
release. Retrieved 2007-12-11.

[4] History of Viruses.


http://csrc.nist.gov/publications/nistir/threats/subsubsection3_
3_1_1.html [5] Landesman, Mary (2009). "What is a Virus
Signature?” Retrieved 2009-06-18.
[6] Christodorescu,M., Jha, S., 2003. Static analysis of
executables to detect malicious patterns. In: Proceedings of the
12th USENIX Security Symposium. Washington .pp. 105-120.

[7] Filiol, E.,2005. Computer Viruses: from Theory to


Applications. New York, Springer, ISBN 10: 2- 287-23939-1.

[8] Filiol, E., Jacob, G., Liard, M.L., 2007: Evaluation


methodology and theoretical model for antiviral
behavioral detection strategies. J. Comput. 3, pp 27–37.
[9] H. Witten and E. Frank. 2005. Data mining: Practical
machine learning tools with Java implementations. Morgan
Kaufmann, ISBN-10: 0120884070.

[10] J. Kolter and M. Maloof, 2004. Learning to detect malicious


executables in the wild. In Proceedings of KDD'04, pp 470-478.

[11] J. Wang, P. Deng, Y. Fan, L. Jaw, and Y. Liu, 2003.Virus


detection using data mining techniques. In Proceedings of IEEE
International Conference on Data Mining.

66

International Journal of Network Security & Its Applications


(IJNSA), Vol.4, No.1, January 2012

[12] Kephart, J., Arnold, W., 1994. Automatic extraction of


computer virus signatures. In: Proceedings of 4th Virus Bulletin
International Conference, pp. 178–184.

[13] L. Adleman, 1990. An abstract theory of computer viruses


(invited talk). CRYPTO ’88: Proceedings on Advances in
Cryptology, New York, USA. Springer, pp: 354–374.

[14] Lee, T., Mody, J., 2006.Behavioral classification. In:


Proceedings of European Institute for Computer Antivirus
Research (EICAR) Conference.

[15] Lo, R., Levitt, K., Olsson, R., 1995: Mcf: A malicious code
filter. Comput. Secur. 14, pp.541– 566.
[16] M. Schultz, E. Eskin, and E. Zadok, 2001.Data mining
methods for detection of new malicious executables. In
Security and Privacy Proceedings IEEE Symposium, pp 38-49.

[17] McGraw, G., Morrisett, G.,2002 : Attacking malicious


code, report to the infosec research council. IEEE Software. pp.
33–41.

[18] P. Szor, 2005.The Art of Computer Virus Research and


Defense. New Jersey, Addison Wesley for Symantec Press.
ISBN-10: 0321304543.

[19] Rabek, J., Khazan, R., Lewandowski, S., Cunningham,


R., 2003. Detection of injected, dynamically generated, and
obfuscated malicious code. In: Proceedings of the 2003 ACM
Workshop on Rapid Malcode, pp. 76–82.

[20] S. Hashemi,Y. Yang, D. Zabihzadeh, and M. Kangavari,


2008.Detecting intrusion transactions in databases using data
item dependencies and anomaly analysis. Expert Systems,
25,5,pp 460–473. DOI: 10.1111/j.1468-0394.2008.00467.x

[21] Sung, A., Xu, J., Chavez, P., Mukkamala, S., 2004.Static
analyzer of vicious executables (save). In: Proceedings of the
20th Annual Computer Security Applications Conference. IEEE
Computer Society Press,ISBN 0-7695-2252-1,pp.326-334.

[22] Virus dataset, Available from: http://virussign.com/

[23] Y. Ye, D. Wang, T. Li, and D, Ye. 2008. An intelligent


pe-malware detection system based on association mining.
In
Journal in Computer Virology, 4, 4, pp 323–334. DOI
10.1007/s11416-008- 0082-4.

[24] Zakorzhevsky, 2011. Monthly Malware Statistics. Available


from:
http://www.securelist.com/en/analysis/204792182/Monthly_Ma
lware_Statistics_June_2011.

[25] Dan Goodin (December 21, 2007). "Anti-virus protection


gets worse". Channel Register. Retrieved 2011-02-24.

You might also like