Project JAISON
Project JAISON
A Project report submitted in partial fulfillment of the requirements for the award of the degree of
BACHELOR OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Submitted by
JAISON.V.R
20BAM027
MR. G.MURUGESAN
Approved by AICTE for MBA/MCA and by UGC for 2(f) & 12(B) status
Pollachi-642 107
CERTIFICATE
Counter Signed by
PC PRINCIPAL
Place: Pollachi
I take this opportunity to express our gratitude and sincere thanks to everyone who
helped me in my project.
It's my prime duty to solemnly express my deep sense of gratitude and sincere thanks
to the guide Mr. G.MURUGESAN, Assistant Professor and Head, UG
Department of Artificial Intelligence and Machine Learning, for his valuable
advice and excellent guidance to complete the project successfully.
I also convey my heartfelt thanks to my parents, friends and all the staff members of
the Department of ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING for
their valuable support which energized me to complete this project.
PROJECT CONTENT
1.Introduction
1.1.EvolutionofMalware
1.2MalwareDetection
1.3.NeedforMachinelearninginMalwareDetection
2.SystemStudy
2.1.ExistingSystem
2.1.1DrawbacksofExistingSystem
2.2.ProposedSystem
2.2.1AdvantagesofProposedSystem 3.AlgorithmsApplied
3.1.DecisionTree
3.2.SVM
3.3XGBOOST
3.4RandomForest
4.Testing
4.1.TestingMethodologies
5.ConclusionandFutureEnhancement
6.SourceCode
7.Bibliography
8.References
INTRODUCTION:
Idealistic hackers attacked computers in the early days because they were eager
to prove themselves. Cracking machines, however, is an industry in today's
world. Despite recent improvements in software and computer hardware
security, both in frequency and sophistication, attacks on computer systems have
increased. Regrettably, there are major drawbacks to current methods for
detecting and analyzing unknown code samples. The Internet is a critical part of
our everyday lives today. On the internet, there are many services and they are
rising daily as well. Numerous reports indicate that malware's effect is
worsening at an alarming pace. Although malware diversity is growing, anti-
virus scanners are unable to fulfill security needs, resulting in attacks on
millions of hosts. Around 65,63,145 different hosts were targeted, according to
Kaspersky Labs, and in 2015, 40,00,000 unique malware artifacts were found.
Juniper Research (2016), in particular, projected that by 2019 the cost of data
breaches will rise to $2.1 trillion globally. Current studies show that
script-kiddies are generating more and more attacks or are automated. To date,
attacks on commercial and government organizations, such as ransomware and
malware, continue to pose a significant threat and challenge. Such attacks can
come in various ways and sizes. An enormous challenge is the ability of the
global security community to develop and provide expertise in cybersecurity.
There is widespread awareness of the global scarcity of cybersecurity and talent.
Cybercrimes, such as financial fraud, child exploitation online and payment
fraud, are so common that they demand international 24-hour response and
collaboration between multinational law enforcement agencies. For single users
and organizations, malware defense of computer systems is therefore one of the
most critical cybersecurity activities, as even a single attack may result in
compromised data and sufficient losses.
Malware attacks have been one of the most serious cyber risks faced by different
countries. The number of vulnerabilities reporting and malware is also
increasing rapidly. Researchers have received tremendous attention in the study
of malware behaviors. There are several factors that lead to the development of
malware attacks. The malware authors create and deploy malware that can
mutate and which has different forms such as ransomware and fileless malwares.
This is done in order to avoid the detection of malware. It is difficult to detect
the malware and cyber attacks using the traditional cyber security procedures.
Solutions for the new generation cyber attacks rely on various Machine learning
techniques.
EVOLUTION OF MALWARE
In order to protect networks and computer systems from attacks, the diversity,
sophistication and availability of malicious software present enormous
challenges. Malware is continually changing and challenges security researchers
and scientists to strengthen their cyber defenses to keep pace. Owing to the use
of polymorphic and metamorphic methods used to avoid detection and conceal
its true intent, the prevalence of malware has increased. To mutate the code
while keeping the original functionality intact, polymorphic malware uses a
polymorphic engine. The two most common ways to conceal code are packaging
and encryption . Through one or more layers of compression, packers cover a
program's real code. Then the unpacking routines restore the original code and
execute it in memory at runtime. To make it harder for researchers to analyze the
software, crypters encrypt and manipulate malware or part of its code. A crypter
includes a stub that is used for malicious code encryption and decryption.
Whenever it's propagated, metamorphic malware rewrites the code to an
equivalent. Multiple transformation techniques, including but not limited to,
register renaming, code permutation, code expansion, code shrinking and
insertion of garbage code, can be used by malware authors. The combination of
the above techniques resulted in increasingly increasing quantities of malware,
making time-consuming, expensive and more complicated forensic
investigations of malware cases. There are some issues with conventional
antivirus solutions that rely on signature-based and heuristic/behavioral
methods. A signature is a unique feature or collection of features that like a
fingerprint, uniquely differentiates an executable. Signature-based approaches
are unable to identify unknown types of malware, however. Security researchers
suggested behavior-based detection to overcome these problems, which analyzes
the features and behavior of the file to decide whether it is indeed malware,
although it may take some time to search and evaluate. Researchers have begun
implementing machine learning to supplement their solutions in order to solve
the previous drawbacks of conventional antivirus engines and keep pace with
new attacks and variants, as machine learning is well suited for processing large
quantities of data.
1. MALWARE DETECTION
In such a way, hackers present malware aimed at persuading people to
install it. As it seems legal, users also do not know what the
programme is. Usually, we install it thinking that it is secure, but on the
contrary, it's a major threat. That's how the malware gets into your
system. When on the screen, it disperses and hides in numerous files,
making it very difficult to identify. In order to access and record
personal or useful information, it may connect directly to the operating
system and start encrypting it Detection of malware is defined as the
search process for malware files and directories. There are several tools
and methods available to detect malware that make it efficient and
reliable. Some of the general strategies for malware detection are:
○ Signature-based
○ Heuristic Analysis
○ Anti-malware Software
○ Sandbox
Several classifiers have been implemented,
such as linear classifiers (logistic regression, naive
Bayes classifier), support for vector machinery, neural
networks, random forests, etc. Through both static and
dynamic analysis, malware can be identified by:
Brief:
In 2008, Symantec published a report that "the release rate of malicious code
and other unwanted programs may be exceeding that of legitimate software
applications.” According to F-Secure, "As much malware was produced in 2007
as in the previous 20 years altogether.”.
Since the rise of widespread Internet access, malicious software has been
designed for a profit, for example forced advertising. For instance, since 2003,
the majority of widespread viruses and worms have been designed to take
control of users' computers for black-market exploitation. Another category of
malware, spyware, - programs designed to monitor users' web browsing and
steal private information. Spyware programs do not spread like viruses, instead
are installed by exploiting security holes or are packaged with user-installed
software, such as peer-to-peer applications.
Clearly, there is a very urgent need to find, not just a suitable method to detect
infected files, but to build a smart engine that can detect new viruses by
studying the structure of system calls made by malware.
Although standard antivirus can effectively contain virus outbreaks, for large
enterprises, any breach could be potentially fatal. Virus makers are employing
"oligomorphic", "polymorphic" and, "metamorphic" viruses, which encrypt
parts of themselves or modify themselves as a method of disguise, so as to not
match virus signatures in the dictionary.
1. DECISION TREE
2. RANDOM FOREST
3. SVM
4. XGBOOST
DECISION TREE:
The decision tree Algorithm belongs to the family of supervised
value of a target variable, for which the decision tree uses the
Let us take Another Attribute Age, as we can see age has three
From the above figure, Now we can say that we can easily
predict which Drug to give to a patient based on his or
her reports.
root.
model.
Purpose of Entropy:
no.
Entropy formula…
the F2.
subset.
What is a Puresubset?
The pure subset is a situation where we will get either all yes or
the leaf node and we also have to take the entropy of those
information.
For each node of the tree, the information value measures how
much information a feature gives us about the class. The split
with the highest information gain will be taken as the first split
and the process will continue until all children nodes are pure,
The algorithm calculates the information gain for each split and
selected.
Like this, the algorithm will perform this for n number of splits,
and the information gain for whichever split is higher it is going
The higher the value of information gain of the split the higher
Gini Impurity:
Gini Impurity is a measurement used to build Decision Trees to
form the tree. More precisely, the Gini Impurity of a data set is
new,
set.
entropy.
2. SVM Algorithm
Let’s understand:
You need to
remember a thumb rule to identify the right hyper-plane:
“Select the hyper-plane which segregates the two classes
better”. In this scenario, hyper-plane “B” has excellently
performed this job.
● Identify the right hyper-plane (Scenario-2): Here, we have
three hyper-planes (A, B, and C) and all are segregating
the classes well. Now, How can we identify the right
hyper-plane?
As I have
already mentioned, one star at other end is like an outlier
for star class. The SVM algorithm has a feature to ignore
outliers and find the hyper-plane that has the maximum
margin. Hence, we can say, SVM classification is robust to
outliers.
3. XGBOOST
Ever since its introduction in 2014, XGBoost has been lauded as
the holy grail of machine learning hackathons and
technique!
then deep dive into the inner workings of this popular and
Table of Contents
● The Power of XGBoost
● Why Ensemble Learning?
○ Bagging
○ Boosting
● Demonstrating the Potential of Boosting
● Using gradient descent for optimizing the loss function
● Unique Features of XGBoost
The Power of XGBoost
The beauty of this powerful algorithm lies in its scalability,
Bagging
While decision trees are one of the most easily interpretable
let’s use each part to train a decision tree in order to obtain two
models.
Boosting
In boosting, the trees are built sequentially such that each
Each tree learns from its predecessors and updates the residual
errors. Hence, the tree that grows next in the sequence will
bias is high, and the predictive power is just a tad better than
these weak learners. The final strong learner brings down both
of trees with fewer splits. Such small trees, which are not very
boosting.
steps:
● An initial model F0 is defined to predict the target variable
y. This model will be associated with a residual (y – F0) ● A
new model h1 is fit to the residuals from the previous step
● Now, F0 and h1 are combined to give F1, the boosted
version of F0. The mean squared error from F1will be
lower than that from F0:
This can be done for ‘m’ iterations, until residuals have been
F0(x) gives the predictions from the first stage of our model.
pattern.
Plots of Fn and hn
would make this process generic and applicable across all loss
functions.
each iteration.
4. RANDOM FOREST
Random forest is a Supervised Machine Learning Algorithm
It
of regression.
creating sequential models such that the final model has the
detail.
Bagging
Bagging, also known as Bootstrap Aggregation is the ensemble
sample from the data set. Hence each model is generated from
results of all models. This step which involves combining all the
results
aggregation.
Now let’s look at an example by breaking it down with the help
Now the model (Model 01, Model 02, and Model 03) obtained
sample.
the figure below. Now n number of samples are taken from the
majority voting. In the below figure you can see that the
different.
different data and attributes. This means that we can make full
segregate the data for train and test as there will always be
Important Hyperparameters
Hyperparameters are used in random forests to either enhance
to use. If the value is 1, it can use only one processor but if the
SOURCE CODE
In [ ]: import os
In [ ]: import os
import pandas
import numpy
import sklearn.ensemble as ek
from sklearn import tree, linear_model
from sklearn.feature_selection import SelectFromModel
from sklearn.externals import joblib
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn import svm
from sklearn.linear_model import LinearRegression
In [ ]: dataset = pandas.read_csv('Malware_Detection_data.csv',sep='|',
low_memory=Fal se)
1/15
In [ ]: dataset.head()
Out[ ]:
Name md5 Machine SizeOfOptionalHeader Characteristic 0 memtest.exe
25
In [ ]: dataset.tail()
Out[ ]:
Name md5 Machine
In [ ]: dataset.describe()
Out[ ]:
Machine SizeOfOptionalHeader Characteristics MajorLinkerVersion MinorLinkerVersio count
std 16267.140560 7.639284 4093.550179 1.942843 5.31230 min 332.000000 224.000000 2.000000 0.000000
0.00000 25% 332.000000 224.000000 8226.000000 8.000000 0.00000 50% 332.000000 224.000000
8226.000000 9.000000 0.00000 75% 34404.000000 240.000000 8450.000000 9.000000 0.00000 max
In [ ]: dataset.groupby(dataset['legitimate']).size()
Out[ ]: legitimate
0.0 3761
1.0 41323
dtype: int64
2/15
In [ ]: X = dataset.drop(['Name','md5','legitimate'],axis=1).values
y = dataset['legitimate'].values
Part 1
In [ ]: import pandas as pd
import numpy as np
In [ ]: malware_csv
Out[ ]:
Name md5 Mach
0 memtest.exe 631ea355665f28d4707448e442fbf5b8
1 ose.exe 9d10f99a6712e28f8acd5641e3a7ea6b
2 setup.exe 4d92f518527353c0db88a70fddcfd390
3 DW20.EXE a41e524f8d45f0074fd07805ff0c9b12
4 dwtrig20.exe c87e561258f2f8650cef999bf643a731
VirusShare_d7648eae45f09b3adb75127f43be6d11 d7648eae45f09b3adb75127f43be6d11
In [ ]: malware_csv.head()
Out[ ]:
Name md5 Machine SizeOfOptionalHeader Characteristic 0 memtest.exe
25
3/15
In [ ]: malware_csv.tail()
Out[ ]:
Name md5 Mach
VirusShare_d7648eae45f09b3adb75127f43be6d11 d7648eae45f09b3adb75127f43be6d11
In [ ]: malware_csv.describe()
Out[ ]:
Machine SizeOfOptionalHeader Characteristics MajorLinkerVersion MinorLinkerVers count
std 10880.347245 5.121399 8186.782524 4.088757 11.8626 min 332.000000 224.000000 2.000000 0.000000
0.0000 25% 332.000000 224.000000 258.000000 8.000000 0.0000 50% 332.000000 224.000000 258.000000
9.000000 0.0000 75% 332.000000 224.000000 8226.000000 10.000000 0.0000 max 34404.000000 352.000000
7/15
In [ ]: malware_csv.plot()
The no of samples are 41323 and no of features are 56 for legitimate part The
no of samples are 96724 and no of features are 56 for malware part
In [ ]: pd.set_option("display.max_columns",None)
malware
Out[ ]:
Name md5 Mach
VirusShare_710890c07b3f93b90635f8bff6c34605 710890c07b3f93b90635f8bff6c34605
VirusShare_d7648eae45f09b3adb75127f43be6d11 d7648eae45f09b3adb75127f43be6d11
11/15
In [ ]: from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
In [ ]: malware_csv
Out[ ]:
Name md5 Mach
dwtrig20.exe c87e561258f2f8650cef999bf643a731
VirusShare_d7648eae45f09b3adb75127f43be6d11 d7648eae45f09b3adb75127f43be6d11
In [ ]: data_input = malware_csv.drop(['Name','md5','legitimate'],axis =
1).values labels = malware_csv['legitimate'].values
extratrees = ExtraTreesClassifier().fit(data_input, labels)
select = SelectFromModel(extratrees, prefit = True)
data_input_new = select.transform(data_input)
12/15
In [ ]: import numpy as np
features = data_input_new.shape[1]
importances = extratrees.feature_importances_
indices = np.argsort(importances)[::-1]
for i in range (features):
print("%d"%(i+1),malware_csv.columns[2+indices[i]],importances[indices[i ]])
1 DllCharacteristics 0.18192824351590617
2 Characteristics 0.10840711225864559
3 Machine 0.09972369581559354
4 Subsystem 0.06886261002211971
5 VersionInformationSize 0.05465157639605862
6 SectionsMaxEntropy 0.04926051040315489
7 ImageBase 0.04548174292036617
8 MajorSubsystemVersion 0.043129379250107805
9 SizeOfOptionalHeader 0.041849160410714396
10 ResourcesMinEntropy 0.03683297953662699
11 SizeOfStackReserve 0.03062319891509856
12 ResourcesMaxEntropy 0.029344981855075357
13 SectionsMeanEntropy 0.020449232460599844
classifier = RandomForestClassifier(n_estimators=50)
classifier.fit(legit_train,mal_train)
In [ ]: conf_matrix
13/15
Gradiant Boost
In [ ]: print("False Positives:",conf_matrix[0][1]*100/sum(conf_matrix[0]))
print("False Negatives:",conf_matrix[1][0]*100/sum(conf_matrix[1]))
98.85910901847157
part2
In [ ]: import os
import pandas
import numpy
import sklearn.ensemble as ek
from sklearn import tree, linear_model
from sklearn.feature_selection import SelectFromModel
from sklearn.externals import joblib
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn import svm
from sklearn.linear_model import LinearRegression
14/15
In [ ]: model = { "DecisionTree":tree.DecisionTreeClassifier(max_depth=10),
"RandomForest":ek.RandomForestClassifier(n_estimators=50),
"Adaboost":ek.AdaBoostClassifier(n_estimators=50),
"LinearRegression":LinearRegression()
}
In [ ]: results = {}
for algo in model:
clf = model[algo]
clf.fit(legit_train,mal_train)
score = clf.score(legit_test,mal_test)
print ("%s : %s " %(algo, score))
results[algo] = score
DecisionTree : 0.9909815284317276
RandomForest : 0.994313654473017
Adaboost : 0.9844983701557407
LinearRegression : 0.5834840523494268
In [ ]: #your_project_completes
In [ ]:
15/15
Conclusion:
[1]http://www.us-cert.gov/control_systems/pdf/undirected_at
t ack0905.pdf
66
[15] Lo, R., Levitt, K., Olsson, R., 1995: Mcf: A malicious code
filter. Comput. Secur. 14, pp.541– 566.
[16] M. Schultz, E. Eskin, and E. Zadok, 2001.Data mining
methods for detection of new malicious executables. In
Security and Privacy Proceedings IEEE Symposium, pp 38-49.
[21] Sung, A., Xu, J., Chavez, P., Mukkamala, S., 2004.Static
analyzer of vicious executables (save). In: Proceedings of the
20th Annual Computer Security Applications Conference. IEEE
Computer Society Press,ISBN 0-7695-2252-1,pp.326-334.