Malware Detection Using Supervised
Machine Learning
Submitted to
By
Niteesh Kunwar
Semester- V
Abstract
This project has been made by using supervised machine learning techniques for
malware detection where primary data has been used for the project. Malware is
analyzed by performing dynamic malware analysis in the malware analysis lab.
The sandbox is made up of Flare VM on Windows 10 distribution. Both malware
and goodware are utilized on Flare VM to obtain log files. Data is extracted from
these log files using an NLP technique called bag of words and labeled afterward.
This is followed by model training and evaluation. RandomForest, Decision Tree,
Logistic Regression, and SVM are the four different algorithms that were used to
train the model. Among them, Random Forest gave the highest accuracy of
99.99806%.
Background
Cyber security is essential for securing sensitive information, data, and patents in
today's world of increasing digitization. Malware is short for malicious software
that is designed to disrupt, damage, or gain unauthorized access to any device or
network. Examples of common malware include viruses, worms, Trojan viruses,
spyware, adware, and ransom ware. Static and dynamic are two types of malware
analysis. This project focuses on dynamic malware analysis, where malware files
are executed in a controlled environment for analysis. The log files generated will
be used to extract data and train the machine learning model.
Problem Statement
With increasing cyber-attacks, it's difficult for traditional programming methods
to deal with them efficiently because of the sheer volume and variety of malware.
ML can be used to detect malware in the system with greater accuracy.
Objective
The main objective to create such a project was to integrate machine learning
with cyber security and to gain an advantage to detect malware in the system
based on generated log files.
Motivation and significance
Any compromise to cyber security has the potential to harm the organization
both long and short term. Cyber security is an important component of a
country's overall security. ML is good at dealing with huge amounts of data and
overcomes the limitations of conventional programming methods when dealing
with cyber-attacks.
Because of its ability to deal with large amounts of data, machine learning is ideal
for boosting cyber security. The deployment of an unsupervised machine learning
model to detect anomalies in network traffic and warn cyber security systems is
common. Many businesses have escaped ransom ware assaults. Financial
organizations, such as banks, are increasingly relying on machine learning for
cyber protection.
Malware is a term used to describe malicious software that is designed to disrupt,
damage, or gain unauthorized access to a device or network. Viruses, worms,
Trojan horses, spyware, adware, and ransom ware are all examples of prevalent
malware.
Malware analysis can be divided into two categories. The first is a static analysis,
in which we examine a malware file without running it. We examine file
signatures such as size, hash, and extension. Dynamic analysis is the second type.
This entails running the malware file in a sandbox and observing how it behaves.
Ricardo Calix used log file data to do virus detection. He ran roughly fifty samples
of both goodware and malware in an isolated environment. Each program's log
files were gathered. They were stored with good1, good2 or badrabbit1,
badrabbit2 based on malware and goodware. He extracts data using a bag of
words technique and labels it 1 (goodware) or -1 (malware). After that, model
training takes place. This project takes a similar method, however, the data is
retrieved from two big log files from malware-infected and non-infected systems,
rather than several short log files each representing one goodware/malware
program.
Methodology
System Requirement
Software Tools for Log File Generation
Malware Analysis Lab:
Kali Linux Host (preferable)
Oracle Virtual box
Window 10 VM
Flare VM
ProcMon
Malware samples from VirusBazaar and GitHub
Goodware samples
For Data Extraction, Model Training, and Evaluation
Google Collab
Code: Python
Hardware Specification
8 GB RAM for the system
Modules
Pandas
Numpy
Seaborn
Matplotlib
Sklearn
System Design
Fig: System for Log File Generation
Algorithm Used
Decision Tree
A Decision tree is a flowchart-like tree structure, where each internal node denotes
a test on an attribute, each branch represents an outcome of the test, and each leaf
node (terminal node) holds a class label. Decision Trees often give very good
results.
Random Forest
Random forest is a supervised learning algorithm. It builds a forest with an
ensemble of decision trees. It is an easy-to-use machine learning algorithm that
produces a great result most of the time even without hyper-parameter tuning. It
combines the results of multiple decision trees thus it gives much better accuracy
than the decision tree.
Fig 2: Random Forest Diagram
Advantages
It overcomes the problem of overfitting by averaging or combining the
results of different decision trees.
Random forests work better for a large range of data items than a single
decision tree does.
Random forest has less variance than a single decision tree.
Random forests are very flexible and possess very high accuracy.
Scaling of data does not require a random forest algorithm.
Random Forest algorithms maintain good accuracy even if a large proportion
of the data is missing.
Disadvantage
Takes a lot of storage to store the model.
It took a lot more time and computation power.
Logistic Regression
Logistic regression is basically a supervised classification algorithm. In a
classification problem, the target variable (or output), y, can take only discrete
values for a given set of features (or inputs), X. Since this is a binary classification,
logistic regression is suitable for use.
SVM
The algorithm creates a line or a hyperplane that separates the data into classes.
It shouldn't be used when there are overlapping classes or when there is too
much noise in the data. This was thought suitable as there are a large number of
features.
WorkFlow
Data Collection
Log file generation
Setting up a Malware Analysis lab and taking a snapshot.
Collecting Malware and goodware Samples.
Run malware samples and save the log files as malware1.csv.
Restore the lab to the previous snapshot.
Run goodware samples and save the log files as good1.csv.
Data Extraction from log files
Applying Natural Language Programming technique bag of words for data
extraction with the help of countvectorizer from sklearn to extract data. Also,
labeling the dataset in the process.
Model Training
Four models are trained and the algorithms used are Decision Tree, Random
Forest, Logistic Regression, and SVM.
Conclusion
The aim of this project is to use Machine Learning to detect malware in the
system. This helps us identify infected systems and improve cyber security at the
individual or organizational level.