100% found this document useful (1 vote)
174 views58 pages

Yahya Thesis - Draft

Thesis Persistent threats
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
174 views58 pages

Yahya Thesis - Draft

Thesis Persistent threats
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 58

APT classification using GP-Adaboost and

deep NN

By

Yahya

Thesis submitted to the Department of Computer and Information


Sciences (DCIS) in partial fulfillment of requirements for the Degree of
BS Computer and Information Sciences (Session 2016-2020)

Department of Computer and Information Sciences


Pakistan Institute of Engineering & Applied Sciences,
Nilore, Islamabad, Pakistan
October 2020
ii

In the name of Almighty ALLAH,


Most Gracious, Most Merciful
iii

Department of Computer and Information Sciences,


Pakistan Institute of Engineering and Applied Sciences (PIEAS)
Nilore, Islamabad 45650, Pakistan

Declaration of Originality
I hereby declare that the work contained in this thesis and the intellectual content of
this thesis are the product of my own work. This thesis has not been previously
published in any form nor does it contain any verbatim of the published resources
which could be treated as infringement of the international copyright law.

I also declare that I do understand the terms ‘copyright’ and ‘plagiarism,’ and that in
case of any copyright violation or plagiarism found in this work, I will be held fully
responsible of the consequences of any such violation.

Signature: _______________________________

Name: Yahya
iv

Certificate of Approval
This is to certify that the work contained in this thesis entitled

“APT Classification using GP-Adaboost and Deep


Neural Networks”
was carried out by

Yahya
under my supervision and that in my opinion, it is fully adequate, in
scope and quality, for the degree of BS Computer and information
Sciences from Pakistan Institute of Engineering and Applied Sciences
(PIEAS)

Approved By:

Signature: ________________________

Supervisor: Dr. Asifullah Khan

Verified By:

Signature: ________________________

Head, DCIS

Stamp:
v

Dedication

Dedicated to my parents for their


endless support, love, and affection…
vi

Acknowledgement
Gratitude and endless thanks to Allah Almighty, the Lord of the Worlds, who
bestowed upon mankind the light of knowledge through laurels of perception,
learning and reasoning, in the way of searching, inquiring and finding the ultimate
truth. To whom we serve, and to whom we pray for help.

I feel my privilege and honor to express my sincerest gratitude to my supervisor


Dr. Asifullah Khan and my co-supervisor Earum Mushtaq for their his kind help,
guidance, suggestions and support throughout the course of this project. I would like
to express my most sincere gratitude and thanks to my beloved parents and friends as
well.

I would also like to thank Pakistan Institute of Engineering and Applied Sciences and
Pattern Recognition Lab, DCIS for providing very conducive educational
environment and adequate resources to carry out the research contained herein.

Yahya
Dept. of Computer & Information Sciences (DCIS)
PIEAS, Nilore, Islamabad
vii

Table of Contents
CERTIFICATE OF APPROVAL............................................................................IV

DEDICATION.............................................................................................................V

ACKNOWLEDGEMENT........................................................................................VI

TABLE OF CONTENTS.........................................................................................VII

LIST OF FIGURES...................................................................................................IX

LIST OF TABLES......................................................................................................X

TABLE OF CODE LISTINGS.................................................................................XI

NOMENCLATURE.................................................................................................XII

ABSTRACT............................................................................................................XIII

CHAPTER 1: INTRODUCTION............................................................................1

1.1. Purpose of Project.........................................................................................................1

1.2. The Problem of APT classification................................................................................1


1.2.1. Defining Advanced Persistent Threat (APT).........................................................................2
1.2.2. The Need for efficient APT Classification Solutions..............................................................2

1.3. The Machine Learning Solution for APT Classification................................................3

1.4. Scope of Research.........................................................................................................4

1.5. Project Objectives..........................................................................................................4

1.6. Thesis Organization.......................................................................................................5

CHAPTER 2: THEORETICAL BACKGROUND................................................6

2.1. Machine Learning Concepts..........................................................................................6


2.1.1. Artificial Neural Networks.....................................................................................................7
2.1.2. Convolutional Neural Networks.............................................................................................8
2.1.3. Residual Neural Networks.....................................................................................................9
2.1.4. Transfer Learning................................................................................................................10
2.1.5. Meta-classification...............................................................................................................12
2.1.6. Genetic Programming..........................................................................................................13
2.1.7. Adaptive Boosting................................................................................................................13
viii

2.2. Literature Review........................................................................................................14

CHAPTER 3: PROPOSED METHODOLOGY..................................................15

3.1. Distribution of the Dataset...........................................................................................16

3.2. Data pre-processing.....................................................................................................17


3.2.1. Conversion of Vector Data to Image Data..........................................................................18

3.3. Transfer Learning and Decision Space........................................................................19


3.3.1. Inception-ResNet-v2.............................................................................................................20
3.3.2. ResNet34..............................................................................................................................21
3.3.3. AlexNet.................................................................................................................................22
3.3.4. Custom DNN Model.............................................................................................................23

3.4. GP-AdaBoost Ensemble..............................................................................................24

CHAPTER 4: EXPERIMENTATION DETAILS...............................................26

4.1. Dataset.........................................................................................................................26

4.2. Development Environment..........................................................................................27

4.3. Experimentation Details..............................................................................................28


4.3.1. Transfer Learning using DNN Models................................................................................28
4.3.2. GP-AdaBoost Parameters....................................................................................................32

CHAPTER 5: PERFORMANCE ANALYSIS.....................................................35

5.1. Performance Metrics...................................................................................................35


5.1.1. Prediction Accuracy.............................................................................................................36

5.2. Results & Discussion...................................................................................................36


5.2.1. Transfer Learning & DNN Results......................................................................................36
5.2.2. Overall Results of the Proposed Method.............................................................................40

CHAPTER 6: CONCLUSION & FUTURE WORK...........................................42

6.1. Summary of the Research............................................................................................42

6.2. Future Work................................................................................................................43

REFERENCES...........................................................................................................44
ix

List of Figures
Figure 1: Financial damage done by cyber attacks in 2019.......................................................2
Figure 2: An overview of a complete APT classification model...............................................4
Figure 3: The working of a single neuron (perceptron) within a neural network......................7
Figure 4: A convolutional neural network with different labelled layers..................................9
Figure 5: A single ResNet block with the identity mapping....................................................10
Figure 6: A comparison of the learning processes of (a) Traditional machine learning tasks (b)
Transfer learning between different tasks.......................................................................11
Figure 7: An overview of the functionality of a stacked (meta) classifier...............................12
Figure 8: An overview of the Genetic Programming approach...............................................13
Figure 9: An overview of the proposed methodology.............................................................15
Figure 10: Procedure followed by the proposed method w.r.t the input data splits.................17
Figure 11: An overview of model predictions in csv file........................................................17
Figure 12: Visual illustration of file to image conversion.......................................................18
Figure 13: sections obtained from PE file and information obtained from images..................19
Figure 14: A skip connection in ResNet network...................................................................22
Figure 15: 1-An overview of Weka interface..........................................................................33
Figure 16: 2-An overview of Weka interface..........................................................................34
Figure 17: 3-An overview of Weka interface..........................................................................34
Figure 18: 4-An overview of Weka interface..........................................................................35
Figure 19: Confusion matrix showing the test results of Inception-ResNet-v2 model on
APTMalware dataset......................................................................................................38
Figure 20: Confusion matrix showing the test results of custom DNN model on APTMalware
dataset............................................................................................................................38
Figure 21: Confusion matrix showing the test results of AlexNet model on APTMalware
dataset............................................................................................................................39
Figure 22: Confusion matrix showing the test results of AlexNet model on APTMalware
dataset............................................................................................................................39
Figure 23: Confusion matrix of the test results of GP-AdaBoost on APTMalware dataset.....40
x

List of Tables
Table 2-1: Different settings of Transfer Learning.................................................................11
Table 3-1: Width of malware images w.r.t their sizes.............................................................19
Table 4-1: APTMalware dataset details..................................................................................26
Table 4-2: A summary of modifications in DNN architectures and their effects....................31
Table 4-3: General parameters of GP-AdaBoost algorithm....................................................32
Table 5-1: Prediction accuracy of the four DNNs on APTMalware dataset in Transfer
Learning task..................................................................................................................37
Table 5-4: Final results obtained from GP-AdaBoost ensemble on APTMalware dataset......40
xi

Table of Code Listings


Listing 2-1: Listing of the AdaBoost algorithm pseudocode...................................................14
Listing 3-2: Pseudocode of GP-AdaBoost algorithm..............................................................24
Listing 4-1: Example of TFR directory files after data conversion.........................................29
Listing 4-2: Some parameters for training of DNN model from scratch.................................29
Listing 4-3: Some parameters for training of a pre-trained DNN model.................................30
xii

Nomenclature
ANN Artificial Neural Network

APT Advanced Persistent Threat

AUC Area Under the Curve

CNN Convolutional Neural Network

DNN Deep Neural Network

EA Evolutionary Algorithm

GP Genetic Programming

k-NN k-Nearest Neighbor

ResNet Residual Network

ROC Receiver Operating Characteristic

TL Transfer Learning

WEKA Waikato Environment for Knowledge Analysis


xiii

Abstract
In the past decade, there had been many APT-attacks which made various government
and private organizations suffer much larger financial and intellectual property losses
as compared to what common malwares have ever did. Such as Stuxnet, Cozy Bear,
Ocean Lotus, WannaCry etc. APT malwares, unlike common malwares, pose a
greater threat to organizations. Various researches have been done in classification of
APT malwares to mitigate their activities and lessen the time between their attack and
detection. Traditional Malware analysis includes static and dynamic analysis. In static
analysis, malware binary is analyzed without executing it. While the dynamic feature
is based on analyzing the binary by executing it in an emulated environment and
recording its API calls, memory access patterns, network usages and so on. But APT
malwares keep on getting complex and advanced and have found ways to bypass
traditional analysis techniques.

In this research project, I incorporated the image processing technique proposed by


Nataraj et al. to convert malware binaries into images and used Convolutional Neural
Networks to classify them into families. I used APT malware binaries from
APTMalware repository on GitHub for this purpose. I used deep learning models to
classify them into families and then improved classification accuracy using a hybrid
technique of genetic programming and boosting called GP-Adaboost to do final
classification.
1

Chapter 1: Introduction

111Equation Chapter (Next) Section 1Information security industry is in constant


struggle for finding new ways to protect corporations from cyber-attacks which keep
getting severe and more complex in nature and vast in scope. Many cyber security
researchers have come up with sophisticated models and methodologies in detecting,
classifying and mitigating APT malwares and some of the methodologies were quite
successful, but every technique has its own drawbacks. Every day, the digital markets
comes across a malware which seems to found a way to bypass traditional security
measures and analysis techniques. The competition is fears on both sides. So, there is
a constant need to do research and propose new techniques to analyze, classify and
mitigate APT malwares.

In past decade, machine learning and especially deep learning models has gained
popularity in image classification due to their classification error being much less than
that in case of human being. So, theoretically any data that can be represented in
image form can be classify into categories with deep learning models. This research
work is in the same direction. We used deep learning models for image form APT
malware classification.

1.1. Purpose of Project


This research and experimentation-based project was geared towards exploring the
possibilities of efficient APT classification system. For all intents and purposes, the
objective here has been to produce a classification system that performs
comparatively well on the currently available APT malwares dataset using modern
machine learning and deep learning techniques.

1.2. The Problem of APT classification


Let’s briefly discuss what exactly is Advanced Persistent Threat (APT) and why it is a
major threat to organizations, and then we discuss the proposed solution and the scope
of our work.

APT Malware classification using GP-Adaboost and deep Neural Networks


2

1.2.1. Defining Advanced Persistent Threat (APT)

As described by Nicho (2014) an “Advanced Persistent Threat (APT) is a term used


for a new breed of insidious threats that use multiple attack techniques and vectors
conducted by stealth to avoid detection so that hackers can retain control over target
systems unnoticed, for long periods of time.” It is clear from this description that an
Advanced Persistent Threat is a malicious computer program that follows a slow
profile and has a specific purpose rather than spontaneous monetary gain. It goes after
the intellectual property which can be used to destabilize a whole organization, in
political campaigns, in business disruption or controlling of masses through social
propaganda.

1.2.2. The Need for efficient APT Classification Solutions

In recent years cyber attacks have been witnessed which has gain more strength and
done damage that is more that the damage done by malwares in the past decade
combined. As it is clear form an IC3 report of the past year, the total financial damage
done in 2019 worth $3500 million. This is apart from the huge intellectual property
that has been stolen form major corporations and government organizations.

Figure 1: Financial damage done by cyber attacks in 2019

APT Malware classification using GP-Adaboost and deep Neural Networks


3

Most of this damage is done by special type of malwares called APT malwares
because their complexity and sophistication make it harder for antivirus softwares to
detect, classify and removes such malwares from the system. These malwares have
dedicated authors which receives huge funding and support from organizations. They
keep trying to discover new exploits and weaknesses in systems and software being
used by target organization and thus succeed to bypass their installed defense systems
easily. APT classification is a major problem to in the process of malware risk
mitigation. When an APT attack is detected, it is best strategy to classify the detected
malware to find its author because most of malware belongs to the same group have
major similarities in the code, tools used and their methodologies. After the author or
group of the APT malware is found, it is easy to mitigate the risk or damage done by
the attack by following different measures based on different APT groups.

Therefore, there is a need to come up with new and innovative ideas and approaches
to mitigate such threat. In other words, there is a need for efficient solutions for APT
classification problem.

1.3. The Machine Learning Solution for APT Classification


Having established the importance of APT classification in information security
industry, lets discuss the proposed machine learning solution for this classification
problem. In the past decade, different researchers built classification models for APT
malwares with different approaches with different types of datasets as discussed in the
literature review section. Many of them produced results with high accuracy, but each
one had their own pros and cons. Accuracy and efficiency are the main tradeoff of this
classification problem. Since machine learning and specifically deep learning models
are very accurate and efficient in image classification problems because they find
hidden patterns in the form of dynamic features from the image data and can predict
the class of the data in real time. In APT malware classification, the time matters a lot;
the shorter the time of classification, the shorter the time will be to mitigate risk and
damage done by malware. That why machine learning and deep learning solutions are
highly accurate than static analysis and highly efficient than dynamic analysis.

Figure 2 presents an overview of such a prediction model that can be used for APT
classification.

APT Malware classification using GP-Adaboost and deep Neural Networks


4

Figure 2: An overview of a complete APT classification model

As seen in Figure 2, we have a dataset with APT malware images belonging to


different APT group. This dataset is split into multiple, usually training and testing,
subsets. Training set is used to train our model and the performance of the model is
evaluated by using the test set. Once we are satisfied with the model performance, it
can be used to make classification on real-time data, based on which risk mitigation
steps can be carried out based on the group or family of the APT malware.

1.4. Scope of Research


In this thesis, Deep Neural Networks (DNNs), Transfer Learning (TL), Genetic
Programming (GP), and ensemble-based classifiers were studied to develop an
efficient and reliable APT classification system for the information security industry.
These topics are relevant to Artificial Intelligence, Computer Vision, Machine
Learning, and Pattern Recognition domains of computer science.

1.5. Project Objectives


The major objectives of the project included:

 Literature survey of APT classification methodologies used in previous


relevant researches.
 Learning classification techniques in general, and ensemble (meta) classifiers
in particular.

APT Malware classification using GP-Adaboost and deep Neural Networks


5

 Developing several individual learners (Deep Convolutional Neural Networks-


based classifiers, using Transfer Learning) for APT classification, and using
their results as baseline & for meta-classification task.
 To develop an AdaBoost-GP ensemble of meta learners.
 Testing, troubleshooting, and removal of shortcomings.
 Performance analysis of the proposed technique.

1.6. Thesis Organization


This thesis report has been organized as follows. First chapter briefly introduces the
problem being tackled, and its proposed solution in general terms. Chapter 2 covers
the literature review and theoretical background relevant to the project which includes
brief summary of the machine learning concepts being employed in this project and
previously-proposed solutions for the problem under view. Chapter 3 introduces the
proposed methodology which has been used to solve the problem of APT
classification by the author. Chapter 4 covers the complete experimentation details of
the project, from introducing the concepts of dataset pre-processing to the
implementation framework. Chapter 5 is dedicated to the performance analysis of
proposed methodology. Chapter 6 provides concluding remarks and comments about
future work that may be carried out considering this work as a foundation.

APT Malware classification using GP-Adaboost and deep Neural Networks


6

Chapter 2: Theoretical Background

Machine Learning is a rapidly-advancing domain within computer science which


deals with the theory and implementation of intelligent, automated systems that learn
from input data. This data can be sensory, environmental, feature vector-based, or in
any other form which is acceptable to the system. Such systems can be made to
achieve a diverse array of objectives, ranging from weather forecasting, making
financial predictions, e-mail filtering, network intrusion detection, optical character
recognition, and much more. With advancements in theoretical knowledge and
computational power, it has become possible to devise and implement powerful and
advanced machine learning algorithms whose learning potential is very high; hence
they are able to solve complex problems.

Finding the perfect solution for a given class of problems is, however, still not
possible. No matter how much computational power is at our disposal, there will
always remain a margin of error because of noise, missing values, discrepancies and
errors in the input data, limitations and restrictions on learning and implementation,
etc. Moreover, a model that works well for a problem may or may not work just as
well for some other problem. Therefore, researchers have to try and experiment with a
range of models and tune their parameters in order to settle on a model that works
well for a given problem.

Similarly, in order to tackle the issue of APT classification, several wide-ranging


solutions have been proposed. They vary in their theory, complexity,
comprehensibility, accuracy, and reliability. This chapter covers some of the major
wok carried out in this domain, and follows up with a theoretical primer on the
frameworks and algorithms that have been used in the proposed solution.

2.1. Machine Learning Concepts


Before moving on to the detailed implementation of the project, some of the core
concepts which have been used during the planning, implementation, and testing
phases of this project must be elaborated upon. This section shall introduce some of

APT Malware classification using GP-Adaboost and deep Neural Networks


7

those necessary machine learning-related concepts and terminologies which shall be


used in the following chapters.

2.1.1. Artificial Neural Networks

Artificial Neural Networks are computational models that are inspired from human
brain. They are composed of simple computation units called perceptrons (also
known as their biological equivalent, neurons) and rely on the numerical input vectors
and a threshold value for their output. Consider such a unit which has been shown in
Figure 3 (source: [ CITATION Ale16 \l 1033 ]).

Figure 3: The working of a single neuron (perceptron) within a neural network

The neuron’s input is actually a weighted sum of several input values, some of which
might be coming from other neurons. They varying weights mean that each input
value can have a different influence on the result (in other words, how important that
input is to the overall result). A neuron would apply a transformation (represented by
the transfer function in Figure 3) on the input before evaluating it against an
activation threshold. If the transformed value equals or exceeds the threshold value,
the neuron is said to have fired, or activated, and its value is positive in whatever
terms have been used to define this phenomenon (like outputting 1, or a positive
value, or some other representation) in contrast to when the threshold value exceeds
and neuron does not fire. A combination of such neurons stacked within layers
constitutes a neural network.

The feature that makes artificial neural networks interesting is that the adaptive
weights along the paths of the network. These weights can be optimized by a learning

APT Malware classification using GP-Adaboost and deep Neural Networks


8

algorithm using a cost function that tries to determine the best values of these weights
against a required output (which may or may not be known beforehand).

Several other concepts are also involved in the learning process of an artificial neural
network which are required to be taken care of by a researcher working on them.
Some of these concepts and terminologies are defined below.

 Optimization techniques. A technique that is used by cost function to find


the best values and/or distribution of weights required for the observed data.
Gradient descent is one of the most well-known optimization techniques used
by neural networks which involves learning by example alongside
backpropagation of errors for learning an optimal weights distribution
 Parameters and hyperparameters. The values that must be decided when
designing a neural network. These include number of neurons in hidden
layers, learning rate (how fast or slow the weights change), momentum (used
to prevent system from getting stuck in local minima), number of epochs or
iterations (decides when to stop the training process), batch size (number of
examples used to train the network at a single time) etc. Not all of these values
are presently known, they might have to be found out using search and
optimization techniques.

There are several types of artificial neural networks, each type having its own
strengths and weaknesses. A couple of them, Convolutional Neural Networks (CNNs
or ConvNets) and Residual Neural Networks (ResNets) which have been used in this
project are briefly described below.

2.1.2. Convolutional Neural Networks

Convolutional Neural Networks are the type of deep (a term which refers to a stacked
implementation of neural network units), feed-forward ANNs which use the
mathematical operation of convolution somewhere along their architecture. They have
high learning capacity because of the concept of local receptive fields. In order to
learn more about them, let’s take a look at an example of a convolutional neural
network and its building blocks.

APT Malware classification using GP-Adaboost and deep Neural Networks


9

As it can be seen in Figure 4, a convolutional neural network does not simply consist
of convolution layers [ CITATION The15 \l 1033 ]. Several other components exist
there too alongside it.

Figure 4: A convolutional neural network with different labelled layers

Here’s a brief overview of the components of a CNN:

 Convolution layers. Layers which carry out the convolution operation on


their input. Unlike conventional neural networks with extended, full
connectivity between layers, a convolutional neural network’s convolution
layers make use of small filters (small as compared to the input size) which are
repeatedly multiplied and slid across the input volume. This results in a 2-D
activations map that give the response of the filters at every spatial position.
Overtime, CNN learns filters that activate when they see some type of
important visual feature: simple, such as edges, at first, and then increasing in
complexity, such as faces, as the learning progresses.
 Sub-sampling layers. Also known as pooling layers, these layers are meant to
reduce the spatial size of the input as the layers progress and are placed at
some intervals throughout the network. This is sometimes required to reduce
the representation and computation complexity. In order to collect summary
information from the previous layers, different operations like max, min, or
average pooling can be used.
 RELU layers. These layers apply an element-wise activation function.
 Fully-connected layer(s). A layer that has connections to all of the activations
in its preceding layer. They computer the class scores that are, in the end, used
for final prediction of the label.

APT Malware classification using GP-Adaboost and deep Neural Networks


10

2.1.3. Residual Neural Networks

Residual Neural Networks (or ResNets, as they are now commonly called to be
known) are based upon a simple improvement over regular convolutional neural
networks. The improvement involves the estimation of difference between the
mapping of input we want to obtain and the input itself [ CITATION HeK \l 1033 ].
This difference is added to the original input to get actual mapping as described in
following equation:

y=F ( x ; {W i }) + x

The same concept is illustrated in Figure 5 (source: [ CITATION HeK \l 1033 ]).

Figure 5: A single ResNet block with the identity mapping

Residual Networks work on the basis of a simple theoretical concept: deeper networks
should perform at least as good as their shallower counterparts. This is intuitive since
we can replicate the shallow model as a deeper model and simply set all of its extra
layers as identity mappings. In case of ResNets, theoretically the network should
perform better since it would learn not only the features in the form of weighted
activations from previous layer but also the original features as well because of the
identity mappings, hence solving the problem of vanishing gradients.

2.1.4. Transfer Learning

Transfer Learning, in general terms, means the study of dependency of some


procedure, learning process, or performance based on some prior experience. In terms
of machine learning, it refers to the ability of an intelligent learning system to
recognize and apply knowledge & skills acquired from some previous task to a novel
& new task, which share at least some sort of common ground.

APT Malware classification using GP-Adaboost and deep Neural Networks


11

The difference between a traditional machine learning solutions and solution based on
transfer learning is illustrated in Figure 6 (based on [ CITATION Yan10 \l 1033 ]).

Figure 6: A comparison of the learning processes of (a) Traditional machine learning


tasks (b) Transfer learning between different tasks

The need for Transfer Learning arises from the fact that in real life the datasets
available to us are not very large or well-defined. Training a neural network on such a
dataset with random weight initializations would be a hectic and most-usually a futile
exercise. It is therefore preferred that an artificial neural network that has previously
been trained on a large dataset (like ImageNet with 1.2 million images and 1000
classes) be taken, modified according to the needs of our own problem, and then fine-
tuned on this new data. This has several benefits like abolishing the need to design a
new architecture and train it from scratch, a probable improvement in generalization,
and faster training, evaluation, and prototyping times [ CITATION Yos14 \l 1033 ].

Table 2-1: Different settings of Transfer Learning

Transfer learning Labeled data in Labeled data in


Tasks
setting source domain target domain
No Yes Classification
Inductive Transfer
Regression
Learning Yes Yes …
Classification
Transductive Transfer
Yes No Regression
Learning

Unsupervised Transfer Clustering
No No
Learning …

APT Malware classification using GP-Adaboost and deep Neural Networks


12

Transfer Learning can be of two types: positive, when learning in one context
improves performance in some other context; and negative, where the learning in one
context has a negative impact on performance in another context. Moreover, the
settings in which Transfer Learning can occur are provided in Table 2 -1.

2.1.5. Meta-classification

Meta-classification is a type of ensemble learning within pattern recognition which


combines predictions from several base classifiers. A new instance that needs to be
classified is first passed on to multiple base (weak) classifiers which individually
provide their predictions for the given feature vector. These predictions are then
appended with the original feature vector to create an extended feature space, and then
a meta-classifier (which can be an ensemble or a stronger classifier) uses this new
feature space to make a prediction. Theoretically speaking, the classification accuracy
of this overall system should be at least as good as that of any individual learning
algorithm used, or better.

An illustration shows the overall working of this type of learning in Figure 7 (source:
[ CITATION Cla17 \l 1033 ]).

Figure 7: An overview of the functionality of a stacked (meta) classifier

APT Malware classification using GP-Adaboost and deep Neural Networks


13

2.1.6. Genetic Programming

In artificial intelligence, Genetic Programming is a technique in which computer


program from a high-level problem statement of a problem are formed. Genetic
programming uses several concepts from natural evolution; it starts from a high-level
statement of “what needs to be done,” encodes several computer programs as a set of
genes, and evolves these computer programs to solve the problem using an
evolutionary algorithm. The major tasks involved in working with a GP-based
solution include the methods used to encode a computer program in an artificial
chromosome and to evaluate its fitness with respect to the predefined task.

Genetic programming begins with a set of multiple randomly-created computer


programs. This population of programs is progressively evolved over a series of
generations. The evolutionary search uses the Darwinian principle of natural selection
(survival of the fittest) and analogs of various naturally occurring operations,
including crossover, mutation, gene duplication, gene deletion, etc.

An overview of how this process works is provided in Figure 8.

Figure 8: An overview of the Genetic Programming approach

2.1.7. Adaptive Boosting

Adaptive Boosting, or more commonly known as AdaBoost, is a meta-algorithm


proposed by Y. Freund and R. Schapire. It can use several types of base learning
algorithms (weak learners) whose output is combined into a weighted sum that
constitutes the final output of the meta-classifiers. The adaptivity of the AdaBoost
algorithm is introduced from the fact that weak learners are constantly tweaked in
favor of those instances which were misclassified by the learners in previous run of

APT Malware classification using GP-Adaboost and deep Neural Networks


14

the learning stage. The individual learners may not be very good at classification by
themselves (their guess may be only slightly better than random chance), but a
combination of such weak learners can be proven to have a strong classification
ability. The pseudocode of the AdaBoost algorithm is provided below.

Listing 2-1: Listing of the AdaBoost algorithm pseudocode

2.2. Literature Review


In the paper “End-to-End Deep Neural Networks and Transfer Learning for
Automatic Analysis of Nation-State Malware”, Ishai Rosenberg, Guillaume Sicard
and Eli (Omid) David used dynamic features as raw input to a deep neural network to
classify APT samples into families. They extracted raw behavior features of 1000
chines and Russian malware samples by using deep neural networks. Then classified
them with an accuracy of 98%.

Laurenza et al, Lazzeretti et al. and Mazzotti et al. proposed a static analysis based
machine learning model to identify APT. They achieved 95% classification accuracy
and by using only static feature extracted form Portable Executable files and using
random forest classifiers.

Lamprak et al. from Zurich, Switzerland and his fellow researchers proposed a
technique based on web request graphs to classify APT samples into different
families.

APT Malware classification using GP-Adaboost and deep Neural Networks


15

Chapter 3: Proposed Methodology

The proposed machine learning-based intelligent solution for APT classification


problem uses several advanced concepts from the domain of computational
intelligence and pattern recognition. Some of these concepts include Deep Neural
Networks, Transfer Learning, Genetic Programming, Adaptive Boosting, and meta-
and ensemble classification techniques. An overview of the overall system is
provided in Figure 9.

Figure 9: An overview of the proposed methodology

As we saw in Section 2.1.5, the concept of meta-classification states that the


predictions obtained from several base classifiers can be used by a meta classifier on
top in order to improve the prediction results. This concept is used in our project as
follows:

1. The datasets are first converted to image form.


2. Multiple neural network models are trained on a portion of the required
dataset.

APT Malware classification using GP-Adaboost and deep Neural Networks


16

3. A new instance that needs to be classified is first converted from executable


file to image format (2-D or 3-D, as per the requirement of the neural network
architecture being used).
4. The Deep Neural Network models trained previously take this image as input
and provide their individual/independent predictions.
5. The predictions obtained by these DNN models are stored in csv file along
with their respective nominal labels and passed on to the GP-AdaBoost
ensemble.
6. GP-AdaBoost ensemble classifier provides the final prediction for the input
instance.

The deep CNN architectures referred to in step 2 obviously need to be trained on


some input data first before they can provide a prediction. Let’s discuss how this data
split is carried out.

3.1. Distribution of the Dataset


Since we are using four convolutional neural networks for initial classification data
and our final decision is made with and ensemble learning algorithm, we used same
distribution of dataset for CNNs and after the first classification step, we changed the
distribution for GP-Adaboost algorithm because our final dataset consisted only 355
samples.

The distribution is as follows:

 We used 90% of the original images dataset for testing the neural networks.
Let’s call it subset A.
 Another subset of dataset consisting of the remaining portion of the dataset
(10%) shall be used for testing the results of the said neural networks. Let’s
call this portion Subset B.
 After training the networks on train portion of the dataset, we used test portion
which is subset B. The predictions of neural networks for the test dataset are
used to create a dataset for training and testing GP-Adaboost ensemble. Since
we used subset B to create this dataset, we call it B’.
 Two portions of Subset B’ can be Subset C and Subset D, where C is used for
training and Subset D is used for testing of the GP-AdaBoost ensemble.

APT Malware classification using GP-Adaboost and deep Neural Networks


17

An overview of the dataset splits and how they are used for testing and training of the
models is given in Figure 10.

Since the subsets A and B are being used by Convolutional Neural Network models,
they must be in image form. Original APTMalware dataset is in Portable Executable
(PE) form. This type of data needs to converted to image form first and then it can be
processed by a CNN. For this purpose, the technique proposed by Nataraj et al is used
to convert the binary files of the dataset into image form. A summary of this process
is provided in Section 3.2.1.

Figure 10: Procedure followed by the proposed method w.r.t the input data splits

The predictions of trained models are stored in a csv file along with their respective
image name and class labels. And outline of the dataset is given in the figure below.
Since Weka does not accepts numeral labels, I converted labels to nominal.

APT Malware classification using GP-Adaboost and deep Neural Networks


18

Figure 11: An overview of model predictions in csv file

3.2. Data pre-processing


Convolutional Neural Networks (CNNs) are well-known because of their ability to
recognize patterns within images [ CITATION Ale16 \l 1033 ]. They are powerful
tools for this job because of the concepts like local receptive fields, accumulation of
increasingly complex information about the input, and learnable filters utilized by
them. This is the reason this work proposes conversion of PE malware files to its
equivalent image forms.

The malware files contained in the APTMalware dataset are all Portable Executable
(PE) files which are given the extension ‘ .file’ to avoid accidental execution. But the
convolutional deep neural networks required the input data to be images so that they
can extract dynamic features used to classify the images. In order to do so, we can use
the method described by Nataraj et al. in his paper.

3.2.1. Conversion of Vector Data to Image Data

To convert PE file to image, we read the image binary from the memory and as a
vector. Then we do 8-bit sampling so that every unit of data represent an 8-bit pixel of
a gray scale image. Then we shape the array into a matrix form and save it as a PNG
image.

APT Malware classification using GP-Adaboost and deep Neural Networks


19

Figure 12: Visual illustration of file to image conversion

After conversion of malware file to image file, the image appears to have different
areas that corresponds to different sections of the binary executable malware file. That
pattern is same for malware files that belongs to the same APT family. That similarity
of pattern in similar malware files leads to the success of this methodology and
image-based approach to malware classification.

Figure 13: sections obtained from PE file and information obtained from images

The choice of width of the image depends on the size of the malware file. Which is
given in the table below.

Table 3-2: Width of malware images w.r.t their sizes

File Size Range Image Width


< 10 kB 32
10 kB – 30 kB 64
30 kB – 60 kB 128
60 kB – 100 kB 256
100 kB – 200 kB 384

APT Malware classification using GP-Adaboost and deep Neural Networks


20

200 kB – 500 kB 512


500 kB – 1000 kB 768
>1000 kB 1024
The files in the dataset are of various sizes which results in images with sizes of
different rage. But neural networks require input images to be of fixed height and
width. So, we have to resize image to in accordance with the specific neural network
we are using to classify those malware images.

3.3. Transfer Learning and Decision Space


The concept of meta-classification dictates that predictions from a combination of
different base learners (meta classifiers) may be fed to a main classifier alongside
their features which uses those predictions from individual learners in order to make a
more accurate decision. This essentially adds additional layer of learning to the
process.

The individual learners, whose predictions are to be used for the meta-classification
task to the GP-AdaBoost ensemble include have been chosen as different Deep Neural
Network architectures. Transfer Learning task would be carried out on these neural
network architectures using APT malware dataset (converted to images form) and
predictions from these individual DNN models has been used for the meta-
classification task in the next step. The DNN models used for this task included:

 Inception-ResNet-v2
 ResNet34
 AlexNet
 Custom DNN Model

3.3.1. Inception-ResNet-v2

Inception series is a class of deep convolutional neural network architectures


proposed by Google. They use significantly more computational power as compared
to AlexNet because of more complex architecture. The concept behind Inception-
ResNet-v2 [CITATION Chr16 \l 1033 ] was to train deeper neural network
architectures using skip connections [ CITATION HeK \l 1033 ] and to automate the
filter size selection. A single block in Inception ResNet architecture uses a filter bank

APT Malware classification using GP-Adaboost and deep Neural Networks


21

with multiple sizes of filters (5x5, 3x3, 1x1), and apart from learning filter weights,
the network also learns the filter sizes which work for the given problem.

On ILSVRC 2012 image classification benchmark, this architecture achieved a Top-5


accuracy of 95.3%.

Figure 3-5: The architecture of Inception-ResNet-v2

Figure 3-6: A closer look at the Inception-ResNet-v2 module

3.3.2. ResNet34

ResNet is one of the most powerful deep neural networks which has achieved
fantabulous performance results in the ILSVRC 2015 classification challenge. ResNet
has achieved excellent generalization performance on other recognition tasks and won

APT Malware classification using GP-Adaboost and deep Neural Networks


22

the first place on ImageNet detection, ImageNet localization, COCO detection and
COCO segmentation in ILSVRC and COCO 2015 competitions. There are many
variants of ResNet architecture i.e. same concept but with a different number of
layers. We have ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-110,
ResNet-152, ResNet-164, ResNet-1202 etc. The name ResNet followed by a two or
more digits number simply implies the ResNet architecture with a certain number of
neural network layers. In our research, we are going to cover ResNet-34.

The basic logic behind residual networks is that neural networks are good function
approximators. They should be able to easily solve the identify function, where the
output of a function becomes the input itself.

F(x) = x

Following the same logic, if we bypass the input to the first layer of the model to be
the output of the last layer of the model, the network should be able to predict
whatever function it was learning before with the input added to it.

F(x) + x = h(x)

The intuition is that learning f(x) = 0 has to be easy for the network.

Figure 3-7: Overview of ResNet34 architecture

APT Malware classification using GP-Adaboost and deep Neural Networks


23

Figure 14: A skip connection in ResNet network

3.3.3. AlexNet

Originally designed by Alex Krizhevsky for the purpose of ImageNet classification


challenge, AlexNet is a classic deep convolutional neural network model
[ CITATION Ale12 \l 1033 ].

Figure 3-8: AlexNet architecture as proposed by Alex et al.

The architecture is composed of eight layers in total. First five are convolutional, next
three are fully-connected which lead to the last 1000-way softmax layer which
produces the distribution over 1000-class labels. Multinomial logistic regression
objective is maximized by the network. The network takes a 224x224x3 image as
input with 96 kernels which are sized at 11x11x3. A 4-pixel stride is used. Similarly,
further layers have kernels smaller in size but increasing number of channels, like
5x5x48 and 3x3x256. Fully-connected layers each have 4096 neurons and a final
1000-way softmax layer is present at the end. The network uses data augmentation
and dropout to reduce the effects of overfitting.

APT Malware classification using GP-Adaboost and deep Neural Networks


24

3.3.4. Custom DNN Model

For comparison purposes, a custom deep neural network model was also trained after
hit-and-trial experimentation with different architecture parameters. The architecture
followed the layer patterns of conv-Leakyrelu-pool-conv-Leakyrelu-pool-fc-softmax-
class. It takes 64x64 input image, 3x3x32 filter with no padding at first convolution
layer and then 2x2 pooling, 3x3x32 at second layer and 2x2 pooling, followed by
fully-connected layer, softmax and classification layer with 12 neurons (since we have
12 APT classes).

Keeping the dataset splits from Section 3.1 in view, Subset A is provided to the DNN
architectures separately and predictions are acquired from them on Subset B. Then, by
storing the predictions along with corresponding labels in a csv file, we get Prediction
Dataset which is used for training and testing of GP-AdaBoost ensemble with 85%
train and 15% test split.

3.4. GP-AdaBoost Ensemble


The proposed methodology for GP and AdaBoost based ensemble approach was
proposed by Idris et al. in [ CITATION Adn12 \l 1033 ]. GP and AdaBoost are two
renowned learning approaches. Genetic Programming has shown promising
performance for the tasks of association discovery, regression, and clustering among
others. Also, the advantages and flexibility offered by GP algorithm make it a good
choice for classification and hence for prediction systems as well. AdaBoost on the
other hand is a boosting technique that combines several weak classifiers to create a
high-level, stronger one.

APT Malware classification using GP-Adaboost and deep Neural Networks


25

Listing 3-2: Pseudocode of GP-AdaBoost algorithm

In the GP-AdaBoost ensemble, every base classifier in subsequent steps is evolved to


identify the ‘hard samples’ that were incorrectly classified by preceding classifier.
This algorithm functions by splitting the dataset firstly into training and testing folds.
The multiple GP programs are evolved per class as specified by Elite size, where
weights boosting is done in AdaBoost style. The optimization benefits offered by GP
are enhanced by adding AdaBoost style boosting to the mix in this algorithm that
performs better because of this ensemble. Fitness function is the prediction accuracy
in this case. At the final step, the resulting programs using the higher output from a
weighted sum of the outputs of GP programs per class are used to make classification
results. The AdaBoost style boosting concept is only used to extend weights updating
required to handle misclassified samples.

APT Malware classification using GP-Adaboost and deep Neural Networks


26

Figure 3-8: A flowchart depicting how the GP-AdaBoost approach works

It has been observed that integrated boosting is much more effective as far as results
and time both are concerned since it is not necessary to generate a whole new
population for each next program of the classes. Summarily, as a test data instance is
given to the GP with boosting algorithm, following steps are carried out on it:

 All mini programs individually produce an output using the input data
 The results are summed up for each class
 The class is identified using maximum produced output i.e. final prediction is
made

APT Malware classification using GP-Adaboost and deep Neural Networks


27

Chapter 4: Experimentation Details

This chapter covers the details about the datasets used for classification, the
implementation details (development environment & technologies, etc.), possibilities
and research into the experimentation theory, and the outcomes of all such
experimentation, etc.

4.1. Dataset
There is not any standard dataset on the internet for APT malwares. Most of the
researchers has to create their own malware datasets that suits their research strategy,
but it requires a lot of time and resources to create such a dataset for example creating
a honeypot on a web server to monitor transmitting files and collecting suspicions
files into a dataset. Since we don’t have required time and resources, I used a non-
standard apt malware dataset which a GitHub user collected in his repository.

For this research, I used APTMalware dataset available in GitHub. The owner of the
dataset created by this dataset by collecting SHA256 hashes of apt malwares detected
across different domains on the internet by Fire-eye, a cyber surveillance and threat
hunting company, and bought the actual malware files from VirusTotal, a cyber
security company which collects and sells malware’s data globally, and created this
dataset.

The malware files are PE executable files with ‘.file’ extension for safety so that
someone don’t accidentally double click and run the malwares on their computer and
named with their MD5 hash signature. All the files are sorted on the basis of their
group/family name and placed in a folder with the name of the group. The dataset is
highly unbalanced, with 32 samples for APT 19 group and 964 for Gorgon Group.
The details of the dataset are shown in the Table 4-1.

Table 3-1: APTMalware dataset details

Country APT Group Family Requested Downloaded


China APT 1 1007 405
China APT 10 i.a. PlugX 300 244
China APT 19 Derusbi 33 32

APT Malware classification using GP-Adaboost and deep Neural Networks


28

Country APT Group Family Requested Downloaded


China APT 21 TravNet 118 106
Russia APT 28 Bears 230 214
Russia APT 29 Dukes 281 281
China APT 30 164 164
North-Korea DarkHotel DarkHotel 298 273
Russia Energetic Bear Havex 132 132
USA Equation Group Fannyworm 395 395
Pakistan Gorgon Group Different RATs 1085 961
China Winnti 406 387
Total 4449 3594

All samples are named according to their SHA-256 hash and grouped by APT group.

4.2. Development Environment


Python is the most popular language among Artificial Intelligence researchers because
it is most easy to use and one can do a lot of work with just few lines of codes.
Furthermore, with new powerful computers equipped with modern GPUs, the relative
performance of python verses other languages such as C++ is not an issue anymore.
The AI community continues to build and share useful resources online, as a result,
python has the most data science related libraries and frameworks than other
languages and it has the biggest community support of AI researchers.

The first part of the project which includes conversion of dataset into images forms,
dataset preprocessing, splitting into train and test subset, training CNNs on train
subset and taking predictions of test subset is all done in Anaconda which is an
environment of python for data science work. I used Jupyter Notebook for most of the
tasks because of the easiness and efficiency in coding and debugging it provides. All
the frameworks and libraries used in this project are as follows:

 Anaconda Jupyter Notebook TF-Slim TensorFlow Keras


 Libraries: Sci-kit Learn, Matplotlib, NumPy, Pandas, Seaborn, Pillow
 Weka

APT Malware classification using GP-Adaboost and deep Neural Networks


29

For the second part, which is meta-classification task, I used Weka which is an open
source machine learning and data analysis tool provided by Waikato university. It has
most of the machine learning algorithms built in it and most of the data preprocessing
tools as well. Weka is developed in java, so it runs on a machine which has java
already installed. Weka accepts data in the form of ‘csv’ file along other file formats.
So, I converted the prediction dataset into csv file and all the preprocessing is done
inside Weka.

Due to Covid-19 pandemic, I did not have access to Pattern Recognition Lab’s
resources, therefore, all this work was done in a Windows 10 x64 machine on a Dell
laptop with Intel core i5 4th generation processor and Intel integrated graphics 4000
series GPU and with 8GM RAM.

4.3. Experimentation Details

4.3.1. Transfer Learning using DNN Models

TF-Slim is a Python-based high-level API that offers a good implementation of


different well-known DNN architectures like Inception, Inception-v2, Inception-
ResNet-v2, and more. This library has been used for the Transfer Learning task using
DNN models in our project.

TF-Slim takes input as TF-Record list. It is a data type that is generated from images,
and each entry in TFR list contains different tags like image height, image width,
color channels, etc. that specify the images contained within the TFR list. In order to
use our APT malware images on a DNN model, we first have to convert it to TFR
data format. This can be achieved with the help of a Python script included in the
official TF-Slim repository named download_and_convert_data.py. To use the script,
use the following command on Anaconda command prompt:

python download_and_convert_data.py ^
--dataset_name=APTMalware ^
--dataset_dir=C:\Users\PC\Downloads\slim\tmp

After generation, a TFR directory would have following, separate train and validation
files:

APT Malware classification using GP-Adaboost and deep Neural Networks


30

Listing 4-3: Example of TFR directory files after data conversion

APTMalware_train_00000-of-00005.tfrecord
APTMalware_train_00001-of-00005.tfrecord
APTMalware_train_00002-of-00005.tfrecord
APTMalware_train_00003-of-00005.tfrecord
APTMalware_train_00004-of-00005.tfrecord
APTMalware_validation_00000-of-00005.tfrecord
APTMalware_validation_00001-of-00005.tfrecord
APTMalware_validation_00002-of-00005.tfrecord
APTMalware_validation_00003-of-00005.tfrecord
APTMalware_validation_00004-of-00005.tfrecord
labels.txt

Next step in implementation is the training and evaluation of the DNN models.
Following script shows how to train Inception ResNet model on a dataset:

Listing 4-4: Some parameters for training of DNN model from scratch

python train_image_classifier.py ^
--train_dir=C:\Users\PC\Downloads\slim\tmp\Train ^
--dataset_name=APTMalware ^
--dataset_split_name=train ^
--dataset_dir=C:\Users\PC\Downloads\slim\tmp\Dataset ^
--model_name=inception_resnet_v2

Instead of training a model from scratch, it is recommended to fine-tune the


parameters of an architecture that has been trained on a large dataset earlier on, like
ImageNet, etc. To indicate a checkpoint from which to fine-tune, we shall call training
with the –-checkpoint_path flag and assign it an absolute path to a checkpoint file.

During the fine-tuning task, care should be taken about restoration of checkpoint
weights. Particularly, when fine-tuning a model for a new task with a different
number of output labels, final logits (classification) layer won’t be restored. This is
the reason that --checkpoint_exclude_scopes flag is used. This flag prevents some
of the variables from being initialized. This means that the new model will have a
classification layer whose dimensions are different from the pre-trained model when
fine-tuning on a classification task using a different number of classes as compared to
the model on which network was trained. For example, if we fine-tune a model
trained on ImageNet using APTMalware dataset, the pre-trained final layer will have
dimensions [2048 x 1001] because of ImageNet task while the new logits layer will be

APT Malware classification using GP-Adaboost and deep Neural Networks


31

of [2048 x 12] dimension. Hence, this flag indicates to the API to prevent loading
these weights from the checkpoint file.

We have to keep in mind here that warm-starting the fine-tuning from a checkpoint
has an impact on the model's weights only during the model’s initialization. As soon
as a model starts training, a new checkpoint file shall be generated in tmp\Train
directory. If the fine-tuning restarted, the weights shall not be started from
the checkpoint_path but the new checkpoint file. The result would be that the
flags --checkpoint_path and --checkpoint_exclude_scopes are used just for
the 0-th global step (initialization of the model). Normally, we only want to train a
sub-set of layers in case of fine-tuning, so the --trainable_scopes flag would assist
us in deciding which layers should be trained and which should remain unchanged.

Below is how we fine-tune Inception-ResNet-v2 on APTMalware dataset with multi-


class numerical labels that originally has been trained on ImageNet with 1000 classes.
Since the dataset is quite small, we will only train the new layers.

python train_image_classifier.py ^
--train_dir= C:\Users\PC\Downloads\slim\tmp\Train ^
--dataset_dir= C:\Users\PC\Downloads\slim\tmp\Dataset ^
--dataset_name=APTMalware ^
--dataset_split_name=train ^
--model_name=inception_resnet_v2 ^
--checkpoint_path= C:\Users\PC\Downloads\slim ^
--checkpoint_exclude_scopes=Inception/Logits,
Inception/AuxLogits ^
--trainable_scopes=Inception/Logits,Inception/AuxLogits

Listing 4-5: Some parameters for training of a pre-trained DNN model

To evaluate the performance of a model (whether pretrained or our own), we can use
the eval_image_classifier.py script, as shown below.

python eval_image_classifier.py ^
--alsologtostderr ^
--checkpoint_path= C:\Users\PC\Downloads\slim\tmp\Train ^
--dataset_dir= C:\Users\PC\Downloads\slim\tmp\Dataset ^
--dataset_name=APTMalware ^
--dataset_split_name=validation ^
--model_name=inception_resnet_v2

Once evaluation is done, we shall have the final accuracy of the model as the output.

APT Malware classification using GP-Adaboost and deep Neural Networks


32

The experimentation in this step is mostly related to modifying the parameters of the
DNN models. These include the following values in the command:

...
--max_number_of_steps=7188 ^
--batch_size=5 ^
--learning_rate=0.01 ^
--learning_rate_decay_type=fixed ^
--save_interval_secs=600 ^
--save_summaries_secs=60 ^
--log_every_n_steps=10 ^
--optimizer=rmsprop ^
--weight_decay=0.00004
...

We can have some intuitive deductions about some of the values, such as since
dataset is small, batch_size has to be small, and that learning_rate should be slow
as well. Other parameters can be learnt with hit-and-trial experimentation.

RerNet34 and AlexNet are implemented by using public GitHub repositories. Their
implementation is straight forward. They need a dataset folder with subfolders
representing images belongs to same class and a text base label file containing each
image path and its corresponding numerical class label. For training and testing, one
has to run the train.py and test.py scripts included in the repositories.

The custom DNN model is implemented in Keras framework with following


parameters:

For the transfer learning task using AlexNet, we have to remove the final layer which
has a 1000-way softmax for ImageNet’s 1000 classes. We replace it with a layer
containing twelve neurons only for our multi-class classification problem consisting
of 12 classes.

We experimented with different parameters and modifications in DNN models. A


summary of the modifications and their results is summarized in Table 4 -4.

Table 4-4: A summary of modifications in DNN architectures and their effects

Experiment Result
Removal of dropout layers Slight improvement
Addition of convolution layer(s) Improvement then degradation

APT Malware classification using GP-Adaboost and deep Neural Networks


33

Removal of multiple fully-connected layers Significant degradation


Increasing no. of epochs Slight improvement, then degradation
Reducing batch size Slight improvement, then degradation
Reducing the Learning Rate Slight improvement

4.3.2. GP-AdaBoost Parameters

Consider the code listing of GP-AdaBoost pseudocode in Listing 2 -1. It provides


insight into some of the parameters used by the algorithm which are discussed below.

Table 4-5: General parameters of GP-AdaBoost algorithm


Parameter name Value
Number of generations 20
Elite size 5
Fitness function Prediction accuracy
Functions +,-,/,*,If,<,>,Pow,&,|,Max,Min,Exp,Log
Max depth of trees 5
Cross-over 0.9
Mutation 0.07
Reproduction of new programs 0.03
Population size 100
Population initializer Ramped half-and-half

The Elite size specifies the number of GP programs evolved per class. The population
generation is controlled with 0.9 cross over, 0.07 mutation and 0.03 reproduction of
new programs. The higher crossover ensures the diversity in each next subsequent
generation. Ramped half and half method is used for population initializer and
prediction accuracy is used as a fitness function. The focus of APT classification
problem is to attain a classifier with highest accuracy in malware class classification.
Thus, AUC is considered as fitness measure to evaluate the suitability of the evolved
classifier. The parameters used for GP-AdaBoost learning are provided in Table 4 -5.

APT classification is in fact a multi-class classification problem. The GP-AdaBoost


algorithm used here treats the APT classification problem as a one class problem,
where involved classes are treated separately. Many one-class classifiers per class are
evolved using AdaBoost style boosting as one big collection.

APT Malware classification using GP-Adaboost and deep Neural Networks


34

Then, pondered results are sum up for each binary class and higher output represents
the class that belongs to test instance. The experimental results show the proposed
approach achieving higher accuracy for classifying APT malware.

For the implementation part, Weka is used as mentioned earlier. It is Java based
machine learning tool used for data analysis. Weka by default have source codes of
most of the machine learning algorithms and data processing algorithms already
included in its repository. We converted our predictions from CNN models into a csv
file format to make it compatible for Weka input. After the dataset is loaded with in
Weka explorer with open file button under Preprocess tab, we can select different
preprocessor function for cleaning and arranging dataset before classification task. As
our dataset does not need preprocessing, we select the classifier from Choose button
under Classify tab. After expanding classifiers and then meta options, we select
AdaBoostM1 classifier. Then click on AdaBoostM1 in the window and add the
parameters for AdaBoost. After that, clicking on last argument of AdaBoost, we select
the base classifier for AdaBoost which is Decision Stump by default and we change it
to Genetic Programming from classifiers and functions from Choose menu. We can
change parameters for Genetic Programming by clicking on GeneticProgramming in
the windows as shown in the figures below.

Figure 15: 1-An overview of Weka interface

APT Malware classification using GP-Adaboost and deep Neural Networks


35

Figure 16: 2-An overview of Weka interface

Figure 17: 3-An overview of Weka interface

APT Malware classification using GP-Adaboost and deep Neural Networks


36

Figure 18: 4-An overview of Weka interface

Chapter 5: Performance Analysis

This chapter provides complete performance analysis and discussion regarding the
results provided by the proposed techniques. First, we shall briefly describe the
metrics used for the performance analysis, and then move on to the results and
discussion part.

5.1. Performance Metrics


The choice of performance measures for the analysis of a proposed solution is very
important because of several aspects. First and foremost, it matters since we have to
compare our own results with those presented in the literature and currently-proposed
solutions, so that a fair and balanced comparison can be made. Next, in order to retain
consistency of results throughout the training, testing, and deployment phase, a good
choice of performance metrics would help the researchers in making more informed
decisions. A well-chosen and consistent performance metric would ensure that the
reported results are also bias-free.

For our project, following performance metric have been chosen.

APT Malware classification using GP-Adaboost and deep Neural Networks


37

5.1.1. Prediction Accuracy

Prediction accuracy is a measure of correct predictions made on the data in per-cent


terms. For example, a prediction accuracy of 75% would mean that for every 100
predictions made by the prediction system, 75 were deemed correct and the remaining
25 were incorrect. Accuracy is closely related to prediction error since the value of 25
refers to prediction error which, in the case of our example, would be 25%.

The reason for choosing this measure is that most of the previously-reported models
use it as well; this choice would help in comparison of different models with
consistency.

5.2. Results & Discussion


We shall first see how well the Transfer Learning task performs i.e. the performance
of the Convolutional Neural Networks models; then, we shall have a look at the
overall performance of the ensemble model (both with and without using the
prediction space from DNN models).

5.2.1. Transfer Learning & DNN Results

As we saw in Section 3.3, three different Convolutional Neural Networks have been
used for Transfer Learning task in our project. These three models are AlexNet,
Inception-ResNet-v2, and Custom DNN model. All three models have their own set
of tunable parameters and the ability to modify them varies. Some offer more control
over the architecture than the others. Inception_resnet-v2 was implemented in
Python’s TF-Slim library, custom CNN in Keras and other two in TensorFlow
framework.

Still, in order to keep the results consistent, a consistent performance measure of


prediction accuracy has been used throughout. Wherever the libraries and APIs allow,
relevant graphs and charts have also been plotted.

Table 5 -6 presents a summary of the performance of the three deep neural networks,
averaged over 10 runs, on APTMalware dataset.

Table 5-6: Prediction accuracy of the four DNNs on APTMalware dataset in


Transfer Learning task

Prediction Accuracy (%)


Runs

APT Malware classification using GP-Adaboost and deep Neural Networks


38

Run 1 86.97 57.83 72.92 88.88


Run 2 87.45 58.65 73.08 87.92
Run 3 87.77 57.44 71.57 89.40
Run 4 85.53 57.25 73.60 89.65
Run 5 87.80 59.48 72.41 87.06
Run 6 86.96 57.98 73.35 89.84
Run 7 85.36 57.17 73.41 89.26
Run 8 87.20 56.27 71.20 88.43
Run 9 86.37 59.55 73.40 89.55
Run 10 86.74 57.05 72.37 89.83
Average 86.62 57.63 73.03 89.31

An assortment of confusion matrices, that show the test results of the DNN models
used for transfer learning are also given below in Figure 21, Figure 20, Figure 19 and
Figure 5-5.

Figure 19: Confusion matrix showing the test results of Inception-ResNet-v2 model
on APTMalware dataset

APT Malware classification using GP-Adaboost and deep Neural Networks


39

Figure 20: Confusion matrix showing the test results of custom DNN model on
APTMalware dataset

Figure 21: Confusion matrix showing the test results of AlexNet model on
APTMalware dataset

APT Malware classification using GP-Adaboost and deep Neural Networks


40

Figure 22: Confusion matrix showing the test results of ResNet34 model on
APTMalware dataset

5.2.2. Overall Results of the Proposed Method

After taking prediction results obtained from different neural network architectures to
the original test dataset, we can now use this dataset to train, and obtain predictions
from, the GP-AdaBoost ensemble model.

Table 5-7: Final results obtained from GP-AdaBoost ensemble on APTMalware


dataset

Runs Prediction Accuracy


Run 1 92.07
Run 2 91.86
Run 3 92.30
Run 4 92.65
Run 5 91.42
Run 6 92.38
Run 7 92.74
Run 8 92.40
Run 9 91.91
Run 10 92.84

APT Malware classification using GP-Adaboost and deep Neural Networks


41

Average 92.59
The performance results of the proposed method for cell2cell dataset are reproduced
in Table 5 -7 for an average of 10 independent runs of the algorithm. Furthermore,
confusion matrix and area under the ROC curve are also shown in figures below.

Figure 23: Confusion matrix of the test results of GP-AdaBoost


on APTMalware dataset

These results show that a decent improvement can be achieved by using our ensemble
approach instead of an approach that uses Transfer Learning only, or GP-AdaBoost
approach only. Further improvements can be made into these results by selecting
some other DNN architecture and modifying it in depth according to image input data.

APT Malware classification using GP-Adaboost and deep Neural Networks


42

Chapter 6: Conclusion & Future Work

6.1. Summary of the Research


This thesis reported the research and experimentation work carried out in the domain
of APT malware classification on the APTMalware dataset. The methodology
proposed hereby uses several concepts from advanced machine learning, including
but not limited to Deep Neural Networks, Transfer Learning, Ensemble and Meta
classification, Adaptive Boosting, and Genetic Programming. The flow of the
proposed technique goes as follows. In this technique, we first convert the data from
executable to images format and then use Transfer Learning to fine-tune several CNN
models on these malware images. Next, the predictions obtained by the CNN models
are stored in a csv file along with their corresponding labels and this dataset is then
used to train and test the GP-AdaBoost based ensemble classifier. Novelty of this
technique lies in converting malware executable files into images to and using these
images to train convolutional neural networks, and using predictions from these
models as input for the GP-AdaBoost meta-classifier.

The purpose of the extended experimentation carried out as part of this thesis was to
ascertain the following things:

 Can the process of Transfer Learning be used for APT malware classification?
o Yes. As we saw, transfer learning provided very good classification results.
 Does using the GP-AdaBoost ensemble improves classification accuracy from the
predictions of base learners (CNN models)?
o Yes. A noticeable increase in the classification accuracy was observed when
predictions from base learners (DNN models) were used to train and then test
GP-AdaBoost.
 Is there one best way to represent an APT malware file in image format before
feeding it to the neural network models?
o Yes. As we followed the technique described by Nataraj et al. by reading
malware executable file as a binary file form memory and sampling it into 8-

APT Malware classification using GP-Adaboost and deep Neural Networks


43

bit file and then converting it into an image with a width relative to its size
produces best results because malware executable files can be of various sizes,
and deciding its width as in table ensures the similarity of images from
different size malwares of same family.

Although the results reported in this thesis are at least as good as other techniques
reported in the literature, or better, there is still room for improvement, which is
briefly described below in Future Work section.

6.2. Future Work


Our experiment resulting in high accuracy with a small and unbalanced dataset
establish the fact that deep learning approach for APT malware image classification is
an accurate and efficient way of APT classification as compared to less accurate static
analysis and highly resource intensive dynamic analysis of APT malwares. But there
is still space for improvement in our research work. From my point of view, there are
following possible ways to improve the results if one carry on the research based on
our methodology:

 The most important improvement will be to find a new, balanced and standard
APT malware dataset if one will be available in future
 Improve the Transfer Learning-based CNN models’ results by experimenting
with diverse modifications in architectures, parameters, and hyperparameters

Using these methods, it is quite possible to improve the results of the proposed
technique.

References

[1] A. Castrounis, "Artificial Intelligence, Deep Learning, and Neural Networks,


Explained," KDnuggets, October 2016. [Online]. Available:
http://www.kdnuggets.com/2016/10/artificial-intelligence-deep-learning-neural-
networks-explained.html. [Accessed October 2017].
[2] Theano Development Team, "Convolutional Neural Networks (LeNet)," January

APT Malware classification using GP-Adaboost and deep Neural Networks


44

2015. [Online]. Available: http://deeplearning.net/tutorial/lenet.html. [Accessed


October 2017].
[3] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image
Recognition," in Computer Vision and Pattern Recognition.
[4] Q. Yang and S. J. Pan, "A Survey on Transfer Learning," IEEE Transactions on
Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345-1359, 2010.
[5] J. Yosinski, J. Clune, Y. Bengio and H. Lipson, "How transferable are features
in deep neural networks?," in Advances in Neural Information Processing
Systems, Montreal, 2014.
[6] "Class: StackingClassifier," MLXtend, 2017. [Online]. Available:
https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/.
[Accessed 30 October 2017].
[7] C. Szegedy, S. Ioffe, V. Vanhoucke and A. Alemi, "Inception-v4, Inception-
ResNet and the Impact of Residual Connections on Learning," in AAAI 2016,
2016.
[8] A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet Classification with
Deep Convolutional Neural Networks," in Advances in Neural Information
Processing Systems, 2012.
[22] Y. Freund and R. E. Schapire, "A Short Introduction to Boosting," Journal of
Japanese Society for Artificial Intelligence, vol. 14, no. 5, pp. 771-780, 1999.
[23] Genetic Programming Inc., "How Genetic Programming Works," Genetic
Programming Inc., 8 July 2007. [Online]. Available: http://www.genetic-
programming.org/. [Accessed October 2017].
[24] TensorFlow, "TensorFlow Slim Repository," TensorFlow, September 2017.
[Online]. Available:
https://github.com/tensorflow/models/tree/master/research/slim. [Accessed
October 2017].
[25] K. Sin, "Transfer learning in TensorFlow using a pre-trained inception-resnet-v2
model," 11 February 2017. [Online]. Available:
https://kwotsin.github.io/tech/2017/02/11/transfer-learning.html. [Accessed
October 2017].

APT Malware classification using GP-Adaboost and deep Neural Networks


45

APT Malware classification using GP-Adaboost and deep Neural Networks

You might also like