Yahya Thesis - Draft
Yahya Thesis - Draft
deep NN
By
Yahya
Declaration of Originality
I hereby declare that the work contained in this thesis and the intellectual content of
this thesis are the product of my own work. This thesis has not been previously
published in any form nor does it contain any verbatim of the published resources
which could be treated as infringement of the international copyright law.
I also declare that I do understand the terms ‘copyright’ and ‘plagiarism,’ and that in
case of any copyright violation or plagiarism found in this work, I will be held fully
responsible of the consequences of any such violation.
Signature: _______________________________
Name: Yahya
iv
Certificate of Approval
This is to certify that the work contained in this thesis entitled
Yahya
under my supervision and that in my opinion, it is fully adequate, in
scope and quality, for the degree of BS Computer and information
Sciences from Pakistan Institute of Engineering and Applied Sciences
(PIEAS)
Approved By:
Signature: ________________________
Verified By:
Signature: ________________________
Head, DCIS
Stamp:
v
Dedication
Acknowledgement
Gratitude and endless thanks to Allah Almighty, the Lord of the Worlds, who
bestowed upon mankind the light of knowledge through laurels of perception,
learning and reasoning, in the way of searching, inquiring and finding the ultimate
truth. To whom we serve, and to whom we pray for help.
I would also like to thank Pakistan Institute of Engineering and Applied Sciences and
Pattern Recognition Lab, DCIS for providing very conducive educational
environment and adequate resources to carry out the research contained herein.
Yahya
Dept. of Computer & Information Sciences (DCIS)
PIEAS, Nilore, Islamabad
vii
Table of Contents
CERTIFICATE OF APPROVAL............................................................................IV
DEDICATION.............................................................................................................V
ACKNOWLEDGEMENT........................................................................................VI
TABLE OF CONTENTS.........................................................................................VII
LIST OF FIGURES...................................................................................................IX
LIST OF TABLES......................................................................................................X
NOMENCLATURE.................................................................................................XII
ABSTRACT............................................................................................................XIII
CHAPTER 1: INTRODUCTION............................................................................1
4.1. Dataset.........................................................................................................................26
REFERENCES...........................................................................................................44
ix
List of Figures
Figure 1: Financial damage done by cyber attacks in 2019.......................................................2
Figure 2: An overview of a complete APT classification model...............................................4
Figure 3: The working of a single neuron (perceptron) within a neural network......................7
Figure 4: A convolutional neural network with different labelled layers..................................9
Figure 5: A single ResNet block with the identity mapping....................................................10
Figure 6: A comparison of the learning processes of (a) Traditional machine learning tasks (b)
Transfer learning between different tasks.......................................................................11
Figure 7: An overview of the functionality of a stacked (meta) classifier...............................12
Figure 8: An overview of the Genetic Programming approach...............................................13
Figure 9: An overview of the proposed methodology.............................................................15
Figure 10: Procedure followed by the proposed method w.r.t the input data splits.................17
Figure 11: An overview of model predictions in csv file........................................................17
Figure 12: Visual illustration of file to image conversion.......................................................18
Figure 13: sections obtained from PE file and information obtained from images..................19
Figure 14: A skip connection in ResNet network...................................................................22
Figure 15: 1-An overview of Weka interface..........................................................................33
Figure 16: 2-An overview of Weka interface..........................................................................34
Figure 17: 3-An overview of Weka interface..........................................................................34
Figure 18: 4-An overview of Weka interface..........................................................................35
Figure 19: Confusion matrix showing the test results of Inception-ResNet-v2 model on
APTMalware dataset......................................................................................................38
Figure 20: Confusion matrix showing the test results of custom DNN model on APTMalware
dataset............................................................................................................................38
Figure 21: Confusion matrix showing the test results of AlexNet model on APTMalware
dataset............................................................................................................................39
Figure 22: Confusion matrix showing the test results of AlexNet model on APTMalware
dataset............................................................................................................................39
Figure 23: Confusion matrix of the test results of GP-AdaBoost on APTMalware dataset.....40
x
List of Tables
Table 2-1: Different settings of Transfer Learning.................................................................11
Table 3-1: Width of malware images w.r.t their sizes.............................................................19
Table 4-1: APTMalware dataset details..................................................................................26
Table 4-2: A summary of modifications in DNN architectures and their effects....................31
Table 4-3: General parameters of GP-AdaBoost algorithm....................................................32
Table 5-1: Prediction accuracy of the four DNNs on APTMalware dataset in Transfer
Learning task..................................................................................................................37
Table 5-4: Final results obtained from GP-AdaBoost ensemble on APTMalware dataset......40
xi
Nomenclature
ANN Artificial Neural Network
EA Evolutionary Algorithm
GP Genetic Programming
TL Transfer Learning
Abstract
In the past decade, there had been many APT-attacks which made various government
and private organizations suffer much larger financial and intellectual property losses
as compared to what common malwares have ever did. Such as Stuxnet, Cozy Bear,
Ocean Lotus, WannaCry etc. APT malwares, unlike common malwares, pose a
greater threat to organizations. Various researches have been done in classification of
APT malwares to mitigate their activities and lessen the time between their attack and
detection. Traditional Malware analysis includes static and dynamic analysis. In static
analysis, malware binary is analyzed without executing it. While the dynamic feature
is based on analyzing the binary by executing it in an emulated environment and
recording its API calls, memory access patterns, network usages and so on. But APT
malwares keep on getting complex and advanced and have found ways to bypass
traditional analysis techniques.
Chapter 1: Introduction
In past decade, machine learning and especially deep learning models has gained
popularity in image classification due to their classification error being much less than
that in case of human being. So, theoretically any data that can be represented in
image form can be classify into categories with deep learning models. This research
work is in the same direction. We used deep learning models for image form APT
malware classification.
In recent years cyber attacks have been witnessed which has gain more strength and
done damage that is more that the damage done by malwares in the past decade
combined. As it is clear form an IC3 report of the past year, the total financial damage
done in 2019 worth $3500 million. This is apart from the huge intellectual property
that has been stolen form major corporations and government organizations.
Most of this damage is done by special type of malwares called APT malwares
because their complexity and sophistication make it harder for antivirus softwares to
detect, classify and removes such malwares from the system. These malwares have
dedicated authors which receives huge funding and support from organizations. They
keep trying to discover new exploits and weaknesses in systems and software being
used by target organization and thus succeed to bypass their installed defense systems
easily. APT classification is a major problem to in the process of malware risk
mitigation. When an APT attack is detected, it is best strategy to classify the detected
malware to find its author because most of malware belongs to the same group have
major similarities in the code, tools used and their methodologies. After the author or
group of the APT malware is found, it is easy to mitigate the risk or damage done by
the attack by following different measures based on different APT groups.
Therefore, there is a need to come up with new and innovative ideas and approaches
to mitigate such threat. In other words, there is a need for efficient solutions for APT
classification problem.
Figure 2 presents an overview of such a prediction model that can be used for APT
classification.
Finding the perfect solution for a given class of problems is, however, still not
possible. No matter how much computational power is at our disposal, there will
always remain a margin of error because of noise, missing values, discrepancies and
errors in the input data, limitations and restrictions on learning and implementation,
etc. Moreover, a model that works well for a problem may or may not work just as
well for some other problem. Therefore, researchers have to try and experiment with a
range of models and tune their parameters in order to settle on a model that works
well for a given problem.
Artificial Neural Networks are computational models that are inspired from human
brain. They are composed of simple computation units called perceptrons (also
known as their biological equivalent, neurons) and rely on the numerical input vectors
and a threshold value for their output. Consider such a unit which has been shown in
Figure 3 (source: [ CITATION Ale16 \l 1033 ]).
The neuron’s input is actually a weighted sum of several input values, some of which
might be coming from other neurons. They varying weights mean that each input
value can have a different influence on the result (in other words, how important that
input is to the overall result). A neuron would apply a transformation (represented by
the transfer function in Figure 3) on the input before evaluating it against an
activation threshold. If the transformed value equals or exceeds the threshold value,
the neuron is said to have fired, or activated, and its value is positive in whatever
terms have been used to define this phenomenon (like outputting 1, or a positive
value, or some other representation) in contrast to when the threshold value exceeds
and neuron does not fire. A combination of such neurons stacked within layers
constitutes a neural network.
The feature that makes artificial neural networks interesting is that the adaptive
weights along the paths of the network. These weights can be optimized by a learning
algorithm using a cost function that tries to determine the best values of these weights
against a required output (which may or may not be known beforehand).
Several other concepts are also involved in the learning process of an artificial neural
network which are required to be taken care of by a researcher working on them.
Some of these concepts and terminologies are defined below.
There are several types of artificial neural networks, each type having its own
strengths and weaknesses. A couple of them, Convolutional Neural Networks (CNNs
or ConvNets) and Residual Neural Networks (ResNets) which have been used in this
project are briefly described below.
Convolutional Neural Networks are the type of deep (a term which refers to a stacked
implementation of neural network units), feed-forward ANNs which use the
mathematical operation of convolution somewhere along their architecture. They have
high learning capacity because of the concept of local receptive fields. In order to
learn more about them, let’s take a look at an example of a convolutional neural
network and its building blocks.
As it can be seen in Figure 4, a convolutional neural network does not simply consist
of convolution layers [ CITATION The15 \l 1033 ]. Several other components exist
there too alongside it.
Residual Neural Networks (or ResNets, as they are now commonly called to be
known) are based upon a simple improvement over regular convolutional neural
networks. The improvement involves the estimation of difference between the
mapping of input we want to obtain and the input itself [ CITATION HeK \l 1033 ].
This difference is added to the original input to get actual mapping as described in
following equation:
y=F ( x ; {W i }) + x
The same concept is illustrated in Figure 5 (source: [ CITATION HeK \l 1033 ]).
Residual Networks work on the basis of a simple theoretical concept: deeper networks
should perform at least as good as their shallower counterparts. This is intuitive since
we can replicate the shallow model as a deeper model and simply set all of its extra
layers as identity mappings. In case of ResNets, theoretically the network should
perform better since it would learn not only the features in the form of weighted
activations from previous layer but also the original features as well because of the
identity mappings, hence solving the problem of vanishing gradients.
The difference between a traditional machine learning solutions and solution based on
transfer learning is illustrated in Figure 6 (based on [ CITATION Yan10 \l 1033 ]).
The need for Transfer Learning arises from the fact that in real life the datasets
available to us are not very large or well-defined. Training a neural network on such a
dataset with random weight initializations would be a hectic and most-usually a futile
exercise. It is therefore preferred that an artificial neural network that has previously
been trained on a large dataset (like ImageNet with 1.2 million images and 1000
classes) be taken, modified according to the needs of our own problem, and then fine-
tuned on this new data. This has several benefits like abolishing the need to design a
new architecture and train it from scratch, a probable improvement in generalization,
and faster training, evaluation, and prototyping times [ CITATION Yos14 \l 1033 ].
Transfer Learning can be of two types: positive, when learning in one context
improves performance in some other context; and negative, where the learning in one
context has a negative impact on performance in another context. Moreover, the
settings in which Transfer Learning can occur are provided in Table 2 -1.
2.1.5. Meta-classification
An illustration shows the overall working of this type of learning in Figure 7 (source:
[ CITATION Cla17 \l 1033 ]).
the learning stage. The individual learners may not be very good at classification by
themselves (their guess may be only slightly better than random chance), but a
combination of such weak learners can be proven to have a strong classification
ability. The pseudocode of the AdaBoost algorithm is provided below.
Laurenza et al, Lazzeretti et al. and Mazzotti et al. proposed a static analysis based
machine learning model to identify APT. They achieved 95% classification accuracy
and by using only static feature extracted form Portable Executable files and using
random forest classifiers.
Lamprak et al. from Zurich, Switzerland and his fellow researchers proposed a
technique based on web request graphs to classify APT samples into different
families.
We used 90% of the original images dataset for testing the neural networks.
Let’s call it subset A.
Another subset of dataset consisting of the remaining portion of the dataset
(10%) shall be used for testing the results of the said neural networks. Let’s
call this portion Subset B.
After training the networks on train portion of the dataset, we used test portion
which is subset B. The predictions of neural networks for the test dataset are
used to create a dataset for training and testing GP-Adaboost ensemble. Since
we used subset B to create this dataset, we call it B’.
Two portions of Subset B’ can be Subset C and Subset D, where C is used for
training and Subset D is used for testing of the GP-AdaBoost ensemble.
An overview of the dataset splits and how they are used for testing and training of the
models is given in Figure 10.
Since the subsets A and B are being used by Convolutional Neural Network models,
they must be in image form. Original APTMalware dataset is in Portable Executable
(PE) form. This type of data needs to converted to image form first and then it can be
processed by a CNN. For this purpose, the technique proposed by Nataraj et al is used
to convert the binary files of the dataset into image form. A summary of this process
is provided in Section 3.2.1.
Figure 10: Procedure followed by the proposed method w.r.t the input data splits
The predictions of trained models are stored in a csv file along with their respective
image name and class labels. And outline of the dataset is given in the figure below.
Since Weka does not accepts numeral labels, I converted labels to nominal.
The malware files contained in the APTMalware dataset are all Portable Executable
(PE) files which are given the extension ‘ .file’ to avoid accidental execution. But the
convolutional deep neural networks required the input data to be images so that they
can extract dynamic features used to classify the images. In order to do so, we can use
the method described by Nataraj et al. in his paper.
To convert PE file to image, we read the image binary from the memory and as a
vector. Then we do 8-bit sampling so that every unit of data represent an 8-bit pixel of
a gray scale image. Then we shape the array into a matrix form and save it as a PNG
image.
After conversion of malware file to image file, the image appears to have different
areas that corresponds to different sections of the binary executable malware file. That
pattern is same for malware files that belongs to the same APT family. That similarity
of pattern in similar malware files leads to the success of this methodology and
image-based approach to malware classification.
Figure 13: sections obtained from PE file and information obtained from images
The choice of width of the image depends on the size of the malware file. Which is
given in the table below.
The individual learners, whose predictions are to be used for the meta-classification
task to the GP-AdaBoost ensemble include have been chosen as different Deep Neural
Network architectures. Transfer Learning task would be carried out on these neural
network architectures using APT malware dataset (converted to images form) and
predictions from these individual DNN models has been used for the meta-
classification task in the next step. The DNN models used for this task included:
Inception-ResNet-v2
ResNet34
AlexNet
Custom DNN Model
3.3.1. Inception-ResNet-v2
with multiple sizes of filters (5x5, 3x3, 1x1), and apart from learning filter weights,
the network also learns the filter sizes which work for the given problem.
3.3.2. ResNet34
ResNet is one of the most powerful deep neural networks which has achieved
fantabulous performance results in the ILSVRC 2015 classification challenge. ResNet
has achieved excellent generalization performance on other recognition tasks and won
the first place on ImageNet detection, ImageNet localization, COCO detection and
COCO segmentation in ILSVRC and COCO 2015 competitions. There are many
variants of ResNet architecture i.e. same concept but with a different number of
layers. We have ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-110,
ResNet-152, ResNet-164, ResNet-1202 etc. The name ResNet followed by a two or
more digits number simply implies the ResNet architecture with a certain number of
neural network layers. In our research, we are going to cover ResNet-34.
The basic logic behind residual networks is that neural networks are good function
approximators. They should be able to easily solve the identify function, where the
output of a function becomes the input itself.
F(x) = x
Following the same logic, if we bypass the input to the first layer of the model to be
the output of the last layer of the model, the network should be able to predict
whatever function it was learning before with the input added to it.
F(x) + x = h(x)
The intuition is that learning f(x) = 0 has to be easy for the network.
3.3.3. AlexNet
The architecture is composed of eight layers in total. First five are convolutional, next
three are fully-connected which lead to the last 1000-way softmax layer which
produces the distribution over 1000-class labels. Multinomial logistic regression
objective is maximized by the network. The network takes a 224x224x3 image as
input with 96 kernels which are sized at 11x11x3. A 4-pixel stride is used. Similarly,
further layers have kernels smaller in size but increasing number of channels, like
5x5x48 and 3x3x256. Fully-connected layers each have 4096 neurons and a final
1000-way softmax layer is present at the end. The network uses data augmentation
and dropout to reduce the effects of overfitting.
For comparison purposes, a custom deep neural network model was also trained after
hit-and-trial experimentation with different architecture parameters. The architecture
followed the layer patterns of conv-Leakyrelu-pool-conv-Leakyrelu-pool-fc-softmax-
class. It takes 64x64 input image, 3x3x32 filter with no padding at first convolution
layer and then 2x2 pooling, 3x3x32 at second layer and 2x2 pooling, followed by
fully-connected layer, softmax and classification layer with 12 neurons (since we have
12 APT classes).
Keeping the dataset splits from Section 3.1 in view, Subset A is provided to the DNN
architectures separately and predictions are acquired from them on Subset B. Then, by
storing the predictions along with corresponding labels in a csv file, we get Prediction
Dataset which is used for training and testing of GP-AdaBoost ensemble with 85%
train and 15% test split.
It has been observed that integrated boosting is much more effective as far as results
and time both are concerned since it is not necessary to generate a whole new
population for each next program of the classes. Summarily, as a test data instance is
given to the GP with boosting algorithm, following steps are carried out on it:
All mini programs individually produce an output using the input data
The results are summed up for each class
The class is identified using maximum produced output i.e. final prediction is
made
This chapter covers the details about the datasets used for classification, the
implementation details (development environment & technologies, etc.), possibilities
and research into the experimentation theory, and the outcomes of all such
experimentation, etc.
4.1. Dataset
There is not any standard dataset on the internet for APT malwares. Most of the
researchers has to create their own malware datasets that suits their research strategy,
but it requires a lot of time and resources to create such a dataset for example creating
a honeypot on a web server to monitor transmitting files and collecting suspicions
files into a dataset. Since we don’t have required time and resources, I used a non-
standard apt malware dataset which a GitHub user collected in his repository.
For this research, I used APTMalware dataset available in GitHub. The owner of the
dataset created by this dataset by collecting SHA256 hashes of apt malwares detected
across different domains on the internet by Fire-eye, a cyber surveillance and threat
hunting company, and bought the actual malware files from VirusTotal, a cyber
security company which collects and sells malware’s data globally, and created this
dataset.
The malware files are PE executable files with ‘.file’ extension for safety so that
someone don’t accidentally double click and run the malwares on their computer and
named with their MD5 hash signature. All the files are sorted on the basis of their
group/family name and placed in a folder with the name of the group. The dataset is
highly unbalanced, with 32 samples for APT 19 group and 964 for Gorgon Group.
The details of the dataset are shown in the Table 4-1.
All samples are named according to their SHA-256 hash and grouped by APT group.
The first part of the project which includes conversion of dataset into images forms,
dataset preprocessing, splitting into train and test subset, training CNNs on train
subset and taking predictions of test subset is all done in Anaconda which is an
environment of python for data science work. I used Jupyter Notebook for most of the
tasks because of the easiness and efficiency in coding and debugging it provides. All
the frameworks and libraries used in this project are as follows:
For the second part, which is meta-classification task, I used Weka which is an open
source machine learning and data analysis tool provided by Waikato university. It has
most of the machine learning algorithms built in it and most of the data preprocessing
tools as well. Weka is developed in java, so it runs on a machine which has java
already installed. Weka accepts data in the form of ‘csv’ file along other file formats.
So, I converted the prediction dataset into csv file and all the preprocessing is done
inside Weka.
Due to Covid-19 pandemic, I did not have access to Pattern Recognition Lab’s
resources, therefore, all this work was done in a Windows 10 x64 machine on a Dell
laptop with Intel core i5 4th generation processor and Intel integrated graphics 4000
series GPU and with 8GM RAM.
TF-Slim takes input as TF-Record list. It is a data type that is generated from images,
and each entry in TFR list contains different tags like image height, image width,
color channels, etc. that specify the images contained within the TFR list. In order to
use our APT malware images on a DNN model, we first have to convert it to TFR
data format. This can be achieved with the help of a Python script included in the
official TF-Slim repository named download_and_convert_data.py. To use the script,
use the following command on Anaconda command prompt:
python download_and_convert_data.py ^
--dataset_name=APTMalware ^
--dataset_dir=C:\Users\PC\Downloads\slim\tmp
After generation, a TFR directory would have following, separate train and validation
files:
APTMalware_train_00000-of-00005.tfrecord
APTMalware_train_00001-of-00005.tfrecord
APTMalware_train_00002-of-00005.tfrecord
APTMalware_train_00003-of-00005.tfrecord
APTMalware_train_00004-of-00005.tfrecord
APTMalware_validation_00000-of-00005.tfrecord
APTMalware_validation_00001-of-00005.tfrecord
APTMalware_validation_00002-of-00005.tfrecord
APTMalware_validation_00003-of-00005.tfrecord
APTMalware_validation_00004-of-00005.tfrecord
labels.txt
Next step in implementation is the training and evaluation of the DNN models.
Following script shows how to train Inception ResNet model on a dataset:
Listing 4-4: Some parameters for training of DNN model from scratch
python train_image_classifier.py ^
--train_dir=C:\Users\PC\Downloads\slim\tmp\Train ^
--dataset_name=APTMalware ^
--dataset_split_name=train ^
--dataset_dir=C:\Users\PC\Downloads\slim\tmp\Dataset ^
--model_name=inception_resnet_v2
During the fine-tuning task, care should be taken about restoration of checkpoint
weights. Particularly, when fine-tuning a model for a new task with a different
number of output labels, final logits (classification) layer won’t be restored. This is
the reason that --checkpoint_exclude_scopes flag is used. This flag prevents some
of the variables from being initialized. This means that the new model will have a
classification layer whose dimensions are different from the pre-trained model when
fine-tuning on a classification task using a different number of classes as compared to
the model on which network was trained. For example, if we fine-tune a model
trained on ImageNet using APTMalware dataset, the pre-trained final layer will have
dimensions [2048 x 1001] because of ImageNet task while the new logits layer will be
of [2048 x 12] dimension. Hence, this flag indicates to the API to prevent loading
these weights from the checkpoint file.
We have to keep in mind here that warm-starting the fine-tuning from a checkpoint
has an impact on the model's weights only during the model’s initialization. As soon
as a model starts training, a new checkpoint file shall be generated in tmp\Train
directory. If the fine-tuning restarted, the weights shall not be started from
the checkpoint_path but the new checkpoint file. The result would be that the
flags --checkpoint_path and --checkpoint_exclude_scopes are used just for
the 0-th global step (initialization of the model). Normally, we only want to train a
sub-set of layers in case of fine-tuning, so the --trainable_scopes flag would assist
us in deciding which layers should be trained and which should remain unchanged.
python train_image_classifier.py ^
--train_dir= C:\Users\PC\Downloads\slim\tmp\Train ^
--dataset_dir= C:\Users\PC\Downloads\slim\tmp\Dataset ^
--dataset_name=APTMalware ^
--dataset_split_name=train ^
--model_name=inception_resnet_v2 ^
--checkpoint_path= C:\Users\PC\Downloads\slim ^
--checkpoint_exclude_scopes=Inception/Logits,
Inception/AuxLogits ^
--trainable_scopes=Inception/Logits,Inception/AuxLogits
To evaluate the performance of a model (whether pretrained or our own), we can use
the eval_image_classifier.py script, as shown below.
python eval_image_classifier.py ^
--alsologtostderr ^
--checkpoint_path= C:\Users\PC\Downloads\slim\tmp\Train ^
--dataset_dir= C:\Users\PC\Downloads\slim\tmp\Dataset ^
--dataset_name=APTMalware ^
--dataset_split_name=validation ^
--model_name=inception_resnet_v2
Once evaluation is done, we shall have the final accuracy of the model as the output.
The experimentation in this step is mostly related to modifying the parameters of the
DNN models. These include the following values in the command:
...
--max_number_of_steps=7188 ^
--batch_size=5 ^
--learning_rate=0.01 ^
--learning_rate_decay_type=fixed ^
--save_interval_secs=600 ^
--save_summaries_secs=60 ^
--log_every_n_steps=10 ^
--optimizer=rmsprop ^
--weight_decay=0.00004
...
We can have some intuitive deductions about some of the values, such as since
dataset is small, batch_size has to be small, and that learning_rate should be slow
as well. Other parameters can be learnt with hit-and-trial experimentation.
RerNet34 and AlexNet are implemented by using public GitHub repositories. Their
implementation is straight forward. They need a dataset folder with subfolders
representing images belongs to same class and a text base label file containing each
image path and its corresponding numerical class label. For training and testing, one
has to run the train.py and test.py scripts included in the repositories.
For the transfer learning task using AlexNet, we have to remove the final layer which
has a 1000-way softmax for ImageNet’s 1000 classes. We replace it with a layer
containing twelve neurons only for our multi-class classification problem consisting
of 12 classes.
Experiment Result
Removal of dropout layers Slight improvement
Addition of convolution layer(s) Improvement then degradation
The Elite size specifies the number of GP programs evolved per class. The population
generation is controlled with 0.9 cross over, 0.07 mutation and 0.03 reproduction of
new programs. The higher crossover ensures the diversity in each next subsequent
generation. Ramped half and half method is used for population initializer and
prediction accuracy is used as a fitness function. The focus of APT classification
problem is to attain a classifier with highest accuracy in malware class classification.
Thus, AUC is considered as fitness measure to evaluate the suitability of the evolved
classifier. The parameters used for GP-AdaBoost learning are provided in Table 4 -5.
Then, pondered results are sum up for each binary class and higher output represents
the class that belongs to test instance. The experimental results show the proposed
approach achieving higher accuracy for classifying APT malware.
For the implementation part, Weka is used as mentioned earlier. It is Java based
machine learning tool used for data analysis. Weka by default have source codes of
most of the machine learning algorithms and data processing algorithms already
included in its repository. We converted our predictions from CNN models into a csv
file format to make it compatible for Weka input. After the dataset is loaded with in
Weka explorer with open file button under Preprocess tab, we can select different
preprocessor function for cleaning and arranging dataset before classification task. As
our dataset does not need preprocessing, we select the classifier from Choose button
under Classify tab. After expanding classifiers and then meta options, we select
AdaBoostM1 classifier. Then click on AdaBoostM1 in the window and add the
parameters for AdaBoost. After that, clicking on last argument of AdaBoost, we select
the base classifier for AdaBoost which is Decision Stump by default and we change it
to Genetic Programming from classifiers and functions from Choose menu. We can
change parameters for Genetic Programming by clicking on GeneticProgramming in
the windows as shown in the figures below.
This chapter provides complete performance analysis and discussion regarding the
results provided by the proposed techniques. First, we shall briefly describe the
metrics used for the performance analysis, and then move on to the results and
discussion part.
The reason for choosing this measure is that most of the previously-reported models
use it as well; this choice would help in comparison of different models with
consistency.
As we saw in Section 3.3, three different Convolutional Neural Networks have been
used for Transfer Learning task in our project. These three models are AlexNet,
Inception-ResNet-v2, and Custom DNN model. All three models have their own set
of tunable parameters and the ability to modify them varies. Some offer more control
over the architecture than the others. Inception_resnet-v2 was implemented in
Python’s TF-Slim library, custom CNN in Keras and other two in TensorFlow
framework.
Table 5 -6 presents a summary of the performance of the three deep neural networks,
averaged over 10 runs, on APTMalware dataset.
An assortment of confusion matrices, that show the test results of the DNN models
used for transfer learning are also given below in Figure 21, Figure 20, Figure 19 and
Figure 5-5.
Figure 19: Confusion matrix showing the test results of Inception-ResNet-v2 model
on APTMalware dataset
Figure 20: Confusion matrix showing the test results of custom DNN model on
APTMalware dataset
Figure 21: Confusion matrix showing the test results of AlexNet model on
APTMalware dataset
Figure 22: Confusion matrix showing the test results of ResNet34 model on
APTMalware dataset
After taking prediction results obtained from different neural network architectures to
the original test dataset, we can now use this dataset to train, and obtain predictions
from, the GP-AdaBoost ensemble model.
Average 92.59
The performance results of the proposed method for cell2cell dataset are reproduced
in Table 5 -7 for an average of 10 independent runs of the algorithm. Furthermore,
confusion matrix and area under the ROC curve are also shown in figures below.
These results show that a decent improvement can be achieved by using our ensemble
approach instead of an approach that uses Transfer Learning only, or GP-AdaBoost
approach only. Further improvements can be made into these results by selecting
some other DNN architecture and modifying it in depth according to image input data.
The purpose of the extended experimentation carried out as part of this thesis was to
ascertain the following things:
Can the process of Transfer Learning be used for APT malware classification?
o Yes. As we saw, transfer learning provided very good classification results.
Does using the GP-AdaBoost ensemble improves classification accuracy from the
predictions of base learners (CNN models)?
o Yes. A noticeable increase in the classification accuracy was observed when
predictions from base learners (DNN models) were used to train and then test
GP-AdaBoost.
Is there one best way to represent an APT malware file in image format before
feeding it to the neural network models?
o Yes. As we followed the technique described by Nataraj et al. by reading
malware executable file as a binary file form memory and sampling it into 8-
bit file and then converting it into an image with a width relative to its size
produces best results because malware executable files can be of various sizes,
and deciding its width as in table ensures the similarity of images from
different size malwares of same family.
Although the results reported in this thesis are at least as good as other techniques
reported in the literature, or better, there is still room for improvement, which is
briefly described below in Future Work section.
The most important improvement will be to find a new, balanced and standard
APT malware dataset if one will be available in future
Improve the Transfer Learning-based CNN models’ results by experimenting
with diverse modifications in architectures, parameters, and hyperparameters
Using these methods, it is quite possible to improve the results of the proposed
technique.
References