DroidFusion Accepted Version

Uploaded by

sidou dali4zz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

DroidFusion Accepted Version

Uploaded by

sidou dali4zz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

IEEE TRANSACTIONS ON CYBERNETICS 1

DroidFusion: A Novel Multilevel Classifier Fusion

Approach for Android Malware Detection
Suleiman Y. Yerima, Member, IEEE, and Sakir Sezer, Member, IEEE,

Abstract—Android malware has continued to grow in volume machine learning based methods are increasingly being ap-
and complexity posing significant threats to the security of plied to Android malware detection. However, classifier fusion
mobile devices and the services they enable. This has prompted approaches have not been extensively explored as they have
increasing interest in employing machine learning to improve
Android malware detection. In this paper we present a novel been in other domains like network intrusion detection.
classifier fusion approach based on a multilevel architecture that In this paper, we present and investigate a novel classifier
enables effective combination of machine learning algorithms fusion approach that utilizes a multilevel architecture to in-
for improved accuracy. The framework (called DroidFusion), crease the predictive power of machine learning algorithms.
generates a model by training base classifiers at a lower level The framework, called DroidFusion, is designed to induce
and then applies a set of ranking-based algorithms on their
predictive accuracies at the higher level in order to derive a a classification model for Android malware detection by
final classifier. The induced multilevel DroidFusion model can training a number of base classifiers at the lower level. A
then be utilized as an improved accuracy predictor for Android set of ranking-based algorithms are then utilized to derive
malware detection. We present experimental results on four combination schemes at the higher level, one of which is
separate datasets to demonstrate the effectiveness of our proposed selected to build a final model. The framework is capable
approach. Furthermore, we demonstrate that the DroidFusion
method can also effectively enable the fusion of ensemble learning of leveraging not only traditional singular learning algorithms
algorithms for improved accuracy. Finally, we show that the like Decision Trees or Naive Bayes, but also ensemble learning
prediction accuracy of DroidFusion, despite only utilizing a algorithms like Random Forest, Random Subspace, Boosting
computational approach in the higher level, can outperform etc. for improved classification accuracy.
Stacked Generalization, a well-known classifier fusion method In order to demonstrate the effectiveness of the DroidFu-
that employs a meta-classifier approach in its higher level.
sion approach, we performed extensive experiments on four
Index Terms—Android Malware Detection, Mobile Secu- datasets derived from extracting features from two publicly
rity, Machine Learning, Classifier Fusion, Ensemble Learning, available and widely used malware samples collection (i.e.
Stacked Generalization.
Android Malgenome project [3] and DREBIN [4]) and a
collection of samples provided by Intel Security (Formerly,
I. I NTRODUCTION McAfee). The unique contributions of this paper can be
summarized as follows:

I N recent years, Android has become the leading mobile

operating system with a substantially higher percentage
of the global market share. Over 1 billion Android devices
◦ We propose a novel general-purpose classifier fusion
approach (DroidFusion) and present its evaluation on
four different datasets. DroidFusion can be applied to
have been sold with an estimated 65 billion app downloads not only traditional learners but also ensemble learners.
from Google Play alone [1]. The growth in popularity of ◦ We propose four ranking-based algorithms that enable
Android and the proliferation of third party app markets has classifier fusion within the DroidFusion framework.
also made it a popular target for malware. Last year, McAfee The algorithms are utilized in building a final improved
reported that there were more than 12 million Android mal- classification model for Android malware detection.
ware samples with nearly 2.5 million new samples discovered ◦ We present the results of extensive experiments to
every year [2]. Android malware can be embedded in a demonstrate the effectiveness of our proposed approach.
variety of applications such as banking apps, gaming apps, The results of experiments with singular classifiers and
lifestyle apps, educational apps, etc. These malware-infected ensemble classifiers are presented.
apps can then compromise security and privacy by allowing ◦ Furthermore, we present results of a performance com-
unauthorized access to privacy- sensitive information, rooting parison of DroidFusion with Stacked Generalization (or
devices, turning devices into remotely controlled bots, etc. Stacking), a well- known classifier fusion method that
Zero-day Android malware have the ability to evade tra- is also based on a multilevel architecture.
ditional signature-based defences. Hence, there is an urgent ◦ Datasets that we created from the feature extraction
need to develop more effective detection methods. Recently, process with DREBIN and Malgenome project malware
samples are released in the supplementary material.
S. Y. Yerima is with the Cyber Technology Institute, School of Computer
Science and Informatics, De Montfort University, Leicester, England. (e-mail: The rest of the paper is structured as follows. Section II
[email protected]). discusses related work while section III presents the DroidFu-
S. Sezer is with the Centre for Secure Information Technologies (CSIT), sion framework. The investigation methodology is presented
Queen’s University Belfast, Northern Ireland, UK (e-mail: [email protected]).
Manuscript received June 03, 2017; revised September 11, 2017, accepted in section IV, while section V presents results, with analyses
November 11 2017. and discussions. Finally, the conclusion is given in section VI.
IEEE TRANSACTIONS ON CYBERNETICS 2

II. R ELATED WORK tiMalDroid is a dynamic analysis behavior based malware

In this section we review related work on machine learn- detection framework that uses logged behavior sequence as
ing based Android malware detection. Static and/or dynamic features with SVM. DroidDolphin [25], also employed SVM
analysis is used to extract model training features, and both with dynamically obtained features. Afonso et al. [26] utilized
methods have pros and cons. Static analysis is prone to dynamic API calls and system call traces and investigated
obfuscation [5], but is generally faster and less resource SVM, J48, IBk (an instance based classifier), BayesNet K2,
intensive than dynamic analysis. Dynamic analysis is resistant BayesNet TAN, Random Forest and Naive Bayes. Alzaylaee
to obfuscation but can be hampered by anti-virtualization [6]– et al. [27] investigated SVM, Naive Bayes, PART, Random
[9] and code coverage limitations [10], [34]. Forest, J48, MLP (multi-layer perceptron), and Simple logistic
by comparing their performances on real phones vs. emulators
A. Static analysis with traditional classifiers using dynamically obtained features. Ni et al. [46], proposed
Recent Android malware detection work that employ ma- a real-time malicious behavior detection system that records
chine learning with static features include the following. API calls, permission uses, and other real-time features such
DroidMat [11] proposed applying k-means and k-NN algo- as user operations. In their paper, they used SVM and Naive
rithms based on static features from permissions, intents and Bayes algorithms for detection with these run-time features.
API (application program interface) calls, to classify apps as Mahindru and Singh [53] extracted 123 dynamic permis-
benign or malware. Arp et al. [4], proposed SVM based on sions from 11000 Android applications which were subse-
permissions, API calls, network access, etc. for lightweight quently applied to several individual machine learning clas-
on-device detection. Yerima, et al. [12], [14] proposed an sifiers including Naive Bayes, Decision Tree, Random Forest,
Eigenpsace analysis approach, as well as Random Forest en- Simple Logistic and k-star. In their experiments Simple Lo-
semble learning models. The machine learning based detection gistic was found to perform marginally better than the others
proposed in the papers were based on API calls, intents, but the malware classification accuracy of Random Forest,
permissions and embedded commands. Varsha et al. [15] Decision Tree (J48) and Simple logistics were comparable.
investigated SVM, Random Forest and Rotation Forests on Other works such as MARVIN [28], adopt a hybrid static
three datasets; their detection method employed static features and dynamic feature based approach with machine learning
extracted from the manifest and application executable files. (SVM and L2 regularized linear classifer). MARVIN assesses
Sharma and Dash [16] utilized API calls and permissions the risk associated with unknown Android apps in the form
to build Naive Bayes and k-NN based detection systems. of a malice score ranging from 0 to 10. Similarly, Su et
In [17], API classes were used with Random Forest, J48 and al. [49] adopted a hybrid static and dynamic feature approach
SVM classifiers. Wang et al. [18] evaluated the usefulness of by performing experiments on 1200 (900 clean and 300
risky permissions for malware detection using SVM, Decision malware) samples. Several machine learning algorithms were
Trees and Random Forest. DAPASA [19] focused on detecting investigated including Bayes Net, Naive Bayes, K-NN, J48,
malware piggybacked onto benign apps by utilizing sensitive and SVM. The best overall accuracy of 91.1% was attained
subgraphs to construct five features depicting invocation pat- with SVM.
terns. The features are fed into machine learning algorithms
i.e. Random Forest, Decision Tree, k- NN and PART, with C. Android malware detection with classifier fusion
Random Forest yielding the best detection performance. Cen et Previous works in intrusion detection systems such as [29]–
al. [20] proposed a detection method based on API calls from [32] investigated classifier fusion for improving detection
decompiled code and permissions. Their proposed method accuracy. This method is also being applied to the detection
applies a probabilistic discriminative model based on regu- of Android malware. For example Milosevic et al. [50] in-
larized logistic regression (RLR). RLR is compared to SVM, vestigated classifer fusion approach with static analysis based
Decision Tree, k-NN, Naive Bayes with information priors and on Android permissions and source code-based analysis. They
Hierarchical mixture of Naive Bayes. used SVM, C.45, Decision Trees, Random Tree, Random
Wang et al. [52] applied Logistic regression, Linear SVM, Forests, JRip and linear regression classifiers. The authors
Decision Tree and Random forest with static analysis for the experimented with ensembles that contained odd combinations
detection of malicious apps. They utilized app-specific static of three and five classifiers using the majority voting fusion
features and platform-specific static features for training the method. The best fusion model achieved an accuracy rate of
machine learning algorithms. The authors reported a maximum 95.6% using the source-code based features. However, the
true positive rate of 96% and false positive rate of 0.06% number of samples used in the experiments were limited (387
with the Logistic Regression classifier based on experiments samples for the permissions-based experiments and 368 for
conducted on 18,363 malware apps and 217,619 benign apps. source code-based analysis)
Other research papers that have investigated static features Yerima et al. [13] compared several classifier fusion meth-
with machine learning for Android malware detection in- ods i.e. Majority vote, Product of probabilities, Maximum
clude [21]–[23], [45], [47], [48] and [54]. probability, and Average of probabilities using J48, Naive
Bayes, PART, RIDOR, and Simple Logistic classfiers. The
B. Dynamic & hybrid analysis with traditional classifiers classifiers were trained with static features extracted from
Some of the detection methods utilized dynamic features 6,863 app samples, and in the experiments presented, the fused
with machine learning, for example AntiMalDroid [24]. An- models performed better than the single classifiers.
IEEE TRANSACTIONS ON CYBERNETICS 3

TABLE I: Overview of some of the papers that apply classfier III. D ROID F USION : GENERAL PURPOSE FRAMEWORK FOR
fusion for Android malware detection. NB = Naive Bayes; SL= CLASSIFIER FUSION
Simple Logistic; LR= Linear Regression; DT = Decision Tree;
VP= Voted Perceptron. AveP = average of probabilities; ProdP The DroidFusion framework consists of a multilevel ar-
= product of probabilities; MaxP = maximum probability. chitecture for classifier fusion. It is designed as a general
purpose classifier fusion system, so that it can be applied to
Paper/Year ML algorithms Fusion approach # samples both traditional singular classifiers and ensemble classifiers
Yerima et. al SVM, J48, Majority vote, 6,863 (which themselves employ a base classifier usually to produce
[13] (2014) PART, Ridor, ProdP,
NB, SL AveP, MaxP different randomly induced models that are subsequently com-
Coronado-de-Alba Random Forest, Meta-ensembling bined). At the lower level, the (DroidFusion) base classifiers
et. al [33] (2016) Random Random Forest in 3,062 are trained on a training set using a stratified N -fold cross
Committee Random Comm.
validation technique to estimate their relative predictive accu-
Milosevic SVM, C.45, RT
et. al [50] (2017) DT, JRip, LR, Majority vote 387 racies. The outcomes are utilized by four different ranking-
Random Forest 368 based algorithms (in the higher layer) that define certain
Wang SVM, KNN, criteria for the selection and subsequent combination of a
et. al [51] (2017) NB, CART, Majority vote 116,028
Random Forest subset (or all) of the applicable base classifiers. The outcomes
Idrees MLP, DT, Majority vote, of the ranking algorithms are combined in pairs in order to find
et al. [55] (2017) Decision Table AveP, ProdP 1,745 the strongest pair, which is subsequently used to build the
RT, J48, final DroidFusion model (after testing against an unweighted
DroidFusion RepTree, VP, Multilevel 3,799
(This paper) Random Forest, weighted 15,036 parallel combination of the base classifiers).
Random Comm., Ranking-based 36,183
Random Sub., approach
AdaBoost A. DroidFusion model construction
The model building i.e. training process is distinct from the
prediction or testing phase, as the former utilizes a training-
validation set to build a multilevel ensemble classifier which is
then evaluated on a separate test set in the latter phase. Figure
Wang et al. [51] extracted 11 types of static features 1, illustrates the 2-level architecture of DroidFusion. It shows
and employ multiple classifiers in a majority vote fusion the training paths (solid arrows) and the testing/prediction
approach. The classifiers include SVM, K-Nearest Neighbour, path (dashed arrows). First, at the lower level each base
Naive Bayes, Classification and Regression Tree (CART) and classifier undergoes an N -fold cross validation based estimate
Random Forest. Their experiments on 116,028 app samples of class performance accuracies. Let the N -fold cross validated
showed more robustness with the majority voting ensemble predictive accuracies for K base classifiers be expressed by
than with the individual base classifiers. Pbase , a K-tuple of the class accuracies of the K base
classifiers:
Idrees et al. [55] utilize permissions and intents as features
to train machine learning models and applied classifier fusion Pbase = {[P1m , P1b ], [P2m , P2b ], ..., [PKm , PKb ]} (1)
for improved perfromance. Their experiments were performed The elements of Pbase are applied to the ranking based
on 1745 app samples starting with a performance comparison algorithms AAB, CDB, RAPC and RACD described later
between MLP, Decision Table, Decision Tree, Random Forest, in section III-B. Let X be the total number of instances
Naive Bayes and Sequential Minimal Optimization classifiers. with M malware and B benign instances, where the M
The Decision Table, MLP, and Decision Tree classfiers were instances possess a label L=1 denoting malware and the B
then combined using three schemes: average of probabilities, instances from X possess a label L=0 denoting benign. All X
product of probabilities and majority voting. Coronado-de- instances are also represented by feature vectors with f binary
Alba et al. [33] proposed and investigated a classifier fusion representations, where f is the number of features extracted
method based on Random Forest and Random Committee from the given app. The features in the vectors take on 0
ensemble classifiers. Their approach embeds Random Forest or 1 representing the absence or presence the given feature.
within Random Commitee to produce a meta-ensemble model. Additionally, after the N -fold cross validation process (as
The meta-model outperformed the individual classifiers in shown in Fig. 1), a set of K-tuple class predictions are derived
experiments performed with 1531 malware and 1531 benign for every instance x, given by:
samples. Table I summarizes papers that have investigated
classifier fusion for Android malware detection.
V (x) = {v1 , v2 , ..., vk }, ∀k ∈ {1, ..., K} (2)
In contrast to all of the existing Android malware detection Note that v1 ,v2 ,...,vk could be crisp predictions or proba-
works, this paper proposes a novel classifier fusion approach bility estimates from the base classifiers. Adding the original
that utilizes four ranking based algorithms within a multilevel (known) class label, l, we obtain:
framework (DroidFusion). We evaluated DroidFusion exten-
sively and compared its performance to Stacking and other
classifer fusion methods. Next, we present DroidFusion. V̇ (x) = {v1 , v2 , ..., vk , l}, ∀k ∈ {1, ..., K}, l ∈ {0, 1} (3)
IEEE TRANSACTIONS ON CYBERNETICS 4

Fig. 1: DroidFusion 2-layer model architecture.

Pbase and V̇ (x), ∀ x ∈ X will be utilized in the level- Hence, the benign class accuracy performance for the given
2 computation during the DroidFusion model construction. scheme is calculated from:
Let us denote the set of four ranking based schemes by
S = {S1, S2, S3, S4}. The pairwise combinations of the PX
+ 1) | CSj (x) = 0, l(x) = 0
ben x=1 (CSj (x)
elements of S will result in 6 possibilities: PSj = (6)
B
Where B is the number of benign instances, while the
φ = {S1S2, S1S3, S1S4, S2S3, S2S4, S3S4} (4) malware accuracy performance is calculated from:
Our goal is to select the best pair of ranking-based schemes PX
from S, and if its performance exceeds that of an unweighted mal CSj (x) | CSj (x) = 1, l(x) = 1
x=1
PSj = (7)
combination of the original base classifiers, it would be X −B
selected to construct the final DroidFusion model. In the Thus the average performance accuracy is simply:
event that the unweighted combination performance is greater,
ben mal
DroidFusion will be configured to apply a majority vote (or B · PSj + (X − B) · PSj
ṖSj = (8)
average of probabilities) of the base classifiers in the final con- X
structed model. In order to estimate the accuracy performance Likewise, to determine the performance of each pairwise
of each scheme in S or each pairwise combination in set φ, a combination in φ:
re- classification of the X instances (in the training-validation) Let ωi , i ∈ {1, ..., Z}, Z ≤ K be the first set of weights
set is performed for each scheme or pair of schemes. The re- derived for the first scheme in the pair, and let µi , i ∈
classification is accomplished using V̇ (x), x ∈ X based on the {1, ..., Z}, Z ≤ K be those derived for the second scheme
criteria defined by the schemes in S using Pbase . Each scheme in the pair. Then, to reclassify the X instances in the training-
in S derives a set of Z weights that will be applied with V̇ (x), validation set according to the combination pair, the class
x ∈ X for every instance during the re-classification process. prediction of each instance x will be given by:
Let ωi , i ∈ {1, ..., Z}, Z ≤ K be the set of weights derived
for a particular scheme in S. Then, to reclassify an instance  PZ PZ
i=1 ωi vi +Pi=1 µi vi
x according to the schemes criterion, its class prediction will 1 : if
 P Z
ω + Z ≥ 0.5
i=1 i i=1 µi



be given by:
CSjSn (x) = 0 : otherwise (9)


 ∀j ∈ {1, 2, 3, 4}, ∀n ∈ {1, 2, 3, 4},
( PZ 
i=1 ωi vi j 6= n, SjSn ≡ SnSj

1 : if P Z ≥ 0.5
CSj (x) = i=1 ωi

0: otherwise ∀j ∈ {1, 2, 3, 4} Therefore, computing benign class accuracy and malware class
(5) accuracy will utilize:
IEEE TRANSACTIONS ON CYBERNETICS 5

◦ An average accuracy based ranking scheme (AAB).

PX
+ 1) | CSjSn (x) = 0, l(x) = 0 ◦ A class differential based ranking scheme (CDB).
ben x=1 (CSjSn (x)
PSjSn = ◦ A ranked aggregate of per class performance based
B
(10) scheme (RAPC).
and ◦ A ranked aggregate of average accuracy and class
differential based scheme (RACD).
PX 1) The Average Accuracy Based (AAB) ranking scheme :
mal x=1CSjSn (x) | CSjSn (x) = 1, l(x) = 1
PSjSn = (11) With the AAB method, the ranking is designed to be directly
X −B proportional to the average prediction accuracies across the
respectively. The average performance accuracy for the pair- classes. In this case, base classifiers with larger overall ac-
wise schemes will then be given by: curacy performance will rank higher. AAB doesnt take into
ben mal
account how well a base classifier performs for a particular
B · PSjSn + (X − B) · PSjSn class. Let AAB be the first scheme S1, from set S. The
ṖSjSn = (12)
X algorithm is summarized as follows:
∀j ∈ {1, 2, 3, 4}, ∀n ∈ {1, 2, 3, 4}, j 6= n, SjSn ≡ SnSj. Let Pbase be the set of performance accuracies Pk,c ∈ Pbase
Equivalently, the unweighted majority vote class predictions of K base classifiers. If m denotes malware and b, benign then
for instance x is given by: the average accuracy of the k th base classifier is given by:
( PK X
1 : if k=1 vi
≥ 0.5 ak = 0.5 × Pk,c | k ∈ {1, ..., K}, 0 < Pk,c ≤ 1
K c=m,b
Cmv (x) = (13) (18)
0 : otherwise ∀k ∈ {1, ..., K}
Let A ← ak , ∀k ∈ {1, ..., K} be a set of the average
Hence, the benign class accuracy performance for the un- predictive accuracies, to which a ranking function Rankdesc (.)
weighted scheme will be given by: is applied:
PX Ā ← Rankdesc (A) (19)
ben (Cmv (x) + 1) | Cmv (x) = 0, l(x) = 0
Pmv = x=1
B Thus, Ā contains an ordered ranking of the Level-1 base
(14) classifiers average predictive accuracies in descending order.
Likewise, the malware class accuracy performance for the Next, the top Z rankings are utilized in weight assignments
unweighted scheme is given by: as follows:
PX
mal Cmv (x) | Cmv (x) = 1, l(x) = 1 ω1 = Z, ω2 = Z − 1, ..., ωZ = 1, Z≤K (20)
Pmv = x=1 (15)
X −B
Thus, the AAB class prediction C(x) for instance x in the
Finally, the average accuracy performance for the unweighted
training-validation set is given by Eq. (5) or given by Eq. (9)
scheme is given by:
when used in the pairwise combination with another scheme.
ben mal
B · Pmv + (X − B) · Pmv 2) The Class Differential Based (CDB) ranking scheme:
Ṗmv = (16) With the CDB method, the ranking is directly proportional
X
After all the re-classifications are completed, and the average to the average predictive accuracy and inversely proportional
accuracies computed, the applicable scheme that will be uti- to the absolute value of the performance difference between
lized to construct the DroidFusion model is selected thus: the classes. Assuming a binary classification problem, this
( approach will be less likely to favour the decision from a
argφ max(Ṗφ ), base classifier that exhibits much higher accuracy in one
φ = {S1S2, S1S3, S1S4, S2S3, S2S4, S3S4, mv} class over the other but will assign larger weights to good
(17) classif iers that perform relatively well in both classes. The
Suppose that S1 and S3 pair are selected by the operation in CDB procedure is described as follows:
Eq. (17), then the class of a given unlabeled instance during Suppose the CDB method is taken as scheme S2, let
the testing or unknown instances prediction (during model the average accuracy of each base classifier be given by
deployment) will be computed by Equation (9) with j=1 and ak in equation (18) and define D̄ with cardinality K as a
n=3. Next, we describe the four ranking-based algorithms set of ordered rankings in descending order of magnitude.
underpinning the schemes in set S that utilize Pbase to Calculate dk proportional to average accuracies and inversely
accomplish all of the above described DroidFusion level-2 proportional to absolute difference of inter-class accuracies:
steps. ak
dk = , k ∈ {1, ..., K} (21)
|Pk,m − Pk,b |
B. Proposed ranking based algorithms Let D ← dk , ∀k ∈ {1, ..., K} be a set of the dk values, to
The design of our proposed algorithms is influenced by the which the ranking function Rankdesc(.) is applied:
observation that most typical classifiers perform differently for
D̄ ← Rankdesc (D) (22)
both classes. That is, class accuracy performance for benign
and malware are very rarely equal in magnitude. The proposed With D̄ containing the ordered rankings of dk values, the top
ranking based algorithms include: Z rankings are also utilized to assigned weights according to
IEEE TRANSACTIONS ON CYBERNETICS 6

Eq. (20). Thus, the S2=CDB class prediction for an instance Then, for each base classifier, aggregate the values and apply
x is determined from Eq. (5). Whenever S2=CDB is used the ranking function Rankdesc (.):
(in conjunction with another scheme) within a pair in the set
expressed by Eq. (4), then equation Eq. (9) will be used for (
hk ← Ak + Gk
the class prediction of the instance. , Ak ∈ Ā, Gk ∈ Ḡ, ∀k ∈ {1, ..., K}
3) The Ranked Aggregate of Per Class accuracies (RAPC) H ← hk
based scheme: In the RAPC method, the ranking is directly (29)
proportional to the sum of the initial per class rankings of the
accuracies of the base classifiers. This method is more likely H̄ ← Rankdesc (H) (30)
to assign a larger weight to a base classifier that performs very Thus, H̄ is the set containing the ranked values of H in
well in both classes. RAPC is summarized as follows. descending order of magnitude. The top Z rankings are then
With F̄ defined as the set of ordered rankings with cardi- used according to Eq. 20 to assign the weights.
nality K, given the initial performance accuracies of Pk,c of
the K base classifiers :
( C. Model complexity
Pm ← Pk,c where c 6= b
, k ∈ {1, ..., K}, c ∈ {m, b} As mentioned earlier, the base classifiers initial accuracies
Pb ← Pk,c where c 6= m
are estimated using a stratified N -fold cross validation tech-
(23)
nique. This procedure will be performed only once during
We then apply the ranking function Rankdesc (.) to both:
training (on the training-validation set) and the preliminary
predictions for all x instances in X for every base classifier
(
P̄m ← Rankdesc (Pm )
(24) will be determined from the procedure. The configurations
P̄b ← Rankdesc (Pb ) (weights) computed from each algorithm is applied together
The per-class rankings for each base classifier are aggregated with these initial (base classifier) predictions to re-classify
and then ranked again: each instance accordingly. Since level-2 training prediction of
( instances requires only re-classification using V̇ (x), ∀x ∈ X
fk ← P̄k,m + P̄k,b , the time complexity for utilizing R level 2 algorithms to
, ∀k ∈ {1, ..., K} (25)
F ← fk predict the classes of X instances using Eq. (5) will be given
by O(RX). The pairwise class predictions also involve re-
F̄ ← Rankdesc (F ) (26) classification, thus the complexity involved for predicting the
Finally, from the set F̄ comprising k ordered values of F , class of X instances using Eq. (9) will be given by O(JX)
where J = R2 . Likewise, for the unweighted majority vote

we select the top Z rankings and use them to assign weights
according to Eq. (20). Suppose the RAPC scheme is taken the complexity will be O(X) as re-classification is involved
as S3, we can determine the class prediction for an instance also. Since we utilize unweighted majority vote and pairwise
x from Eq. (5). If S3=RAPC is used (in conjunction with combinations for final model building (Eq. (17)) the total
another scheme) within a pair in the set expressed by Eq. (4), training time complexity in level-2 is therefore given by
O(X) + O(JX) = O((J + 1)X) where J = R2 for the

then equation Eq. (9) will be employed for the class prediction
of the instance. R level 2 ranking-based algorithms.
4) The Ranked aggregate of Average accuracy and Class
Differential (RACD) scheme: With RACD, the ranking is IV. INVESTIGATION METHODOLOGY
directly proportional to the sum of the initial rankings of the A. Automated static analyzer for feature extraction
average performance accuracies and the initial rankings of the
difference in performance between the classes. This method is The features used in the experimental evaluation of the
designed to assign a larger weight to the base classifiers with DroidFusion system are obtained using an automated static
good initial overall accuracy that also have a relatively smaller analysis tool developed with Python. The tool enables us to
difference in performance between the classes. The algorithm extract permissions and intents from the application manifest
is described as follows. file after decompiling with AXMLprinter2 (a library for de-
Suppose, we take the RACD method as scheme S4, define compiling Android manifest files). In addition, API calls are
a set H̄ for ordered values with cardinality K. Given A, the extracted through reverse engineering the .dex files by means
set of computed average accuracies for each base classifier of Baksmali disassembler. The static analyzer also searches
(determined in the AAB scheme) compute the class differential for dangerous Linux commands from the application files
for each corresponding classifier as follows: and checks for the presence of embedded .dex, .jar, .so, and
.exe files within the application. Previous works [35] have
gk ← |Pk,m − Pk,b | , k ∈ {1, ..., K} (27) shown that these set of static application attributes provide
discriminative features for machine learning based Android
Define G ← gk , ∀k ∈ {1, ..., K} as the ordered set of gk
malware detection, hence, we utilized them for DroidFusion
values to which a ranking function Rankascen (.) is applied to
experiments. Furthermore, while extracting API calls, third
rank gk in ascending order of magnitude:
party libraries are excluded using the list of popular ad libraries
Ḡ ← Rankascen (G) (28) obtained from [36]. Fig. 2 shows an overview of the feature
IEEE TRANSACTIONS ON CYBERNETICS 7

Where T P (true positives) is the number of correct

predictions of malware classification and F N (False
negatives) is the number of misclassified malware in-
stances in the set. T P R is also synonymous with recall
and sensitivity.
◦ False positive ratio (FPR): The ratio of incorrectly
classified benign instances to the total number of benign
instances, given by:
FP
FPR = (35)
TN + FP
Where F P (false positives) is the number of incorrect
predictions of benign classifications while T N (true
Fig. 2: Overview of the python based static analyzer for negatives) is the number of correct predictions of be-
automated feature extraction. nign instances.
◦ Precision: also known as positive predictive rate is
calculated as follows:
extraction process using our the static app analyzer. The
TP
features are represented in binary form and labelled with class P recision = (36)
values in all the datasets. TP + FP
◦ F-measure: This metric combines precision and recall
as follows:
B. Feature selection
2 · precision · recall
Feature ranking and selection is usually applied for dimen- FM = (37)
precision + recall
sionality reduction which in turn lowers model computational
cost. The study in this paper utilized four datasets for evaluat- In [20], is has been shown that (especially for unbal-
ing DroidFusion. One of the datasets is derived from feature anced datasets) F-measure is a better metric than the
reduction of an initial set of 350 features down to 100 by Area Under Cure (AUC) for the Receiver Operating
applying the Information Gain (IG) feature ranking approach Cost (ROC) which uses values of T P R and F P R
to rank the features and then selecting the top n features. to plot a graph for different thresholds. Thus, in our
IG evaluates the features by calculating the information gain experiments we utilize F-measure as the main indicator
achieved by each feature. Specifically, given a feature X, IG of predictive power. Note that precision and recall can
is expressed as: be calculated for both malware and benign classes.
Hence, if F m and F b are the F-measures for malware
IG = E(X) − E(X/Y ) (31) and benign classes respectively while Nm and Nb are
Where E(X) and E(X/Y ) represent the entropy of the feature the number of instances in each class, the combined
X before and after observing the feature Y respectively. The metric known as weighted F-measure is the sum of F-
entropy of feature X is given by: measures weighted by the number of instances in each
X class, given by:
E(X) = − p(x)log2 (p(x)) (32) F m · Nm + F b · Nb
x∈X
WFM = (38)
Where p(x) is the marginal probability density function for Nm + Nb
the random variable X. Similarly, the entropy of X relative ◦ Time taken to test the model. This is the time in seconds
to Y is given by [38]: to test a constructed model from the testing set. All
X X models were evaluated on a Windows 7 Enterprise 64
E(X/Y ) = − p(x) p(x|y)log2 (p(x|y)) bit PC with 32GB of RAM and Intel Xeon CPU 3.10
x∈X x∈X
(33) GHz speed.
Where p(x|y) is the conditional probability of x given y. The
higher the reduction of the entropy of feature X, the greater
D. Datasets description
the significance of the feature.
The experiments performed to evaluate DroidFusion was
done using four datasets from three collections of Android app
C. Model Evaluation Metrics samples. Table II shows the details of each of the datasets. The
The following performance metrics are considered in the first one (Malgenome-215) consists of feature vectors from
evaluation of the models: 3,799 app samples where 2,539 were benign and 1,260 were
◦ True positive ratio (TPR): The ratio of correctly clas- malware samples from the Android malware genome project
sified malicious apps to the total number of malicious [3], a reference malware samples collection widely used by
apps. This is given by: the malware research community. This dataset contains 215
TP features. The second dataset (Drebin-215) also consists of
TPR = (34) vectors of 215 features from 15,036 app samples; of these,
TP + FN
IEEE TRANSACTIONS ON CYBERNETICS 8

TABLE II: Datasets used for the DroidFusion evaluation TABLE III: malgenome 215 train-validation set results and
experiments. Level-2 algorithm based rankings for the base classifiers (5 =
highest rank, 1 = lowest).
Datasets #samples #malware #benign #features
Malgenome-215 3799 1260 2539 215 Classifier TPR TNR AAB CDB RAPC RACD
Drebin-215 15036 5560 9476 215 J48 0.975 0.983 4 4 5 5
McAfee-350 36183 13805 22378 350 REPTree 0.961 0.974 1 2 1 1
McAfee-100 36183 13805 22378 100 Random
Tree-100 0.972 0.982 3 3 3 2
Random
Tree-9 0.966 0.973 2 5 1 4
9476 were benign samples while the remaining 5,560 were Voted
malware samples from the Drebin project [4]. The Drebin Perceptron 0.971 0.991 5 1 4 2
samples are also publicly available and widely used in the
research community. Both Drebin-215 and Malgenome-215 TABLE IV: malgenome 215 train-validation set Level-2 com-
datasets are made available as supplementary material. bination schemes intermediate results.
The final two datasets come from the same source of
samples. These are McAfee-350 and McAfee-100 in the table. Combination PrecM RecalM PrecB RecalB W-FM
They both have 36,183 instances of feature vectors derived AAB+CDB 0.980 0.985 0.993 0.990 0.9883
from 13,805 malware samples and 22,378 benign samples AAB+RAPC 0.984 0.984 0.992 0.992 0.9893
AAB+RACD 0.982 0.984 0.992 0.991 0.9887
made available to us by Intel Security (Formerly McAfee).
CDB+RAPC 0.982 0.984 0.992 0.991 0.9887
Dataset #3 has 350 features, while Dataset #4 has the top 100 CDB+RACD 0.976 0.983 0.992 0.988 0.9864
features with the largest information gain from the original 350 RAPC+RACD 0.982 0.984 0.992 0.991 0.9887
features in Dataset #3. In the experiments presented, Dataset
#1, #2 and #3 are used to investigate DroidFusion with singular
base classifiers, while Dataset #4 is used to study the fusion
of ensemble base classifiers with DroidFusion. Note that all of A. Performance of DroidFusion with the Malgenome-215
the features were extracted using our static app analysis tool dataset.
described in section IV-A. In order to evaluate DroidFusion on the Malgenome- 215
dataset, we split the dataset into two parts, one for testing
V. R ESULTS AND DISCUSSIONS and another for training-validation. The ratio was training-
In this section, we present and discuss the results of four validation: 80%, testing: 20%. The stratified 10- fold cross
sets of experiments performed to evaluate DroidFusion per- validation approach was used to construct the DroidFusion
formance. We utilized the open source Waikato Environment model using the training-validation set. Table III shows the
for Knowledge Analysis (WEKA) toolkit [37] to implement per-class accuracies of each of the 5 base classifiers resulting
and evaluate DroidFusion. Feature ranking and reduction of from 10-fold cross-validation on the training-validation set.
dataset #3 into dataset #4 was also done with WEKA. In The subsequent rankings determined from AAB, CDB, RAPC
all the experiments we set K=5, i.e. five base classifiers are and RACD are also presented. Each of the algorithms induced
utilized. Also, we take N =10 and Z=3 for the cross validation a different set of rankings from the base classifiers accuracies.
and weight assignments respectively. In the first three sets of After applying Eq.(9) to the instances in the training-validation
experiments, non-ensemble base classifiers were used, which set and computing the accuracies with Eqs. (10)-(12), we
were: J48, REPTree, Voted Perceptron and Random Tree. The obtained the performances of the pairwise combinations of
Random Tree learner was used to build two separate classifier the level-2 algorithms as shown in Table IV.
models using different configurations i.e. Random Tree-100 The results in Table IV clearly depict the overall per-
and Random Tree-9. With Random Trees, the number of formance improvement achieved by the level-2 combination
variables selected at each split during tree construction is schemes over the individual base classifiers. From Table III,
a configuration parameter which by default (in WEKA) is J48 has the best malware recall of 0.975 but its recall for
given by: log2 f + 1, where f is the number of features (# benign class is 0.983. On the other hand, Voted Perceptron
variables = 9 for f =350 with the McAfee-350 dataset). The had the best recall of 0.991 for the benign class, but its
same configuration is used in the Drebin-215 and Malgenome- recall for the malware class is 0.971 (on the training-validation
215 experiments for consistency. Thus, selecting 100 and 9 for set). On the training-validation set, the best combination is
Random Tree-100 and Random Tree-9 respectively results in AAB+RAPC (i.e. S1S3 pair) having 0.984 recall for mal-
two different base classifier models. Random Tree, REPTree, ware and 0.992 recall for benign class, and a weighted F-
J48 and Voted Perceptron were selected as example base clas- measure of 0.9893. J48 and Voted Perceptron had weighted
sifiers (out of 12 base classifiers) because of their combined F-measures of 0.9804 and 0.9843 respectively. These were
accuracy and training time performance as determined from below all of the weighted F-measures achieved by the combi-
preliminary investigation; a different set of learning algorithms nation schemes shown in Table IV. Hence, these intermediate
can be used with DroidFusion since it designed to be general- training-validation set results already show the capability of
purpose, and not specific to a particular type of machine the DroidFusion approach to produce stronger models from
learning algorithm. the weaker base classifiers.
IEEE TRANSACTIONS ON CYBERNETICS 9

TABLE V: malgenome 215 Comparison of DroidFusion with

base classifiers and traditional combination schemes on test
set.
Classifier PrecM RecM PrecB RecB W-FM T(s)
J48 0.948 0.948 0.974 0.974 0.9654 0.02
REPTree 0.960 0.956 0.978 0.980 0.9720 0.01
Random
Tree-100 0.967 0.956 0.978 0.984 0.9747 0.03
Random
Tree-9 0.955 0.944 0.972 0.978 0.9667 0.02
Voted
Perceptron 0.971 0.956 0.978 0.986 0.9760 0.05
Maj. voting 0.988 0.960 0.980 0.994 0.9827 0.05
Average of
Probabilities 0.988 0.960 0.980 0.994 0.9827 0.06 Fig. 3: Weighted F-measure results from the Malgenome-215
Maximum dataset experiments.
Probability 0.906 0.988 0.994 0.949 0.9623 0.04
MultiScheme 0.983 0.956 0.978 0.992 0.9800 0.07
DroidFusion 0.984 0.968 0.984 0.992 0.9840 0.07
TABLE VI: DREBIN 215 train-validation set results and
Level-2 algorithm based rankings for the base classifiers (5
= highest rank, 1 = lowest).
Classifier TPR TNR AAB CDB RAPC RACD
After the model has been built with the help of the
J48 0.959 0.983 4 3 5 4
training-validation set, the full DroidFusion model (featuring REPTree 0.950 0.979 1 1 1 1
AAB+RAPC in level-2) was evaluated on the test set. For Random
comparison, the base classifier models were re-trained on the Tree-100 0.968 0.981 5 5 4 5
complete training-validation set and then tested on the same Random
Tree-9 0.958 0.977 2 4 2 3
test set. The results are shown in Table V. Figure 3, is a
Voted
graph of the respective weighted F- measures. The results of Perceptron 0.956 0.982 3 2 3 2
DroidFusion are also compared to those of three classifier
combination methods: Majority Vote, Maximum Probability
and Average of Probabilities [13], and a meta learning method
known as MultiScheme. The MultiScheme approach evaluates the pairwise combinations of the level-2 algorithms are shown
a given number of base classifiers in order to select the best in Table VII.
model. In WEKA, it can be configured to use cross-validation From Table VI, Random Tree-100 had the best recall rate
or to build its model on the entire training set. In our exper- for the malware class (i.e. 0.968) while J48 had the best recall
iments we selected 10-fold cross validation configuration for rate for the benign class (0.983). On the training-validation set,
the MultiScheme learner to enable a comparative equivalent to the weighted F-measure for Random Tree-100 was 0.9762,
DroidFusion. Time T (s) depicts the testing time on the entire while that of J48 was 0.9741. Looking at Table VII, all of
instances in the test set for each of the methods in Table V. the combination schemes had better Weighted F-measures
On the test set, Random Tree-100 recorded the best (than the base classifiers) indicating accuracy performance
weighted F-measure out of the 5 base classifiers. Table V enhancement potential at this stage. The best combination is
shows that higher precision, recall (for both classes) and a the RAPC+RACD (S3S4 pair) scheme, whose configuration
larger weighted F-measure was obtained with DroidFusion is selected to build the final DroidFusion model.
compared to all of the base classifiers. DroidFusion also After the full DroidFusion model was built, it was then
performed better than MultiScheme and all the three com- evaluated on the test set. The base classifiers were re-trained
bination schemes. These results with Malgenome-215 dataset on the entire training-validation set and tested on the test set
demonstrate the effectiveness of the DroidFusion approach. for comparison. The results are presented in Table VIII, where
Random Tree-100 can be seen to have the best Weighted F-
measure (0.9824) out of the 5 base classifiers. The Droid-
B. Performance of DroidFusion with the Drebin-215 dataset.
In this section, we present the evlaution of DroidFusion TABLE VII: DREBIN 215 train-validation set Level-2 com-
on the Drebin-215 dataset.Table VI shows the predictive bination schemes intermediate results.
accuracies on the 5 non- ensemble base classifiers on the
training-validation set during DroidFusion model training. The Combination PrecM RecalM PrecB RecalB W-FM
split ratios for the training-validation and testing sets was AAB+CDB 0.966 0.972 0.984 0.980 0.9771
AAB+RAPC 0.984 0.972 0.984 0.991 0.9840
90%:10% and the 10-fold cross-validation procedure was
AAB+RACD 0.966 0.972 0.984 0.980 0.9771
utilized during training. The rankings induced by AAB, CDB, CDB+RAPC 0.981 0.972 0.984 0.989 0.9827
RAPC and RACD algorithms are also shown. Again, applying CDB+RACD 0.966 0.972 0.984 0.980 0.9771
Eq. (9) to the instances in the training-validation set and RAPC+RACD 0.981 0.976 0.986 0.989 0.9842
computing accuracies with Eqs. (10)-(12) the performances of
IEEE TRANSACTIONS ON CYBERNETICS 10

TABLE VIII: DREBIN 215 Comparison of DroidFusion with TABLE IX: McAfee train-validation set results and Level-2
base classifiers and traditional combination schemes on test algorithm based rankings for the base classifiers (5 = highest
set. rank, 1 = lowest).
Classifier PrecM RecM PrecB RecB W-FM T(s) Classifier TPR TNR AAB CDB RAPC RACD
J48 0.972 0.964 0.979 0.984 0.9766 0.03 J48 0.941 0.973 4 3 4 3
REPTree 0.976 0.951 0.972 0.986 0.9730 0.04 REPTree 0.928 0.966 2 2 2 2
Random Random
Tree-100 0.975 0.978 0.987 0.985 0.9824 0.04 Tree-100 0.948 0.968 5 5 4 5
Random Random
Tree-9 0.947 0.971 0.983 0.968 0.9692 0.04 Tree-9 0.935 0.962 3 4 2 3
Voted Voted
Perceptron 0.969 0.950 0.971 0.982 0.9701 0.37 Perceptron 0.917 0.959 1 1 1 1
Maj. voting 0.983 0.973 0.984 0.990 0.9837 0.32
Average of
Probabilities 0.983 0.973 0.984 0.990 0.9837 0.31
Maximum
Probability 0.908 0.996 0.998 0.941 0.9617 0.33 TABLE X: McAfee 350 train-validation set Level-2 combina-
MultiScheme 0.984 0.969 0.982 0.984 0.9784 0.05
DroidFusion 0.981 0.984 0.991 0.989 0.9872 0.38
tion schemes intermediate results.
Combination PrecM RecalM PrecB RecalB W-FM
AAB+CDB 0.945 0.955 0.972 0.966 0.9618
AAB+RAPC 0.970 0.956 0.973 0.982 0.9720
AAB+RACD 0.969 0.956 0.973 0.981 0.9714
CDB+RACD 0.969 0.955 0.972 0.981 0.9710
CDB+RACD 0.966 0.972 0.984 0.980 0.9771
RAPC+RACD 0.970 0.957 0.974 0.982 0.9724

amongst the 5 base classifiers. Random Tree-100 had the best

malware class recall of 0.948 out of the 5 base classifiers.
J48 had the highest Weighted F-measure of 0.9684. This is
Fig. 4: Weighted F-measure results from the Drebin-215 less than the Weighted F-measure of all combination schemes
dataset experiments. (shown in Table X) except the AAB+CDB scheme which
had a Weighed F-measure of 0.9618. These intermediate
results of the DroidFusion approach demonstrate the potential
Fusion model recorded the best precision and recall (for performance improvement obtainable in the final model.
both classes) compared to the base classifiers resulting in a
Table XI shows the results of the base classifiers and
weighted F-measure of 0.9870. Figure 4 illustrates the graph
the final DroidFusion model on the test set. The table, and
of F-measures for the test set results. DroidFusion can be
the graphs in Figure 5 clearly show that DroidFusion in-
seen to also perform better than to Majority Vote, Maximum
creases performance accuracy over the single-algorithm base
Probability, Average of Probabilities and MultiScheme. These
classifiers. DroidFusion results are equal to that of Majority
results clearly demonstrate the effectiveness of the DroidFu-
vote and Average of Probabilities but perform better than
sion approach.
the Maximum probability and MultiScheme methods. This
is because, Eq. (17) selected mv as the strongest classifier
C. Performance of DroidFusion with the McAfee-350 dataset. over any of the pairs based on the computations on the initial
In this section, the results of experiments on the McAfee- N -fold cross validation predictions of the base classifiers.
350 dataset are presented. The same split ratios for training- The mv scheme in this case achieved a W-FM of 0.9735
validation/testing and the procedures applied in the previous compared to 0.9724 obtained by CDB+RAPC (S2S3 pair)
experiment was adopted. The rankings from AAB, CDB, and RAPC+RACD (S3S4 pair). Therefore, DroidFusion was
RAPC and RACD are shown in Table IX alongside the configured to use Eqs. (13)-(16) on the test set. However,
per-class accuracy performances on the validation set that if either of the strongest pairs had been used, it would
induced the rankings. Just like in the previous experiments, result in a Weighted F-measure performance of 0.9777 on
we apply Eq.(9) to the instances in the training-validation set the test set; which still surpasses the Weighted F-measures
and compute the accuracies with Eqs. (10)-(12). The resulting from Maximum probability (0.9423) and those of the five
performances of the pairwise combinations of the level-2 original base classifiers. These results once again confirm the
algorithms are shown in Table X. effectiveness of the proposed DroidFusion approach. In the
From Table IX (training-validation set results for the base next section we presents results obtained from experiments
classifiers), J48 had the best benign class recall of 0.973 investigating ensemble learners as base classifiers.
IEEE TRANSACTIONS ON CYBERNETICS 11

TABLE XI: McAfee 350 Comparison of DroidFusion with TABLE XII: McAfee 100 train-validation set results and
base classifiers and traditional combination schemes on test Level-2 algorithm based rankings for the (ensemble) base
set. classifiers (5 = highest rank, 1 = lowest).
Classifier PrecM RecM PrecB RecB W-FM T(s) Classifier TPR TNR AAB CDB RAPC RACD
J48 0.967 0.950 0.969 0.980 0.9685 0.11 Random Forest 0.959 0.979 4 4 2 4
REPTree 0.942 0.943 0.965 0.964 0.9560 0.11 Random Sub.
Random (REPTree) 0.923 0.984 1 1 2 1
Tree-100 0.954 0.951 0.970 0.972 0.9640 0.11 AdaBoost:
Random (Random Tree) 0.949 0.979 2 3 1 2
Tree-9 0.952 0.936 0.961 0.971 0.9576 0.12 Random Sub.
Voted (Random Tree) 0.951 0.985 5 2 4 2
Perceptron 0.928 0.917 0.949 0.956 0.9411 6.76 Random Comm.
Maj. voting 0.980 0.964 0.978 0.988 0.9788 6.76 (Random Tree) 0.961 0.980 3 5 4 5
Average of
Probabilities 0.980 0.964 0.978 0.988 0.9788 7.01
Maximum
Probability 0.874 0.990 0.993 0.912 0.9423 6.54
MultiScheme 0.967 0.950 0.969 0.980 0.9685 0.12 TABLE XIII: McAfee-100 train-validation set Level-2 combi-
DroidFusion 0.980 0.964 0.978 0.988 0.9788 7.02 nation schemes intermediate results.
Combination PrecM RecalM PrecB RecalB W-FM
AAB+CDB 0.980 0.955 0.973 0.988 0.9753
AAB+RAPC 0.982 0.954 0.972 0.989 0.9756
AAB+RACD 0.980 0.955 0.973 0.988 0.9753
CDB+RAPC 0.980 0.955 0.973 0.988 0.9753
CDB+RACD 0.977 0.957 0.974 0.986 0.9749
RAPC+RACD 0.980 0.955 0.973 0.988 0.9753

Random Tree as base learner for the ensemble (base) classifiers

comes from our preliminary experiments (ommitted due to
space constraint) which also confirms previous suggestion
that it produces the strongest classifiers for most ensemble
Fig. 5: Weighted F-measure results from the McAfee-350 methods [42]. In the preliminary experiments, it was also
dataset experiments. found that by taking the top 100 features only a marginal
drop in performance was observed for the ensemble base clas-
sifiers. Hence, this enabled us undertake the experiments with
D. Performance of DroidFusion with the McAfee-100 dataset ensemble classifers using a significantly reduced dimension
using ensemble learners as base classifiers. while using the same number of instances (i.e. 36,183).
In this section we present results of experiments performed
Table XII shows the accuracy performance of the 5 en-
to investigate the feasibility of utilizing DroidFusion to en-
semble models used as the DroidFusion base classifiers on
hance accuracy performance by combining ensemble clas-
the training-validation set instances (using the 10-fold cross-
sifiers rather than traditional singular classifiers. Ensemble
validation). The corresponding AAB, CDB, RAPC and RACD
learners have been shown to perform well in classification
rankings are also depicted. Similar to the previous experi-
problems [14], [33]. Our goal is to investigate whether by
ments, the level-2 combination schemes performance improves
using DroidFusion for fusion of ensemble classifiers, further
on that of the individual ensemble classifiers. This is also
accuracy improvements can be achieved. For our ensemble
indicative of potential performance improvement obtainable
learning based experiments, we reduced the number of features
when the final model is constructed. In this case, AAB+RAPC
from 350 down to 100 using the information gain feature
(S1S3 pair) is the recommended configuration as seen from
ranking technique (Eq. 31-33). The ensemble learners consid-
Table XIII results.
ered as example base classifiers include: Random Forest [39],
AdaBoost [40] (with Random Tree base classifier), Random In Table XIV, the test set results of the ensemble classifiers
Committee (with Random Tree base classifier), Random Sub- and those of DroidFusion are given. The results of Multi-
space [41] (with Random Tree base classifier), and Random Scheme, Majority vote, Average of probabilities and maximum
Subspace with REPTree base classifier. Note that the two probabilities are also shown. DroidFusion improves benign re-
Random Subspace learners with different base classifiers yield call rates over all of the ensemble models in the base classifier
completely different models. In terms of number of iterations level. The overall weighted F-measure of DroidFusion is the
for the ensemble learners the configurations used were: Ad- highest as shown in Table XIV and Figure 6 graphs. This
aBoost (25 iterations), Random Forest, Random Committee, shows that the DroidFusion approach can also be effectively
and Random Subspace (10 iterations each). Our choice of applied for fusion of ensemble classifiers.
IEEE TRANSACTIONS ON CYBERNETICS 12

TABLE XIV: McAfee 100 Comparison of DroidFusion

with (ensemble) base classifiers and traditional combination
schemes on test set.
Classifier PrecM RecM PrecB RecB W-FM T(s)
Random Forest 0.960 0.965 0.978 0.975 0.9712 0.09
Random Sub.
(REPTree) 0.971 0.931 0.958 0.983 0.9630 0.05
AdaBoost
(Random Tree) 0.963 0.957 0.974 0.977 0.9694 0.11
Random Sub.
(Random Tree) 0.974 0.957 0.974 0.984 0.9737 0.06
Random Comm.
(Random Tree) 0.963 0.964 0.978 0.977 0.9720 0.08
Maj. voting 0.977 0.959 0.975 0.986 0.9757 0.24
Average of Fig. 7: DroidFusion vs. Stacking results (Weighted F-measure)
Probabilities 0.978 0.958 0.974 0.987 0.9759 0.19
Maximum TABLE XV: DroidFusion vs. Stacked Generalization for the
Probability 0.972 0.958 0.974 0.983 0.9734 0.20
MultiScheme 0.960 0.968 0.980 0.975 0.9724 0.08 four datasets.
DroidFusion 0.983 0.958 0.974 0.990 0.9777 0.22
Method/dataset PrecM RecalM PrecB RecalB W-FM
DroidFusion-
malgenome 0.984 0.968 0.984 0.992 0.9840
StackingC-
malgenome 0.992 0.964 0.982 0.996 0.9853
DroidFusion-
Drebin 0.981 0.984 0.991 0.989 0.9872
StackingC-
Drebin 0.988 0.969 0.982 0.993 0.9841
DroidFusion-
McAfee-350 0.980 0.964 0.978 0.988 0.9788
StackingC-
McAfee-350 0.974 0.967 0.980 0.984 0.9775
DroidFusion-
McAfee-100 0.983 0.958 0.974 0.990 0.9777
StackingC-
McAfee-100 0.978 0.958 0.974 0.987 0.9759

Fig. 6: Weighted F-measure results from the McAfee-100

dataset experiments with ensemble base classifiers. the case of the malgenome-215 dataset. On all the other three
datasets, DroidFusion performed better. A notable advantage
of DroidFusion over Stacking is that it provides a wider range
E. Performance of DroidFusion vs. Stacked generalization. of criteria for weighting and fusion of base classifiers through
Stacked Generalization [43], has a similar (multilevel) ar- the use of four separate algorithms; by contrast, Stacking
chitecture to DroidFusion. It is also a well-known framework (with liner regression meta classifier) effectively combines
for classifier fusion which has been extensively studied and classifiers based on only one criterion (i.e. weighting the
applied to many machine learning problems. For this reason, base classifiers according to their relative strengths (overall
we compared our proposed approach to the Stacked Gen- performance accuracies) [44]).
eralization method. One noticeable difference between our
approach and Stacked Generalization is that instead of training F. Analysis of time performance
with a meta- learner in level-2, we utilized a computational As mentioned earlier, the app processing to extract features
approach where ranking algorithms are used to combine the was done using our bespoke Python based tool described
outcomes of the lower level classifiers. We used the StackingC in section IV-A. Table XVI presents an overview of app
implementation in WEKA which uses a Linear Regression processing time estimates. This is dependent on the size of
meta classifier in level-2. Note that this is considered to be the app which can range between a few kilobytes to a several
the most effective Stacked Generalization configuration [44] megabytes. Hence, the average unzipping and disassembly
(given that any learning algorithm can be chosen as the meta time was 0.739 seconds while the average time to analyse
classifier). The StackingC learner is also configured to use the manifest and extract permissions, intents etc. was 0.0048
10-fold cross validation when combining the base learners. seconds. The rest of the processing involves mining the disas-
Applying the Stacked generalization algorithm to the same sembled files and scanning for other attributes. This took on
base classifiers and with the same four datasets the results average 6.4 seconds. The total average processing time for the
are given in Figure 7 and Table XV. From Figure 7, the apps was therefore approximately 7.145 seconds. During the
Weighted F-measures comparative results for the four datasets experiments the feature vectors were fed into trained models
showed that StackingC achieved a better performance only in for testing. The DroidFusion model testing times were 0.07
IEEE TRANSACTIONS ON CYBERNETICS 13

TABLE XVI: Analysis of app processing time [2] McAfee Labs. McAfee Labs Threat Predictions Report. March 2016.
[3] Y. Zhou and X. Jiang, ”Dissecting android malware: Characterization and
Task Lowest (s) Highest (s) Average (s) evolution” In proc. 2012 IEEE Symposium on Security and Privacy (SP),
Unzipping and San Fransisco, CA, USA, 20-23 May, 2012 , pp. 95-109.
dissassemby 0.392 1.18 0.739 [4] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, and K. Rieck, ”Drebin:
Manifest analysis 0.0013 0.0088 0.0048 Efficient and Explainable Detection of Android Malware in Your Pocket”
Code analysis 3.428 15.47 6.4 In proc. 20th Annual Network & Distributed System Security Symposium
Total 7.145 (NDSS), San Diego, CA, USA, 23-26 Feb. 2014.
[5] A. Apvrille, and R. Nigam. Obfuscation in Android Malware, and
how to fight back. Virus Bulletin, July 2014. Available from:
https://www.virusbulletin.com/virusbulletin/2014/07/obfuscation-
android-malware-and-how-fight-back [Accessed Sept. 2017]
seconds (for 759 instances), 0.38 seconds (for 1503 instances), [6] Y. Jing, Z. Zhao, G.-J. Ahn, and H. Hu, ”Morpheus: Automatically
7.02 seconds (for 3618 instances), and 0.22 seconds (for 3618 Generating Heuristics to Detect Android Emulators” In proc. 30th An-
instances) in the four sets of experimental results presented nual Computer Security Applications Conference (ACSAC 2014),New
Orleans, Louisiana, USA, Dec. 8-12, 2014, pp. 216-225.
earlier. These figures clearly illustrate the scalability of static- [7] T. Vidas and N. Christin, ”Evading Android runtime analysis via sandbox
based features solution with only an average of just over 7 detection” In proc. 9th ACM Symposium on Information, Computer and
seconds required to process an app and classify it using a Communications Security, Kyoto, Japan, June 04-06, 2014, pp. 447-458.
[8] T. Petsas, G. Voyatzis, E. Athanasopoulos, M. Polychronakis, and S.
trained DroidFusion model. Thus, it is feasible in practice to Ioannidis, ”Rage against the virtual machine: hindering dynamic analysis
deploy the system for scenarios requiring large scale vetting of android malware” In proc.7th European Workshop on System Security
of apps. (EuroSec ’14), Amsterdam, Netherlands, April 13, 2014.
[9] F. Matenaar and P. Schulz. Detecting android sandboxes.
Note that although our study is based on specific static http://dexlabs.org/blog/btdetect, August 2012. [Accessed: Sept. 2017].
features, classifiers trained from other types of features can [10] S. R. Choudhary, A. Gorla, A. Orso, ”Automated test input generation
also be combined using DroidFusion. Basically, DroidFusion for Android: are we there yet?” In proc. 30th IEEE/ACM international
conference on Automated Software Engineering (ASE 2015), Nov. 9-13,
is agnostic to the feature engineering process. 2015, pp. 429-440.
[11] W. Dong-Jie, M. Ching-Hao, W. Te-En, L. Hahn-Ming, and W. Kuo-
Ping, ”DroidMat: Android malware detection through manifest and API
G. Limitations of DroidFusion calls tracing,” In proc. Seventh Asia Joint Conference on Information
Although the proposed general-purpose DroidFusion ap- Security(Asia JCIS), 2012, pp. 62-69.
[12] S. Y. Yerima, S. Sezer, and I. Muttik. ”Android malware detection:
proach has been demonstrated empirically to enable improved An eigenspace analysis approach” In proc. Science and Information
accuracy performance by classifier fusion, there is scope Conference (SAI 2015), London, UK, 28-30 July 2015, pp.1236-1242.
for further improvement. The current DroidFusion design is [13] S. Y. Yerima, S. Sezer, I. Muttik ”Android malware detection using
parallel machine learning classfiers” In proc. 8th Int. Conf. on Next
aimed at binary classification. Future work could investigate Generation Mobile Apps, Services and Technolgies (NGMAST 2014),
extending the algorithms in the DroidFusion framework to Oxford, UK, Sept. 10-12, 2014, pp. 37-42
handle multi-class problems. [14] S. Y. Yerima, S. Sezer, and I. Muttik. High accuracy Android malware
detection using ensemble learning. IET Information Security, Vol 9, issue
6, 2015, pp. 313-320.
VI. C ONCLUSION [15] M. Varsha, P. Vinod, & K. Dhanya. Identification of malicious Android
app using manifest and opcode features. Journal of Computer Virology
In this paper, we proposed a novel general purpose multi- and Hacking Techniques, 2016, pp. 1-14.
level classifier fusion approach (DroidFusion) for Android [16] A. Sharma and S. Dash, ”Mining API calls and permissions for An-
malware detection. The DroidFusion framework is based on droid malware detection” in Cryptology and Network Security. Springer
International Publishing, 2014, pp. 191205.
four proposed ranking-based algorithms that enable higher- [17] P.P. K., Chan and W-K. Song, ”Static detection of Andoid malware by
level fusion using a computational approach rather than the using permissions and API calls” In proc. 2014 international Conference
traditional meta classifier training that is used for example in on Machine Learning and Cybernetics, Lanzhou, July 13-16, 2014.
[18] W. Wang, X. Wang, D. Feng, J. Liu, Z. Han and X. Zhang. Exploring
Stacked Generalization. We empirically evaluated DroidFusion Permission-Induced Risk in Android applications for Malicious Applcica-
using four separate datasets. The results presented demon- tion Detection. IEEE Transactions on Information Forensics and Security,
strates its effectiveness for improving performance using both Vol. 9, No. 11, Nov. 2014, pp. 1869-1882.
[19] M. Fan, J. Liu, W. Wang, H. Li, Z. Tian and T. Liu. DAPASA: Detecting
non-ensemble and ensemble base classifiers. Furthermore, we Android Piggybacked Apps through Sensitive Subgraph Analysis. IEEE
showed that our proposed approach can outperform Stacked Transactions on Information Forensics and Security, Vol. 12, Issue 8,
Generalization whilst utilizing only computational processes March 2016, pp. 1772-1785.
[20] L. Cen, C. S. Gates, L. Si, and N. Li. A Probabilistic Discriminative
for model building rather than training a meta classifier at the Model for Android Malware Detection with Decompiled Source code.
higher level. IEEE Transactions on Secure and Dependable Computing, Vol. 12, No.
4, July/August 2015.
[21] Westyarian, Y. Rosmansyah, B. Dabarsyan, ”Malware detection on
ACKNOWLEDGMENT Android Smartphones using API class and Machine learning” 2015 In-
This work is supported by the UK Engineering and Phys- ternational Conference on Electrical Enginnering and Informatics (ICEEI
2015), 10-11 Aug. 2015.
ical Sciences Research Council EPSRC grant EP/N508664/1 [22] F. Idrees, and M. Rajarajan. ”Investigating the Android intents and
Centre for Secure Information Security (CSIT-2). permissions for malware detection”. In proc. 10th IEEE Int. Conf.
on Wireless and Mobile Computing, Networking and Communications
R EFERENCES (WiMob), Oct. 2014, pp. 354-358).
[23] B. Kang, S. Y. Yerima, S. Sezer and K. McLaughlin. N-gram opcode
[1] Smartphone OS market share worldwide 2009-2015 analysis for Android malware detection. International Journal of Cyber
Statistic, Statista, Hamburg, Germany, 2017 [Online] Situational Awareness, Vol. 1, No. 1, Nov. 2016.
https://www.statista.com/statistics/263453/global-market- share-held- [24] M. Zhao, F. Ge, T. Zhang, and Z. Yuan, ”Antimaldroid: An efficient svm
by-smartphone-operating-systems based malware detection framework for android” In C. Liu, J. Chang, and
IEEE TRANSACTIONS ON CYBERNETICS 14

A. Yang, editors, ICICA (1), volume 243 of Communications in Computer [49] M.-Y. Su, J.-Y. Chang, and K.-T. Fung ”Machine Learning on Merging
and Information Science, Springer, 2011. pp. 158166. Static and Dynamic Features to identify malicious mobile apps” In proc.
[25] W.-C. Wu and S.-H. Hung, ”Droiddolphin: A dynamic Android malware 9th Int. Conf. on Ubiquitous and Future Networks (ICUFN), 2017, Milan,
detection framework using big data and machine learning” In proc. 2014 Italy, 4-7 July 2017. pp. 863-867.
ACM conf. on Research in Adaptive and Convergent Systems, (RACS [50] N. Milosevic, A. Dehghantanha, K.-K. R. Choo ”Machine Learning
’14), NY, USA, pp. 247-252. aided Android malware classification” Computers & Electrical Engineer-
[26] V. M. Afonso, M. F. de Amorim, A. R. A. Gregio, G. B. Junquera, and ing, Volume 61, July 2017, pp 266-274.
P. L. de Geus. Identifying Android malware using dynamically obtained [51] W. Wang, Y. Li, X. Wang, J. Liu, X. Zhang ”Detecting Android
features. Journal of Computer Virology and Hacking Techniques, 2014. malicious apps and categorizing benign apps with ensemble classifiers”
[27] M. K Alzaylaee, S. Y. Yerima, S. Sezer ”EMULATOR vs REAL Future Generation Computer Systems, 2017, ISSN 0167-739X.
PHONE: Android Malware Detection Using Machine Learning” 3rd [52] X. Wang, W. Wang, Y. He, J. Liu, Z. Han, X. Zhang ”Characterizing
ACM Int. Workshop on Security and Privacy Analytics (IWSPA ’17), Android apps behaviour for effective detection of malapps at large scale”
Co-located with ACM CODASPY 2017, Scotts., AZ, USA, March 2017. Future Generation Computer Systems, Volume 75, Oct. 2017, pp. 30-45.
[28] Lindorfer, M., Neugschwandtner, M., & Platzer, C. ”MARVIN: Efficient [53] A. Mahindru and P. Singh ”Dynamic Permissions based Android
and comprehensive mobile app classification through static and dynamic malware detection using machine learning techniques” In proc. 10th
analysis” In proc. IEEE 39th Annual Computer Software and Applications Innovations in Software Engineering Conference, Jaipur, India, Feb. 5-7,
Conference (COMPSAC), pp. 4223433. 2017. pp 202-210.
[29] D. Gaikwad and R. Thool ”DAREnsemble: Decision Tree and Rule [54] M. Yang, S. Wang, Z. Ling,, Y. Liu, Z. Ni. Detection of ma-
Learner Based Ensemble for Network Intrusion Detection System” In licious behaviour in Android apps through API calls and permis-
Proc. 1st Int. Conf. on Information and Communication Technology for sion uses analysis. Concurrency Computed: Pract Exper. 2017: e4172.
Intelligent Systems, Springer, 2016, pp. 185-193. https://doi.org/10.1002/cpe.4172
[30] A. Balon-Perlin and B. Gamback. Ensemble of Decision Trees for Net- [55] F. Idrees, M. Rajarajan, M. Conti, T. M. Chen, Y. Rahulamathavan.
work Intrusion Detection. International Journal on Advances in Security, PIndroid: A novel Android malware detection system using ensemble
Vol. 6. No. 1 and 2, 2013. learning methods. Computers & Security, Vol 68, July 2017, pp. 36-46.
[31] M. Panda and M. R. Patra ”Ensembling rule based classifiers for detect-
ing network intrusions” Int. Conf.on Advances in Recent Technologies
in Communication and Computing, 2009, IEEE, DOI 10.1109/ART-
Com.2009.121.
[32] A. Zainal, M. A. Maarof, S. M. Shamsuddin and A. Abraham Ensemble
of one-class classifiers for network intrusion detection system In proc. Suleiman Y. Yerima (M’04) received the B.Eng.
Fourth international conference on information assurance and security, degree (first Class) in electrical and computer engi-
2008, IEEE, DOI 10.1109/IAS.2008.35 neering from the Federal University of Technology,
[33] L. D. Coronado-De-Alba, A. Rodriguez-Mota, P. J. Escamilla- Ambrosio Minna, Nigeria, the M.Sc, degree (with distinction)
Feature Selection and ensemble of classifiers for Android malware detec- in personal, mobile and satellite communications
tion In proc. 8th IEEE Latin-American Conference on Communications from the University of Bradford, Bradford, U.K.,
(LATINCOM 2016), 15-17 Nov. 2016. and the Ph.D. degree in mobile computing and
[34] M. K Alzaylaee, S. Y. Yerima, S. Sezer ”Improving Dynamic Analysis of communications from the University of South Wales,
Android Apps Using Hybrid Input Test Generation” In proc. Int. Conf. on Pontypridd, U.K. (formerly, the University of Glam-
CyberSecurity and Protection of Digital Services (Cyber Security 2017), organ) in 2009.
London, UK, June 19-20, 2017. He is currently a Senior Lecturer of Cyber Se-
[35] Y. Aafer, W. Du, and H. Yin, ”DroidAPIMiner: Mining API-level fea- curity in the Faculty of Technology, at De Montfort University, Leicester,
tures for robust malware detection in Android” In proc. 9th Int.Conference United Kingdom. He was previously a Research Fellow at the Centre for
on Security and Privacy in Communication Networks (SecureComm Secure Information Technologies (CSIT), Queens University Belfast, UK,
2013). Sydney, Australia, Sep. 25-27, 2013. where he led the mobile security research theme from 2012 until 2017. He
[36] T. Book, A. Pridgen, and D. S. Wallach, ”Longitudinal Analysis of was a member of the Mobile Computing Communications and Networking
Android Ad Library Permissions” In proc. Mobile Security Technologies (MoCoNet) Research group at Glamorgan from 2005 to 2009. From 2010 to
conference (MoST 13), San Fransisco, CA, May 2013. 2012, he was with the UK- India Advanced Technology Centre of excellence
[37] M. Hall, E. Frank, G. Holmes, B. Pfahriger, P. Reutermann and I. H. in Next Generation Networks, Systems and Services (IU-ATC), University of
Witten. The WEKA data mining software: an update. ACM SIGKDD Ulster, Coleraine, Northern Ireland .
Explorations, Vol.11, No.1. June 2009.pp 10-18. Dr. Yerima is a member of the IAENG, and (ISC)2 professional societies.
[38] T. M. Cover, J. A. Thomas, Elements of Information Theory, 2nd He is a Certified Information Systems Security Professional (CISSP) and
Edition, John Wiley & Sons, inc., Hoboken, New Jersey, 2006, pp. 41. a Certified Ethical Hacker (CEH). He was the recepient of the 2017 IET
[39] L. Breiman. Random forests. Machine Learning, 45, 2001, pp 5-32. Information Security premium (best paper) award.
[40] Y. Freund and R. E. Schapire, ”Experiments with a new boosting
algorithm” In proc. 13th Int. Conf. on Machine Learning, San Francisco,
1996, pp. 148-156.
[41] T. K. Ho. The Random Subspace Method for Constructing Decision
Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Vol 20 (8), 1998, pp. 832-844, 1998
[42] T. Ho, ”Random Decision Forests”, Proc. of the 3rd Int. Conf. on Sakir Sezer (M ’00) received the Dipl. Ing. degree
Document Analysis and Recognition, 1995, pp. 278-282. in electrical and electronic engineering from RWTH
[43] David H. Wolpert. Stacked generalization. Neural Networks. 1992, pp, Aachen University, Germany, and the Ph.D. degree
241-259. in 1999 from Queens University Belfast, U.K. Prof.
[44] K. M. Ting and I. H. Witten. Issues in Stacked Generilization. Journal Sezer is currently Secure Digital Systems Research
of Artificial Intelligence Research, 10, May 1999, pp. 271-289. Director and Head of Network Security Research
[45] T. Ban, T. Takahashi and S. Guo ”Integration of Multi-modal Features in the School of Electronics Electrical Engineering
for Android Malware Detection Using Linear SVM” In proc. 11th Asia and Computer Science at Queens University Belfast.
Joint Conference on Information Security, 2016. His research is leading major (patented) advances in
[46] Z. Ni, M. Yang, Z. Ling, J. N. Wu and J. Luo, ”Real-Time Detection of the field of high-performance content processing and
Malicious Behavior in Android Apps,” In proc. Int. Conf. on Advanced is currently commercialized by Titan IC Systems.
Cloud and Big Data (CBD), Chengdu, 2016, pp. 221-227. He has co- authored over 120 conference and journal papers in the area of
[47] Z. Wang, J. Chai, S. Chen and W. Li, ”DroidDeepLearner: Identifying high-performance network, content processing, and System on Chip. Prof.
Android malware using deep learning” IEEE 37th Sarnoff Symposium, Sezer has been awarded a number of prestigious awards including InvestNI,
Newark, NJ, 2016, pp. 160-165. Enterprise Ireland and Intertrade Ireland innovation and enterprise awards,
[48] S. Wu, P. Wang, X. Li, Y. Zhang. Effective detection of android malware and the InvestNI Enterprise Fellowship. He is also cofounder and CTO of
based on the usage of data flow APIs and machine learning. Information Titan IC Systems and a member of the IEEE International System- on-Chip
and Software Technology, Vol.75, 2016, Pages 17-25, ISSN 0950-5849. Conference executive committee.