DroidFusion Accepted Version
DroidFusion Accepted Version
Abstract—Android malware has continued to grow in volume machine learning based methods are increasingly being ap-
and complexity posing significant threats to the security of plied to Android malware detection. However, classifier fusion
mobile devices and the services they enable. This has prompted approaches have not been extensively explored as they have
increasing interest in employing machine learning to improve
Android malware detection. In this paper we present a novel been in other domains like network intrusion detection.
classifier fusion approach based on a multilevel architecture that In this paper, we present and investigate a novel classifier
enables effective combination of machine learning algorithms fusion approach that utilizes a multilevel architecture to in-
for improved accuracy. The framework (called DroidFusion), crease the predictive power of machine learning algorithms.
generates a model by training base classifiers at a lower level The framework, called DroidFusion, is designed to induce
and then applies a set of ranking-based algorithms on their
predictive accuracies at the higher level in order to derive a a classification model for Android malware detection by
final classifier. The induced multilevel DroidFusion model can training a number of base classifiers at the lower level. A
then be utilized as an improved accuracy predictor for Android set of ranking-based algorithms are then utilized to derive
malware detection. We present experimental results on four combination schemes at the higher level, one of which is
separate datasets to demonstrate the effectiveness of our proposed selected to build a final model. The framework is capable
approach. Furthermore, we demonstrate that the DroidFusion
method can also effectively enable the fusion of ensemble learning of leveraging not only traditional singular learning algorithms
algorithms for improved accuracy. Finally, we show that the like Decision Trees or Naive Bayes, but also ensemble learning
prediction accuracy of DroidFusion, despite only utilizing a algorithms like Random Forest, Random Subspace, Boosting
computational approach in the higher level, can outperform etc. for improved classification accuracy.
Stacked Generalization, a well-known classifier fusion method In order to demonstrate the effectiveness of the DroidFu-
that employs a meta-classifier approach in its higher level.
sion approach, we performed extensive experiments on four
Index Terms—Android Malware Detection, Mobile Secu- datasets derived from extracting features from two publicly
rity, Machine Learning, Classifier Fusion, Ensemble Learning, available and widely used malware samples collection (i.e.
Stacked Generalization.
Android Malgenome project [3] and DREBIN [4]) and a
collection of samples provided by Intel Security (Formerly,
I. I NTRODUCTION McAfee). The unique contributions of this paper can be
summarized as follows:
TABLE I: Overview of some of the papers that apply classfier III. D ROID F USION : GENERAL PURPOSE FRAMEWORK FOR
fusion for Android malware detection. NB = Naive Bayes; SL= CLASSIFIER FUSION
Simple Logistic; LR= Linear Regression; DT = Decision Tree;
VP= Voted Perceptron. AveP = average of probabilities; ProdP The DroidFusion framework consists of a multilevel ar-
= product of probabilities; MaxP = maximum probability. chitecture for classifier fusion. It is designed as a general
purpose classifier fusion system, so that it can be applied to
Paper/Year ML algorithms Fusion approach # samples both traditional singular classifiers and ensemble classifiers
Yerima et. al SVM, J48, Majority vote, 6,863 (which themselves employ a base classifier usually to produce
[13] (2014) PART, Ridor, ProdP,
NB, SL AveP, MaxP different randomly induced models that are subsequently com-
Coronado-de-Alba Random Forest, Meta-ensembling bined). At the lower level, the (DroidFusion) base classifiers
et. al [33] (2016) Random Random Forest in 3,062 are trained on a training set using a stratified N -fold cross
Committee Random Comm.
validation technique to estimate their relative predictive accu-
Milosevic SVM, C.45, RT
et. al [50] (2017) DT, JRip, LR, Majority vote 387 racies. The outcomes are utilized by four different ranking-
Random Forest 368 based algorithms (in the higher layer) that define certain
Wang SVM, KNN, criteria for the selection and subsequent combination of a
et. al [51] (2017) NB, CART, Majority vote 116,028
Random Forest subset (or all) of the applicable base classifiers. The outcomes
Idrees MLP, DT, Majority vote, of the ranking algorithms are combined in pairs in order to find
et al. [55] (2017) Decision Table AveP, ProdP 1,745 the strongest pair, which is subsequently used to build the
RT, J48, final DroidFusion model (after testing against an unweighted
DroidFusion RepTree, VP, Multilevel 3,799
(This paper) Random Forest, weighted 15,036 parallel combination of the base classifiers).
Random Comm., Ranking-based 36,183
Random Sub., approach
AdaBoost A. DroidFusion model construction
The model building i.e. training process is distinct from the
prediction or testing phase, as the former utilizes a training-
validation set to build a multilevel ensemble classifier which is
then evaluated on a separate test set in the latter phase. Figure
Wang et al. [51] extracted 11 types of static features 1, illustrates the 2-level architecture of DroidFusion. It shows
and employ multiple classifiers in a majority vote fusion the training paths (solid arrows) and the testing/prediction
approach. The classifiers include SVM, K-Nearest Neighbour, path (dashed arrows). First, at the lower level each base
Naive Bayes, Classification and Regression Tree (CART) and classifier undergoes an N -fold cross validation based estimate
Random Forest. Their experiments on 116,028 app samples of class performance accuracies. Let the N -fold cross validated
showed more robustness with the majority voting ensemble predictive accuracies for K base classifiers be expressed by
than with the individual base classifiers. Pbase , a K-tuple of the class accuracies of the K base
classifiers:
Idrees et al. [55] utilize permissions and intents as features
to train machine learning models and applied classifier fusion Pbase = {[P1m , P1b ], [P2m , P2b ], ..., [PKm , PKb ]} (1)
for improved perfromance. Their experiments were performed The elements of Pbase are applied to the ranking based
on 1745 app samples starting with a performance comparison algorithms AAB, CDB, RAPC and RACD described later
between MLP, Decision Table, Decision Tree, Random Forest, in section III-B. Let X be the total number of instances
Naive Bayes and Sequential Minimal Optimization classifiers. with M malware and B benign instances, where the M
The Decision Table, MLP, and Decision Tree classfiers were instances possess a label L=1 denoting malware and the B
then combined using three schemes: average of probabilities, instances from X possess a label L=0 denoting benign. All X
product of probabilities and majority voting. Coronado-de- instances are also represented by feature vectors with f binary
Alba et al. [33] proposed and investigated a classifier fusion representations, where f is the number of features extracted
method based on Random Forest and Random Committee from the given app. The features in the vectors take on 0
ensemble classifiers. Their approach embeds Random Forest or 1 representing the absence or presence the given feature.
within Random Commitee to produce a meta-ensemble model. Additionally, after the N -fold cross validation process (as
The meta-model outperformed the individual classifiers in shown in Fig. 1), a set of K-tuple class predictions are derived
experiments performed with 1531 malware and 1531 benign for every instance x, given by:
samples. Table I summarizes papers that have investigated
classifier fusion for Android malware detection.
V (x) = {v1 , v2 , ..., vk }, ∀k ∈ {1, ..., K} (2)
In contrast to all of the existing Android malware detection Note that v1 ,v2 ,...,vk could be crisp predictions or proba-
works, this paper proposes a novel classifier fusion approach bility estimates from the base classifiers. Adding the original
that utilizes four ranking based algorithms within a multilevel (known) class label, l, we obtain:
framework (DroidFusion). We evaluated DroidFusion exten-
sively and compared its performance to Stacking and other
classifer fusion methods. Next, we present DroidFusion. V̇ (x) = {v1 , v2 , ..., vk , l}, ∀k ∈ {1, ..., K}, l ∈ {0, 1} (3)
IEEE TRANSACTIONS ON CYBERNETICS 4
Pbase and V̇ (x), ∀ x ∈ X will be utilized in the level- Hence, the benign class accuracy performance for the given
2 computation during the DroidFusion model construction. scheme is calculated from:
Let us denote the set of four ranking based schemes by
S = {S1, S2, S3, S4}. The pairwise combinations of the PX
+ 1) | CSj (x) = 0, l(x) = 0
ben x=1 (CSj (x)
elements of S will result in 6 possibilities: PSj = (6)
B
Where B is the number of benign instances, while the
φ = {S1S2, S1S3, S1S4, S2S3, S2S4, S3S4} (4) malware accuracy performance is calculated from:
Our goal is to select the best pair of ranking-based schemes PX
from S, and if its performance exceeds that of an unweighted mal CSj (x) | CSj (x) = 1, l(x) = 1
x=1
PSj = (7)
combination of the original base classifiers, it would be X −B
selected to construct the final DroidFusion model. In the Thus the average performance accuracy is simply:
event that the unweighted combination performance is greater,
ben mal
DroidFusion will be configured to apply a majority vote (or B · PSj + (X − B) · PSj
ṖSj = (8)
average of probabilities) of the base classifiers in the final con- X
structed model. In order to estimate the accuracy performance Likewise, to determine the performance of each pairwise
of each scheme in S or each pairwise combination in set φ, a combination in φ:
re- classification of the X instances (in the training-validation) Let ωi , i ∈ {1, ..., Z}, Z ≤ K be the first set of weights
set is performed for each scheme or pair of schemes. The re- derived for the first scheme in the pair, and let µi , i ∈
classification is accomplished using V̇ (x), x ∈ X based on the {1, ..., Z}, Z ≤ K be those derived for the second scheme
criteria defined by the schemes in S using Pbase . Each scheme in the pair. Then, to reclassify the X instances in the training-
in S derives a set of Z weights that will be applied with V̇ (x), validation set according to the combination pair, the class
x ∈ X for every instance during the re-classification process. prediction of each instance x will be given by:
Let ωi , i ∈ {1, ..., Z}, Z ≤ K be the set of weights derived
for a particular scheme in S. Then, to reclassify an instance PZ PZ
i=1 ωi vi +Pi=1 µi vi
x according to the schemes criterion, its class prediction will 1 : if
P Z
ω + Z ≥ 0.5
i=1 i i=1 µi
be given by:
CSjSn (x) = 0 : otherwise (9)
∀j ∈ {1, 2, 3, 4}, ∀n ∈ {1, 2, 3, 4},
( PZ
i=1 ωi vi j 6= n, SjSn ≡ SnSj
1 : if P Z ≥ 0.5
CSj (x) = i=1 ωi
0: otherwise ∀j ∈ {1, 2, 3, 4} Therefore, computing benign class accuracy and malware class
(5) accuracy will utilize:
IEEE TRANSACTIONS ON CYBERNETICS 5
Eq. (20). Thus, the S2=CDB class prediction for an instance Then, for each base classifier, aggregate the values and apply
x is determined from Eq. (5). Whenever S2=CDB is used the ranking function Rankdesc (.):
(in conjunction with another scheme) within a pair in the set
expressed by Eq. (4), then equation Eq. (9) will be used for (
hk ← Ak + Gk
the class prediction of the instance. , Ak ∈ Ā, Gk ∈ Ḡ, ∀k ∈ {1, ..., K}
3) The Ranked Aggregate of Per Class accuracies (RAPC) H ← hk
based scheme: In the RAPC method, the ranking is directly (29)
proportional to the sum of the initial per class rankings of the
accuracies of the base classifiers. This method is more likely H̄ ← Rankdesc (H) (30)
to assign a larger weight to a base classifier that performs very Thus, H̄ is the set containing the ranked values of H in
well in both classes. RAPC is summarized as follows. descending order of magnitude. The top Z rankings are then
With F̄ defined as the set of ordered rankings with cardi- used according to Eq. 20 to assign the weights.
nality K, given the initial performance accuracies of Pk,c of
the K base classifiers :
( C. Model complexity
Pm ← Pk,c where c 6= b
, k ∈ {1, ..., K}, c ∈ {m, b} As mentioned earlier, the base classifiers initial accuracies
Pb ← Pk,c where c 6= m
are estimated using a stratified N -fold cross validation tech-
(23)
nique. This procedure will be performed only once during
We then apply the ranking function Rankdesc (.) to both:
training (on the training-validation set) and the preliminary
predictions for all x instances in X for every base classifier
(
P̄m ← Rankdesc (Pm )
(24) will be determined from the procedure. The configurations
P̄b ← Rankdesc (Pb ) (weights) computed from each algorithm is applied together
The per-class rankings for each base classifier are aggregated with these initial (base classifier) predictions to re-classify
and then ranked again: each instance accordingly. Since level-2 training prediction of
( instances requires only re-classification using V̇ (x), ∀x ∈ X
fk ← P̄k,m + P̄k,b , the time complexity for utilizing R level 2 algorithms to
, ∀k ∈ {1, ..., K} (25)
F ← fk predict the classes of X instances using Eq. (5) will be given
by O(RX). The pairwise class predictions also involve re-
F̄ ← Rankdesc (F ) (26) classification, thus the complexity involved for predicting the
Finally, from the set F̄ comprising k ordered values of F , class of X instances using Eq. (9) will be given by O(JX)
where J = R2 . Likewise, for the unweighted majority vote
we select the top Z rankings and use them to assign weights
according to Eq. (20). Suppose the RAPC scheme is taken the complexity will be O(X) as re-classification is involved
as S3, we can determine the class prediction for an instance also. Since we utilize unweighted majority vote and pairwise
x from Eq. (5). If S3=RAPC is used (in conjunction with combinations for final model building (Eq. (17)) the total
another scheme) within a pair in the set expressed by Eq. (4), training time complexity in level-2 is therefore given by
O(X) + O(JX) = O((J + 1)X) where J = R2 for the
then equation Eq. (9) will be employed for the class prediction
of the instance. R level 2 ranking-based algorithms.
4) The Ranked aggregate of Average accuracy and Class
Differential (RACD) scheme: With RACD, the ranking is IV. INVESTIGATION METHODOLOGY
directly proportional to the sum of the initial rankings of the A. Automated static analyzer for feature extraction
average performance accuracies and the initial rankings of the
difference in performance between the classes. This method is The features used in the experimental evaluation of the
designed to assign a larger weight to the base classifiers with DroidFusion system are obtained using an automated static
good initial overall accuracy that also have a relatively smaller analysis tool developed with Python. The tool enables us to
difference in performance between the classes. The algorithm extract permissions and intents from the application manifest
is described as follows. file after decompiling with AXMLprinter2 (a library for de-
Suppose, we take the RACD method as scheme S4, define compiling Android manifest files). In addition, API calls are
a set H̄ for ordered values with cardinality K. Given A, the extracted through reverse engineering the .dex files by means
set of computed average accuracies for each base classifier of Baksmali disassembler. The static analyzer also searches
(determined in the AAB scheme) compute the class differential for dangerous Linux commands from the application files
for each corresponding classifier as follows: and checks for the presence of embedded .dex, .jar, .so, and
.exe files within the application. Previous works [35] have
gk ← |Pk,m − Pk,b | , k ∈ {1, ..., K} (27) shown that these set of static application attributes provide
discriminative features for machine learning based Android
Define G ← gk , ∀k ∈ {1, ..., K} as the ordered set of gk
malware detection, hence, we utilized them for DroidFusion
values to which a ranking function Rankascen (.) is applied to
experiments. Furthermore, while extracting API calls, third
rank gk in ascending order of magnitude:
party libraries are excluded using the list of popular ad libraries
Ḡ ← Rankascen (G) (28) obtained from [36]. Fig. 2 shows an overview of the feature
IEEE TRANSACTIONS ON CYBERNETICS 7
TABLE II: Datasets used for the DroidFusion evaluation TABLE III: malgenome 215 train-validation set results and
experiments. Level-2 algorithm based rankings for the base classifiers (5 =
highest rank, 1 = lowest).
Datasets #samples #malware #benign #features
Malgenome-215 3799 1260 2539 215 Classifier TPR TNR AAB CDB RAPC RACD
Drebin-215 15036 5560 9476 215 J48 0.975 0.983 4 4 5 5
McAfee-350 36183 13805 22378 350 REPTree 0.961 0.974 1 2 1 1
McAfee-100 36183 13805 22378 100 Random
Tree-100 0.972 0.982 3 3 3 2
Random
Tree-9 0.966 0.973 2 5 1 4
9476 were benign samples while the remaining 5,560 were Voted
malware samples from the Drebin project [4]. The Drebin Perceptron 0.971 0.991 5 1 4 2
samples are also publicly available and widely used in the
research community. Both Drebin-215 and Malgenome-215 TABLE IV: malgenome 215 train-validation set Level-2 com-
datasets are made available as supplementary material. bination schemes intermediate results.
The final two datasets come from the same source of
samples. These are McAfee-350 and McAfee-100 in the table. Combination PrecM RecalM PrecB RecalB W-FM
They both have 36,183 instances of feature vectors derived AAB+CDB 0.980 0.985 0.993 0.990 0.9883
from 13,805 malware samples and 22,378 benign samples AAB+RAPC 0.984 0.984 0.992 0.992 0.9893
AAB+RACD 0.982 0.984 0.992 0.991 0.9887
made available to us by Intel Security (Formerly McAfee).
CDB+RAPC 0.982 0.984 0.992 0.991 0.9887
Dataset #3 has 350 features, while Dataset #4 has the top 100 CDB+RACD 0.976 0.983 0.992 0.988 0.9864
features with the largest information gain from the original 350 RAPC+RACD 0.982 0.984 0.992 0.991 0.9887
features in Dataset #3. In the experiments presented, Dataset
#1, #2 and #3 are used to investigate DroidFusion with singular
base classifiers, while Dataset #4 is used to study the fusion
of ensemble base classifiers with DroidFusion. Note that all of A. Performance of DroidFusion with the Malgenome-215
the features were extracted using our static app analysis tool dataset.
described in section IV-A. In order to evaluate DroidFusion on the Malgenome- 215
dataset, we split the dataset into two parts, one for testing
V. R ESULTS AND DISCUSSIONS and another for training-validation. The ratio was training-
In this section, we present and discuss the results of four validation: 80%, testing: 20%. The stratified 10- fold cross
sets of experiments performed to evaluate DroidFusion per- validation approach was used to construct the DroidFusion
formance. We utilized the open source Waikato Environment model using the training-validation set. Table III shows the
for Knowledge Analysis (WEKA) toolkit [37] to implement per-class accuracies of each of the 5 base classifiers resulting
and evaluate DroidFusion. Feature ranking and reduction of from 10-fold cross-validation on the training-validation set.
dataset #3 into dataset #4 was also done with WEKA. In The subsequent rankings determined from AAB, CDB, RAPC
all the experiments we set K=5, i.e. five base classifiers are and RACD are also presented. Each of the algorithms induced
utilized. Also, we take N =10 and Z=3 for the cross validation a different set of rankings from the base classifiers accuracies.
and weight assignments respectively. In the first three sets of After applying Eq.(9) to the instances in the training-validation
experiments, non-ensemble base classifiers were used, which set and computing the accuracies with Eqs. (10)-(12), we
were: J48, REPTree, Voted Perceptron and Random Tree. The obtained the performances of the pairwise combinations of
Random Tree learner was used to build two separate classifier the level-2 algorithms as shown in Table IV.
models using different configurations i.e. Random Tree-100 The results in Table IV clearly depict the overall per-
and Random Tree-9. With Random Trees, the number of formance improvement achieved by the level-2 combination
variables selected at each split during tree construction is schemes over the individual base classifiers. From Table III,
a configuration parameter which by default (in WEKA) is J48 has the best malware recall of 0.975 but its recall for
given by: log2 f + 1, where f is the number of features (# benign class is 0.983. On the other hand, Voted Perceptron
variables = 9 for f =350 with the McAfee-350 dataset). The had the best recall of 0.991 for the benign class, but its
same configuration is used in the Drebin-215 and Malgenome- recall for the malware class is 0.971 (on the training-validation
215 experiments for consistency. Thus, selecting 100 and 9 for set). On the training-validation set, the best combination is
Random Tree-100 and Random Tree-9 respectively results in AAB+RAPC (i.e. S1S3 pair) having 0.984 recall for mal-
two different base classifier models. Random Tree, REPTree, ware and 0.992 recall for benign class, and a weighted F-
J48 and Voted Perceptron were selected as example base clas- measure of 0.9893. J48 and Voted Perceptron had weighted
sifiers (out of 12 base classifiers) because of their combined F-measures of 0.9804 and 0.9843 respectively. These were
accuracy and training time performance as determined from below all of the weighted F-measures achieved by the combi-
preliminary investigation; a different set of learning algorithms nation schemes shown in Table IV. Hence, these intermediate
can be used with DroidFusion since it designed to be general- training-validation set results already show the capability of
purpose, and not specific to a particular type of machine the DroidFusion approach to produce stronger models from
learning algorithm. the weaker base classifiers.
IEEE TRANSACTIONS ON CYBERNETICS 9
TABLE VIII: DREBIN 215 Comparison of DroidFusion with TABLE IX: McAfee train-validation set results and Level-2
base classifiers and traditional combination schemes on test algorithm based rankings for the base classifiers (5 = highest
set. rank, 1 = lowest).
Classifier PrecM RecM PrecB RecB W-FM T(s) Classifier TPR TNR AAB CDB RAPC RACD
J48 0.972 0.964 0.979 0.984 0.9766 0.03 J48 0.941 0.973 4 3 4 3
REPTree 0.976 0.951 0.972 0.986 0.9730 0.04 REPTree 0.928 0.966 2 2 2 2
Random Random
Tree-100 0.975 0.978 0.987 0.985 0.9824 0.04 Tree-100 0.948 0.968 5 5 4 5
Random Random
Tree-9 0.947 0.971 0.983 0.968 0.9692 0.04 Tree-9 0.935 0.962 3 4 2 3
Voted Voted
Perceptron 0.969 0.950 0.971 0.982 0.9701 0.37 Perceptron 0.917 0.959 1 1 1 1
Maj. voting 0.983 0.973 0.984 0.990 0.9837 0.32
Average of
Probabilities 0.983 0.973 0.984 0.990 0.9837 0.31
Maximum
Probability 0.908 0.996 0.998 0.941 0.9617 0.33 TABLE X: McAfee 350 train-validation set Level-2 combina-
MultiScheme 0.984 0.969 0.982 0.984 0.9784 0.05
DroidFusion 0.981 0.984 0.991 0.989 0.9872 0.38
tion schemes intermediate results.
Combination PrecM RecalM PrecB RecalB W-FM
AAB+CDB 0.945 0.955 0.972 0.966 0.9618
AAB+RAPC 0.970 0.956 0.973 0.982 0.9720
AAB+RACD 0.969 0.956 0.973 0.981 0.9714
CDB+RACD 0.969 0.955 0.972 0.981 0.9710
CDB+RACD 0.966 0.972 0.984 0.980 0.9771
RAPC+RACD 0.970 0.957 0.974 0.982 0.9724
TABLE XI: McAfee 350 Comparison of DroidFusion with TABLE XII: McAfee 100 train-validation set results and
base classifiers and traditional combination schemes on test Level-2 algorithm based rankings for the (ensemble) base
set. classifiers (5 = highest rank, 1 = lowest).
Classifier PrecM RecM PrecB RecB W-FM T(s) Classifier TPR TNR AAB CDB RAPC RACD
J48 0.967 0.950 0.969 0.980 0.9685 0.11 Random Forest 0.959 0.979 4 4 2 4
REPTree 0.942 0.943 0.965 0.964 0.9560 0.11 Random Sub.
Random (REPTree) 0.923 0.984 1 1 2 1
Tree-100 0.954 0.951 0.970 0.972 0.9640 0.11 AdaBoost:
Random (Random Tree) 0.949 0.979 2 3 1 2
Tree-9 0.952 0.936 0.961 0.971 0.9576 0.12 Random Sub.
Voted (Random Tree) 0.951 0.985 5 2 4 2
Perceptron 0.928 0.917 0.949 0.956 0.9411 6.76 Random Comm.
Maj. voting 0.980 0.964 0.978 0.988 0.9788 6.76 (Random Tree) 0.961 0.980 3 5 4 5
Average of
Probabilities 0.980 0.964 0.978 0.988 0.9788 7.01
Maximum
Probability 0.874 0.990 0.993 0.912 0.9423 6.54
MultiScheme 0.967 0.950 0.969 0.980 0.9685 0.12 TABLE XIII: McAfee-100 train-validation set Level-2 combi-
DroidFusion 0.980 0.964 0.978 0.988 0.9788 7.02 nation schemes intermediate results.
Combination PrecM RecalM PrecB RecalB W-FM
AAB+CDB 0.980 0.955 0.973 0.988 0.9753
AAB+RAPC 0.982 0.954 0.972 0.989 0.9756
AAB+RACD 0.980 0.955 0.973 0.988 0.9753
CDB+RAPC 0.980 0.955 0.973 0.988 0.9753
CDB+RACD 0.977 0.957 0.974 0.986 0.9749
RAPC+RACD 0.980 0.955 0.973 0.988 0.9753
TABLE XVI: Analysis of app processing time [2] McAfee Labs. McAfee Labs Threat Predictions Report. March 2016.
[3] Y. Zhou and X. Jiang, ”Dissecting android malware: Characterization and
Task Lowest (s) Highest (s) Average (s) evolution” In proc. 2012 IEEE Symposium on Security and Privacy (SP),
Unzipping and San Fransisco, CA, USA, 20-23 May, 2012 , pp. 95-109.
dissassemby 0.392 1.18 0.739 [4] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, and K. Rieck, ”Drebin:
Manifest analysis 0.0013 0.0088 0.0048 Efficient and Explainable Detection of Android Malware in Your Pocket”
Code analysis 3.428 15.47 6.4 In proc. 20th Annual Network & Distributed System Security Symposium
Total 7.145 (NDSS), San Diego, CA, USA, 23-26 Feb. 2014.
[5] A. Apvrille, and R. Nigam. Obfuscation in Android Malware, and
how to fight back. Virus Bulletin, July 2014. Available from:
https://www.virusbulletin.com/virusbulletin/2014/07/obfuscation-
android-malware-and-how-fight-back [Accessed Sept. 2017]
seconds (for 759 instances), 0.38 seconds (for 1503 instances), [6] Y. Jing, Z. Zhao, G.-J. Ahn, and H. Hu, ”Morpheus: Automatically
7.02 seconds (for 3618 instances), and 0.22 seconds (for 3618 Generating Heuristics to Detect Android Emulators” In proc. 30th An-
instances) in the four sets of experimental results presented nual Computer Security Applications Conference (ACSAC 2014),New
Orleans, Louisiana, USA, Dec. 8-12, 2014, pp. 216-225.
earlier. These figures clearly illustrate the scalability of static- [7] T. Vidas and N. Christin, ”Evading Android runtime analysis via sandbox
based features solution with only an average of just over 7 detection” In proc. 9th ACM Symposium on Information, Computer and
seconds required to process an app and classify it using a Communications Security, Kyoto, Japan, June 04-06, 2014, pp. 447-458.
[8] T. Petsas, G. Voyatzis, E. Athanasopoulos, M. Polychronakis, and S.
trained DroidFusion model. Thus, it is feasible in practice to Ioannidis, ”Rage against the virtual machine: hindering dynamic analysis
deploy the system for scenarios requiring large scale vetting of android malware” In proc.7th European Workshop on System Security
of apps. (EuroSec ’14), Amsterdam, Netherlands, April 13, 2014.
[9] F. Matenaar and P. Schulz. Detecting android sandboxes.
Note that although our study is based on specific static http://dexlabs.org/blog/btdetect, August 2012. [Accessed: Sept. 2017].
features, classifiers trained from other types of features can [10] S. R. Choudhary, A. Gorla, A. Orso, ”Automated test input generation
also be combined using DroidFusion. Basically, DroidFusion for Android: are we there yet?” In proc. 30th IEEE/ACM international
conference on Automated Software Engineering (ASE 2015), Nov. 9-13,
is agnostic to the feature engineering process. 2015, pp. 429-440.
[11] W. Dong-Jie, M. Ching-Hao, W. Te-En, L. Hahn-Ming, and W. Kuo-
Ping, ”DroidMat: Android malware detection through manifest and API
G. Limitations of DroidFusion calls tracing,” In proc. Seventh Asia Joint Conference on Information
Although the proposed general-purpose DroidFusion ap- Security(Asia JCIS), 2012, pp. 62-69.
[12] S. Y. Yerima, S. Sezer, and I. Muttik. ”Android malware detection:
proach has been demonstrated empirically to enable improved An eigenspace analysis approach” In proc. Science and Information
accuracy performance by classifier fusion, there is scope Conference (SAI 2015), London, UK, 28-30 July 2015, pp.1236-1242.
for further improvement. The current DroidFusion design is [13] S. Y. Yerima, S. Sezer, I. Muttik ”Android malware detection using
parallel machine learning classfiers” In proc. 8th Int. Conf. on Next
aimed at binary classification. Future work could investigate Generation Mobile Apps, Services and Technolgies (NGMAST 2014),
extending the algorithms in the DroidFusion framework to Oxford, UK, Sept. 10-12, 2014, pp. 37-42
handle multi-class problems. [14] S. Y. Yerima, S. Sezer, and I. Muttik. High accuracy Android malware
detection using ensemble learning. IET Information Security, Vol 9, issue
6, 2015, pp. 313-320.
VI. C ONCLUSION [15] M. Varsha, P. Vinod, & K. Dhanya. Identification of malicious Android
app using manifest and opcode features. Journal of Computer Virology
In this paper, we proposed a novel general purpose multi- and Hacking Techniques, 2016, pp. 1-14.
level classifier fusion approach (DroidFusion) for Android [16] A. Sharma and S. Dash, ”Mining API calls and permissions for An-
malware detection. The DroidFusion framework is based on droid malware detection” in Cryptology and Network Security. Springer
International Publishing, 2014, pp. 191205.
four proposed ranking-based algorithms that enable higher- [17] P.P. K., Chan and W-K. Song, ”Static detection of Andoid malware by
level fusion using a computational approach rather than the using permissions and API calls” In proc. 2014 international Conference
traditional meta classifier training that is used for example in on Machine Learning and Cybernetics, Lanzhou, July 13-16, 2014.
[18] W. Wang, X. Wang, D. Feng, J. Liu, Z. Han and X. Zhang. Exploring
Stacked Generalization. We empirically evaluated DroidFusion Permission-Induced Risk in Android applications for Malicious Applcica-
using four separate datasets. The results presented demon- tion Detection. IEEE Transactions on Information Forensics and Security,
strates its effectiveness for improving performance using both Vol. 9, No. 11, Nov. 2014, pp. 1869-1882.
[19] M. Fan, J. Liu, W. Wang, H. Li, Z. Tian and T. Liu. DAPASA: Detecting
non-ensemble and ensemble base classifiers. Furthermore, we Android Piggybacked Apps through Sensitive Subgraph Analysis. IEEE
showed that our proposed approach can outperform Stacked Transactions on Information Forensics and Security, Vol. 12, Issue 8,
Generalization whilst utilizing only computational processes March 2016, pp. 1772-1785.
[20] L. Cen, C. S. Gates, L. Si, and N. Li. A Probabilistic Discriminative
for model building rather than training a meta classifier at the Model for Android Malware Detection with Decompiled Source code.
higher level. IEEE Transactions on Secure and Dependable Computing, Vol. 12, No.
4, July/August 2015.
[21] Westyarian, Y. Rosmansyah, B. Dabarsyan, ”Malware detection on
ACKNOWLEDGMENT Android Smartphones using API class and Machine learning” 2015 In-
This work is supported by the UK Engineering and Phys- ternational Conference on Electrical Enginnering and Informatics (ICEEI
2015), 10-11 Aug. 2015.
ical Sciences Research Council EPSRC grant EP/N508664/1 [22] F. Idrees, and M. Rajarajan. ”Investigating the Android intents and
Centre for Secure Information Security (CSIT-2). permissions for malware detection”. In proc. 10th IEEE Int. Conf.
on Wireless and Mobile Computing, Networking and Communications
R EFERENCES (WiMob), Oct. 2014, pp. 354-358).
[23] B. Kang, S. Y. Yerima, S. Sezer and K. McLaughlin. N-gram opcode
[1] Smartphone OS market share worldwide 2009-2015 analysis for Android malware detection. International Journal of Cyber
Statistic, Statista, Hamburg, Germany, 2017 [Online] Situational Awareness, Vol. 1, No. 1, Nov. 2016.
https://www.statista.com/statistics/263453/global-market- share-held- [24] M. Zhao, F. Ge, T. Zhang, and Z. Yuan, ”Antimaldroid: An efficient svm
by-smartphone-operating-systems based malware detection framework for android” In C. Liu, J. Chang, and
IEEE TRANSACTIONS ON CYBERNETICS 14
A. Yang, editors, ICICA (1), volume 243 of Communications in Computer [49] M.-Y. Su, J.-Y. Chang, and K.-T. Fung ”Machine Learning on Merging
and Information Science, Springer, 2011. pp. 158166. Static and Dynamic Features to identify malicious mobile apps” In proc.
[25] W.-C. Wu and S.-H. Hung, ”Droiddolphin: A dynamic Android malware 9th Int. Conf. on Ubiquitous and Future Networks (ICUFN), 2017, Milan,
detection framework using big data and machine learning” In proc. 2014 Italy, 4-7 July 2017. pp. 863-867.
ACM conf. on Research in Adaptive and Convergent Systems, (RACS [50] N. Milosevic, A. Dehghantanha, K.-K. R. Choo ”Machine Learning
’14), NY, USA, pp. 247-252. aided Android malware classification” Computers & Electrical Engineer-
[26] V. M. Afonso, M. F. de Amorim, A. R. A. Gregio, G. B. Junquera, and ing, Volume 61, July 2017, pp 266-274.
P. L. de Geus. Identifying Android malware using dynamically obtained [51] W. Wang, Y. Li, X. Wang, J. Liu, X. Zhang ”Detecting Android
features. Journal of Computer Virology and Hacking Techniques, 2014. malicious apps and categorizing benign apps with ensemble classifiers”
[27] M. K Alzaylaee, S. Y. Yerima, S. Sezer ”EMULATOR vs REAL Future Generation Computer Systems, 2017, ISSN 0167-739X.
PHONE: Android Malware Detection Using Machine Learning” 3rd [52] X. Wang, W. Wang, Y. He, J. Liu, Z. Han, X. Zhang ”Characterizing
ACM Int. Workshop on Security and Privacy Analytics (IWSPA ’17), Android apps behaviour for effective detection of malapps at large scale”
Co-located with ACM CODASPY 2017, Scotts., AZ, USA, March 2017. Future Generation Computer Systems, Volume 75, Oct. 2017, pp. 30-45.
[28] Lindorfer, M., Neugschwandtner, M., & Platzer, C. ”MARVIN: Efficient [53] A. Mahindru and P. Singh ”Dynamic Permissions based Android
and comprehensive mobile app classification through static and dynamic malware detection using machine learning techniques” In proc. 10th
analysis” In proc. IEEE 39th Annual Computer Software and Applications Innovations in Software Engineering Conference, Jaipur, India, Feb. 5-7,
Conference (COMPSAC), pp. 4223433. 2017. pp 202-210.
[29] D. Gaikwad and R. Thool ”DAREnsemble: Decision Tree and Rule [54] M. Yang, S. Wang, Z. Ling,, Y. Liu, Z. Ni. Detection of ma-
Learner Based Ensemble for Network Intrusion Detection System” In licious behaviour in Android apps through API calls and permis-
Proc. 1st Int. Conf. on Information and Communication Technology for sion uses analysis. Concurrency Computed: Pract Exper. 2017: e4172.
Intelligent Systems, Springer, 2016, pp. 185-193. https://doi.org/10.1002/cpe.4172
[30] A. Balon-Perlin and B. Gamback. Ensemble of Decision Trees for Net- [55] F. Idrees, M. Rajarajan, M. Conti, T. M. Chen, Y. Rahulamathavan.
work Intrusion Detection. International Journal on Advances in Security, PIndroid: A novel Android malware detection system using ensemble
Vol. 6. No. 1 and 2, 2013. learning methods. Computers & Security, Vol 68, July 2017, pp. 36-46.
[31] M. Panda and M. R. Patra ”Ensembling rule based classifiers for detect-
ing network intrusions” Int. Conf.on Advances in Recent Technologies
in Communication and Computing, 2009, IEEE, DOI 10.1109/ART-
Com.2009.121.
[32] A. Zainal, M. A. Maarof, S. M. Shamsuddin and A. Abraham Ensemble
of one-class classifiers for network intrusion detection system In proc. Suleiman Y. Yerima (M’04) received the B.Eng.
Fourth international conference on information assurance and security, degree (first Class) in electrical and computer engi-
2008, IEEE, DOI 10.1109/IAS.2008.35 neering from the Federal University of Technology,
[33] L. D. Coronado-De-Alba, A. Rodriguez-Mota, P. J. Escamilla- Ambrosio Minna, Nigeria, the M.Sc, degree (with distinction)
Feature Selection and ensemble of classifiers for Android malware detec- in personal, mobile and satellite communications
tion In proc. 8th IEEE Latin-American Conference on Communications from the University of Bradford, Bradford, U.K.,
(LATINCOM 2016), 15-17 Nov. 2016. and the Ph.D. degree in mobile computing and
[34] M. K Alzaylaee, S. Y. Yerima, S. Sezer ”Improving Dynamic Analysis of communications from the University of South Wales,
Android Apps Using Hybrid Input Test Generation” In proc. Int. Conf. on Pontypridd, U.K. (formerly, the University of Glam-
CyberSecurity and Protection of Digital Services (Cyber Security 2017), organ) in 2009.
London, UK, June 19-20, 2017. He is currently a Senior Lecturer of Cyber Se-
[35] Y. Aafer, W. Du, and H. Yin, ”DroidAPIMiner: Mining API-level fea- curity in the Faculty of Technology, at De Montfort University, Leicester,
tures for robust malware detection in Android” In proc. 9th Int.Conference United Kingdom. He was previously a Research Fellow at the Centre for
on Security and Privacy in Communication Networks (SecureComm Secure Information Technologies (CSIT), Queens University Belfast, UK,
2013). Sydney, Australia, Sep. 25-27, 2013. where he led the mobile security research theme from 2012 until 2017. He
[36] T. Book, A. Pridgen, and D. S. Wallach, ”Longitudinal Analysis of was a member of the Mobile Computing Communications and Networking
Android Ad Library Permissions” In proc. Mobile Security Technologies (MoCoNet) Research group at Glamorgan from 2005 to 2009. From 2010 to
conference (MoST 13), San Fransisco, CA, May 2013. 2012, he was with the UK- India Advanced Technology Centre of excellence
[37] M. Hall, E. Frank, G. Holmes, B. Pfahriger, P. Reutermann and I. H. in Next Generation Networks, Systems and Services (IU-ATC), University of
Witten. The WEKA data mining software: an update. ACM SIGKDD Ulster, Coleraine, Northern Ireland .
Explorations, Vol.11, No.1. June 2009.pp 10-18. Dr. Yerima is a member of the IAENG, and (ISC)2 professional societies.
[38] T. M. Cover, J. A. Thomas, Elements of Information Theory, 2nd He is a Certified Information Systems Security Professional (CISSP) and
Edition, John Wiley & Sons, inc., Hoboken, New Jersey, 2006, pp. 41. a Certified Ethical Hacker (CEH). He was the recepient of the 2017 IET
[39] L. Breiman. Random forests. Machine Learning, 45, 2001, pp 5-32. Information Security premium (best paper) award.
[40] Y. Freund and R. E. Schapire, ”Experiments with a new boosting
algorithm” In proc. 13th Int. Conf. on Machine Learning, San Francisco,
1996, pp. 148-156.
[41] T. K. Ho. The Random Subspace Method for Constructing Decision
Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Vol 20 (8), 1998, pp. 832-844, 1998
[42] T. Ho, ”Random Decision Forests”, Proc. of the 3rd Int. Conf. on Sakir Sezer (M ’00) received the Dipl. Ing. degree
Document Analysis and Recognition, 1995, pp. 278-282. in electrical and electronic engineering from RWTH
[43] David H. Wolpert. Stacked generalization. Neural Networks. 1992, pp, Aachen University, Germany, and the Ph.D. degree
241-259. in 1999 from Queens University Belfast, U.K. Prof.
[44] K. M. Ting and I. H. Witten. Issues in Stacked Generilization. Journal Sezer is currently Secure Digital Systems Research
of Artificial Intelligence Research, 10, May 1999, pp. 271-289. Director and Head of Network Security Research
[45] T. Ban, T. Takahashi and S. Guo ”Integration of Multi-modal Features in the School of Electronics Electrical Engineering
for Android Malware Detection Using Linear SVM” In proc. 11th Asia and Computer Science at Queens University Belfast.
Joint Conference on Information Security, 2016. His research is leading major (patented) advances in
[46] Z. Ni, M. Yang, Z. Ling, J. N. Wu and J. Luo, ”Real-Time Detection of the field of high-performance content processing and
Malicious Behavior in Android Apps,” In proc. Int. Conf. on Advanced is currently commercialized by Titan IC Systems.
Cloud and Big Data (CBD), Chengdu, 2016, pp. 221-227. He has co- authored over 120 conference and journal papers in the area of
[47] Z. Wang, J. Chai, S. Chen and W. Li, ”DroidDeepLearner: Identifying high-performance network, content processing, and System on Chip. Prof.
Android malware using deep learning” IEEE 37th Sarnoff Symposium, Sezer has been awarded a number of prestigious awards including InvestNI,
Newark, NJ, 2016, pp. 160-165. Enterprise Ireland and Intertrade Ireland innovation and enterprise awards,
[48] S. Wu, P. Wang, X. Li, Y. Zhang. Effective detection of android malware and the InvestNI Enterprise Fellowship. He is also cofounder and CTO of
based on the usage of data flow APIs and machine learning. Information Titan IC Systems and a member of the IEEE International System- on-Chip
and Software Technology, Vol.75, 2016, Pages 17-25, ISSN 0950-5849. Conference executive committee.