PlausMal-GAN Plausible Malware Training Based On Generative Adversarial Networks For Analogous Zero-Day Malware Detection
PlausMal-GAN Plausible Malware Training Based On Generative Adversarial Networks For Analogous Zero-Day Malware Detection
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2022.3170544, IEEE
Transactions on Emerging Topics in Computing
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, VOL. XX, NO. XX, XX 2022 1
Abstract—Zero-day malicious software (malware) refers to a previously unknown or newly discovered software vulnerability. The
fundamental objective of this paper is to enhance detection for analogous zero-day malware by efficient learning to plausible generated
data. To detect zero-day malware, we proposed a malware training framework based on the generated analogous malware data using
generative adversarial networks (PlausMal-GAN). Thus, the PlausMal-GAN can suitably produce analogous zero-day malware images
with high quality and high diversity from the existing malware data. The discriminator, as a detector, learns various malware features
using both real and generated malware images. In terms of performance, the proposed framework showed higher and more stable
performances for the analogous zero-day malware images, which can be assumed to be analogous zero-day malware data. We
obtained reliable accuracy performances in the proposed PlausMal-GAN framework with representative GAN models (i.e., deep
convolutional GAN, least-squares GAN, Wasserstein GAN with gradient penalty, and evolutionary GAN). These results indicate that the
use of the proposed framework is beneficial for the detection and prediction of numerous and analogous zero-day malware data from
noted malware when developing and updating malware detection systems.
Index Terms—Zero-day Malware, Analogous Malware Detection, Malware Augmentation, Malware Data, Generative Adversarial
Networks
F
1 I NTRODUCTION
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2022.3170544, IEEE
Transactions on Emerging Topics in Computing
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, VOL. XX, NO. XX, XX 2022 2
malware data and the generated malware data by the fixed Recently, there have been some methods developed for
generator. Ideally, the proposed framework can apply any zero-day malware detection [13], [14], [35], [36]. Venka-
kind of GAN model, so we evaluated the performance by traman and Alazab used a similarity matrix of malware
applying the latest and repetitive GAN models. Moreover, for visualization in order to detect zero-day malware [14].
we obtained stable performance for abundant analogous This method can be used to visually observe that different
zero-day malware test data in relatively few training data malware families exhibit significantly different behavior
conditions. patterns. Gupta and Rani proposed a big data framework
to address the big data problem caused by increase in mal-
ware [35]. They also attempted to detect zero-day malware
2 BACKGROUND using big data analysis techniques and machine-learning
2.1 Malware Detection algorithms.
Owing to the increasing damage caused by malware and This method modeled a series of opcodes to detect zero-
zero-day malware, research on malware detection methods day malware. Due to the increasing threat of malware in
have been continuously improving. We discuss two aspects a cyber-physical system, Huda et al. proposed a detection
of malware detection: malware detection and zero-day mal- method that uses methods like SVM and K-means to detect
ware detection. unknown malware by extracting knowledge and essential
Several reported studies have dealt with malware de- structures from already unlabeled, cheap, available data
tection [10], [13]–[17]. Nataraj et al. presented a visualiza- [36]. In the aforementioned zero-day malware detection
tion approach that differs from traditional approaches for methods, certain rules are fixed, and zero-day malware that
malware detection [10], where they transformed the mal- does not follow these rules cannot be detected. Recently,
ware’s binary information into grayscale malware images. Kim et al. has proposed transferred deep-convolutional
Ye et al. and Ndibanje et al. used Windows Audit Log generative adversarial network (tDCGAN), which generates
and API Call for malware detection [18], [19]. Traditional fake malware and learns to distinguish it from real malware
machine learning algorithms such as hidden Markov mod- [13]. This method obtained not only enhanced performance
els, support vector machines (SVMs) and random forests in malware detection but also showed possibility in a zero-
were also used for malware detection [20]–[23]. Singh et al. day attack experiment. Since the method is no consideration
proposed a big data analysis framework based on random of high diversity (e.g., plausible diversity) or quality in gen-
forests for malware detection [24]. Chen et al. attempted to erated zero-day malware, nor was it measured numerically
detect malware by analyzing mobile network traffic with (i.e., fréchet inception distance, etc.), it is difficult to assume
machine-learning methods [25]. Recently, there have been that focused on zero-day malware detection. While, we
many methods to use deep learning and generative ad- implemented analogous zero-day malware classifier with
versarial networks (GAN) because the available computing GAN models to create new high-diversity and high-quality
power has increased [11], [12], [26]–[31]. Pascanu et al. used malware images for generating plausible malware augmen-
recurrent neural networks for time-series information in tation. The generated data is used to create a robust detector
malware classification [26], [32]. Ye et al. presented a hetero- for zero-day malware detection.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2022.3170544, IEEE
Transactions on Emerging Topics in Computing
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, VOL. XX, NO. XX, XX 2022 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2022.3170544, IEEE
Transactions on Emerging Topics in Computing
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, VOL. XX, NO. XX, XX 2022 4
Real
(a) Samples
Malware classes
Ramnit
…
ࡳࣂ ࡳࣂ ࡳࣂೖ Lollipop
ܯ ܯ ܯ Kelihos_ver3
Generator ࡳࣂ Samples
Simda
Noise ा
Tracur
Kelihos_ver1
Class ࢉ
Obfuscator.ACY
Gatak
…
ࡳࣂ ࡳࣂ ࡳࣂ
ऐ ऐ … ऐ
Real or Fake
Vundo
Class ࢉ Real
Samples Simda
Tracur
Kelihos_ver1
Obfuscator.ACY
Gatak
Fig. 3: The proposed PlausMal-GAN framework consists of two-phases. (a) The generator and discriminator training
based on GAN with malware classifier. (b) Training the discriminator as a zero-day malware detector from plausible
malware augmentation. For an intuitive explanation, it is shown using evolutionary GAN, which is one of the
representative GANs.
And, we considered standard GAN approach (minmax), (i.e., D(x̂) → 0). In other words, if the discriminator is
least-squares approach, heuristic approach, and combin- confident that the generated malware data is fake malware
ing the preceding three-approach for DCGAN, LSGAN, data, the generator may not train well. However, we have
WGAN-GP, and E-GAN model in the proposed framework, been able to solve this problem to some extent by adding
respectively. In E-GAN, we considered an evolutionary a classification loss. Unlike early gentle gradients, if the
step consists of three sub-steps: variation, evaluation, and generated malware distribution is somewhat similar to the
selection. In the variation step, we adopt three objectives real malware distribution, the minimax mutation provides a
that are interpretable and complementary as mutations pro- steep gradient, which later allows stable learning.
posed by Wang et al. [40]. As shown in Figure 4, the dif-
ference between the three objective functions are minimax
mutation, heuristic mutation, and least-squares mutation. Mminimax = Ex̂∼pgen [log(1 − D(x̂)) − log p(c|x̂)]. (3)
G
In addition, we added a classification loss function to the
existing mutation functions, because not only the data is
close to real but also data corresponding to the class must be The heuristic mutation minimizes the log probability that
generated. The minimax mutation is similar to the minimax the discriminator will do well, which maximizes the log
objective function of the original GAN, which aimed to probability that the discriminator will go wrong. Using this
minimize the log probability that the discriminator would mutation, the gradient is steep even though the discrimi-
do well. In the original GAN, gradient vanishing can occur nator is convinced that the generated malware data is fake.
when the discriminator produces a result close to zero Thus, the heuristic mutation can avoid a vanishing gradient,
unlike the minimax mutation, which suggests the possibility
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2022.3170544, IEEE
Transactions on Emerging Topics in Computing
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, VOL. XX, NO. XX, XX 2022 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2022.3170544, IEEE
Transactions on Emerging Topics in Computing
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, VOL. XX, NO. XX, XX 2022 6
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2022.3170544, IEEE
Transactions on Emerging Topics in Computing
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, VOL. XX, NO. XX, XX 2022 7
(a) (b)
Fig. 5: Examples of (a) real malware images and (b) generated malware images in the proposed framework.
E-GAN
Model DCGAN LSGAN WGAN-GP
(r = 0.1 , r = 0.5)
$ F F X U D F \
FID 220.16 190.70 206.23 146.39, 127.96