INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 01, JANUARY 2020 ISSN 2277-8616
A Survey On Machine Learning For Cyber
Security
A. Lakshmanarao, M. Shashi
Abstract: Cyber crime is proliferating everywhere exploiting every kind of vulnerability to computing environment. Ethical Hackers pay more attention
towards assessing vulnerabilities and recommending mitigation methodologies. The development of effective techniques has been an urgent demand in
the field of the cybersecurity community. Machine Learning for cybersecurity has become an issue of great importance recently due to th e effectiveness
of machine learning and deep learning in cybersecurity issues. Machine learning techniques have been applied for major challenges in cybersecurity
issues like intrusion detection, malware classification and detection, spam detection and phishing detection. Although machin e learning cannot automate
a complete cybersecurity system, it helps to identify cyber-security threats more efficiently than other software-oriented methodologies, and thus reduces
the burden on security analysts. Ever evolving nature of cyber threats throws challenges continuously on the researchers to explore with the ideal
combination of deep expertise in cybersecurity and in data science. In this paper, we present the current state of art machine learning applications and
their potential for cybersecurity. An analysis of machine learning algorithms for most common types of cybersecurity threats is presented.
Index Terms: Cybersecurity, Malware detection, Machine learning, Deep learning.
—————————— ——————————
1 INTRODUCTION 2.1 Regression:
Since the invention of the internet technology, cyberspace has In regression, the value of a dependent feature is estimated
emerged as a cen-tral hub for the creation of cyberattacks. based on the values of the independent features by learning
The advances in technologies further facilitate hackers to from the existing data related to past events and such
discover vulnerabilities and to create viruses and malware knowledge is used to handle new events. In cybersecurity,
continuously challenging the cyber security industry. Cyber Fraud detection can be solved by regression. Once a model is
security involves providing secure computing and learnt from the past transaction database, based on observed
communicative environment with proper innovations and features of the current transactions, it determines fraudulent
procedures intended to shield PCs, systems, projects, and transactions. Machine learning provides Linear regression,
information from assault, unapproved access, change, or Polynomial regression, Support vector machine, Decision tree,
annihilation. These frameworks are made out of network Random forest and other regression methods for regression
security and host security systems with firewalls, anti-virus analysis. Venkatesh Jaganathan [2] [Link] applied multiple
softwares, Intrusion detection systems etc. Machine Learning regression techniques for predicting the impact of attacks.
is proven to be capable of solving the most common problems They have taken the Overall CVSS (Common Vulnerability
in different domains like image processing, Health informatics Scoring System) score as a dependent variable and two
applications, physical sciences, Computational Biology, independent variables as X1(number of vulnerabilities),
Robotics, Financial prediction, Audio Processing, Medical X2(Average Input Network Traffic). Daria Lavrova [3] [Link]
Diagnostics, Video Processing, Text Processing [1].Specifically proposed a multiple regression model for the detection of
Machine learning techniques are also applied successfully in security incidents in the IoT. With this technique, they were
the field of cybersecurity to develop effective solutions. able to find known and unknown attacks.
Machine learning has excellent potential for detecting various
types of cyber-attacks and thus has become an important tool 2.2 Classification:
for the defenders. ESET conducted a survey on ―usage of Classification is another extensively used supervisory machine
machine learning for cybersecurity‖, in which 80% of the learning task. In cybersecurity, spam detection is successfully
participants believed that Machine Learning will help their implemented by ML based classifiers which involves
organization to detect and respond faster to threats [9]. discriminating a given email messages as spam or not. The
spam filter models are able to separate spam messages from
2 MACHINE LEARNING TECHNIQUES: non-spam messages. Machine learning techniques for
classification include Logistic Regression,K-Nearest
Neighbors, Support Vector Machine, Naïve Bayes,Decision
————————————————
A. Lakshmanarao, Assistant Professor, Department of CSE, Raghu Tree,Random Forest [Link] the availability of
Engineering College, Dakamarri, Visakhapatnam, A.P,India, large collection of past data with labels, Deep Learning
Email: laxman1216@[Link] classification models involving Restricted Boltzmann
[Link], Professor, Department of CS & SE, College of Machines(RBM), Convolutional Neural Networks (CNN),
Engineering, Andhra University, Visakhapatnam, Andhra Pradesh, Recurrent Neural Networks (RNN), or Long-Short Term
India, Email: smogalla2000@[Link]
Memory (LSTMs) cells for feature extraction followed by a
densely connected neural network have become more efficient
in solving complex tasks. Applicability of the above supervisory
machine learning techniques is conditioned based on the
availability of large collection of labeled data.
2.3 Clustering:
Both regression and classification are supervised learning
models, for which labeled data is essential. Clustering is an
499
IJSTR©2020
[Link]
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 01, JANUARY 2020 ISSN 2277-8616
unsupervised learning model, which extracts general patterns with the SVM. The features are extracted from the training set
from the data even when the data is not labeled. Groups of using a two-layer Restricted Boltzmann Machine (RBM). The
similar events constitute a cluster as they share common deep belief networks-based IDS could outperform SVM model
features that define a specific(behavior)pattern. In and achieved an accuracy of 97.5%J. Kim [7] [Link] applied a
cybersecurity, clustering can be used for forensic analysis, specific type of Recurrent Neural Networks called LSTM model
anomaly detection, malware analysis, etc. K-means, K- for training the IDS using KDD Cup 1999 dataset. They
Medoids, DBSCAN, Gaussian Mixture Model, Agglomerative studied the impact of learning rate and the number of neurons
clustering are some of the ML clustering techniques used in in the hidden layer upon the attack Detection Rate. They
cybersecurity. Neural network based Self Organizing Maps conducted several experiments with different learning rate and
(SOMs) can also be used for clustering. hidden layer sizes and achieved Detection Rate of 98.88%.
Anna L. Buczak [Link] [8] stated that data (pcap, NetFlow, or
3 CYBER SECURITY ISSUES: other network data) play a vital role in applying ML/DM
The four major areas where Machine Learning algorithms play approach for Intrusion Detection System. They also noted that
a crucial role are Intrusion Detection Systems, Malware there is a large gap in the availability of labeled data. N. Shone
analysis, Mobile (Android) malware detection and Spam [10] [Link] proposed a deep learning model for Network
Detection. Intrusion Detection System operation with combination of ML
and DL methods. They proposed a non-symmetric deep
autoencoder (NDAE) on KDD Cup ‘99, NSL-KDD datasets.
3.2 Malware Detection:
Malware is coined from ‗Malicious software‘ in short,is a
specific type of cyber threat [Link] it is used for
illegal activities like compromising the system by stealing data
or bypass access control or cause harm to the host computer
and the [Link] term malware is broadly used for various
types of malicious programs like viruses, Trojan horses,
worms, bugs, adware, bots, rootkits, spyware, Ransomware,
Key logger, backdoor. Each of these malware types contain
several families. For example, ransomware can be classified
as Charger family, Jisut family, Koler family, Pletor family,
RansomBO family, Svpeng family, Simplocker family, etc.
These malicious programs can be embedded in different
formats like UNIX ELF (Executable and linkable) files,
windows PE files (Portable Executables with .exe, dll, efi.).
Document-based malware programs can be embedded in
.doc,.pdf,.rtf files. Malware can also be in the form of
Figure 1: Cyber security issues extensions and plugins for popular software platforms like web
browsers, web frameworks. Dolly Uppal [11] [Link] proposed a
3.1 Intrusion Detection: malware classification and detection system based on the n-
Whenever secure information compromised by malicious gram method. They have applied a pre-modeled program for
software or policy violations, then Intrusion Detection Systems tracking the execution of the samples and captured the API
comes into the picture. Detection of an intrusion can be done calls. After generating the feature vector, they applied different
in several ways. Broadly the methods are classified into either machine learning algorithms and achieved the best results
signature-based or anomaly-based. In the signature-based with the SVM [Link] Chowdhury [12] [Link]
approach, all packets are compared with the signatures of proposed a Neural Network-based approach for malware
known malicious threats. In the anomaly-based approach, detection. They extracted features from PE headers using the
network traffic is monitored against an established baseline of n-gram method and conducted experiments with the extended
normality. Saroj Kr. Biswas [4] showed that machine learning set of features and achieved 97% accuracy with ANN. Bowen
based feature selection techniques play an important role in a Sun [13] [Link] proposed a malware classification model using
good intrusion detection system. They applied a combination static features in different perspectives. They extracted static
of feature selection techniques and achieved good results. R. features in 3 perspectives including PE features, bytecode
Vinaya Kumar [5] [Link] proposed a scale-hybrid-IDS-AlertNet features, and assembler code features. They compared the
system that can analyze network, host-level activities. This performance of eight classifiers among which the best
model was created using Deep Neural Networks (DNNs). They classifier could achieve an f1-score of 93.56%. Mahmoud
developed a scalable framework which is based on big data Kalash [14] [Link] proposed a CNN for malware classification.
approaches and Apache Spark cluster computing platform. They represented the codes of 25 families of malware binaries
They conducted different experiments using DNNs with 1000 to grayscale images and applied CNN for classification. They
epochs and learning rate between 0.01 to 0.5 on various conduct experiments with two well-known datasets ‗Malimg‘
publicly available datasets like CICIDS 2017, KDDCup 99, and ‗Microsoft malware‘ and reported that they achieved
UNSW-NB15, NSL-KDD, WSN-DS. They also applied 98.52%, 99.97% accuracy on the two datasets respectively.
traditional machine learning algorithms as baselines for
comparisons. [Link] Alom [6] proposed a Deep Belief
Networks for Intrusion Detection and compared their model
500
IJSTR©2019
[Link]
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 01, JANUARY 2020 ISSN 2277-8616
3.3 Android Malware Detection: feature selection techniques and deep learning models like
Android is the most widely used mobile platform and hence Recurrent Neural Networks (RNNs). Malware detection (PC)
highly targeted by the mobile malware creators. As the number can be solved by ANNs and CNNs effectively. Malware
of android malware types are increasing day by day, it has samples are first converted to images and then CNN's are
become more and more challenging to detect and classify applied. Android malware detection can be addressed by
mobile malware variants. A large number of attempts are made shallow machine learning algorithms and various fusion
by the researchers towards mobile malware detection. models. Spam detection can be efficiently addressed by
DroidMat [15] applied k-means clustering and K-NN algorithms shallow machine learning models like Naïve Bayes and K-NN
on static features from android apps. Arp et al. [16], Varsha et models and deep learning models like CNN.
al. [17], Sharma and Dash [18] extracted static features from
android apps and they achieved good results by applying
machine algorithms like SVM, Random Forest, K-NN, Naive
Bayes, Decision Trees. AntiMalDroid [19],Droid Dolphin [20]
applied Support Vector Machines on dynamic features
extracted from malware apps (logged behavior sequence as
features) and achieved good accuracy. Suleiman Y. Yerima
[21] et al. proposed a Multilevel Classifier Fusion method for
Android Malware Detection. They proposed four ranking based
algorithms based on accuracy, recall and precision rates.
Based on their ranking algorithms, they combined four
classifiers to achieve a better detection rate. They evaluated
their model performance on three datasets and achieved a
good recall rate.
3.4 Spam Detection:
Spam Detection is also one of the major challenges in
cybersecurity. Spam is an unsolicited bulk messaging
generally used for advertising. Generally, spam indicates email
spam, but it could be a message on social networking sites
and other blogging platforms also. Spam messages waste a lot
of valuable time. Sometimes, users get spam emails that
disguised themselves as authentic message from a bank to
trap the users. Responding to such spam messages may lead
to incur heavy financial lossess. Machine learning techniques
have been applied by many researchers for spam detection.
Muhammad N. Marsono[22] et al applied the Naïve Bayes
classification technique for identification of spam messages
among incoming email and achieved good results. James
Clark [23] et al applied the K-NN model for automated email
classification problem. S. Jancy Sickory Daisy[24] proposed a
hybrid spam detection system based on Naive Bayes
classification and Markov Random Field method. They Figure 2: Machine Learning for cyber security
evaluated their model based on its accurateness, time
consumption and claimed that the performance of the hybrid 4 CONCLUSION
approach is better than the baseline methods. Sreekanth Machine Learning approaches are widely applied to solve
Madisetty [25] [Link] proposed an ensemble model for spam various types of cybersecurity problems. Advances in the field
classification on Twitter. They developed deep learning models of machine learning and deep learning offers promising
based on [Link] ap-plied various word embedding to solutions to cybersecurity issues. But it is important to identify
preprocess the input in textual form into numeric form before which algorithm is suitable for which application. Multi-Layered
training the CNN [Link] used 5 CNNs (CNN + Twitter approaches are needed to keep the solution resilient against
Glove, CNN + Google News, CNN + Edinburgh, CNN + H malware attacks and to achieve high detection rates. The
Spam, CNN + Random) for word embeddings and one feature- selection of a particular model plays a vital role in solving
based model for spam detection. Mehul Gupta [26] [Link] cybersecurity issues. In this paper the authors explored the
compared various machine learning and deep learning state of art mechanisms for cybersecurity problems. The
techniques for SMS spam detection on two different datasets. autonomous capabilities of machine learning and deep
They compared the performance of eight different classifiers learning algorithms must not be overestimated. The
and showed that CNN Classifier achieved the accuracy of combination of human supervision and Machine learning
99.19% and 98.25% for the two datasets. Figure 2 shows a techniques results in achieving the desired goals of
summary of machine learning algorithms for solving various cybersecurity.
cybersecurity issues. Although most of the researchers applied
all the machine learning algorithms for all four cybersecurity REFERENCES
issues, we summarized only appropriate models for specific [1] William G Hatcher, Wei Yu, ―A Survey of Deep Learning:
cybersecurity issue. Intrusion detection can be solved by good Platforms, Applications and Emerging Research Trends‖,
501
IJSTR©2019
[Link]
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 01, JANUARY 2020 ISSN 2277-8616
IEEE Access 2018, Volume: 6, [16] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, and K.
DOI:10.1109/ACCESS.2018.2830661. Rieck, ―Drebin: Efficient and explainable detection of
[2] Venkatesh Jaganathan, Premapriya Muthu Android malware in your pocket,‖ in Proc. 20th Annu.
Sivashanmugam, Priyesh Cherurveettil, ―Using a Netw. Distrib. Syst. Security Symp. (NDSS), San Diego,
Prediction Model to Manage Cyber Security Threats‖, CA, USA, Feb. 2014, pp. 1–15.
Hindawi Publishing Corporation the Scientific World [17] M. V. Varsha, P. Vinod, and K. A. Dhanya, ―Identification of
Journal Volume 2015, Article ID 703713, malicious Android app using manifest and opcode
[Link] features,‖ J. Comput. Virol. Hacking Tech., vol. 13, no. 2,
[3] Daria Lavrova, Alexander Pechenkin,‖ Applying pp. 125–138, 2017.
Correlation and Regression Analysis to Detect Security
Incidents in the Internet of Things‖, International Journal of [18] A. Sharma and S. K. Dash, ―Mining API calls and
Communication Networks and Information Security permissions for Android malware detection,‖ in Cryptology
(IJCNIS), Volume. 7, No. 3, December 2015. and Network Security. Cham, Switzerland: Springer Int.,
[4] Saroj Kr. Biswas, ―Intrusion Detection Using Machine 2014, pp. 191–205.
Learning: A Comparison Study‖, International Journal of [19] M. Zhao, F. Ge, T. Zhang, and Z. Yuan.,‖ An efficient SVM-
Pure and Applied Mathematics, Volume 118 No. 19 2018, based malware detection framework for Android,‖ in
101-114. Communications in Computer and Information Science,
[5] R. Vinayakumar, Mamoun Alazab, (Senior Member, IEEE), vol. 243, Springer, 2011, pp. 158–166.
K. P. Soman, Prabaharan Poornachandran, Ameer Al- [20] W.-C. Wu, S.-H. Hung, ―A dynamic Android malware
Nemrat, A.N. Venkatraman, ―Deep Learning Approach for detection framework using big data and machine
Intelligent Intrusion Detection System‖, IEEE Access, learning,‖ in Proc. ACM Conf. Res. Adapt. Convergent
VOLUME 7, 2019, Digital Object Identifier Syst. (RACS), Towson, MD, USA, 2014, pp. 247–252.
10.1109/ACCESS.2019.2895334. [21] Suleiman Y. Yerima, Member, IEEE, and Sakir Sezer,
[6] Md. Zahangir Alom, Venkata Ramesh Bontupalli, and Member, IEEE, ―Droid Fusion: A Novel Multilevel Classifier
Tarek M. Taha, ―Intrusion Detection using Deep Belief Fusion Approach for Android Malware Detection‖, IEEE
Networks‖, 978-1-4673-7565-8/15/$31.00 ©2015 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 49, NO. 2,
[7] J. Kim, L. T. Thu and H. Kim ―Long Short-Term Memory FEBRUARY 2019.
Recurrent Neural Network Classifier for Intrusion [22] Muhammad N. Marsono, M. Watheq El-Kharashi, Fayez
Detection,‖ IEEE International Conference on Platform Gebali, ―Targeting spam control on middleboxes: Spam
Technology and Service, 2016. detection based on layer-3 e-mail content classification‖
[8] Anna L. Buczak and Erhan Guven,‖ A Survey of Data Elsevier Computer Networks, 2009.
Mining and Machine Learning Methods for Cyber Security [23] James Clark, Irena Koprinska, Josiah Poon, ―A Neural
Intrusion Detection‖, IEEE Communications Surveys and Network Based Approach to Automated E-mail
Tutorials, Volume. 18, No. 2,2nd Quarter 2016. Classification‖, Proceedings IEEE/WIC International
[9] Ondrej Kubovič (ESET Security Awareness Specialist),‖ Conference on Web Intelligence, 0-7695-1932-6, Oct.
Machine-Learning Era in Cy-bersecurity: A Step Towards A 2003.
Safer World or The Brink of Chaos‖, Machine-Learning [24] S. Jancy Sickory Daisy, [Link] Begum, ―Hybrid Spam
Era in Cybersecurity White Paper, February 2019 Filtration Method using Ma-chine Learning Techniques‖,
[10] N. Shone, V. D. Phai, T. N. Ngoc, Q. Shi, "A deep learning International Journal of Innovative Technology and
approach to network intrusion detection", IEEE Exploring Engineering, ISSN: 2278-3075, Volume-8,
Transactions on Emerging Topics in Computational Issue-9, July 2019.
Intelligence-Feb-2018(41-50). [25] Sreekanth Madisetty and Maunendra Sankar Desarkar, ―A
[11] Dolly Uppal, Vinesh Jain, Rakhi Sinha and Vishakha Neural Network-Based Ensemble Approach for Spam
Mehra and ―Malware Detection and Classification Based Detection in Twitter‖, IEEE Transactions on Computational
on Extraction of API Sequences‖, 978-1-4799-3080- Social Systems, Volume: 5, Issue: 4, Dec. 2018.
7/14/$31.00_c 2014 IEEE. [26] Mehul Gupta, Aditya Bakliwal, Shubhangi Agarwal & Pulkit
[12] Mozammel Chowdhury, Azizur Rahman, Rafiqul Islam, Mehndiratta, ―A Comparative Study of Spam SMS
―Protecting Data from Mal-ware Threats using Machine Detection using Machine Learning Classifiers‖, Eleventh
Learning Technique‖, 2017 12th IEEE Conference on International Conference on Contemporary Computing
Industrial Electronics and Applications (ICIEA). (IC3), 2-4 August, 2018, Noida, India, 978-1-5386-6835-
[13] Bowen Sun, Qi Li, Yanhui Guo, Qiaokun Wen, Xiaoxi Lin, 1/18,2018 IEEE
Wenhan Liu, ―Malware Family Classification Method
Based on Static Feature Extraction‖, 2017 3rd IEEE
International Conference on Computer and
Communications
[14] Mahmoud Kalash, Mrigank Rochan, Noman Mohammed,
Neil D. B. Bruce, Yang Wang, Farkhund Iqbal, ―Malware
Classification with Deep Convolutional Neural Net-works‖,
978-1-5386-3662-6/18/$31.00 ©2018 IEEE
[15] D.-J. Wu, C.-H. Mao, T.-E. Wei, H.-M. Lee, and K.-P. Wu,
―DroidMat: Android mal-ware detection through manifest
and API calls tracing,‖ in Proc. 7th Asia Joint Conf. Inf.
Security (Asia JCIS), 2012, pp. 62–69.
502
IJSTR©2019
[Link]