0% found this document useful (0 votes)
347 views43 pages

ML Techniques for Intrusion Detection

This document discusses machine learning techniques for intrusion detection. It begins by providing background on the increasing problem of hacking incidents and types of attacks researchers have considered for detection. It then describes common security solutions like firewalls and intrusion detection systems, focusing on intrusion detection. The detection mechanisms used by intrusion detection systems are described as misuse detection, anomaly detection, and hybrid detection. Misuse detection is discussed in more detail, including signature-based and machine learning-based techniques. The key limitations of signature-based intrusion detection are also noted.

Uploaded by

SalmaanCadeXaaji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
347 views43 pages

ML Techniques for Intrusion Detection

This document discusses machine learning techniques for intrusion detection. It begins by providing background on the increasing problem of hacking incidents and types of attacks researchers have considered for detection. It then describes common security solutions like firewalls and intrusion detection systems, focusing on intrusion detection. The detection mechanisms used by intrusion detection systems are described as misuse detection, anomaly detection, and hybrid detection. Misuse detection is discussed in more detail, including signature-based and machine learning-based techniques. The key limitations of signature-based intrusion detection are also noted.

Uploaded by

SalmaanCadeXaaji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

686 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO.

1, FIRST QUARTER 2019

A Detailed Investigation and Analysis of Using


Machine Learning Techniques for Intrusion
Detection
Preeti Mishra , Member, IEEE, Vijay Varadharajan, Senior Member, IEEE, Uday Tupakula, Member, IEEE,
and Emmanuel S. Pilli , Senior Member, IEEE

Abstract—Intrusion detection is one of the important security hit by suspected distributed denial of service attack on Sept 28,
problems in todays cyber world. A significant number of tech- 2014. Panjwani et al. [5] reported that some form of network
niques have been developed which are based on machine learning scanning activity precedes 50% of the attacks against cyber
approaches. However, they are not very successful in identifying
all types of intrusions. In this paper, a detailed investigation systems. Attackers are not only launching flooding and probing
and analysis of various machine learning techniques have been attacks but also spreading malware files in the form of virus,
carried out for finding the cause of problems associated with var- worm, spams to exploit the vulnerabilities present in exist-
ious machine learning techniques in detecting intrusive activities. ing software, causing a threat to the sensitive information of
Attack classification and mapping of the attack features is pro- users stored on machines. Cisco Annual Security report men-
vided corresponding to each attack. Issues which are related to
detecting low-frequency attacks using network attack dataset are tioned [6] that spam related to the Boston Marathon bombing
also discussed and viable methods are suggested for improvement. comprised 40% of all spam messages delivered worldwide on
Machine learning techniques have been analyzed and compared April 17, 2013. On a recent survey done by Cisco in 2017 [7],
in terms of their detection capability for detecting the various Trojan was classified as one of the top five malware which
category of attacks. Limitations associated with each category of is used to gain initial access to the user’s computers and
them are also discussed. Various data mining tools for machine
learning have also been included in the paper. At the end, organizational networks. Hence, security in such a complex
future directions are provided for attack detection using machine technological environment is a big challenge and needs to be
learning techniques. tackled intelligently.
Index Terms—Machine learning, intrusion, attacks, security. Researchers have considered a different category of attacks
for intrusion detection. For example, Denial of Service (DoS)
attacks (Bandwidth and Resource Depletion), Scanning attacks
I. I NTRODUCTION (Probe) and Remote to Local (R2L) attacks and User to Root
ACKING incidents are increasing day by day as tech-
H nology rolls out. A large number of hacking incidents
are reported by companies each year. Distributed Denial of
(U2R) attacks which are based on KDD’99 dataset [12]. A
recent attack dataset (UNSW-NB [13]), classifies attacks into
nine categories: Fuzzer, Analysis, Reconnaissance, ShellCode,
Service (DDoS) attack was launched against Estonian web- Worm, Generic, DoS, Exploit and Generic. All these attacks
sites in 2007, allegedly by Russia [1]. On June 17, 2008, have been discussed in detail in Section III.
Amazon [2] started receiving some authenticated request from Current security solutions include the use of middle-boxes
multiple users in one of its location. The requests began to such as Firewall, Antivirus and Intrusion Detection Systems
increase significantly causing the servers slow down. On Jan (IDS). A firewall controls traffic that enters or leaves a network
2013, European Network and Information Security Agency based on source or destination address. It alters the traffic
(ENISA) [3] reported that Dropbox was attacked by DDoS according to the firewall rules. Firewalls are also limited to
and suffered a substantial loss of service for more than 15 the amount of state available and their knowledge of the hosts
hours affecting all users across the globe. Facebook [4] was receiving the content. An IDS is a type of security tool that
monitors network traffic and scans the system for suspicious
Manuscript received June 21, 2017; revised November 27, 2017 and April 2,
2018; accepted May 22, 2018. Date of publication June 15, 2018; date activities and alerts the system or network administrator [14].
of current version February 22, 2019. (Corresponding author: Emmanuel It is the main focus of concern in this paper.
S. Pilli.) IDS are mainly two types: Host based and Network based. A
P. Mishra was with MNIT, Jaipur 302017, India. She is now with the
Department of Computer Science and Engineering, Graphic Era (Deemed) Host based Intrusion Detection System (HIDS) [15] monitors
University, Dehradun 248002, India (e-mail: [Link]@[Link]). individual host or device and sends alerts to the user if sus-
V. Varadharajan and U. Tupakula are with the Faculty of Engineering picious activities such as modifying or deleting a system file,
and Built Environment and Advanced Cyber Security Research Centre,
University of Newcastle, Callaghan, NSW 2308, Australia (e-mail: unwanted sequence of system calls, unwanted configuration
[Link]@[Link]; [Link]@[Link]). changes are detected. A Network based Intrusion Detection
E. S. Pilli is with the Department of Computer Science and Engineering, System (NIDS) [16] is usually placed at network points such
Malaviya National Institute of Technology, Jaipur 302017, India (e-mail:
[Link]@[Link]). as a gateway and routers to check for intrusions in the network
Digital Object Identifier 10.1109/COMST.2018.2847722 traffic.
1553-877X  c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See [Link] for more information.
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 687

TABLE I
D IFFERENCE B ETWEEN M ISUSE D ETECTION AND A NOMALY D ETECTION

At high-level, the detection mechanism used by these IDSes signatures in a certain format. Let us consider TCP-ping attack
are of three types: misuse detection, anomaly detection, and for illustrating the signature based misuse detection system
hybrid detection. In misuse detection approach, IDS maintains (particularly SNORT [8]). If an attacker wants to know, if
a set of the knowledge base (rules) for detecting the known a machine is active or not, he/she scans the machine. An
attack types. Misuse detection techniques can be broadly clas- attacker sends ICMP ping packets. If the machine is set to
sified into Knowledge based and machine learning based not to respond for ICMP ECHO REQUEST ping packets, an
techniques. In the knowledge based technique, network traffic attacker may use the nmap tool to send the TCP ping pack-
or host audit data (such as system call traces) are compared ets to port 80 with ACK flag set with sequence number 0.
against predefined rules or attack patterns. Knowledge based The characteristics of this attack is that flag is set to ‘A’ value
techniques can be categorized into three types: (i) Signature and acknowledge set to 0 value [21]. As such packets are not
matching (ii) State transition analysis and (iii) Rule based acceptable at the victim side; on receiving the packets, RST
expert systems [17]. packet is sent to attacker’s machine which signals machine is
Signature matching based misuse detection techniques scan alive. The rule for detecting TCP ping attack, targeted against
the incoming packets against fixed patterns. If any of the pat- victim machine residing in the network with IP [Link]/24
terns match with the packet header, the packet is flagged as is as follows:
anomalous. State transition analysis based approaches, main- alert TCP any ->[Link]/24 any,(flags: A;ack: 0;
tain a state transition model of the system for the known msg: “TCP ping detected”;)
suspicious patterns. Different branches of the model lead to a The major limitations with signature based IDS is that it
final compromised state of the machine. The rule based expert requires the regular update of the system for adding signature
systems maintain a database of rules for different intrusive rules for up-to-date attacks. It generates more false alarms for
scenarios. The knowledge based IDS requires regular mainte- the new evolving attacks whose signatures are not defined.
nance of knowledge database in a dynamic manner and can fail Later, anomaly detection approaches are used for detecting
to detect variants of attacks. Misuse detection can also be per- intrusions.
formed using supervised machine learning algorithms such as Anomaly detection based IDSes are based on the hypothesis
Back Propagation Artificial Neural Network (BP-ANN) [18], that attacker’s behavior differs from normal user’s behav-
Decision Tree (DT) C4.5 [19] and Multi-class Support Vector ior [22]. It helps in detecting the evolving attacks. Anomaly
Machine (SVM) [20]. based IDSes model the normal behavior of the system and
Machine learning based IDS provides a learning based keep on updating it over a duration of time. For example,
system to discover classes of attacks based on learned nor- each network connection is identified by a set of features
mal and attack behavior. The goal of machine learning based such as protocol, service, number of login attempts, pack-
IDS (based on supervised learning algorithms) is to generate ets per flow, bytes per flow, source address, destination
a general representation of known attacks. Misuse detection address, source port, destination port, etc. The behavioral
techniques fail to detect unknown attacks. However, these statistics of these features are recorded over a period. Any
techniques provide good detection accuracy for detecting well- abnormal deviation in the feature values for any connec-
known attacks. These type of IDSes also require the regular tion flow will be marked as anomalous by the anomaly
maintenance of the signature database which increases the detection engine. Anomaly detection techniques are widely
overhead of user. categorized into three types: Statistical techniques, Machine
Misuse detection based IDS particularly signature based are learning based techniques and Finite state machine (FSM)
very popular and have got commercial success. The pros and based techniques [23]. A finite state machine (FSM) produces
cons associated with these approaches are shown in Table I. a behavioral model which is composed of states, transitions,
These IDSes maintain a database of known attack signatures. and actions. Kumar et al. [24] have proposed an IDS which
An attack signature describes the characteristics of an attack. makes use of Hidden Markov Model to model the transitions
It can be in the form of a code script, a sequence of system of user behavior over a longer span of time. Anomaly detection
call patterns or a behavioral profile, etc. IDS stores the attack can also be performed using semi-supervised and unsupervised
688 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

machine learning algorithms such as Self Organizing Map Each of these approaches learns from the available dataset
(SOM) Neural Network [25], clustering algorithms [26] and which is described by a set of connection features such as
One class Support Vector Machine (SVM) [27]. Machine source/destination port number, source/destination IP address,
learning based IDS for anomaly detection provide a learn- source bytes and destination bytes, etc. Ensemble learning is
ing based system to discover zero-day attacks. A Zero-day one of the forms of Multiple learning algorithms in which
attack refers to exploitation of a vulnerability that has not predictions by a set of classifiers are combined in some way,
been known earlier. However, these techniques suffer from discussed in detail in Section IV.
high-false positives because of their limitations in differen- Our analysis reveals that a feature set for analyzing the
tiating attack behavior and evolving normal behavior. The behavior of some particular category of attack, is different
difference between misuse detection and anomaly detection from the feature set of another category of attacks; since each
approach is shown in Table I. Hybrid detection approaches of the attack categories posses some unique characteristics.
integrate misuse and anomaly detection approach for detecting We discuss the standard feature selection methods used by
attacks. The details of these approaches with the example of researchers if the domain knowledge of the attacks is not
existing literature is presented in Section V. In general, some known. Our main goal of the paper is to perform a detailed
of the advantages of using Machine learning based IDS over investigation and critical analysis of using machine learn-
conventional signature based IDS are as follows: ing approaches for intrusion detection in environment. The
• It is easy to bypass the signature based IDS by doing performance analysis of various categories of machine learning
slight variations in an attack pattern whereas Machine techniques is also carried out, and observations are provided
learning based IDS based on supervised techniques can concerning each category. Our paper mainly concentrates on
easily detect the attack variants as they learn the behavior intrusion detection in wired cyber traditional networks.
of the traffic flow. The detailed discussion of intrusion detection application
• The CPU load is low to moderate in Machine learning in wireless networks can be referred from here [29]–[31].
based IDS as they do not analyze all signatures of the Different types of machine learning based IDSes are available
signature database as done by signature based IDS. for mobile devices. For example, AmoxID [32] is based on
• Some of the Machine learning based IDS, particularly SVM algorithms and implemented for iOS and Android OS.
based on unsupervised learning algorithms, can detect SMARTbot [33] is an off-device behavioral analysis frame-
novel attacks. work based on Artificial Neural Networks back-propagation
• Machine learning based IDS can capture the complex method for mobile botnet detection and achieves 99.49% accu-
properties of the attack behavior and improve the detec- racy. A light-weight Android malware detection system is
tion accuracy and speed than conventional signature proposed by Shabtai et al. [34], called Andromaly which also
based IDS. uses machine learning algorithms. Sikder et al. [35] propose
• Different types of attacks keep on evolving. Signature a context-aware sensor based attack detector, called 6thSense
based IDS will require the maintenance of the signature to detect attacks which bypass the flaws in sensor manage-
database time to time and keep it up-to-date whereas ment system. It makes use of Markov Chain, Naive Bayes,
Machine learning based IDS based on clustering and and Logistic Model Tree (LMT). A detailed survey of vari-
outlier detection won’t require such update. ous types of machine learning based IDS for mobile devices
In this paper, we have mainly focused on the use of machine particularly mobile phones can be found here [36]. A detailed
learning for anomaly, misuse or hybrid detection mechanism survey on virtualization based attacks such as VM Escape,
with their detailed analysis and investigated their capability for Side Channel Attacks, Hyperjacking, attacks on Guest-OS, etc.
attack detection. A detailed study of various machine learn- and their detection techniques in Cloud/Virtualization environ-
ing approaches is helpful in exploring solutions for advanced ment, has been separately addressed in our recent work [37].
cyber intrusion detection. The machine learning based intru- More specifically, a survey on cache based side channel attacks
sion detection approaches have been categorized into four and prevention approaches has also been recently published by
types. These are as follows: (i) Single classifiers with all fea- Anwar et al. [38].
tures of data set (ii) Single classifiers with limited features The major contributions of our present research work are
of data set (iii) Multiple classifiers with all features of data as follows:
set and (iv) Multiple classifiers with limited features of data • The classification of attacks based on their characteris-
set. In single classifier system, an individual classifier is used tics is presented. Various factors that make the detection
to detect Intrusions. Multiple classifier is a broad term which of low-frequency attacks (like U2R and R2L, Worms,
considers a set of ML algorithms at the time of learning and ShellCode etc.) difficult to achieve by machine learning
detecting intrusions. A set of classifiers are integrated to pro- techniques are discussed and methods are suggested for
vide a common output for detecting intrusions. For example, improving their detection rate.
Kim et al. [28] proposed multiple classifier methods which • The discussion of various existing literature for intrusion
hierarchically integrates misuse detection model with anomaly detection is provided, highlighting the key characteristics,
detection model rather than just combining their results. DT the detection mechanism, feature selection employed,
C4.5 acts as a misuse detection module, and one class SVM attacks detection capability.
acts as an anomaly detection module. Multiple classifiers based • The critical performance analysis of various intrusion
approach lower the false alarm and improve the detection rate. detection techniques is provided with respect to their
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 689

attack detection capability. The limitations and compar- machine learning for intrusion detection without giving any
ison with other approaches are also discussed. Various critical analysis or observations.
suggestions are provided for improvement in each cate- Ahmed et al. [22] provided a survey on network anomaly
gory of techniques. detection approaches. Attacks are classified into four cat-
• Future directions of machine learning are provided for egories: DoS, probe, U2R and R2L based on KDD’99
intrusion detection applications. dataset [12]. Each category points to a specific type of
The paper is organized into XI Sections. In Section II, a anomaly. They have provided the discussion over different
comparison with related surveys is given, highlighting our types of machine learning approaches, i.e., classification based,
specific contributions to compare our work. In Section III, clustering based, statistical based and information theory based
a detailed description of different types of attacks with their approaches. The application of various types of machine learn-
characteristics is provided. In Section IV, the description of ing approaches in intrusion detection, distinguishes the normal
various machine learning techniques & their characteristics instances from anomalous instances. A brief summary of
is presented with a discussion on the importance of feature issues with various network intrusion detection dataset is dis-
selection in machine learning. Section V provides the detailed cussed. The collaborative IDSes are suggested as a future
and comprehensive summary of different machine learning research directions. However, a detailed in-depth description
approaches for intrusion detection, and Section VI classifies and analysis of various existing IDS proposals based on
them based on their ability to detect an attack. Section VII machine learning is lacking in their survey. Authors have also
discusses the performance analysis of some machine learning not provided future directions for machine learning algorithms.
techniques for detecting different security attacks. The secu- A discussion on machine learning and data mining
rity issues associated with each category of machine learning techniques for intrusion detection has been given by
techniques are discussed, and solutions are provided to over- Buczak and Guven [41]. Their survey describes the appli-
come the security issues. Various useful measures are provided cation of machine learning and data mining techniques for
for improving their detection rate followed by Section VIII, misuse and anomaly detection. They have clarified the differ-
which describes the issues in detecting low-frequency attacks. ence between machine learning (ML) and data mining (DM)
In Section IX, various data mining tools for machine learning and stated that ML is an older sibling of DM. Since they both
and deep learning have been discussed. In Section X, future use same methods for classification or knowledge discovery
directions are provided to give a brief insight into the ongoing of data, they use the term ML/DM methods for algorithms
and future research work. In the end, in Section XI, concluding under study. In their survey, they have described various meth-
remarks are mentioned with the scope of future work. ods and related them to misuse, anomaly and hybrid detection
techniques. The description about the time complexity of algo-
II. R ELATED W ORK rithms is also mentioned in the paper. They have observed
There are surveys on applying machine learning to that KDD’99 and DARPA have been mostly used data sets as
intrusion detection. Some of them are discussed to high- this makes the comparison relevant to authors. However, some
light their contributions. The specific contributions which researchers have used NetFlow and tcpdump dataset. They
make our work different than others are also presented. have recommended which ML/DM method will be suitable
Agrawal and Agrawal [39] have provided a survey on for misuse and anomaly detection individually.
anomaly detection using data mining techniques for intrusion In our survey, a detailed investigation and analysis of var-
detection. They have categorized the anomaly detec- ious machine learning techniques have been carried out for
tion approaches based on three factors: clustering based finding the limitations associated with various machine learn-
approaches, classification based approaches and hybrid ing techniques in detecting intrusive activities. The key factor
approaches. K-means. K-Meoids, EM clustering, Outliers which differentiates the present work from existing surveys is
detection algorithms have been described under cluster- that it is based on the premise that no one particular intrusion
ing based approaches. Naive Bayes Algorithm, Genetic detection technique, based on single/multiple classifier algo-
Algorithm, Neural Networks, Support Vector Machine have rithms can help in detecting all types of attacks. Hence, the
been described under classification based approaches. Hybrid use of specific intrusion detection technique is recommended
approaches describe the combination of machine learning for detecting a specific set of attacks. The importance of var-
approaches. They have provided the brief comparison of ious factors while selecting an algorithm is discussed. Attack
papers using the ensemble based approaches. classification is also provided with attack examples and the
Haq et al. [40] provided a survey on the application of specific attack features are mapped to each attack. Various
machine learning techniques in intrusion detection. They have issues in detecting low-frequency attacks are also mentioned
broadly classified the techniques into three major categories: and methods are also suggested for improvement.
supervised learning, unsupervised learning and reinforcement A summary of various intrusion detection approaches is
learning. In supervised learning, a classifier is trained on the discussed including the literature on diverse datasets. In agree-
labeled dataset. Unsupervised learning is used when we do ment with Buczak and Guven [41], we also found that
not have the labeled dataset. In Reinforcement learning, a people have mostly used KDD’99, DARPA dataset. Existing
domain expert can label the unlabeled instances. They have intrusion detection approaches based on machine learning
provided a brief description of various single classifier and techniques have been thoroughly analyzed concerning individ-
ensemble algorithms and provided references to papers using ual attack categories. Limitations associated with approaches
690 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

Fig. 1. A taxonomy of various Attacks based on KDD’99 dataset.

for each category are discussed with viable solutions. After the the attacker can perform any activity on these compromised
exhaustive survey of literature and critical analysis using the hosts such as execute malicious programs and damage the
comparison of results reported by researchers, various observa- system. Hackers exploit vulnerabilities present in the computer
tions are reported and analyzed. Future directions are provided or network by using specialized tools such as Nmap, scapy,
in the field of intrusion detection. Future directions specifically Metasploit, Armitage, Dsniff, Tcpdump, Net2pcap, Snoop,
point towards the usage of deep learning and reinforcement Ettercap, Nstreams, Argus, Karpski, Ethereal, Amap, Vmap,
learning techniques for intrusion detection. Various challenges TTLscan and Paketto, etc. [44]–[47]. A detailed description of
associated with these approaches have also been discussed in these tools can be found in [48]. In a secure environment, both
paper. network and host based security are important. In this Section,
we have described attacks which are classified into three broad
categories based on their characteristics as shown in Figure 1
III. C LASSIFICATION OF ATTACKS W ITH R ELEVANT and Figure 3. In each category, we have also described the
ATTACK F EATURES important attack features for each attack based on KDD’99
Network and host based attacks have become pervasive in dataset [12] and UNSW-NB dataset [13]. All these features
today’s world. Attackers attempt to bypass the security of are described in detail in Table II and Table III.
the network by exploiting the existing vulnerabilities in the
network. They disturb the normal functioning of the network
by malfunctioning the network devices, flooding the network A. Denial of Service Attacks (Resource Depletion and
by sending excessive packets, performing the scanning over Bandwidth Depletion Attacks
a network, etc. It causes the unavailability of service to the This category of attacks cause the unavailability of ser-
legitimate users and highly reduces network throughput. Host vice to the legitimate users and hence also referred as DoS
based attacks attempt to bypass the security of a host machine. (Denial of Service) attacks [49]. For example, lets take an
Attacker gains unprivileged access to a machine and tries to attack scenario: An attacker can send multiple service requests
gain root access which may lead to the destruction of important either to register with the enterprise or to access any of the
system files, modification of sensitive data, leakage of user’s valid service instance running in the enterprise. In this case,
private information, etc. Host based attacks can be launched the administrative server will be flooded with many service
as the next step after network attacks. Any machine over the requests and will fail to provide services to other legitimate
network can be compromised by a hacker who has access customers/users. There can be another attack scenario where
to the network. In this case, the attacker first tries to estab- multiple machines are used to launch DoS attack: A large num-
lish the connection over the network to the target machine by ber of machines are connected to an organization or enterprise
exploiting the weakness in the network protocols or security network. If an attacker has access to one or more machines
devices such as Firewall [42], Intrusion Detection System [43] of an organization/enterprise. It can misuse this privilege and
and then tries to copy malicious files over the network to the can launch DoS attack to the other machines in the same
host machine. Once the user executes these files, the system network subnet. Here, the attacking surface is very broad,
is compromised and is under the control of the attacker. Now an attacker can occupy multiple machines (Zombies) and can
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 691

TABLE II
TCP C ONNECTION F EATURES IN KDD’99 [50]

use them to launch DoS attacks. This kind of DoS is also address rather than a specific address. Example of such attacks
called as Denial of Service attack (DDoS). DoS attack is are smurf and fraggle attacks [52]. In Resource Depletion
classified into two types: Bandwidth Depletion and Resource attacks, attacker ties up the resources of a victim system. This
Depletion attacks. In Bandwidth Depletion attack, attacker attack can be launched by exploiting the network protocol
tries to overload the network by network packets. There are (ex. neptune, mailbomb) or by forming malformed packets
two classes in Bandwidth Depletion attacks: Flooding attacks (ex. Land, Apche2, Back, teardrop, ping of death etc) which
and Amplification attacks. In Flooding attacks, attacker tries to are sent to the victim machine over the network. A brief
flood the network by sending excessive ICMP or UDP packets explanation about some of these attacks [53] is given below:
causing overloading of the network resources. In Amplification Land: In Land attack, an attacker sends spoofed SYN
attacks, attacker tries to exploit the IP address broadcast fea- packet in which the source address is the same as the
ture of most of the routers. This feature allows a sending destination address. It is effective in some of the TCP/IP
system to specify a broadcast IP address as the destination implementations.
692 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

TABLE III
TCP C ONNECTION F EATURES IN UNSW-NB

Attack Features: The attack can be detected by consid- of receiving target operating system, the machine crashes due
ering the feature ‘Land’. If the value of feature ‘Land’ is to improper handling of the overlapping packets. Such attacks
1, it means that source and destination address are identi- are successful on different operating systems such as Windows
cal. Hence this feature is most important in recognizing this 3.1x, Windows 95, Windows NT and versions of the Linux
attack. kernel prior to 2.1.63.
Teardrop: In this attack, the attacker tries to send the frag- Attack Features: Feature ‘Wrong Fragment’ which is the
mented packets to a target machine. He sets the fragment offset sum of bad checksum packets in a connection provides some
in such a way that the subsequent packets overlap with each clue about the malformed IP packets. Hence this feature is
other. If there is a bug in the IP fragmentation reassembly code important in recognizing this attack.
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 693

Smurf: Smurf attack is an amplification based denial of features such as ‘Duration’, ‘Flag’ (S0: ‘Initial SYN but no
service attack in which attacker sends a large number of ICMP further communication’ etc.), ‘Dst host count’ (percentage of
echo messages to a broadcast IP address with the spoofed connection to the same destination IP (victim machine)) are
address of victim’s machine as a source IP. On receiving the very important in recognizing this attack. Therefore noting
packet, each machine in the broadcast network replies to the those connections which are raising SYN flag with no con-
victim’s machine making its resources busy uselessly [54]. nection established within a short duration of time are useful
Attack Features: This attack can be easily detected in the in detecting the attack.
victim machine by looking at the huge number of ICMP echo
replies to victim machine without sending any ICMP echo
requests packets from the victim machine. There are some B. Scanning Attacks
feature such as ‘Service’ (ICMP), ‘Duration’, ‘Dst host same A scanning activity is a growing cyber security concern
srv rate’ (used to find the percentage of connections to the because it is the primary stage of an intrusion detection
same service and to the same destination IP address coming attempt that is used to locate the target systems in the network
from attacker’s machines) and ‘Same srv rate’ (used to find the and subsequently exploit known vulnerabilities. An attacker
percentage of connections to the same service and to the same sends a large number of scan packets to gain the detailed
destination IP address going from victim machine) which are description about the machines using scanning tools such as
useful in determining the total number of ICMP echo packet to nmap, satan, saint, msscan etc. Bou-Harb et al. [58] pro-
victim machine within some duration of time and total ICMP vided a detailed discussion on scanning techniques. They have
reply packets from the victim machine within some duration provided a classification of cyber scanning topic into three
of time. parts: Nature, Strategy and Approach. The nature of scan-
Ping of Death: Ping of Death (PoD) is a denial of service ning attack can be active or passive. The attack strategies
(DoS) attack caused by an attacker deliberately sending an IP could be remote to local, local to remote, local to local and
packet larger than the 65,536 bytes allowed by the IP proto- remote to remote. They also classified 19 cyber scanning tech-
col. The maximum allowable IP packet size is 65,535 bytes, niques with their pros and cons. At high-level all 19 categories
including the packet header, which is typically 20 bytes long. are explained under five major categories: Open Scans [59],
This causes the system to crash or freeze. Many operating Half-Open Scans [60], Stealthy Scans [61], Sweep Scans [62]
systems are vulnerable to this attack [55]. and Miscellaneous scans [63], [64]. For example, open scan
Attack Features: An attempted Ping of Death can be iden- and stealthy scan particularly SYN-ACK scan are shown in
tified by noting the size of all ICMP packets and flagging Figure 2 [58]. Open scan uses the TCP-handshake connec-
those that are longer than 65,535 bytes. Features ‘Dst bytes’ tion. It detects the TCP ports by making use of SYN flag and
(total number of bytes received) and ‘Duration’ in a connection TCP protocols. A closed port replies with RST flag set (line
may be helpful in providing some clue about PoD attack which i) whereas open port replies with ACK flag set (line ii). The
means by comparing the total number of bytes received within attacker can now reset the connection by sending the RST and
a short duration of time with some threshold value (65,535). ACK. A firewall can detect such simple scans by looking at
Mailbomb: In Mailbomb attack, unauthorized users send a logs. Stealthy scan advances the open scans and also makes
large number of email messages with large attachments to a use of other flags together with SYN flag to avoid its detec-
particular mail server, filling up disk space resulting in denied tion. For the stealthy SYN-ACK scan, an attacker sends the
email services to other users [56]. SYN and ACK flag to the target, close ports sends the RST
Attack Features: This attack can be identified by looking flag (line iii) whereas open ports will generate any response
for thousands of mail messages coming from a particular user (line iv). It is a relatively fast method and does not require the
within a short period of time. Features such as ‘Destination IP’, three-way handshake or solo SYN flag. Other than scanning
‘Dst bytes’ (total bytes received), ‘Service’ (SMTP/MIME), scenarios, the author also addresses the IP versions issues with
and ‘Dst host same src port rate’ (percentage of connections cyber scanning activities. A separate literature review is pro-
to the same port and to the same destination IP address) are vided for distributed detection techniques which are classified
important features in detecting the behavior of this attack. based on scanning activities one to one approach, one to many
SYN Flood: In SYN flood, TCP/IP implementation is approach, many to one approach and many to many approach.
exploited. An attacker sends the SYN request to the victim These Probes are useful in launching future attacks [65]. Some
machine. Victim replies by ACK and waits for the reply. of the scanning attacks are described below.
The server adds the information of each half-open connection Ipsweep: An Ipsweep is used to determine which hosts are
in the pending connection queue. The half-open connections listening on the network by sending many ping packets. If a
on the victim server system will eventually fill the queue target host replies, the reply reveals the targets IP address to
and the system will be unable to accept any new incoming the attacker.
connections [57]. Attack Features: A Network Intrusion Detection System can
Attack Features: A SYN flood attack can be distinguished examine the total number of ping packets coming within a
from normal network traffic by looking for a number of simul- short duration of time. Features such as ‘Duration’, ‘Service’
taneous SYN packets destined for a particular machine that (ICMP), ‘Dst host same srv rate’ (used to find the connections
are coming from an unreachable host or set a threshold for to the same service) and ‘Flag’ (used to find connection sta-
the duration of time a system has to wait for the reply. Hence tus) are important to find the total ping messages within short
694 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

buffer overflow attack, the attacker exploits the vulnerability


of a user program which copies too much data in a static
buffer without checking to make sure the data will fit well. The
attacker tries to manipulate the data that overflows the buffer
and causes arbitrary commands to be executed by the operating
system. In Ffbconfig attack [66], the attacker exploits the ffb-
config program distributed with some OS. Attacker overwrites
the internal stack space of the ffbconfigct program which
does not perform sufficient bound checking on arguments.
The ffbconfig program is a part of FFB (Fast Frame Buffer)
Graphics Accelerator. In loadmodule attack attempts to exploit
the vulnerability present in some operating systems [67]. The
loadmodule program loads to dynamically loadable kernel
drivers into currently running system and creates to special
devices in /dev directory. An attacker exploits the bug present
in the loadmodule program to gain root access to the machine.
Perl attack exploits the bugs saved in set-user-ID and set-
group-ID scripts present in the Suidperl version of Perl. In
this version, the interpreter does not exempt the root privi-
leges properly when changing effective user and group IDs.
Another example is rootkit attacks [68]. Rootkits are stealthy
Fig. 2. Examples of scanning attacks: The Open Scan targeting (i) a closed
and (ii) an open port (ii); The SYN|ACK Scan targeting (iii) a closed and (iv) programs which are used to install a backdoor or hidden entry
an open port. way to the attacker system to bypass the root privileges of the
machine. Rootkits allow the attacker to hide many suspicious
processes from the machine and install additional software
duration of time and current state of connection to detect the such as sniffier, keylogger to compromise the resources of the
ipsweep attack. machine [69].
Reset scan: In Reset Scan, an attacker sends reset packets Attack Features: KDD’99 features are not sufficient to
(RST flag up) to victim machine to determine if the machine observe the behavior of the attack. In fact, it is very difficult
is active. If the victim machine does not send any response to to distinguish these attacks from each other by considering
reset packet; the machine is alive. the KDD’99 features. However, few features present in KDD
Attack Features: These scans can be detected by examining 99 such as ‘Num failed login’, ‘Su attempted’, ‘Is hot login’,
the various RST packets coming to a vulnerable machine with ‘Num shells’, ‘Root Shell’ and ‘Num root’, ‘Duration’ and
same service within a short period of time. Features such as ‘Service’ provides some hint about abnormal behavior of the
‘Duration’, ‘Service’, ‘Flag’, ‘Dst host count’ (used to find root user and hence helpful in detecting U2R attacks.
the sum of connections to vulnerable machine) are important
to find the sum of connections that have initiated RST packet
with short duration of time with same service protocol. D. Remote to User Attacks
SYN scan: SYN scan is a half-open scanning attack because Remote to User (R2L) attack refers to those group of
the attacker does not make a complete TCP connection. exploits which are used to gain local access to the vulnerable
Attacker sends a large number of SYN packets to different machine, provided the attacker can send packets to the victim
ports. Open ports respond with SYN-ACK, and close port machine over a network. There are various ways to launch
responds with RST. attacks to gain such an illegitimate privilege to a machine. In
Attack Features: These scans can be detected by checking Dictionary/Guess Password attack, an attacker tries to make
the connections with large half-open connections with Flag repeated guesses of possible username and password. The
either REJ (connections rejected; Initial SYN elicited, a RST attack can be attempted using many services which provide
reply) or S1 (SYN’s exchanged nothing further seen) initi- the login facility such as telnet, ftp, pop, rlogin and imap. In
ated by attacker machine. Hence, features such as ‘Duration’, FTPwrite attack, an attacker tries to exploit the ftp misconfig-
‘Flag’, ‘Dst host diff srv rate’ (percentage of connections to uration [70]. In ftp configurations, if ftp root directory or sub
different ports and to the same destination IP) are important directories are not write protected and in the same group of
features to detect this attack. the ftp account. An attacker can add files to these directories
such as rhost files and gain access to the machine. In Imap
attack, an attacker tries to exploit the buffer overflows of the
C. User to Root Attacks Imap server which exists in the authentication code of the login
User to Root (U2R) attack refers to the group of exploits transaction [71]. Attacker sends a carefully crafted text to exe-
which are used to gain the root access to a machine by an cute arbitrary instructions. In Xlock attack, attackers exploit
unprivileged local user. These exploits are used in different the unprotected X console of the user to gain access to the
ways to gain the root access to the machine. For example, in machine. Attacker displays the modified xlock program to the
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 695

Attack Features: If a source is sending large number of


packets continuously using same service protocol and/or at
same destination port number over some duration of time. It
could be indication of Fuzzer. In fact, some other features such
as source to destination bytes, source of destination packet
count and lots of variation (or ‘jitter’) can be the indication
of problems. The features which are very helpful in this cate-
gory of attack are: dur, service, sbytes, spkts, srcjitter, synack,
cf_srv_src, ct_src_dport_ltm, described in Table III.

F. Analysis
Fig. 3. A taxonomy of various Attacks based on UNSW-NB dataset.
This category of attack refers to various intrusions that
penetrates the Web applications by various means such as
user and waits till the user enters the password in that display. port scanning, malicious Web scripting (like HTML files
The password is sent back to the attacker by the trojan version penetration) and sending spam emails etc.
of xlock program [72]. In wazermaster attack, an attacker tries Attack Features: The attack characteristics of various port
to exploit the bug present in the FTP server. If FTP server has scanning attacks and various important features for detecting
given write permissions to the guest account, an attacker can those attacks are discussed in Section III-B. There are Anti-
login to guest account in the public domain of FTP servers spam filters provided by the mail service providers to filter
and can upload ‘warez’ (copies of illegal software) into the such emails coming from unauthorized source. Spam emails
server. Users can later download these files [73]. Warezclient can be bypassed by such filters. Hence, in addition to source
attack is launched by a legal user during FTP connection after IP address, the analysis of overall network performance can
the execution of warezmaster by an attacker. Users download be done by considering various possible features as listed in
the files (illegal software copies) from the server that were Table III.
previously created by warezmaster [74]. In particular, Web application attacks can be detected by
Attack Features: Network connection features are not suf- performing the HTML header, email header analysis or code
ficient to observe the behavior of the R2L attacks. In fact it analysis (scripting codes) [75].
is very difficult to distinguish these attacks from each other
by considering the network connection features. However, few G. Backdoor
features present in KDD’99 such as ‘Duration’, ‘Service’, ‘Src In backdoor attack, attacker can bypass the normal authenti-
bytes’, ‘Dst bytes’, ‘Num failed login’, ‘Is guest login’, ‘Num cation and can obtain unauthorized remote access to a system.
compromised’, ‘Num File creation’, ‘Count’, ‘Dst host count’ Attacker tries to locate the data by doing fraudulent activi-
and ‘Dst host srv count’ may provide some hint about abnor- ties to bypass the system security of the system. Hacker uses
mal behavior of a user in a local connection and hence helpful backdoor programs to install the malicious files, modifying the
in detecting R2L attacks. code or gain access to the system or data.
There is a little difference between R2L and U2R attacks. Attack Features: Some of the important features that
In U2R attacks, it is assumed that user has the local privilege must be present in the feature set are as follows: {sport,
to the victim machine (obtained via R2L attack). Attacker tries dsport, dur, sbytes, service, ackdat, sjit, djit, ct_flw_http_mthd,
to attain the root privileges after accessing the machine. Hence is_ftp_login, ct_srv_src, ct_dst_ltm}. It won’t be easy to get
the values of the traffic feature will be similar to the normal exact information about a backdoor attempt at victim machine.
connection in case of U2R and least important to consider. The However, by analyzing network features, one can get some
basic and content features are important in this case whereas clue about unauthorized network attempts.
in R2L attacks, attacker tries to obtain the local access to a
remote machine. In R2L all features are important. In DoS and H. Exploits
Probe attack, traffic features are very important together with
Exploits category refers to intrusions that exploit the soft-
other features. In Section VIII, we have described the difficul-
ware vulnerabilities, bug or glitch within the operating system
ties in detecting these attacks using network attack data set.
or software. Attackers utilize the knowledge of the software to
On the basis of UNSW-NB attack dataset, attacks have been
launch exploits with an intention to cause harm to the system.
categorized into 9 types as shown in Figure 3. DoS is described
Attack Features: Various important features which are cru-
earlier in detail. Other attacks are described below.
cial for detecting the attempts of launching exploits at mon-
itored machine are as follows: {srcip, dstip, sport, dsport,
E. Fuzzers sinpkt, synack, is_sm_ips_ports, ct_ftp_cmd, res_bdy_len,
In fuzzer attack, attacker sends a large amount of randomly ct_src_ltm, ct_src_ltm} (refer Table III). These features may
generated input sequence from command line or in form of provide some hint about the attempt of launching exploits.
protocol packets. Attacker tries to discover security loopholes However, exploits can be more appropriately detected by mon-
in the OS, program or network and make this resources itoring the operating system behavior using dynamic analysis
suspended for a time period and can even crash them. techniques. Once can refer our work for same [76], [77].
696 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

I. Generic L. Worms
Generic attack against a cryptographical system, tries to Worms are malicious programs or malware that replicate
break the key of the security system. It is independent of themselves and spread to other computers. It uses the network
the implementation details of the cryptographic system. The to spread the attack. Most of the worms are designed to repli-
structure of the block-cipher is not considered. For exam- cate and do not try to change the system files. However, they
ple, birthday attack is a Generic attack which considers hash can cause disruption to the services by increasing network
function as a black box. traffic.
Attack Features: It would be good to take all possible Attack Features: The important network feature could be as
network features of Generic attack into consideration. The follows: srcip, dstip, sport, dsport, proto, spkts, dpkts, tcprtt,
accuracy of the system could not be very good if only con- stcpb, dtcpb ct_srv_src, ct_flw_http_mthd, is_ftp_login etc,
sidering network features. One can also perform the dynamic (refer Table III) which could help in analyzing the spread of
analysis of code to check the behavior of codes running in the packets from the same source address using particular service
victim machine. UNSW-NB does not provide system specific and Internet Protocol (IP) over a period of time.
features such as root_login, su_attempted, Hot, Num_Shell etc. Attacks are intentional attempts to destroy or gain unau-
as specified by KDD’99. thorized access to a machine or access user’s data in an
unauthorized way. Attacks target a computer network and/or
J. Reconnaissance a computer and harm the resources. Various attacks have been
Reconnaissance refers to attacks that gather information discussed in our study. Each attack is launched in some way
about the target computer network in order to bypass its and carries some unique characteristics which we have dis-
security control. It can be defined as a probe which is a pre- cussed. The network features which are essential for detection
liminary step towards launching further attacks. Attacker use of a particular category of attacks have also been mapped to
port scanning OS scanning, nslookup, dig, whois, etc. to gather specific attack category. KDD’99 has been used by most of the
information about the system. Depending on TCP responses researches. Hence, we have considered it for our attack study.
collected for each crafted packet we can make an intelligent However, as it is very old, we have also considered a very
guess of the operating system. After collecting sufficient infor- recent IDS attack dataset, i.e., UNSW-NB [13] which contains
mation, attacks such as DDoS, worm, buffer-overflow exploits ten categories of attacks. ISCX-IDS attack dataset [78] is not
etc. can be launched. publicly available. We obtained this dataset from University
Attack Features: Various important network features to of New Brunswick (UNB) in the form of PCAP files on
detect such attacks: {sport, dsport, srcip, dstip, dur, spkts, request. The attack features and their description is not pro-
sinpkt, service, synack, ct_srv_src, ct_src_ltm, ct_dst_ltm}. vided by the authors. Hence, KDD’99 and UNSW-NB have
All the features provide the key network information about been considered for the study.
the source and destination system. The details about vari-
ous port scanning attack, corresponding features and attack IV. M ACHINE L EARNING : T ECHNIQUES AND F EATURE
characteristics are already described earlier in Section III-B S ELECTION
In this Section, we have discussed various most popular
K. Shellcode machine learning techniques used for detecting Intrusions.
A shellcode is used as a payload which is executed in the These techniques hold different characteristics and provide
target machine to exploit the software’s vulnerability. It is different results for detecting intrusions. Here, we have men-
called as shellcode as it starts a command shell which is under tioned the working of these techniques with their characteris-
the control of the attacker. Local shell codes try to exploit the tics. We have further described the various features selection
vulnerability of high privileged process on a local machine approaches with their pros and cons and provide the optimal
for ex. bufferoverflow. Remote shellcode targets a vulnerable feature set for each attack.
process running on a remote system. On successful execution,
an attacker gains the remote access to the local machine. For A. Techniques Used in Machine Learning
ex. bindshell connects the attacker to a certain port of victim Machine learning techniques work in two phases: training
machine. and testing. In training phase, they perform the mathematical
Attack Features: Some of the important features which calculations over the training dataset and learn the behavior
are important for attack analysis are: {sport, dsport, srcip, of traffic over a period. In the testing phase, a test instance is
dstip, dur, service, sbytes, dbytes, state, res_bdy_len, synack, classified as normal or intrusive based on the learned behavior.
is_ftp_login} (refer Table III). Network features may be help- Various popular machine techniques are described below.
ful for detecting remote shellcode. However, in order to 1) Decision Tree: Decision tree learning methods use
provide lower false alarms and good accuracy, shellcode can branching method to illustrate every possible outcome of a
be detected by doing the behavior analysis of the programs. decision. They can work with discrete-value attributes and
These types of attacks fall under the category of low-frequency continuous value attributes as well. The learned trees are then
attacks and can be launched easily at remote machines in a represented in the form of if-then rules. Three basic elements
few attempts of making a network connection to the remote of the tree are decision node, branch and leaf node as shown
machine. in Figure 4. Decision node specifies a test over some attribute.
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 697

Fig. 4. Graphical representation of Decision Tree.

Fig. 5. Input, Hidden and Output layers in Neural Network.


Each branch represents one of the possible values for this
attribute. At last, leaf node represents the class to which the
object belongs. There exist various decision tree algorithms.
Some of the important decision tree algorithms are ID3 [79], and involves high computational cost. Changing variables,
C 4.5 [80], CART [81], LMT Tree, etc. ID3 is the very first excluding duplication information, or altering the sequence
DT algorithm developed by Quinlan. An ID3 algorithm uses midway can lead to major changes.
greedy search approach. The tests are selected using infor- 2) Artificial Neural Network: Neural Network Learning
mation gain criteria. In ID3 algo, data may be overfitted methods provide a robust approach for approximating real-
and overclassified. ID3 does not handle missing values and valued, discrete-valued and vector-valued target functions. The
numeric attributes. C4.5 is an improved version of ID3, given Multi layered Perceptrons Back Propagation Algorithm [83],
by Quinlan. It accepts both discrete and continuous values Adaptive Resonance Theory based [84], Radial Basis
and splits the tree based on the gain ratio. It also solves the Functions based [85], Hopfields Networks [86] and Neural
over-fitting problem by using error based pruning technique. Tree [87] are some examples of classification algorithms using
J48 is an open source implementation of C 4.5 in Weka. It Neural Network. ANN consists of three main elements: input
reduces the chances of overfitting. However, for noisy data, node, hidden nodes (processing elements in hidden layers)
overfitting may happen. CART algorithm splits the tree based and output node as shown in Figure 5. Multi-layer perceptron
on towing criteria. It also handles both categorical and numer- (MLP) neural network trained by Back-propagation learning
ical values. It uses cost-complexity based pruning and handles (BPL) consists of two stages: feed forward and back propa-
missing values. Logistic Model Tree (LMT) uses a decision gation. Input data are fed to every node of hidden layer in
tree having linear regression model. feedforward stage. Each hidden node and output node calcu-
Most of these algorithms operate from root to leaf to lates its activation value. The difference between the output
arrive at some decision. The following measures are used target and the desired target value is used to generate an
for choosing the best attribute during classification: Entropy error. In back-propagation stage, the error is propagated back
and Information gain. Entropy characterizes the impurity of from the output layer to input layer, and weights are adjusted
an arbitrary collection of examples whereas Information gain between output nodes and hidden nodes. The gradient descent
measures how well a given attribute separates the training method is used to update weights. The weights are updated till
examples according to their target classification. A decision a predefined threshold is reached [88]. Neural Networks are
tree is suitable for the problems where (a) Instances can be suitable for the problems where a) Instances are represented
represented by attribute-value pairs. Each attribute can have by many attribute-value pairs. These values can be highly co-
a disjoint set of possible values. (b) Target function should related or independent of each other. b) The target function
have discrete output value (for ex. yes or no). (c) The train- output may be discrete-valued, real-valued or vector of real or
ing data may have errors. Decision trees are robust to errors. discrete values. c) Training sample may contain errors. ANN
(d) Training data may contain missing attribute values [82]. is robust to noise. d) The learned function is typically difficult
We have analyzed the performance of decision tree in to understand by humans and this ability to understand the
Section VII. Decision trees perform better than other single learned target function is not important by human [83].
classifiers as it implicitly performs the feature screening or Artificial Neural Network is a nonlinear model that is easy
feature selection based on the two parameters: Entropy and to use. BPL Neural Networks are easy to reach the local min-
Information gain. The more Information gain a feature has, imum and thus stability is lower. Especially for low-frequency
the more capable the feature is in discriminating the out- attacks, the detection precision is low. It takes a longer time to
put classes. The top most nodes over which the tree split train the neural network because of its nonlinear mapping of
are the most important features of a dataset. They are not global approximation. Neural Network cannot detect temporar-
sensitive to outliers. However, computing probabilities of dif- ily dispersed and collaborative attacks because of inability to
ferent possible branches, determining the best split of each restore past events. It is difficult to find the accurate number of
node, and selecting optimal combining weights to prune algo- hidden layers and number of neurons. Classifier’s performance
rithms contained in the decision tree are complicated tasks also depends on the choice of the activation function. It
698 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

requires larger dataset and the output performance depends


on the trained parameters and dataset relevant to the training.
3) Naive Bayes Classifier: Naive Bayes classifier is based
on the Bayesian learning method and it is found to be useful
in many applications. It is called “naive” because it is based
on the simplifying assumption that attribute values are condi-
tionally independent of each other. It is applied to the learning
task where each instance x can be described by a conjunction
of attributes and where the target function f(x) can take any
of the value from some finite set V (a set of target values). In
the learning steps various P(vj) and p(ai—vj) are estimated,
given a training data {a1 , a2 , a3 . . . . . . ai } of i attributes. It
estimates the posterior probabilities of observing a class label
from a set of normal class and anomaly class labels. For a
given test instance, Class label with largest posterior is cho- Fig. 6. Multi-class Linear SVM.
sen as the predicted class [82]. It is suitable for the problems
where a) Target function should have discrete output value (for
ex. yes or no). b) Attribute-value pairs can represent instances. the data to a high dimensional space called ‘feature space’
c) The independent assumption of Naive Bayes is acceptable. and define separating hyperplane there. The kernel function
There are three Naive Bayes (NB) algorithms: Gaussian Naive is used to map the data into a new feature space for clas-
Bayes [89], Bernoulli Naive Bayes [90] and Multinomial sification. The choice of a kernel function is very important
Naive Bayes [91]. Gaussian NB is used for continuous data here [94]. Radial basis kernel (RBF) [95] can be used to learn
values which are distributed according to a Gaussian distri- the complex region.
bution. Bernoulli NB is a binomial model used for binary Therefore, SVM algorithms can be categorized into two
feature vectors such as Bag of words model. Multinomial NB types based on the type of kernel function: Linear SVM and
is used for discrete values in which feature vectors represent Non-linear SVM. In Linear SVM, the training data is sepa-
the frequencies in which certain events occur. The probability rated by hyperplane by the linear kernel function. If data is
calculation is different in each of the three NB algorithms. not linearly separable, nonlinear SVM classifier gives poor
Naive Bayes classifier achieves a fast speed of detection results [96]. Hence, nonlinear kernel maps the input data to
and is simpler than other classifiers. However, it makes an a higher dimensional feature space to find the linear plane.
assumption that features are independent of each other. This Based on the type of detection (misuse/anomaly), SVM can
independent relation assumption may not hold true in detecting be categorized into two types: Multi-class SVM and one-class
various types of attacks. For example, in the publicly available SVM. Multi-class SVM is used for supervised learning algo-
KDD’99 intrusion detection dataset, the features are highly rithm. Multi-class classification using SVM can be done in
dependent on each other. For example (refer Table II for fea- two ways: one versus all (the traditional way) and one versus
tures), feature P29 (same srv rate) is dependent on P23 (count). one. In one versus one, a set of binary SVM classifiers are
P23 refers to the sum of connections to the same destination built and the class is selected that is predicted by most of the
IP address. P29 refers to the percentage of connections that classifiers. One class SVM is unsupervised machine learning
were to the same service (tcp, http, icmp, etc.) among the con- algorithm used for novelty detection [97].
nections aggregated in P23 (count). Similarly, feature P28 (srv SVM suffers from the drawback of extensive memory
error rate) is dependent on P23 (count). P27 (rerror rate) is also requirement and algorithmic complexity. The performance also
dependent on P23 etc. Such an assumption may not give desir- depends on the choice of the kernel function and choosing
able results for all types of attacks. Hidden Naive Bayes [92] the parameters of kernel functions. Linear SVM produces
is an extension of Naive Bayes and relaxes this assumption. less accurate results and produces overfitting. Training time
It achieves an accuracy of (99.6%) for DoS attack detection. of SVM is also very high which is not desirable in IDS
4) Support Vector Machine: Support Vector Machine is one where retraining a model is required time to time since user’s
of the most successful machine learning technique in Intrusion behavior keeps on changing. Although it is robust to Noise.
detection when applied with other classifiers. SVM [93] is 5) Genetic Algorithms: Genetic Algorithms (GA) are
based on the notion of the margin-either side of the hyperplane search algorithms that find an approximate solution based
that separates two data classes as shown in Figure 6. The gen- on the principles of natural selection and genetics [98]. The
eralization error can be reduced by maximizing the margin and four operators used in this process are initialization, selection,
creating the largest possible distance between the separating crossover, and mutation as shown in Figure 7. GA evolves
hyperplane and instances on either side of it. Data points that to a high-quality population of individuals starting from the
lie on the margin of optimum separating hyperplane are known arbitrary selected initial population. Each is called a chromo-
as support vector points and solution is represented as a linear some and is composed of a predefined number of genes. The
combination of these points. If the data contains misclassi- quality of genes is measured by its fitness function and quanti-
fied instances, SVM may not be able to find the separating tative representation of each rule’s adaptation [99]. During this
hyperplanes. One of the solutions to this problem is mapping process, the initially selected population is evolved for some
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 699

Fig. 7. Execution Flow of Genetic Algorithm.

Fig. 8. Execution Flow of K-means Clustering.


generations. In each iteration, the three operators: selection,
crossover, and mutation are sequentially applied to each with
certain probabilities. Thus, only the fitness genes survive and The techniques fail if the anomalies in the data form the
reproduce. Some of the characteristics of GA are: a) They clusters by themselves. In this case, this category of techniques
are intrinsically parallel, they can explore solution space in will not be able to detect intrusions.
multiple directions at once. b) They are well suited for the 7) K-Nearest Neighbor Approach: K-NN is an instance
problems where the space for the potential solutions is truly based learning algorithms. KNN comes under nonparametric
large. c) They are an adaptable system with GA and can be lazy learning algorithms [82]. Nonparametric means that it
easily retrained which provides the possibility of generating does not make any assumptions on the underlying data dis-
new rules for Intrusion Detection. d) No gradient information tribution. Lazy means it delays the generalization until the
is required. e) Do not require complex mathematics to exe- classification is performed. The training phase is much faster
cute. f) They work with a population of solutions rather than than other classifiers but more computational time is involved
a single solution [100]. during the classification process. It is based on the assump-
The three important things in Intrusion Detection are speed, tion that instances in a dataset will exist in close proximity
accuracy and adaptability. GA has been observed to perform to other instances that have similar properties. It also assumes
well when applied with other classifiers to optimize the param- that normal data instances occur in dense neighborhoods while
eters of classification process and in the selection of features anomalies occur far from their closest neighbors. The label of
in Intrusion Detection Systems. However, they lack in few the unclassified instance can be determined by looking at the
aspects such as there is no absolute assurance that a genetic class label of its neighbor instance. The Nearest Neighbour
algorithm will find a global optimum. Moreover representing based anomaly detection techniques can be grouped into two
a problem space in the genetic algorithm is complex. They broad categories: (1) Technique in which distance between
need a large number of fitness function evolution. a data point and kth neighbor is used as anomaly score.
6) K-Means Clustering: K-means algorithm is a clustering (2) Technique in which relative density of each data point is
based anomaly detection algorithms. They are based on the calculated as an anomaly score [103]. The choice of k affects
assumption that normal data instances lie close to their closest the performance of kNN [104]. They are some important char-
cluster centroid while anomalies lie far away from their closest acteristics of kNN such as a) They are unsupervised and do not
cluster centroid [101]. In the first step, the data is clustered make any assumption regarding the generative distribution for
into K clusters assuming any K data points as the centroid the data. b) They are sensitive to the choice of similarity func-
of different clusters. The other data points are assigned to tion which is used to compare instances. c) They require large
the clusters based on their closest distance measure from the storage. d) Computationally expensive technique. e) They are
centroids. In the next step, the centroid is recalculated as an not robust to noise and can misclassify instances if noise is
average of data points of the cluster for each cluster. The pro- present [94].
cess is repeated till some stopping criteria is reached such as Distance based kNN is used by most of the researchers in
till there is no change in centroid as shown in Figure 8. K- IDS to do the initial refinement of anomalies in the training
mean clustering has been widely adopted in integration with dataset. However, the performance greatly depends on distance
other classifiers by researchers working on Intrusion Detection measure defined between a pair of data instances. It can be a
Techniques. Some researchers have used it as a classifier to challenging task to define the distance measure of the complex
separate anomaly from normal data instance while some used data. It fails to label the instances correctly if the normal data
it as a data compaction technique to separate outliers from points do not have enough close neighbors while anomalies
the training data to provide the refined training data set to have enough close neighbors.
the classifier. In both the cases the detection results have been 8) Fuzzy Logic: Fuzzy logic is a form of many-valued
improved [102]. logic that deals with approximate rather than fixed and exact
700 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

reasoning. Fuzzy logic offers rigor of formal methods without the possible sequences are analyzed. Authors have simulated
requiring undue precision. It also offers alternative methods the model using the HTTP traffic of DARPA’99 dataset. It
to handle policy preferences and conflicts [105]. A fuzzy set trains k distinct HMM over the randomly generated sub-
theory is defined in terms of fuzzy logic. Semantic of fuzzy samples of sequences. The output produced by each of HMM
operators are understood by using geometric model. Fuzzy is finally combined to produce the detection accuracy. The
logic can interpret the properties of a neural network and a ensemble of HMM found to perform better than single HMM
precise description of its performance can be obtained. Neuro- classifier.
fuzzy is very popular in the area of Intrusion Detection. It is There are some advantages with HMM: (i) A HMM which
applied by many researchers as discussed in Section V. A fuzzy is well tuned with parameters provide better compression
set A in X is characterized by a membership function fA(x) than simple Markov Model. (ii) The model is fairly read-
which associates each point in X, a real number in the interval able probabilistic graph model. (iii) HMM very well captures
[0, 1], with the values of fA(x) at x representing the “grade of the dependencies between the consecutive sequences. (iv) The
membership” of x in A. Thus, the nearer the value of fA(x) ensemble of HMM is found to perform well for recogniz-
to unity, the higher the grade of membership of x in A [106]. ing the structure of sequences. There are some disadvantages
Fuzzy logic is not enough to detect all types of attacks. with HMM: (i)‘A fully connected HMM can lead to over-
It performs well when it is integrated with other classifiers. fitting problem which happens when the model is trained
Fuzzy Logic techniques have been used in correlation with with a dataset having large parameter space. (ii) HMM when
intrusion detection systems [107], [108]. The key characteris- implemented with the Viterbi algorithm, becomes expen-
tics of fuzzy logic are as follows [109]: (a) The fuzzy rules sive, both in terms of memory and compute time. However,
allow constructing the if-then rules which can be easily mod- Churbanov and Winters-Hilt [112] applied EM clustering with
ified based on security applications. (b) They can combine Viterbi to provide linear memory requirement.
the input from varying sources. (d) The quantitative measures 10) Swarm Intelligence: A swarm can be considered as a
used by IDS such as connection interval, CPU usage time, group of cooperating agents which work together to achieve
etc. are fuzzy in nature. (e) A numerical value can belong to some purpose and task. Swarm Optimization is an advanced
multiple fuzzy sets at the same time, i.e., a numerical value machine learning algorithm which is based on the evolu-
does not have to be fuzzified using only one membership func- tionary computations. Kolias et al. [113] provided a survey
tion. (f) The degree of alert that can be produced by an IDS on swarm intelligence (SI) approaches for intrusion detec-
is often fuzzy. The disadvantage of the fuzzy rules are as fol- tion. They have provided a detailed comparison of various
lows: (i) They consider that all factors are equally important SI based IDS systems pointing to their advantages and disad-
which are to be combined. (ii) A fuzzy system requires more vantages. The core SI based techniques used for supervised
fine tuning and simulation before operational. (iii) It is hard to classification have been described. Most of the IDS described
develop a model from a fuzzy system in comparison to other by authors are anomaly detection IDS. The SI based IDS
machine learning solutions due to the complexity involved in approaches make use of multiple agents which collaborate
building the fuzzy model. with each other to solve a problem and provide the optimal
9) Hidden Markov Model: A Markov Model produces a solution. An agent can be used to find the classification rules
behavioral model which is composed of states, transitions and for misuse detection or finding the clusters for anomaly detec-
actions. Both Hidden Markov Models (HMM) and Markov tion. They have mainly categorized the SI based approaches
Chains come under the category of Markov Models. In Markov into three types: (i) Ant Colony Optimization (ACO) based
Chain, the transition probabilities are known which determine IDS, (ii) Particle Swarm Optimization (PSM) based IDS (iii)
the topology of the model. In HMM, the system being modeled Ant Colony Clustering (ACC) based IDS. ACO algorithms are
is represented by a Markov process with unknown parameters. motivated by the behavior of the ants to find the shortest paths
HMM can be defined as a tool for presenting the probability from their nest to the food. AntNag [114] is the first ACO
distribution of a sequence. In HMM, an observation Xt at time algorithm for intrusion detection which is based on making
t is generated by a stochastic process. However, state Zt of directed graphs for attacks. PSO algorithms are motivated by
the process cannot be directly observed (hidden). HMM sat- the coordinated movement dynamics of animal groups. ACC
isfies the Markov property where the state Zt depends only algorithms are motivated by the clustering and sorting behav-
on the previous state Zt−1 observed at t-1 [110]. HMM main- ior of ants to work autonomously. A detailed study can be
tains a transition matrix K*K. Each element Aij of the matrix referred from their literature. There are some advantages with
describes the probability of the transition from Zt-1 to Zt swarm optimization: (i) SI based systems are adaptable and
which can be written as: Aij = P (Zt,j = 1|Zt,i = 1). can be adjusted to new stimuli. (ii) These systems are scalable
Ariu and Giacinto [111] proposed an HMM based IDS archi- since same control architecture can be applied to a group of
tecture known as HMMpayl which is an anomaly based IDS. agents. (iii) They are flexible as agents can be easily removed
The main goal of HMMpayl is to protect the Web server or added without affecting the architecture, etc. There are also
and the applications hosted by the server from attackers. A some disadvantages for SI based systems: (i) The complex-
subset of n-grams are randomly selected from the sequence ity associated with swarms provides the unpredictable results.
of payload and passed to HMM for further analysis. The (ii) A rich hierarchical swarm based system takes time to shift
advantage with the HMMpayl is that it reduces the compu- to states. (iii) There is no central control which makes the
tational cost in comparison to other n-gram models where all system redundant and uncontrollable, etc. [115]
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 701

11) Ensemble of Classifiers/ Ensemble Learning: Ensemble each of the classifier is again learned by a combiner or meta-
learning makes use of multiple learners and combines the algorithm to make the final prediction. Ensemble classifier has
predictions made by a set of classifiers called as base learn- some advantages over the single classifiers: (i) Generalization
ers. The use of multiple machine learning algorithms helps in ability of the ensemble is much better than single classi-
generating a set of hypotheses for a problem. The ensemble fiers.(ii) It may not be possible to select a particular learner
of classifiers integrates the hypotheses to generate a common based on the available training dataset. The search process
result. The ensemble of classifiers provides a stronger general- may take longer. (iii) Ensemble often reduces the overfitting
ization capability compared to individual base learners [116]. problem of single classifiers and improves the prediction error,
Each base learner is generated by using machine learning algo- etc. There are some disadvantages of ensemble learning is
rithms such as Decision Tree, Naive Bayes, Neural Network, as follows: (i) The complexity of the ensemble affects the
Support Vector Machine etc. Some of the ensemble methods training time. (ii) Sometimes learning concepts become dif-
make use of the homogeneous base learners in which multiple ficult to understand. (iii) Requires more memory than single
instances of the same machine learning algorithm are used classifiers, etc.
to generate a set of hypotheses over different sub-samples In this subsection, we discussed various machine learn-
of the same training dataset. For example, Random Forest is ing techniques. Some of them are supervised such as various
one of the popular ensemble classifiers which combines the DT algorithms, multi-class SVM, MLP BP-ANN, NB and
predictions made by the multiple decision trees. Some of the KNN, etc. Supervised machine learning algorithms can detect
ensemble methods make use of the heterogeneous base learn- known attack patterns. They require a labeled attack dataset.
ers in which different machine learning algorithms are used Whereas some of the ML algorithms such as one-class SVM,
as base learners to generate a set of hypotheses. For exam- K-Mean Clustering, Self Organizing Map (SOM), DBSCAN,
ple, Neural Network, SVM and DT can be trained over a etc., are some of the examples of unsupervised machine learn-
training dataset and their predictions can be combined to gen- ing algorithms. Unsupervised learning is helpful in analyzing
erate common predictions. There are many ways to combine the unlabeled attack dataset and finding the outliers. The out-
the predictions made by multiple base learners. Some of the liers can be noise or it can be anomalies which are rarely found
popular methods are bagging, boosting, majority voting and in the normal scenarios. The outliers are further explored sta-
stacking [117]. tistically and useful information is extracted out of them which
The first effective method of ensemble learning was can be helpful to find distinct characteristics from data. Fuzzy
Bagging, also called bootstrap aggregation. Bagging is used to logic can be applied in Classification and Clustering algo-
reduce the variance of the machine learning algorithms having rithms to improve their learning capability and attack detection
high variance. A variance can be termed as the amount of the rate. For ex., Neuro-Fuzzy and Fuzzy c-means clustering are
change in the prediction of the target function for the different some popular applications of fuzzy logic in different ML
training dataset. Ideally, variance should not change too much. techniques.
Bagging creates different instances of the same training dataset Swarm intelligence techniques such as Particle Swarm
by selecting bootstrap samples. A bootstrap sample is created Optimization (PSO) is helpful in nonlinear optimization prob-
by selecting a sub set of samples from the training dataset. lems. HMM can be used to capture the dependencies between
If a training dataset of size n is given; Bagging will generate sequences using probabilistic approach. Both swarm intelli-
m new training sets of size k by sampling with replacement. gence and HMM can be used for supervised and unsupervised
In sampling with replacement, an observation may be repeated learning problems. Different classification algorithms have got
multiple times in a set. Each of the newly generated training set different characteristics. Each one of them has some pros and
is used to train a model. The output /predictions by each model cons as discussed before. The ensemble learning provides
can be combined by either voting (in case of classification) or the combination of same/different supervised and unsuper-
averaging (regression). vised algorithms used to solve the target problem altogether.
In boosting, random samples of training dataset without It often reduces the over-fitting problem of single classifiers
replacement (any observation cannot be repeated in sub-set) and improves the classification rate.
is extracted and used to train a weak learner. In the sec-
ond iteration, another sample sub-set is extracted in the same
way; however, in the new training set 50% samples which are B. Feature Selection in Machine Learning
previously misclassified are added and another weak learner is Two main important things that highly affect the
trained. In next iteration, a new training set is formed having performance of a classifier are: Classifier’s Technique and
the samples which are classified by both the previous learn- Selected Feature Subset. Researchers have proposed various
ers. After having some such iterations, the predictions made combinations of classifiers and feature selection methods (dis-
by weak learners are combined by the majority voting scheme. cussed in detail in Section IV). The goal of feature selection
Both Boosting and Bagging combines the predictions based on is to select the most important and optimal subset of features.
the majority vote or average rule [118]. Stacking makes use of Features selection improves the generalization performance,
the combiner algorithm to combine the predictions. Stacking reduces the computational cost of the classifier and makes
is a meta-algorithm which trains n base classifiers/learners the classifier faster for detecting unseen data and simplifies
for the given dataset. Each of the trained classifiers gener- the understanding of data processing. There are various draw-
ates the predictions for the given input data. The output of backs of considering all features in the detection technique
702 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

such as (i) It will increase the computational overhead of the srv count’), P34 (‘Dst host same srv rate’) and
system and will make training and testing time slower. (ii) It P39 (‘Dst host srv serror rate’) etc. These are very
will also lead to more storage requirement as the more number important for detecting high-frequency attacks (DoS,
of features a database contains, the more space it requires to Probe).
store each feature. (iii) It limits the generalization capability We have provided an overview of different types of machine
of a classifier which uses data mining techniques for detect- learning algorithms, i.e., rule based, probability based, clus-
ing intrusions. (iv) It increases the error rate of the classifier tering based, ensemble learning, genetics based, swarm-
since irrelevant features diminishes the discriminating power intelligence based, etc. The key characteristics, advantages and
of relevant features. disadvantages of various machine learning algorithms have
The feature selection methods are categorized into also been discussed. It would be impropriated to say that
three types [50]: (i) Filter methods (ii) Wrapper methods one type of machine learning algorithm will work best in all
(iii) Hybrid/Embedded methods. Filter methods are indepen- type of dataset. A classifier’s accuracy is not solely important
dent of the classifier. They compute the intrinsic properties of factor. Many factors affect the selection of appropriate clas-
the data. Filters are fast enough compared to the other meth- sification model such as Is our data composed of categorical
ods. They are relatively robust against overfitting. The major only, or numeric only, or both? What is the size of dataset?,
drawback is that they do not consider the results of the classi- Do we need to retrain the classifier often? Do we need quick
fier’s performance over the selected features. Hence, they fail and fast deployment model? Do we have labeled data or
to provide the best feature subset for classification. Wrapper unlabeled data? What is the complexity of data?. Hence, the
based methods use the combination of feature subset searching key characteristics of machine learning algorithms must be
algorithms and classifier algorithm. The performance is mea- known before selecting a bunch of classifiers for performance
sured as per the classification rate. The feature subset with analysis.
good classification rate is chosen at the end. A classification Also, the importance of feature selection and types of fea-
rate threshold is taken into consideration as a stopping cri- ture selection methods are also presented that can be used
teria of feature selection. Thus wrapper based methods are with machine learning algorithms. Feature selection helps in
classifier dependent. The major drawback of this method is identifying important, non-redundant and relevant attributes
that successive learning of classifier may result in overfitting that contribute to the accuracy of the predictive model.
problem. Computational time is usually high as it involves suc- Also, considering less and important features speed up the
cessive iterations of subset selection algorithm and classifier. classifiers.
Hybrid methods are combined with the classifier’s design in
the training phase. Data exploitation is optimized which will
reduce the number of retraining of the classifier for each new V. S UMMARY OF IDS BASED ON S INGLE /M ULTIPLE
subset. Hybrid methods have higher computational cost than C LASSIFIER BASED M ACHINE L EARNING A PPROACHES
filter based methods. The importance of various feature selec- Machine learning techniques have been used in different
tion algorithms for intrusion detection application has been ways for detecting intrusions using publicly available datasets
addressed in detail here [119]. such as KDD’99 [12], DARPA 1998 [53]. These datasets have
Let us take an example of features of KDD’99 network 41 features as shown in Table II in Section III. Recent attack
attack dataset. There are 41 features. Table II (shown in datasets such as ISCX 2012 [78] and UNSW-NB15 [13] has
Section III) categories all features into four main categories: been used in some of the approaches for validation. The sum-
• Basic Features (1-9): It refers to the basic features of mary of various IDS proposals is shown in Table IV and their
an individual TCP connection such as P3 (‘Service’), performance results are shown in Table V. The acronyms for
P1(‘Duration’), P4(‘Flag’), etc. evaluation metrics are shown in Table VI.
• Content Features (10-22): Content Features are extracted Initially, single classifier techniques have been used as a
from the data portions of the packets such as P11 (num standalone entity to classify the intrusions but they lack in
of failed logins), P14 (Root Shell), P10 (‘Hot’) and P13 performance for correctly classifying intrusions from normal
(‘Num Compromised’). These features are important to data instances. Later, feature selection methods are proposed
detect low-frequency attacks such as U2R and R2L. This to improve the detection rate and to reduce the computa-
is because DoS and Probe normally involves a large num- tional cost. However, there was no significant improvement in
ber of connections within a shorter period whereas R2L classification rate. Afterwards, some researchers combined the
and U2R normally involves a single connection and are single classifiers to improve the detection rate of intrusions by
embedded in the data portion of a packet. using the 41 features of the dataset. There are some limitations
• Traffic Features (23-31): Traffic features are computed with this technique such as low detection detect rate and high
using a 2s time window such as P23 (‘Count’), P24 computational cost especially for the low frequency (which are
(‘Srv count’), P29 (‘Same srv rate’) etc. These are very in less frequency in the dataset) attacks such as low-frequency
important for detecting high-frequency attacks (DoS and attacks. In some approaches, feature selection has been used
Probe). with multiple Classifiers which greatly improved the detection
• Traffic Features (32-41): Traffic features are com- rate for all types of attacks. However, there was no significant
puted using a 2s time window from destination to improvement in computational cost due to multiple process-
host such as P32 (‘Dst host count’), P33 (‘Dst host ing of classifier modules. By the gradual evolution of machine
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 703

TABLE IV
S UMMARY OF E XISTING IDS A PPROACHES BASED ON M ACHINE L EARNING

[120]

[121]

[123]

[124]

[122]

[125]

[126]

[127]

[128]

[129]

[130]

[131]

[132]

[61]

[133]

[134]

[135]

[136]

[26]

[137]

[138]

[139]

[140]

[141]

[142]

[143]

[144]

[145]

[146]

[147]
[148]

[149]

[150]

[151]

[152]

[153]

[154]

[155]

learning techniques in intrusion detection, we have classified classifiers with all features (iv) Multiple classifiers with lim-
them into four categories. (i) Single classifiers with all fea- ited features. These techniques have been described in detail
tures. (ii) Single classifiers with limited features. (iii) Multiple below.
704 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

TABLE V
P ERFORMANCE R ESULTS OF E XISTING IDS A PPROACHES

[120]

[121]

[123]

[124]

[122]

[125]

[147]

CT SVM [126]

TCM-KNN [127]

[128]
Bayesian clustering,
DT C 4.5 [129]

[130]

[131]

[132]

Decision Tree C 4.5 [61]

[134]
[135]

[136]

[26]

[137]

[138]

[156]

[140]

[28]

[141]

[142]

[143]

[144]

[157]

[158]

[104]

A. Single Classifier With All Features Intrusion Detection using KDD’99 dataset. In the initial step,
Incorporating single classifiers in IDS was the first step it collects the training and test dataset from KDD’99. The data
towards intrusion detection using machine learning techniques. sets are pre-processed to be used by SVM classifier. SVM is
Kim and Park [120] proposed a misuse detection approach trained over the training dataset and as a result, decision model
which applies Support Vector Machine (SVM) for Network is generated. This decision model corresponds to hyperplanes
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 705

TABLE VI
G LOSSARY OF ACRONYMS U SED IN P ERFORMANCE M EASURES

in feature space with some support vectors and weight vector experiments on a KDD’99 dataset for training and testing.
values. In the learning process various, C values are taken as Before the training of Neural Network, the number of neurons
1, 500, 1000 and kernel functions such as linear, 2-poly and in the input layer is defined as the number of input variables.
Radial Basis Function (RBF). The system tunes the various The number of neurons in the output layer is equal to the
values of C and kernel function to validate which kernel func- total number of classes. They consider only one hidden layer.
tion is effective and efficient. After learning, validation process Neural Network is performing well for detecting DoS and
is carried out in which test instances are passed to the learned Probe attack, but it fails to detect the low-frequency attacks
classifier to check for the validity of the classification. The since the number of records for these attacks is very less in
performance is compared in terms of detection rate and mis- comparison to other attacks (DoS and Probe). To improve the
classification rate. The classifier is not producing good results algorithm, an enhanced Decision Tree C4.5 is proposed. In the
for detecting Scanning attacks and Low-frequency attacks. The enhanced algorithm, default condition of original algorithm is
approach is achieving 91.6% detection rate for detection DoS, treated as new class whereas earlier the default was treated as a
36.65% for Probe attack and 12% for U2R attack and 22% normal class. Thus any new instance which does not match the
for R2L attacks. Researchers have not reported the results for rule will be treated as suspicious. The modified C4.5 algorithm
false alarms. provides improvement for low-frequency attacks too. DT pro-
Amor et al. [121] performed the Intrusion Detection using vides the detection rate of 99.99% for DoS, 99.78% for probe,
two different misuse detection approaches particularly Naive 90.39% for U2R and 98.93% for R2L attacks.
Bayes and Decision tree classifier separately and compared Tajbakhsh et al. [131] proposed the misuse detection
their performance. The KDD’99 dataset is used for training approach based on fuzzy association rules using KDD’99
and testing. Decision Tree algorithm builds the tree based on dataset. In the training phase, a membership function is used
the dataset values. Each nonleaf node corresponds to the test to perform the feature to item transformation. Each attribute-
attribute whereas each branch represents the output of the test value pair is called an item. The fuzzy membership function
attribute. An appropriate test attribute is chosen based on the is based on Fuzzy C-Means (FCM) clustering algorithm. In
Entropy and Information gain. Leaf node represents the final the next phase of training, the produced items are reduced.
class of the object. Decision tree produces the rules traversing KDD’99 contains 189 items; rule generation over 189 items is
from root to leaf. In Naive Bayes, conditional probabilities are not possible. The rules are generated based on the minimum
calculated for each attribute of a test instance corresponding support and confidence values. The fuzzy association rules
to each class label. The product of this posterior probabilities are used to build the classifier. By the rule sets, an instance
helps in determining the final class. After the learning phase, is assigned a label. In the testing phase, each feature of a
the test instances are passed to the classifier to check for the test instance is transformed to an item using the membership
correctness of classification in terms of detection rate. Both are function. Then these transformed records are passed through
performing very poor in detecting the low-frequency attacks the learned classifier which classifies the instance. The train-
(U2R and R2L). NB achieves detection rate of 96.65% for ing data sets have been sampled into five sets (normal, DoS,
DoS and 88.33% detection rate for probe whereas DT achieves Probe, U2R and R2L). Rules are produced for each class. The
97.24% detection rate for DoS and 77.92% detection rate for total execution time of this classifier is 500s. The technique
probe attack. The detection rate is low (0.53%-11.84%) for is not performing well for any of the attacks detection. It pro-
both approaches as shown in Table V. vides 78.9% DR for DoS attack detection, 88.5% DR for probe
Bouzida and Cuppens [122] performed the Intrusion attack, U2R DR for 68.6% and R2L DR for 6.2% attack. The
Detection System using Back Propagation Neural Network overall detection rate is 70%-90% with 2% false positives.
(BPL NN) classifier and Decision tree separately for misuse Kumar and Yadav [140] proposed the simplest model of
detection and compared their performance. They performed misuse detection system which is based on Neural Network.
706 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

In the first phase, network data set is selected and prepared for sampling techniques are performing better and are most
training and testing. Further, the selected data is pre-processed effective sampling techniques. The selective sampling pro-
to make it compatible with Neural Network by converting all vides the best performance with 100% accuracy and no false
symbolic values into numeric values, for ex. ICMP=1, TCP=2 positive alerts having sampling rate above 20%. A generic
and UDP=3. After this, the normalization step is performed sketch-guided sampling also provides good results for detect-
in which Z-score values are calculated for each feature value. ing application-layer attacks. Sketch guided sampling provides
In the next phase, Neural Network is trained over the trans- 92% detection rate at 40% sampling rate. Authors have gen-
formed data set. It consists of three layers input, hidden and erated various DoS attack traces using different tools and
output with 41, 29 and 5 neurons respectively. The learned intermixed the attack traffic with the attack-free traffic from
classifier is tested over the testing data set. Neural Network ISCX dataset.
is found performing well for detecting Intrusions except for Wang et al. [158] proposed an intrusion detection frame-
low-frequency intrusions (U2R and R2L). work which uses support vector machine (SVM) integrated
Amoli et al. [145] proposed an unsupervised clustering with a data transformation method. Feature dependencies are
based anomaly detection approach which is based on anomaly incorporated in the algorithm. Logarithm marginal density
detection approach to detect and classify the DoS, DDoS, ratios transformation (LMDRT) method is used to perform
Probe attacks. The model is composed of two detection data transformation. The augmented features are of a bet-
engines which monitor and inspect the behavior of the network ter quality which is supplied to the SVM. LMDRT method
in normal or encrypted communications. The first engine cal- is motivated by Naive Bayes theory for classification as it
culates a self-adaptive threshold value to detect the network considers the marginal density ratio. The proposed framework
traffic changes that are caused by attacks such as DoS, DDoS, has been evaluated using NSL-KDD 99 dataset. It achieves
scanning and worm, etc. The clustering is done in two steps: 99.31% accuracy, 99.20% detection rate and 0.60% false
The network traffic do not pass the threshold, the engine clus- alarm rate.
ters the attack-free traffic according to DBSCAN algorithm. Various intrusion detection techniques based on single clas-
The clustering algorithm calculates the acceptable distance of sifier have been discussed. Single classifiers are simple and
the network instances and puts the points into the cluster. Once easy to understand. However, the limitations of single classi-
the traffic passes the threshold value, again clusters are cre- fier algorithm such as sensitivity towards the choice of input
ated for outliers. The points that cross the acceptable distances parameters, choice of the kernel function, number of training
are treated as outliers. The second engine aims to detect the variables and overfitting, etc. reduce the chances of getting
botmaster. The first engine sends the IP addresses with attack good evaluation results. Secondly, if the dataset has got too
details to the second engine which then correlates the packets many attributes, it becomes difficult for the classifier to provide
to find the main system controlling DoS. They have considered timely results. Single classifier algorithms, when combined
the ISCX dataset to validate the approach. It achieves 98.39% with feature selection algorithms, reduce the computational
accuracy, 100% recall, 98.12% precision, 96.39% TNR and cost, discussed in next section.
3.61% FPR and outperforms the K-mean outlier detection.
Bhamare et al. [152] presented the use of machine learn-
ing for detecting attacks in the cyber network. They have B. Single Classifier With Limited Features
executed various machine learning algorithms using two new Feature selection techniques are used with sin-
network datasets other than KDD’99, i.e., UNSW-NB15 and gle classifier approaches to improve its performance.
ISOT datasets. These are dynamically generated new datasets Sangkatsanee et al. [61] proposed a real-time Intrusion
which provide real attack statistics. Various misuse detection Detection System (RT-IDS) based on decision tree C4.5 to
algorithms such as DT, NB, LR and SVM with three differ- detect the two different types of network intrusions such
ent kernels, which are RBF, Polynomial, Linear. DT provides as Denial of Service (DoS) and Probe. The approach is
an accuracy of 88.67%, NB provides 73.8% accuracy, SVM applied in context to misuse detection. The framework
with RBF kernel provides 70.15% accuracy, SVM with poly- consists of three phases: data preprocessing, classification and
nomial kernel provides 68.06% accuracy, SVM with linear post-processing. In the preprocessing phase, a packet sniffer
kernel provides 69.54% accuracy, and LR provides 89.26% is used which uses Jpcap library and network information to
accuracy. DT provides 6.9% FPR, SVM with RBF function extract IP header, TCP header, UDP header and ICMP header,
provides 4.1% FPR, SVM with poly function provides 53.3% etc. The packet information is considered between the source
FPR, SVM with linear function provides 50.7% FPR, NB pro- and destination IP of each packet. Author used Information
vides 7.3% FPR, LR provides 4.3% FPR. We can say that gain to extract the 12 features as shown in Table VII. Next
among all Logistic regression is providing better results with is classification phase in which the classifier is trained over
low FPR. However, the results are not very good using such the training dataset of known labels. In the testing phase, the
simple methods of ML. learned model is used to classify the test instance as normal
Jazi et al. [157] presented a novel approach for detect- or intrusive based on the learned behavior. Post-processing
ing application-layer DDoS attack which uses non-parametric is used to reduce the false alarm. In this phase, the network
CUSUM algorithm. The authors investigated thirteen sam- data between source and destination is divided into groups of
pling techniques to perform filtering on data. Out of them, five records. In each record, if there exist 3-5 records which
they observed that selective flow sampling and sketch-guided are reported as the same attack, that group is considered as
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 707

TABLE VII
L IST OF IDS T ECHNIQUES W ITH F EATURE S ET

[120]
[121]
[123]
[124]
[122]
[126] [126]
[125]
[147]
[127]

[128]

[129]

[130]
[131]
[132]
[61]

[133]

[134]

[135]
[136]

[26]

[137]
[138]
[156]
[140]
[28]
[141]
[142]
[143]
[144]
[159]
[160]

an attack type. The technique is not detecting low-frequency and 12 features for R2L. The features are shown in Table VII.
attacks. The detection rate for other attacks is higher than In the training phase, five SVMs are trained over five different
98% with only two seconds of computational speed. It detects datasets (Normal, DoS, Probe, U2R and R2L). In the testing
DoS attack with 99.434% DR & 0.73% false alarm and probe phase, the attack samples are supplied to each of the trained
attack with 98.868% DR & 0.9% false alarms. classifiers which classifies them in one of the five classes. The
Amiri et al. [133] proposed Modified Mutual Information method does not achieve detection rate greater than 90% to
based feature selection approach (MMIFS) and used it with any of the attack detection. Moreover, the detection rate for
Support Vector Machine (SVM) to detect different types of U2R attack is lowest (30.70%).
attacks mainly low-frequency attacks. They have considered Lin et al. [144] proposed a new distance based feature
training and testing dataset of KDD’99. In the initial phase, extraction approach (CANN) applied with anomaly detection
data normalization and reduction is carried out by dividing approach based on a k-NN algorithm to detect intrusions. In
every attribute value by its own maximum value. In the next the first phase, a clustering algorithm is used to make clus-
Phase, feature selection is carried out over the imported train- ters of the training data and then two distances are used to
ing data. MMIFS initially set the feature set empty. It computes determine the new feature value: First is between a specific
the mutual information of the features with the class output and data point and its cluster center and second is between a spe-
selects the first feature that has the maximum mutual informa- cific data point and its nearest neighbor. New one-dimensional
tion (MI) value. In the next step, the MI is calculated between distance based feature value now represents each data point
features and those features are selected which meets particular in the training data. In the next phase, principal component
criteria which are explained by authors in their work. This step analysis (PCA) is used to select the relevant features. Only 6
is repeated till the desired number of features are selected. The features are selected as shown in the Table VII. Another phase
final set is provided as output to the user. The final set contains is classification phase in which k-NN classifier is trained by
8 features for DoS, 12 features for Probe, 14 features for U2R the new training data set. In the testing phase, CANN process
708 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

is again carried out for the test instances and testing data is also optimization (TVCPSO) with multiple criteria linear program-
represented using one dimensional feature space. k-NN clas- ming (MCLP) and SVM individually for doing parameter
sifier performs the classification over the test data instances. tuning and feature selection. In order to increase the speed
The classifier is not able to detect low-frequency attacks. The of PSO in searching for optimum and avoiding the local opti-
overall detection rate is 99.99%, accuracy 99.76% and false mum, the author introduced the chaotic concept with PSO.
alarms 0.003%. It achieves accuracy ∼ 99% for both DoS and The framework has been implemented using NSL-KDD’99.
probe attack detection. The overall detection rate provided by TVCPSO-MCLP is
Koc et al. [143] proposed the Hidden Naive Bayes Classifier 97.23% having false alarm rate 2.41% and accuracy 96.88%
which is an extension of the Naive Bayes classifier for misuse with feature selection. TVCPSO-SVM provides 97.84% accu-
detection. In the first phase of the proposed framework, the racy, 97.03% detection rate and 0.87% false alarm rate
attribute values are first converted to discrete values using the with feature selection. Out of the two algorithms, TVCPSO-
entropy minimization discretization and proportional k-interval SVM better results for DoS, Normal and probe and provides
discretization. In the next phase, feature selection is carried detection rate of 98.84%, 99.13% and 89.29% respectively.
out using three methods: correlation based (CFS), consistency Whereas TVCPSO-MCLP provides better detection rate for
based (CONS) and INTERACT methods. Author has tested R2L and U2R, having a detection rate of 75.08% and 59.62%
the combination of various discretization method and feature respectively.
selection method with classifier to derive the best method with Akashdeep et al. [159] provided an intrusion detection
high detection accuracy. Next phase is the classification phase approach which uses ANN and combines it with proposed fea-
in which Hidden Naive Bayes classifier is used to learn the ture reduction method. All the features of the original dataset
behavior of attack sample from training data. Naive Bayes are first ranked according to Information Gain method and
(NB) classifier is based on the assumption of the indepen- correlation method individually. Three feature subsets are built
dent relation of the attributes. Hidden Naive Bayes (HNB) using each method, i.e., IG-1, IG2, IG3 and CR1, CR2, CR3.
relaxes this assumption and extend the post probability calcu- The first subset under each category contains 1- 10 ranked
lation formula which also considers the mutual information of features, second sub-set contains 11-30 features, and the third
attributes during probability calculation. In the testing phase, subset contains remaining features. The first and second sub-
a test instance is classified based on the learned behavior. In set of each category are combined using union operation &
Intrusion Detection, the attribute values are very much depen- the second subset of each category are combined using the
dent on each other. For example, if we want to check a number intersection. Rest subsets are ignored. A final subset of features
of failed logins over a period. Here, content feature (num (total 25) is obtained by doing union operation over selected
of failed login) value and basic feature (duration) value both subsets as shown in Figure 9. The reduced KDD 99 dataset
affects the output value and dependent on each other’s values with 25 features is used to train ANN classifier. It provides
too in determining the output. Hence, HNB is improving the 86.6% detection rate (DR) for U2R, 93.8% DR for DoS, 91.9%
NB performance for detecting the intrusions. The author pro- DR for R2L and 89.8% DR for probe attack.
vided DoS detection rate as 99.60% and overall detection as Ambusaidi et al. [160] proposed flexible mutual informa-
93.72% using KDD’99 dataset. tion (MI) based feature selection (FMIFS) algorithm which
Gharaee and Hosseinvand [153] proposed the new feature can handle linear and nonlinear features efficiently. FMIFS
selection based intrusion detection model (GF-SVM) to detect has been used to select features to remove the most irrel-
intrusions in the network. A feature selection approach is evant features. The filtered dataset is then evaluated by
proposed where a Genetic algorithm (GA) and SVM are inte- machine-learning based network intrusion detection technique,
grated to provide an optimal set of features. Authors have particularly Least Square Support Vector Machine based
done slight modifications in the fitness function of the GA. IDS (LSSVM-IDS). The performance is measured over the
Instead of using the accuracy and number of features (NumF) KDD’99 dataset. It provides 99.46% DR, 0.13% FPR and
as parameters for fitness function, they have used three param- 99.79% accuracy. When evaluated with NSL-KDD’99, it pro-
eters: TPR, FPR and NumF. Each parameter is multiplied by vides 98.76% DR, 0.28% FPR and 99.91% accuracy. It
certain weight based on the user’s choice. In each iteration of outperforms other methods like MIFS and Flexible Linear
GA, each chromosome is evaluated and chromosomes with the Correlation Coefficient Based Feature Selection (FLCFS).
highest classification accuracy (using SVM) are selected. The Various intrusion detection techniques based on single clas-
optimal features are used to filter the dataset and Least Squared sifier with feature selection algorithm, have been discussed.
Support Vector Machine (LSSVM) is used to learn/detect the Applying feature selection improves the performance of clas-
train/test dataset with selected features. They have considered sification. However, one needs to take care of which combi-
7 features for normal attacks and 6-14 features for different nation of feature selection and machine learning algorithm is
types of attacks. The results using UNSW-NB15 dataset are providing the best results. It also makes the classifier faster
as follows: It achieves an accuracy of 97.45% with 98.47% over a selected set of attributes of features. However, there
TPR and 0.04% FPR for detecting normal traffic. It achieves is less or moderate improvement in the classification results.
an accuracy of 79.19%-99.45% with TPR 67.31%-100% and The drawbacks associated with one classifier algorithm can be
FPR 0.01%-0.09% for detecting various types of attacks. overcome when combined with another classifier algorithm (s)
Bamakan et al. [161] proposed an intrusion detection frame- to improve the classification result, discussed in next section
work which integrates the time varying chaos particle swarm in detail.
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 709

detection techniques, applying all features of data set.


Mukkamala et al. [124] proposed the ensemble approach
which integrates Artificial Neural Network (ANN), Support
Vector Machine (SVM) and Multivariate Adaptive Regression
Splines (MARS). Multi-Layer Feed Forward algorithm is very
robust in detecting anomalies. However, ANN suffers from
the drawback of detecting low-frequency attacks which are in
limited numbers in the training data set but ANN requires
the huge dataset in training. SVM can perform well even
on small data set but it is very slower than other classifiers.
MARS is a mathematical process in finding the optimal vari-
able transformations and interactions. It can find the complex
data structures too that often hide in high dimensional space.
The ensemble of these classifiers is an attempt to reduce the
mean squared error and increase the accuracy of classification.
In the initial phase, data preprocessor obtains data from the
DARPA 1998. In the next phase, each classifier is trained over
the data set and individual learned classifiers are formed. In
the final phase, majority voting scheme is applied to make the
final decision over the test instance. The detected class is one
in which majority of the classifiers agreed. The approach is
providing good accuracy for each category of attacks.
Zhang et al. [123] proposed hierarchical hybrid (misuse
and anomaly detection) framework based on multiple Radial
Basis Function Neural Networks (RBF NN) and a clustering
algorithm in serial and parallel order. In Serial Hierarchical
intrusion detection system (SHIDS), the classifier is trained
Fig. 9. Feature Reduction technique integrated with ANN.
based on the training dataset of KDD’99. In the next phase,
traffic is passed through the classifier. Normal packets are ver-
C. Multiple Classifier With All Features ified and passed further but Intrusion packets are detected and
Kumar and Kumar [146] proposed a multi-objective genetic saved in the database. A clustering algorithm is used to clus-
algorithm (MOGA) for misuse detection. It generates multiple ter the attacks based on their statistical distribution. When
base classifiers and later chooses a subset of classifiers to the number of records in the cluster exceeds some predefined
produce ensemble. It executes in three phases. In first phase, threshold, a new classifier is retrained to the new attack group
the approach targets to find the optimal Pareto front of non- and added to the last level of classifier. SHIDS suffers from the
dominated solutions. A set of base classifiers are produced problem of single point of failure since the classifiers depend
as candidate solutions. The type of base classifier determines on each other. The errors are accumulated in this process.
the values of chromosomes and their size. Optimized real val- To overcome this problem, author proposed another approach
ues of classifiers are produced as an output of phase 1 which named as Parallel Intrusion Detection System (PHIDS). In
will be used for ensemble. In phase 2, the pareto front of PHIDS, anomaly detection classifier is trained in the first level.
non-dominated solutions (of phase 1) are combined instead In second level, misuse detection classier is used to separate
of combining the entire population given as input by user. the intrusion into four groups based on their signature. In third
In phase 3, MOGA combines the predictions given by base level, the output of second level is passed to one of the four
classifiers to get the final output by ensemble model. They classifiers which are trained over the separate attack data set
have implemented the proposed approaches using KDD’99 such as DoS, Probe, U2R and R2L. These classifiers run in
and ISCX 2012 dataset. MLP is used as base classifier. parallel. The detected records are stored in the database. If
MLP is trained using proposed approach (AMGA2-MLP) and a number of attacks exceeds the threshold then the corre-
compared with MLP (trained by back propagation method), sponding classifier is retrained for the new attacks. Parallel
bagged MLP and boosted MLP. AMGA2-MLP provides better framework with multiple Neural Network classifier works bet-
performance than others. For ISCX 2012 dataset, it achieves ter than single classifiers and achieves around 99% detection
an average detection rate of 97.0% with 2.4% average FPR. It rate. RBF can model any nonlinear function using a single hid-
provides 7% improvement in detection rate and reduction in den layer and requires less computation time in comparison
FPR by 96% over MLP and its ensemble using bagging. The to BPL (Linear Back Propagation) NN.
proposed approach can enhance the detection rate by 28% and Toosi and Kahani [147] proposed a Neural Fuzzy classifier
reduce FPR by 51% over the results of MLP trained by back by combining the fuzzy logic with the neural network without
propagation method using KDD cup dataset. using any feature selection technique. The proposed misuse
The next innovation was to integrate many classi- detection system consists of two layers. In the first layer, five
fiers together to improve the performance of intrusion ANFIS (Adaptive Neuro-Fuzzy Interference System) modules
710 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

are trained to observe the intrusive activity. Each classifier is providing the detection of low-frequency attacks and achieves
trained over different samples of training data set. Each train- 100% detection rate for Probe attack.
ing sample represents exactly one class (Normal, DoS, Probe, Wang et al. [132] proposed an integrated hybrid intrusion
U2R or R2L). Each classifier acts as a signature based clas- detection approach which consists of Neural Network and
sifier. In the next phase, output of ANFIS module is supplied Fuzzy Clustering algorithm named as FC-ANN for detecting
to the fuzzy interference module which maps the output from intrusions. In the first stage, the data set is divided into training
the Neuro-Fuzzy classifiers to final output space and makes the and testing phases. Different training subsets are constructed
final decision for normal or intrusive nature. Output is the clas- using fuzzy clustering approach. In the second stage, for each
sifier label of the first layer (neuro-fuzzy output layer) in which training subset an Artificial Neural Network is trained and
the output is the close value to the interference module. Author different ANN classifiers are formulated. The results are com-
has also applied Genetic Algorithm (GA) to optimize the struc- bined and a new ANN is trained over the results to reduce
ture of fuzzy decision-making system. Author has specified the the errors. Fuzzy Clustering is performed by using a Fuzzy
overall detection rate of the classifier as 95.3% with false pos- c-means algorithm which clusters the data points based on
itive rate 1.9% over KDD’99 dataset. It has adaptive learning the membership grade. The cluster center and membership
capability. However, it is not working well for low-frequency grades are updated in each iteration to produce the optimized
attacks. results. ANN is used as classifier to classify the traffic based
Khan et al. [126] proposed the hybrid detection approach on the learned behavior of attacks. Sigmoid activation function
by integrating the Support Vector Machine (SVM) with takes the input as the product of weight and input value and
Hierarchical Clustering named as CT SVM using DARPA passes the results to other neurons. The output value com-
1998. Hierarchical clustering is used to reduce the training pared with target and error is calculated. Backpropagation
time of SVM and improve its efficiency of attack detection. algorithm is used to update the system for new values. The
SVM uses a hypothesis space of linear function in a higher technique produces the overall detection rate as 91.32% using
dimensional feature space. The trained model corresponds to KDD’99. The technique improves the detection rate of ANN
the hyperplanes. The data points closest to the hyperplane are for low-frequency attacks.
called as support vectors. In the proposed model, hierarchical Feng et al. [142] proposed an intrusion detection system
clustering is performed in the first phase over the training data (CSVAC) that takes advantage of both misuse and anomaly
set using DGSOT algorithm. In each iteration, new nodes are detection algorithms particularly Support Vector Machine
added to the tree based on the learning process. After each (SVM) and Clustering based Self-Organized Ant Colony
iteration, an SVM is trained over the nodes of the tree to Networks (CSOACN). SVM can learn over the little vol-
reduce the computational overhead. In the next iteration, gen- ume of data. CSOACN provides the power of adaptability.
erated support vectors are passed to the clustering algorithm In the real-time Intrusion Detection System, it is essential
to control the tree growth. In this way, only support vector that whenever new data points are added, the old model
nodes grow. The process is continued till some stopping cri- should be updated immediately. It saves a significant retrain-
teria is achieved. The stopping criteria could be based on tree ing time of the system. In the first phase, the training data
size or tree level or accuracy level. The training time of single is normalized to remove the biasness of some features over
SVM is 17.34 hr while integrating it clustering algorithm; it others. Next phase is training phase in which SVM is trained
is reduced to 13.18 hr. The system is not working well for over the several training data subsets repeatedly. In the first
detecting low-frequency attacks. iteration, initial hyperplanes are generated randomly. After that
Tong et al. [130] presented the Intrusion Detection System in each iteration, CSOACN clustering algorithm selects the
which integrates the Radial Basis Function Neural Network points around the generated support vectors and forms clus-
(RBF NN) with the Elman Network using DARPA 1998. ters. These data points will be used for training the SVM in
Neural Network fails to remember the past events. Elman the next iteration. Here, CSOACN not only learns the data
Network overcomes the weakness of Neural Network and pro- points and identify the outliers but also makes the data selec-
vides this capability. It helps in allowing the occasional misuse tion for training the SVM. At the end of the training phase, we
behavior and detecting the temporally co-located intrusion and have two learned classifiers: SVM and CSOACN. In the test-
collaborative events. Elman Network has a set of the context ing phase, the test instances are passed through the classifiers.
nodes. Each node takes the input from the hidden node and If both classifiers classify instance as anomalous, then only it
forwards the output to each hidden node of its hidden layer. is flagged as anomalous. CSOACN is used to determine the
Therefore in the hybrid network, both input and hidden node subclass of the attack. If results differ from each other, data
activates the activation node and hidden node fed forward to item is labeled as “amphibolous”. It can be further used for
activate the output nodes. The memorial functionality intro- analyzing the behavior of normal or intrusive data. The classi-
duced by context node helps in remembering the past events fier provides 3.388 s of training time, but it is not performing
and correlating the sequence of events. The value of context well for Probe and U2R attack detection.
nodes increases in each iteration and slowly decreases back Elhag et al. [148] combined the genetic fuzzy system
to zero based on the threshold. The hybrid model is eas- (GFS) with the pairwise learning (one to one mapping: OVO)
ily adaptable to new intrusions and requires less retraining architecture. The use of fuzzy sets creates a smoother bor-
time. This technique is very helpful in detecting temporar- derline between rules set and pair-wise learning improves the
ily dispersed and collaborative attacks. The technique is not precision of the rare attack patterns. The combined misuse
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 711

detection approach provides the high performance and is being algorithm. 12 feature for DoS, 12 features for Probe, 8 features
compared with decision tree by the authors. In GFS, fuzzy for U2R and 10 features for R2L are selected using 41 vari-
association rules (FARC-HD) form the base classifier. The able input data set. The working of the flexible neural tree is
inner working of the FARC-HD applies the genetic algorithm based on the neural network except it is flexible to expand.
to optimize the membership values of FARC-HD and obtain The tree expands base on the fitness function. Particle Swarm
the compact rules. The OVO method converts the multiclass Optimization (PSO) technique is used to optimize the param-
classification problem into binary sub-problems by making all eters during this process. PSO conducts a search using a
possible pair of classes. Then a binary classifier is trained population of particles that corresponds to an individual in
for the subsample of data ignoring other samples that do an evolutionary algorithm. The algorithm is working well for
not belong to its related class. Each of the trained models all categories of attack detection.
processes an instance and then the predictions by all classifiers Xiang et al. [129] proposed a hybrid detection algorithm
are combined to obtain the output. They have used preference based on Decision Tree C4.5 and Bayesian (AutoClass) clus-
relations solved by non-dominance criteria to combine the tering algorithm using KDD’99 dataset. The technique oper-
results. They have selected 5 labels per variable for fuzzy sets. ates in four stages. In the first stage, Decision Tree C4.5 is
The minimum support considered is 0.05 and minimum confi- used to classify the training data set into three categories (DoS,
dence is 0.8. The approach achieves 99% overall accuracy with Probe and others). Decision Tree fails to separate the U2R
97.77% attack detection rate and 0.191% false alarms. The and R2L attacks. In the second stage, Bayesian Clustering is
accuracy is 98.05% for DoS attack, 95.83% for probe attack, used to separate the normal connections from U2R and R2L
87.54% for R2L and 65.38% for U2R. The approach is achiev- connections. Clustering algorithm performs better than super-
ing the high accuracy for R2L and U2L and outperforming the vised algorithms in detecting the low-frequency attacks. In this
layered approach discussed above. stage, four features are used for clustering: duration, service,
Yassin et al. [149] proposed the hybrid detection approach src bytes and dst bytes. In this phase 178 clustered are formed
which provides integration of K-Means clustering and Naive and 31 are declared as attacks. In the third phase, again deci-
Bayes classifier. The combination of anomaly detection and sion tree C4.5 is used to separate the U2R from R2L attacks.
misuse detection is used to detect the attacks which can This is easier since normal connections are filtered out in the
be bypassed by having only one type of detection mech- second stage. In this stage, only 41 features are used as shown
anism. In the first phase, K-means clustering is used as a in Table II (Section III) and Table VII (Specific features). The
pre-classification module to make the clusters. Each cluster last stage further specifies individual U2R and R2L attacks
represents the group of similar data. The entire data is labeled based on the given training data. This classification is well
with Kth cluster set. Afterwards, Naive Bayes algorithm is effective for known attacks and results depends on the avail-
used as a classification module to classify the data instances ability of sufficient label data. This technique is performing
of the labeled cluster into the attack and normal. The approach low for R2L attacks detection.
is implemented using ISCX 2012 attack dataset. It achieves an Lin et al. [136] proposed an algorithm that uses multiple
accuracy of 99.8% with 95.4% detection rate and 0.13% false machine learning algorithms to detect the intrusions namely
alarms. The integration improves the accuracy of Naive Bayes Support Vector Machine (SVM), Decision Tree (DT) and
which provides 82.8% detection rate and 17.6% false alarms. Simulated Annealing (SA) in context of misuse detection. It
Various intrusion detection approaches based on multiclas- takes advantage of all three classifiers such as SVM performs
sifier algorithm are discussed. The classifiers are trained to well for classifying the intrusions, DT can produce rules and
learn all features of the training dataset. There is an improve- SA coverage to global optima. In the first phase, KDD’99
ment in the accuracy of the system. However, computational dataset is prepared for training and testing purpose. In the next
cost and complexity of the system are high. The detection stage, SVM and SA are combined to select the best features.
rate is improved by multiple classifier algorithms especially SVM maps the training data into high dimensional feature
for low-frequency attacks such as U2R and R2L. Combining space. SVM is trained and tested with the different possible
a bunch of classifiers may not always work better. Various feature set and at last best feature set is selected with maxi-
possible different combinations of multiclassifier needs to be mum accuracy. During this process, SA is used to optimize
cross-validated. The performance and speed can be improved the two of the parameters (C and λ) used by SVM with
by integrating the multiple classifier techniques with suitable Gaussian radial basis function kernel. Here C is the parameter
feature selection approach, discussed in next section. for the soft margin cost function, which controls the influ-
ence of each support vector. λ is the free parameter of the
Gaussian radial basis function. In the next phase, DT is used
D. Multiple Classifier With Limited Features to produce the rules for the classification using the selected
Feature selection techniques are further used with Hybrid features of the previous phase. Information gain and Entropy
Classifiers for further improvement, especially for low- are two important measures used by this algorithm while build-
frequency attack detection. Chen et al. [128] proposed a ing the decision tree. Here, SA is again used to optimize the
misuse detection approach based on flexible neural Tree (FNT) two parameters of DT: pruning confidence factor (CF) and
technique using DARPA 1998. The parameters of FNT are minimum cases (M). This technique is performing well for
optimized using the particle swarm optimization technique. detecting all types of attacks. Especially for DoS, the detection
Genetic Algorithm is used to select the features for the rate is 100%.
712 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

Casas et al. [26] proposed Unsupervised Network Intrusion and applying selection, crossover, and mutation to find the
Detection System (UNIDS) which uses three algorithms Sub- optimal subset till stop criteria is reached. In each iteration,
Space Clustering, DBSCAN and Evidence Accumulation for the effectiveness of the feature set is determined by the clas-
Ranking Outliers (EA4RO) for anomaly detection. Sub Space sifier’s accuracy. Further different SVM classifiers are trained
clustering is used to project the X (a feature vector with over the selected parameters over the five training data sub-
n dimensions) into N k-dimensional Sub Spaces (Xi ). Each sets. The overall detection rate is 92.268% with 1.025% false
Sub Space Xi is then partitioned using DBSCAN clustering positive rate. However, the technique is not working well for
algorithm. DBSCAN algorithm generates clusters of various low-frequency attacks.
densities. Anomalies present in low-density areas. It output the Chandrasekhar and Raghuveer [139] proposed a hybrid
set of clusters with a set of outliers. Clustering in low dimen- Model which uses the power of Clustering (K-means), Fuzzy
sional space is much faster and efficient. Outliers represent Neural Network (neuro-fuzzy) and Support Vector Machine
the different IP flows. EA4RO algorithm is used to rank these (SVM) to identify the intrusions. Processing the huge chunk
flow based on their degree of abnormality (dissimilarity). The of data introduces errors and affects the efficiency of the clas-
degree of dissimilarity is measured as a distance from outlier sifier. Hence, in the initial phase, the proposed framework
to the centroid of the biggest cluster. Here Mahalanobis dis- divides the training data set into small subsets based on the
tance measure is considered to measure the dissimilarity. The similarity of the data items by using K-means clustering algo-
flows are ranked according to their dissimilarity measure. IP rithm. It reduces the sparsity in data and makes it more suitable
flows whose dissimilarity is greater than a predefined thresh- for the classifier. In the next phase, five neuro-fuzzy classifiers
old are marked as anomalous. The algorithm is working well are trained over the five training subsets. It is difficult to deter-
detecting all attacks. Especially for Probe attacks, accuracy mine the number of neurons and hidden layers in the Neural
is 100% when tested with the KDD’99 dataset. It is using 9 Network. The problem is overcome by introducing the fuzzy
features as shown in Table VII. logic with the neural network. It can manage imprecise, partial
Li et al. [138] proposed a hybrid approach based on Support and vague information. It uses the backpropagation algorithm
Vector Machine (SVM), ant colony algorithm and clustering to find out the input membership function. Each Neuro-Fuzzy
algorithm. There are three phases of classification. In the first Network outputs the set of features with the membership value.
phase, the training data set (KDD’99) is pre-processed by In the next phase, SVM is trained using the selected fea-
deleting the repeated data in the database. Data compaction tures for each of the training samples and support vectors
is further achieved by using the K-mean Clustering which are generated. In the testing phase, SVM classifies the test
groups data items in different clusters based on their simi- instances based on the generated hyperplanes. The algorithm
larity measure. The intersection of original data and clusters is performing well for detecting all types of attacks.
remain. It reduces the size of data and makes it more efficient Horng et al. [134] proposed a hybrid framework based on
for classification. SVM takes very long time to process the hierarchical clustering and Support Vector Machine (SVM)
huge database. In the next phase, training subsets are selected using KDD’99 dataset. To improve the detection rate of SVM
using the ant colony algorithm. The effectiveness of each sub- and to reduce its training time, Clustering is integrated with
set is evaluated using the SVM classifier. In the third phase, SVM. In the first phase, data transformation and scaling is per-
feature selection is performed using proposed Gradual Feature formed to convert the non-continuous values into continuous
Removal (GFR) method. In GFR method, a feature is removed form and normalize the values on a scale [0-1]. In the next
from the feature set one by one and accuracy of the classifier phase, Clustering Feature (CF) tree is constructed for each cat-
is examined for the feature set. The process is carried out for egory of attack. A CF tree is a balanced height tree with two
each feature in the feature set and influence of a feature is parameters: branching factor (B) and radius threshold (T). It
noted. The most balanced feature set is selected based on the is compact representation of the dataset. Insertion and dele-
accuracy of the classifier. In the next phase, SVM learns the tion of a new data point are same as B+ tree. One entry in
behavior of attack data supplied to it based on the selected the leaf node represents one cluster. The hierarchical cluster-
feature set. The proposed technique does not perform well for ing organizes the training data in tree form, making the data
detecting U2R attacks. However, the detection rate is greater more balanced. In the next phase, feature selection is carried
than 90% for other attacks. out using the gradual feature removal method, in which each
Kuang et al. [141] proposed the misuse detection system attribute is taken out once and accuracy of the classifier is
based on Kernel Principal Component Analysis (KPCA), noticed to examine the influence of the feature in the output.
Support Vector Machine (SVM) and Genetic Algorithm (GA). The output of this phase is 19, 17, 24 and 24 features for DoS,
SVM provides the good generalization capability over small Probe, U2R and R2L respectively. In the next phase, SVM is
sample training data. GA performs the feature selection for trained over the selected features using RBF (Radial Basis
classification. In the first phase, data preprocessing is carried Kernel Function). In testing, a test instance is classified as per
out over training dataset (KDD’99) to transform all data val- the learned behavior of classifier. The system achieves 95.72%
ues into the numeric form and normalize all values. Training accuracy with 0.7% false positive rate. The performance is
data is divided into five subclasses based on the category. In good for DoS and Probe and poor for low-frequency attacks.
the next phase, KPCA transfers the high dimensional feature Gupta et al. [150] proposed a layered misuse detection
space into a low dimensional eigenspace. Then GA performs approach for achieving high accuracy and high efficiency. The
the feature selection by iteratively selecting the populations attack accuracy is achieved by using the Conditional Random
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 713

Fields (CRF) and high efficiency is achieved by using the SVM provides an accuracy of 88.03% whereas the proposed
layered approach. CRFs are probabilistic systems which are scheme provides 98.76% accuracy with a randomly selected
used to model the conditional distribution over a set of random three features using SA approach. It provides 0.09% FPR and
variable. CRFs have some advantages over Markov models as 1.35% FNR rate which is reasonably low.
they are undirected models used for sequence tagging which Moustafa and Slay [155] proposed a hybrid feature selec-
makes them free from the observation bias and label bias. They tion approach to reduce the irrelevant set of features and
have considered four attack groups namely, DoS, Probe, U2R to integrate it with other machine learning algorithms for
and R2L namely. They have selected features separately for intrusion detection. The proposed NIDS architecture makes
each class based upon the attack characteristics without using use of both anomaly and misuse detection approaches for
standard feature selection algorithm. Four different training intrusion detection. NIDS, first of all, takes the input from the
sets are created for each of the layers with comprises one of dataset (UNSW-NB15 or NSL KDD’99). It then calculates
the attack class and normal class. Each layer is trained for spe- the center points for attribute values. A center point or mode
cific attack category using specific feature set for the attack is the most frequent value of the attribute. The center points
class. In each iteration of the algorithm in training phase, a for all the features are used as an input to the association
CRF model is trained for each layer for a specific class. After rule mining algorithm (Apriori) to reduce its processing
training, there are four CRF models which are plug-in sequen- time. The association rule mining finds out the correlation of
tially in such way that the connections labeled as normal are the two or more features/attributes and finds out the highly
passed to the next layer otherwise detected as attack class ranked features. The dataset is filtered based on the selected
(corresponding to the layer) and connection is blocked. The features and given as an input to the detection engine. Here,
layered approach achieves 98.60% detection rate for probe three algorithms namely Expectation-Maximization (EM)
with 0.91% false alarms, 97.40% detection rate for DoS attack clustering, Naive Bayes (NB) and Logistic Regression (LR)
with 0.07% false alarms, 86.3% detection rate for U2R with have been used. The results using UNSW-NB15 are as
0.05 false alarms and 29.60% detection rate for R2L with follows: EM provides an accuracy of 77.2% with 13.1%
0.350% false alarms. FAR. LR provides accuracy of 83.0% with 14.2% FAR and
Mamun et al. [151] proposed a deep packet inspection tech- NB provides 79.5% accuracy with 23.5% FAR.
nique based on the use of Shannon’s entropy to identify the Aburomman and Reaz [104] proposed an ensemble ML
application flows, part of encrypted traffic. The feature set is based intrusion detection approach in which six k-NN and
composed of the following features: the entropy of the entire six SVM classifiers are trained over KDD’99. The results of
payloads, sliding window or n-gram length, the entropy of all the 12 trained models are combined using three different
encoded payload and Bi-Entropy, considered for both binary approaches. In first way, PSO generates weights which are
payload and encoded payload. A logarithmic function is used combined by Weighted Majority Voting algorithm (WMA) to
to transform all the metrics as it was found to improve the combine the results of trained models. In second way, the
accuracy of the classifier. All the features are further pre- behavior parameters of PSO are combined using Local uni-
processed using genetic algorithm. The genetic algorithm is modal sampling (LUS) and rest is same as the first approach.
integrated with the Least Square SVM (LSSVM), used as a In third way, WMA is used to fuse the results of classifiers.
training algorithm. In each iteration, a genetic algorithm is In all three ways, the results are produced and compared.
used to select the features based on the fitness function. The LUS-PSO-WMA (second way of combining results) provides
fitness function is calculated by weighting the true positive better accuracy among all. It provides 83.6878% accuracy for
rate, false positive rate and a total number of selected features. detecting normal traces, 96.8576% accuracy for probe attacks,
Total 10 features are considered using GF-LSSVM. LSSVM 98.8534% for DoS attacks, 99.8029% for U2R attacks and
is trained with the selected features to classify the traffic. It 84.7615% for R2L attacks as shown in Table V.
achieves the detection rate of 96.7% for encrypted traffic and Various intrusion detection techniques based on multiple
96.6% for unencrypted traffic with almost similar false alarm classifier algorithms which are integrated with suitable fea-
∼ 0.03% using ISCX dataset. ture selection approach have been discussed. Applying feature
Chowdhury et al. [154] provided the use of machine learn- selection, definitely improves the speed of classification and in
ing for network intrusion detection. They have applied the some cases, it is improving the detection rate also. However,
combination of simulated annealing (SA) [162] and Support time complexity is not much reduced. The overall complex-
Vector Machine (SVM) [20] to improve the detection accu- ity is still high as it consists of multiple classifier and feature
racy and reduce the false alarms. Misuse detection algorithms selection algorithms. This class of algorithms can make use of
have the power to classify the normal and abnormal classes, parallel programming techniques to reduce the training time.
given the attack behavior. In the proposed misuse detection Also, obtaining good classification results for low-frequency
algorithm, first n features are selected using the SA algorithm attacks is still a challenge.
from a set of K features. Now dataset N with n selected fea-
tures is used to train the SVM. The trained model is used to
detect the future test instances. The experiments have been VI. C LASSIFICATION OF T ECHNIQUES FOR A S PECIFIC
performed on the UNSW-NB dataset. From the dataset, 150, ATTACK D ETECTION
000 samples are selected randomly which contains 75,000 nor- In this Section, we provide the classification of various
mal and 75,000 anomaly samples. The 70% of the total dataset techniques for different attacks based on their performances.
is used for training and 30% is used for testing. The normal It helps the readers in choosing a particular technique for
714 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

TABLE VIII
C LASSIFICATION OF T ECHNIQUES FOR D O S ATTACKS

[136]
[121] [144] [124]
[132] [134]
[122] [61] [128]
[147]
[121] [143] [129]
[120] [123]
[133] [126] [26]
[131] [142]
[138]
[130]
[139]
[135]

[141]

detecting the specific attack. The detailed description of these achieving 90% in this category. Decision Tree [121] with 41
techniques is presented earlier in Section V, and the features features is providing detection rate of just 77.92% but when
with techniques are also shown in Table VII. Most of the applied with selected features (12 features) [61], the detection
approaches use KDD’99 as a common dataset for evaluation. rate improved to 98.868%. Similarly, SVM [120] provides very
Hence, these are considered for comparison. poor detection rate of 36.65%, but when MMIFS [133] feature
selection technique is applied which uses only 8 features, its
A. Detection of Denial of Service Attacks detection rate is improved to 86.46%. Further integrating the
Single classifier approach is much easier to implement. single classifiers provides a significant improvement in detec-
Decision Tree (DT) is performing well in this category tion. ANN [122] is achieving 71.63% detection rate but when
with detection rate of 97.24%. However, the detection rate it is integrated with Elman network [130] it provides 100%
is improved to 99.43% when applied with 12 features. detection. Even the integration with SVM and MARS [124]
CANN [144] is achieving 99.99% accuracy with 6 features. is also achieving 99.85% detection rate. Results are further
However, it will perform very well in detecting the Land improved in terms of detection rate and computational time
attack. For other DoS attacks, it needs to improve its feature set with Hybrid classifiers with feature selection. Here the com-
(the reason is explained in summary of observations in Section bination of Subspace clustering, DBSCAN and EAR [26]
VII-B). ANN when combined with SVM and MARS [124], algorithm is achieving 100% detection rate with 9 features.
improves its performance and provides 99.97% classifica- FNT technique with PSO and GA [128] also achieves 98.39%
tion rate. The performance of ANN [140] is also improved detection rate with 12 features. SVM integrated with DT and
when combined with Fuzzy Clustering (FC-ANN) [132] with SA [136] is performing far better with a detection rate of
99.91% detection rate. Fuzzy Logic with ANN (FC-ANN) 98.35% with 23 features in comparison to only SVM technique
is providing 99.5% detection rate. Now if we talk about (36.65%) with 41 features. The performance of techniques is
SVM [120], its detection rate is also improved when com- shown in Table IX.
bining it with Clustering approach (CT SVM) [126]. Earlier
the detection rate was just 91.6% and later it improved to
97.35%. In fact, there is also an improvement if we combine C. Detection of User to Root Attacks
SVM with ant colony networks (CSVAC) [142] and detec- We have classified techniques based on their detection rate
tion rate improves to 94.84%. All the above mentioned hybrid for U2R attacks. We have discussed those techniques which
techniques are considering 41 features which will affect the are performing well.
computation time. SVM in combination with DT and SA [136] In case of U2R attacks, the performance of the single classi-
is proving the highest detection rate of 100% with 23 features fier is very poor. However, the integration with other classifiers
and when combined with Hierarchical clustering it achieves is improving their performance. FNT approach with PSO and
99.5% detection with 19 features. The techniques are mainly GA [128] performs very well and achieves the highest detec-
using KDD’99 data set. The classification of various other tion rate of 99.7% with 12 features (refer Table VII for
techniques with detection rate for DoS attack detection is features in Section V). Intrusion Detection technique which
shown in Table VIII. uses Multiple Neural Network classifiers [123] is providing
99.7% detection rate for a specific type of U2R attack, i.e.,
B. Detection of Scanning (Probe) Attacks guess password. There is a significant improvement of Neural
Single classifiers with all 41 features are not performing Network [140] performance as its detection rate is improved
well for Probe attacks detection. None of the classifiers is to 93.18% when it is combined with the fuzzy clustering
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 715

TABLE IX
C LASSIFICATION OF T ECHNIQUES FOR S CANNING ATTACKS

[121] [144] [130] [26]

[131] [61] [124] [141]


[128]
[121] [143] [123] [136]
[126]
[122] [133] [134]
[120] [147]
[142] [129]
[132]
[138]

[139]
[135]

TABLE X
C LASSIFICATION OF T ECHNIQUES FOR U2R ATTACKS

[131] [133] [130]


[120] [123]
[144] [26]
[121] [132]
[61] [129]
[121] [135]
[143] [124]
[122] [141]
[142]
[136]
[147]
[138]
[126]
[134]
[130]
[139]

approach (FC-ANN) with no change in the set of features. classifiers is improving their performance. Here ensemble of
Earlier the detection rate of ANN was 0% for U2R attacks. SVM, ANN and MARS [124] is proving 100% detection rate
The Integration of ANN with SVM and MARS [124] is also as claimed by the author which is a significant improvement
proving an improvement with a detection rate of 76%, but it is over ANN [122] (26.68% detection rate). However, they have
not acceptable. There is very little improvement in SVM [120] not mentioned anything about what specific R2L attacks are
(12%) when it is integrated with clustering approach (CT detected. A hybrid classifier with feature selection is also
SVM, 17.23%) [126] which is also not acceptable. Hybrid performing very good. Here FNT approach with PSO and
techniques with feature selection are providing good detection GA [128] is providing detection rate of 99.09% with 12 fea-
rate for U2R attacks. The performance of other techniques is tures. FPSO is achieving the detection rate of 97.22% with 16
shown in Table X. features. Integration of SVM, DT and SA [136] is achieving
a detection rate of 90.67% with 23 features. The performance
D. Detection of Remote to User Attacks of other techniques is shown in Table XI.
We have classified techniques based on their detection rate Multiple classifiers with feature selection are perform-
for R2L attacks. The techniques are discussed which are ing better for R2L than U2R attacks detection. This is
performing well. because of (1) Most of the hybrid classifiers are filter-
In case of R2L attacks, the performance of the single ing the data before training the classifier by performing
classifier is very poor. However, the integration with other clustering on the data which group data items into groups
716 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

TABLE XI
C LASSIFICATION OF T ECHNIQUES FOR R2L ATTACKS

[128]
[122] [133] [123] [135]

[120] [144] [136]


[123]
[121] [61] [138]
[142]
[131] [143] [26]
[132]
[121] [129]
[126]
[134]
[147]
[141]
[130]
[139]

based on their similarity measure. It brings the balance A. Performance Analysis of Single Classifier Algorithms
in data which is very crucial in machine learning algo- With 41 Features
rithms. (2) Multiple classifiers are more adaptable than single Among all four Classifications, single classifiers with
classifiers and hence can learn the new attack behavior all features are performing very low for the detection of
efficiently. However, still, there are various difficulties asso- various attacks. We have mainly considered five standard
ciated with detecting low-frequency attacks as discusses in machine learning classifiers: Decision Tree (C4.5) [121],
Section VIII. Neural Network [140], Naive Bayes [121], Support Vector
Existing intrusion detection approaches based on machine Machine [120] and Fuzzy Association rules [131] as shown
learning have been thoroughly analyzed with respect to in Figure 10. Decision tree [121] gives better detection rate
individual attack categories. Limitations associated with (97.24%) for DoS attacks compared to other four classifiers.
approaches for each category are discussed with viable Neural Network [140] achieves highest detection rate (90.95%)
solutions. No one particular machine learning algorithm for Probe attacks whereas Fuzzy Association rules [131] pro-
can help in detecting all types of attacks. Hence, vides far better results for U2R attacks with detection rate
the use of a specific algorithm (misuse, anomaly or 68.6% as it uses data reduction technique. However, Fuzzy
hybrid) is recommended for detecting a specific set of rules are not working well for DoS attacks Detection. It meets
attacks. only 78.9% Detection rate. SVM [120] Classifier has high-
est detection rate around 22% for U2R attacks which is not
acceptable in an environment where security is an important
VII. P ERFORMANCE A NALYSIS OF D IFFERENT M ACHINE aspect. Although KDD’99 contains a large number of DoS and
L EARNING A LGORITHMS IN I NTRUSIONS D ETECTION Probe connection records, even then single classifiers are not
We have carried out the critical performance analysis of able to detect the behavior of these attacks. The reasons for
various machine learning techniques in detecting all types the degrading performance can be explained in terms of two
of attacks. The analysis is carried out with respect to each major aspects.
category of machine learning techniques which are earlier (i) Low Detection Rate & High Computational Cost:
described in Section V. The results, reported by researchers Classification task becomes time-consuming and hectic when
have been analyzed and compared. Based on the observation, considering all features of data. It results in increasing com-
we found that most of the work has been validated using putation power, storage and error rate and it affects the
KDD’99. Hence, the performance analysis of techniques based performance of classifier badly. The classifier suffers from
on KDD’99 has been carried out. The best performing tech- the problem of “Curse of Dimensionality”. The problem
niques have been discussed for each attack category and also can be explained as when the dimensionality increases,
the limitations of each category of techniques and the solutions the volume of the space increases so fast that the avail-
applied to overcome the limitations are provided. We provide able data becomes sparse. This sparsity could be problem-
the overall conclusion at the end of each category. It provides atic for machine learning algorithms because of statistical
readers with a clear view of limitations in intrusion detection significance of training dataset. In high dimensional data,
techniques and why the integration and feature selection is all objects appear to be sparse and dissimilar in many
applied. ways; data organizing strategies may not work efficiently
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 717

Fig. 11. Performance comparison of single classifiers with feature selection.


Fig. 10. Performance comparison of single classifiers with 41 features for
different Attacks.
(iii) Data reduction technique is working well for refining
the results for low-frequency attacks. It can be interpreted by
in such a sparse data organization. Hence there is a dire looking the results of fuzzy rules for U2R attacks. The same
need to find optimal feature for analyzing the behavior of may not work for all other attacks. It means many other factors
an attack and making a classifier learn the behavior effi- affect the detection accuracy of the classifier for a particu-
ciently and accurately. How many features does the algorithm lar attack such as the method of detection, sufficient features
should employ is one of the most important aspect that a for learning the behavior of an attack, training dataset, data
researcher must take into consideration before performing the refinement and feature extraction.
classification.
(ii) Algorithmic Drawbacks: The algorithmic drawbacks of
each classifier is being discussed in details in Section IV-A. B. Performance Analysis of Single Classifier Algorithms
The above limitations of the single classifiers with all fea- With limited Features
tures make them less suitable for attack detection. Some There is an improvement after feature selection with sin-
problems are computational complexity, less adaptability, sen- gle classifiers as shown in Figure 11. For example, Decision
sitivity towards input change, sensitivity towards the choice tree [61] (C 4.5) is performing better with 12 features for
of the kernel function and its parameters, number of train- DoS and Probe attacks detection. However, because of its
ing variables, algorithmic complexity and overfitting. This incapability to detect low-frequency attacks, the author has
causes average detection rate for high-frequency attacks (DoS experimented only with DoS and Probe attacks having a detec-
and Probe) and poor detection rate for low-frequency attacks tion rate of 99.43% and 98.868% respectively. The 12 features
(U2R, R2L). are listed in Table VII in Section V. Hidden Naive Bayes [143]
Various solutions for overcoming the limitations of single is providing similar results for DoS attacks with a detection
classifiers are as follows: rate of 99.6%. It relaxes the assumption of independence rela-
i) Incorporating Feature Selection/Feature Extraction tion of variables and provides a significant improvement over
method before Classification by using machine learning algo- Naive Bayes [121]. It achieves a detection rate of 99.6% over
rithms such as Genetic Algorithms, SVM and Clustering or by Naive Bayes with detection rate 96.65% for detection of DoS
using statistical methods such as Particle Swarm Optimization, attacks. Author has not provided separate detection results for
Principle Component Analysis, Gradual Feature Removal, and other attacks. So it is hard to say how it performs for other
Mutual Information based Feature Selection. attacks. MMIFS with SVM [133] is providing average results
ii) Integrating multiple classifiers and combining their result for all four types of attacks. However, it provides better results
based on some criteria such as majority voting (Ensemble of for U2R and R2L attacks with a detection rate of 30.7% and
classifiers). 84.85%. The features employed by MMIFS technique is dif-
iii) Integrating multiple classifiers where the output of one ferent for different attacks. It considers only 8 features for
classifier is fed as an input to another classifier to refine the DoS, 12 features for Probe, 8 features for U2R and 10 fea-
previous classification results (Stacking of Classifiers). It will tures for R2L attacks. The features are listed in Table VII in
also reduce the false alarms generated by a single classifier. Section V. The reasons for the degrading performance can be
iv) Integrating feature selection/feature extraction explained in terms of two major aspects.
approaches along with the multiple classifiers. (i) Low Detection Rate: Even after selecting features, a tech-
Observations based on analysis are as follows: (i) Single nique may not perform well for attack detection since the
classifiers without feature selection are not giving good selected features are not good enough to learn the behavior
performance for attack detection. of an attack. However, selecting and applying a suitable fea-
(ii) A particular classifier is not providing good results for ture selection method is still an open challenge. The selection
detecting all categories of attacks. Hence we can’t say that a of an appropriate feature set is another aspect of improving the
particular classifier is the best for attacks detection. performance of the machine learning algorithms. For example,
718 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

MMIFS [133] is not considering feature P7 (land) which is Classifier (k-NN). This approach comes in the category of
very important for land (DoS) attacks detection. It also does single classifier with limited features in our categorization of
not take features such as P10 (hot), P11 (num failed logins), machine learning algorithms. The approach claims to achieve
P14 (root shell) and P16 (num root) into consideration which 99.99% accuracy for DoS attacks and 99.98% accuracy for
are very important for the detection of U2R and R2L attacks. Probe attacks. They have utilized only 6 features namely
(ii) Algorithmic Drawback: Even after employing feature P7(land), P9(urgent), P11(num failed logins), P18(num shells),
selection, the detection results are not good. It could be P21(is host login), P20 (num outbound cmds).
because of the algorithmic drawback of a classifier. The draw- Now as per our analysis of various research papers, out
backs of single classifiers with characteristics are mentioned of these 6 features only feature7 (Land) is the most rele-
in Section IV. vant feature for detecting Land attack. For this attack, CANN
The limitations of single classifiers with feature selection may achieve 99.99% accuracy. However, for other attacks
can be improved by considering following factors: such as such as teardrop, feature 8 (wrong fragment) is most
(i) Each feature in the Feature Set should be relevant enough relevant. To detect Back attack features (P5: src bytes, P6:
and nonsimilar to each other to learn the behavior of an dst bytes) [163] and features(P10: hot, P13: num compro-
attack. For example, P11 (num failed login), P14 (root shell) mised) [164] are most relevant features, calculated based
and P10 (hot) features are very important in the detection of on Rough Set theory and Information Gain measure respec-
U2R attacks and should be in the feature set for U2R attacks tively and hence should be included in the Feature Set.
detection. There are various methods such as Gradual Feature Olusola et al. [163] claims that feature P20 (outbound com-
Removal, Information gain, Chi-Square for feature selection. mand count for FTP session) and the feature P21(hot login)
They may not provide the accurate features for all attacks. The are least relevant features for Intrusion Detection. The claim is
researcher should not solely depend on the output of these provided based on the Dependency Ratio calculated for each
methods. feature based on a rough set theory which is 0.000 for both
(ii) Detection results and training time can be improved by features (P20 ,P21) for all categories of DoS attacks. This jus-
integrating the classifiers so that another classifier can over- tifies that CANN may work very good for Land attack but it
come the drawbacks of one classifier. For example, Clustering has to improve its feature set for detecting other attacks.
with SVM improves (CT SVM) [126] the training time and
detection rate of SVM Classifier [120] as shown in Figure 10
and Figure 11. Clustering preprocesses the data and groups C. Performance Analysis of Multiple Classifier Algorithms
them into clusters based on the characteristics SVM Classifier With All Features
is trained for one cluster. This improves the training time and We observed that multiple classifiers are performing bet-
accuracy of detection of SVM Classifier. ter than single classifiers in term of detection rates as shown
Observations based on analysis are as follows: (i) in Figure 10 and 12. If we consider the case of multiple
Employing feature selection does not necessarily mean that classifiers without feature selection, Ensemble of ANN, SVM
it will improve the detection results. The researcher should and MARS [124] is achieving the highest detection rate
also focus on appropriate method employed for the fea- of 99.97% for DoS attacks. FC-ANN [132] is achieving a
ture selection. However, it will reduce the computational good detection rate 99.91% for detection of DoS attacks and
time of a classifier with less storage overhead. For example, 93.18% for detection of U2R attacks using KDD’99 dataset.
MMIFS [133] with SVM with 8 features is providing detection Neuro-Fuzzy [147] is also achieving good detection rate for
rate of 78.69% for DoS attacks whereas SVM [120] without DoS (99.5%) using same dataset KDD’99. ANN with Elman
feature selection is achieving detection rate of 91.6% for DoS network [130] experimented over DARPA 98 achieves the
attacks. highest detection rate 100% for Probe attacks. FC-ANN [132]
(ii) If a feature set is working well for analyzing the behav- provides the highest detection rate 93.18% for U2R attacks
ior of a particular attack, it may not work well for other over KDD’99 dataset. CSVAC [142] achieves good detection
attacks. There is need to identify the behavior of each attack rate of 87% over KDD’99 dataset for R2L attacks Multiple
and accordingly to design a feature set for each attack. NN [123] achieves 99.7% detection rate for specific U2R
(iii) If a Classifier is achieving 99.99% accuracy for DoS attack, i.e., Guess password attack. However, it has not been
attacks or any other attacks. It should be the average of all considered while comparison as it does not provide categorical
types of DoS attacks such as Backdoor, Land, neptune, Pod, results. The comparisons are performed with other techniques
Smurf, teardrop etc. If a Classifier is achieving such a big accu- in this category. On an average, multiple classifiers with all
racy 99.99% for one or two types of DoS attacks, it would be features are performing good or average for high-frequency
inappropriate to say that the Classifier is achieving 99.99% attacks (DoS and Probe), but they are resulting into average or
accuracy for that category of attack. For example, in Cluster poor detection rates for low-frequency attacks (U2R and R2L).
Center and Nearest Neighbor (CANN) [144] approach one However, they suffer from problems such as slow response
dimensional distance based feature is used to represent each time, high error rate, more storage requirement etc. For exam-
data sample. The distance is the sum of two distances; first is ple Neuro Fuzzy [147] technique uses KDD’99 dataset and
distance between each data sample and its cluster center and provides 99.5% detection rate for DoS, 84.1% detection rate
second is distance between the data and its neighbor in the for Probe, 41.1% detection rate for U2R and 31.5% detection
same cluster. Data samples are classified using k-Neighbor rate for R2L attacks.
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 719

the detection of Low-Frequency Attacks. The reasons are


described in detail in Section VIII.
The performance of Multiple classifiers with all features can
be improved by:
(i) Parallel Processing of the multiple classifiers Modules:
Parallel processing of modules will reduce their computa-
tional time and increase their efficiency of attack detection.
For example, in FC-ANN [132] approach, training data sub-
sets are trained by different ANN classifiers which makes it
very slow since training data subsets are processed one by
one and then fuzzy aggregation module integrates the results.
The training time of the classifier is 86625s (24h) which is
Fig. 12. Performance comparison of multiple classifiers with all features.
quite high. Another example is multiple classifier approaches
which uses the unsupervised technique (subspace clustering
approach, DBSCAN and EA4RO algorithm) [26], partitions
the feature space into small independent subspaces. DBSCAN
The limitation with this category of classifiers can be is performed over each subspace to detect outliers one by
explained as. one. It improves the detection rate but incurs in very high
(i) High Computational Cost: There are two reasons for computational cost.
having high computational cost:(a) Number of features are (ii) Integrating feature selection techniques with the multiple
41 which will enhance the computational cost of the classi- classifiers will increase their accuracy of attack detection and
fier. It will also cause the problem of the curse of dimen- will reduce computational time.
sionality which brings the sparsity in data as described in Observations based on analysis are as follows:
Section VII-A. It may lower the performance of classifier for (i) The detection rate of single classifiers is improved for
the detection of U2R and R2L attacks since they are already most of the attacks by integrating it with other machine learn-
limited in the KDD’99 training and test dataset. For DoS ing techniques. For example, SVM [120] is providing 91.6%
and Probe attacks detection most of the multiple classifiers detection rate for DoS attacks, 36.65% for Probe attacks, 12%
are achieving 90-99% but the performance is not good for for U2R attacks and 22% for R2L attacks. Whereas when it
U2R and R2L attacks. (b) Multiple classifiers are slower than is integrated with Ant Colony networks (used for Clustering)
single classifiers in term of their training time because the named as CSVAC [142], it is achieving detection rate 94.84%
data has to be processed by multiple classifiers to arrive at for DoS attacks, 53.25% for Probe attacks, 44.23% for U2R
a common conclusion and sometimes the serial execution of attacks and 87% for R2L attacks. Both the techniques are
classifiers where the output of one classifier becomes input experimented over KDD’99 dataset with 41 features. Hence
to next classifier make them slower. For example, multiple multiple classifiers are performing better than single classifiers.
classifier approach which combines Bayesian Clustering And (ii) Most of the classifiers significantly improve the ineffi-
Decision Tree (C4.5) [129] works in three stages. In the first ciency of single classifiers in detecting U2R and R2L attacks.
stage, it classifies DoS, Probe and others (normal and U2R This is because those multiple classifiers perform the data
and R2L) using Decision Tree C4.5. In second stage, it sepa- filtering before sending the training data to the classifier.
rates normal connections from U2R and R2L using Bayesian Clustering is one of the most popular techniques for sampling
Clustering and in third stage, it separates U2R from R2L again the data based on their behavior. It decreases the sparsity in
by using Decision Tree C4.5. This serial processing makes it data as similar data points are clustered in the same cluster.
much slower to process huge amount of data. Each classifier is trained for the particular cluster to learn the
(ii) Complexity: Integrating many classifiers together makes behavior of the cluster data points. A cluster may refer to
the system complex. For example, In ANN with Elman normal, DoS, Probe, U2R or R2L data records.
Network [130], context nodes are additional nodes which are (iii) Integrating multiple classifiers may increase the compu-
added to the neural network to keep the memory of past tational time and complexity of a classifier making them more
events. It will increase processing time of neural network and difficult to understand and design. (iv) Integrating a classifier
bring complexity into the network. Activation node of neural with other classifier does not mean that it will improve the
network receives the input from output of previously hidden detection rate for all attacks. For example, ANN with all fea-
nodes or input variable and context node (Elman network) tures [122] achieves 71.63% detection rate for Probe attacks
which provides the data of past events. Moreover, the integra- detection when it is integrated with Fuzzy Clustering (FC-
tion improves the detection rate and reduced retraining time ANN) [132], its detection rate goes down to 48.12% for the
but increases complexity and does not overcome the drawback same attack (Probe). Thus Fuzzy Clustering with ANN is not
of neural networks inability to detect low-frequency attacks working well for Probe attacks detection. However, for other
such as U2R and R2L attacks as shown in Figure 12. attacks, the detection rate is increasing. It can be interpreted
(iii) Average or Poor Detection Results for Low-Frequency by Figure 10 and Figure 12 for FC-ANN. The results have
Attacks: Some of the techniques are providing average been obtained over KDD’99 dataset. It is because FC-ANN
performance whereas most of them are performing poor for divides the training data into smaller subsets by using Fuzzy
720 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

(ii) Complexity: These are very complex than single classi-


fiers in design and working.
(iii) Average or good detection results for Low-Frequency
Attacks: Few of the techniques are providing good
performance whereas most of them are resulting in average
performance. The detection results are not improved for all
the techniques when they are integrated with other classi-
fiers. Low-frequency attacks such as User to Root (U2R) and
Remote to Local (R2L) preserves the similar characteristics
as normal data. Moreover, there are various other reasons for
having difficulty in detecting these attacks. The reasons are
explained in details in Section VIII.
The performance of multiple classifiers with feature selec-
tion can be improved by:
(i) Parallel processing of multiple modules is necessary in
Fig. 13. Performance comparison of multiple classifiers with feature order to provide the detection results on time. It is very impor-
selection.
tant in case of real-time Intrusion Detection System where the
notification of malicious activity has to be to notified as early
Clustering and then train the ANN Classifier for the individ- as possible without causing much damage to the system. For
ual subset. It may result in detection error for those attacks example, various applications such as Online Banking, Online
where training subset does not contain the sufficient number shopping and Cloud applications are very sensitive to mali-
of attack connections. cious activity and a small malicious action can result in drastic
results. This is explained with an example of a classifier in
Section VII-C.
D. Performance Analysis of Multiple Classifier Algorithms (ii)The selected features should be sufficient enough to
With limited Features analyze the behavior of an attack. Feature selection method
We observed that there is an improvement after feature provides the good results but these results should be further
selection with Multiple classifiers as shown in Figure 13. For improved by researchers based on the signature of an attack.
example, Let us consider approach A (CSVAC) [142] which We have discussed the important features for each attack in
combines SVM with ant colony network and considers all 41 Section III.
features over KDD’99. Another approach B [138] also com- Observations based on analysis are as follows: (i) Data
bines SVM with ant colony network algorithm and uses 19 refinement and dividing the data into clusters improves the
features obtained using GFS (Gradual Feature removal) over accuracy of detection. The discussed approach B in [138] per-
KDD’99. However, approach B also refines data using k-mean forms extensive data refinement using k-mean and ACO before
in the first stage. Approach B achieves the 97.67% detec- passing the training data to SVM. Approach A performs the
tion rate for DoS attacks, 91.45% detection rate for Probe data sampling. It selects the random data points in the first
attacks. Also, 53.84% detection rate for U2R attacks and iteration and creates support vectors for selected data points.
90.34% detection rate for R2L attacks whereas approach A In the next iterations, only those data points are selected which
achieves the 94.84% detection rate for DoS attacks, 53.25% are closer to the generated support vectors and classifier is
detection for Probe attacks, 44.23% detection rate for U2R retrained for new data points.
attacks and 87% detection rate for R2L attacks. Approach B (ii) If a technique is achieving highest detection rate for a
with feature selection and data refinement is performing bet- particular attack, It may not achieve the same performance
ter than approach A without feature selection. Most of the for detecting the other attacks. For example, Hierarchical
researchers have incorporated the feature selection strategy, Clustering with SVM technique [134] is considering 19 fea-
i.e., filter based, wrapped based or hybrid based methods with tures and using KDD’99 dataset for detecting all attacks (DoS,
multiple classifiers. They are performing good or average for Probe, U2R and R2L) and is achieving a detection rate of
detecting the high-frequency attacks (DoS and Probe) and low- 99.5% for DoS, 97.5% for Probe, 19.7% for U2R and 28.8%
frequency attacks (U2R and R2L). For example technique for R2L. It means we should not use the same technique to
based on SVM, DT and SA with 23 features (listed in the detect all attacks, i.e., if a technique is good in DoS and Probe,
Table VII) uses KDD’99 dataset and achieves 100% detection it may not be good in U2R and R2L attacks.
rate for DoS attacks, 98.35% detection rate for Probe attacks, (iii) If an optimal subset of features is good for detecting
80% detection rate for U2R and 90.67% detection rate for R2L one attack, it may not be good for detecting other attacks.
attacks. The limitation with this category of classifiers can be Most of the authors are considering a same optimal subset
explained as. of features for detecting all four categories of attacks and
(i) High computational time: However, feature selection has their technique is providing major variation in terms of
saved significant time of computation but it is still not much detection rate for different attacks. Since the behavior of
improved because of the execution of multiple modules of an attack is different than other attacks, there is need to
classifiers in serial order. train the classifier for different categories of attacks using
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 721

the different optimal subset chosen for different attack. keyword spotting is difficult to achieve with machine learning
For example FPSO [135] is using same 16 features sub- algorithms using KDD’99 dataset.
set {P1,P5,P6,P8,P9,P10,P11,P13,P16,P17,P18,P19,P23, (iv) The number of U2R and R2L samples, present in the
P24,P32,P33}for detecting DoS, Probe, U2R and R2L. training and testing dataset of KDD’99 are very less when
However, it results in good detection results for DoS (97.22%) compared to the DoS and Probe attacks. The insufficient learn-
and R2L(97.22%) and average results for Probe(77.77%) and ing of such attacks makes the classifier less suitable to detect
U2R(69.44%). Here features P3 (service), P7 (land) and P14 such attacks. Moreover, imbalanced distribution of data, make
(root shell) are not considered which are most important the classifier to treat such attacks as normal attacks.
features for attack detection. (v) Activities performed by these attacks may be similar
(iv) Multiple classifier with selected features are performing in terms of a number of file creation, root shell login, sum of
better than other categories of classifiers. operations performed as root, etc. In such a case identifying the
An exhaustive literature study of intrusion detection tech- low-frequency attacks become more difficult. However, careful
niques with respect to each category of classification tech- examination of the system call traces for the presence of spe-
niques is carried out and critically analyzed. Various inferences cific modules or processes, the suspicious sequence of system
are drawn by comparing results reported by researchers. calls, invocation of specific commands, etc. may provide some
The observations are analyzed thoroughly. The limitations clues to identify the attack activity in the system. For exam-
associated with each category of the technique is discussed ple, Ffbconfig attack exploits a buffer overflow (U2R). It
and viable solutions to overcome the limitations are pro- configures the Creator Fast Frame Buffer (FFB) Graphics
vided. At the end, conclusions drawn with respect to each Accelerator which is a part of FFB Configuration software
classification are mentioned to provide scope for further package, SUNWffbcf. The attack can be detected by exam-
improvement. ining the system call traces of the system for the presence
of ‘/usr/sbin/ffbconfig/’ command with an oversized argument
for ‘-dev’ parameter [53].
VIII. I SSUES IN D ETECTING L OW-F REQUENCY ATTACKS (vi) A very high accuracy (around 90-99%) is achieved in
Machine learning algorithms work on the statistics of the some approaches in detecting those attacks but we can’t say
data obtained from attack data set. DoS and Probe attacks can that these techniques will achieve the same accuracy for detect-
be detected easily by careful examination of the statistics of ing the unseen attacks or newly generated U2R or R2L attacks.
the connections at the vulnerable host machine whereas it is Since the techniques are validated over the test database of
hard to detect the low-frequency attacks such as U2R and R2L KDD’99 which contains the features values of the attacks that
even by careful examination of the statistics of the connec- may be sufficient to separate them from DoS and Probe. For
tions using KDD’99 dataset. This is because of the following example, a dictionary attack is an R2L attack in which attacker
reasons: makes repeated guesses of username and passwords to gain
(i) The connection statistics of low-frequency attacks are access to some machine remotely. The attack can be detected
very similar to the normal connection. by examining two features: session protocol of every service
(ii) There exist similarity in the behavior of U2R and R2L (P2) and num of failed login attempts (P11) over a period.
connection records. Hence, it is also difficult to differenti- But, if the feature values are not providing sufficient infor-
ate U2R and R2L attacks itself from each other. In fact, a mation which may happen if victim password is not strong
U2R attack is one of the variations of R2L attack. In R2L enough and attacker accesses the victim machine in one or
attack, a user does not have the local access to the machine. To two guesses such as by entering his phone no or school name
access the root privileges, he has to first access a normal user’s etc. The feature values for this attack will be similar to a nor-
account by using various account hijacking exploits. Then mal connection. In this case, machine learning algorithm will
after login as a normal user, he can launch further exploits not work efficiently to detect these attacks.
to gain root privileges whereas, in a U2R attack, the attacker It is difficult to detect low-frequency attacks just by exam-
has unprivileged local access to the victim machine. ination of network features. The issues in detecting low-
(iii) Low-frequency attacks can be launched in a single con- frequency attacks have been identified and discussed. The
nection. Information provided by the KDD’99 dataset about possible viable solutions to detect such attacks such as buffer
the connection is not sufficient. Although some of the Content overflow, password cracking, dictionary attack, virus, etc. have
features are present in the KDD’99 dataset (refer Table II been discussed. One can refer our recent work [76], [165]
in Section III) such as num failed logins (P11), root shell for detecting these attacks using system call analysis. In our
(P14), num compromised (P13), root shell(14), etc. but they another recent work [77], we have considered both system
are not enough for attack identification. For example, loadmod- call and network features for detection of low-frequency
ule attack (U2R) loads two dynamically loadable kernel drivers attacks.
of the currently running system and creates special devices in
the /dev directory to use those modules. Because of a bug in
the way loadmodule sanitizes environment, an unauthorized IX. DATA M INING T OOLS FOR M ACHINE L EARNING
user can gain root access on the local machine. The attack can There exist many tools that support the implementation of
be detected by the keyword spotting in the user’s session to various machine learning methodologies. Some of them are
find strings ’set $IFS=‘V’ and ‘loadmodule’ [53]. This kind of described below.
722 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

A. Weka E. RapidMiner
Waikato Environment for Knowledge Analysis RapidMiner [170] is a software developed by RapidMiner
(Weka) [166] is a machine learning software tool, developed company for data mining applications. It provides an inte-
by University of Waikato, New Zealand in 1993. Although grated environment for data pre-processing, machine learning,
many changes have been incorporated till now. It is an open deep learning, text mining etc. It provides both commercial
source tool which is freely available under GNU General and free edition. The free edition is named as RapidMiner
Public License and written in JAVA. Weka supports various Studio which is limited to 1 logical processor and 10,000 rows
data analysis tasks such as data pre-processing, feature selec- and can be obtained under AGPL license. It uses client/server
tion, classification, clustering, regression and visualization. model where a server is basically hosted on the cloud plat-
The tool takes input as a set of records in the flat file (.ARFF form. It provides a template based framework and removes the
files) where a set of attributes describes each record. It is need for coding. It supports text mining, image mining, video
easy to use due to Graphic User Interfaces (GUIs) provided mining, and social network analysis. The import formats sup-
by the tool. ported by it are SQL, TXT, XML, XLS, etc. It performs data
extraction, transformation, analysis and visualization.
B. Scikit-Learn
F. Environment for Developing KDD-Applications Supported
Scikit-learn is an open source machine learning the library, by Index-Structure (ELKI)
developed as Google summer of code project by David
Cournapeau in 2007 [167]. It is written in python and incor- ELKI [171] is an open source software for data mining
porates python numerical and scientific libraries like NumPy applications, developed at the Ludwig Maximilian University
and SciPy and other libraries like Panda, matplotlib, etc. It of Munich, Germany. It emphasizes on unsupervised machine
provides various efficient tools for machine learning like clas- learning methods such as K-mean clustering, K-medians
sification, clustering, regression algorithms. It also supports clustering, DBSCAN, OPTICS, Expectation-maximization,
methods for feature extraction and provides learning tutorials Hierarchical clustering Canopy clustering, etc. It also provides
to understand the concepts. data index structures like R-ree, R*-tree, M-tree and K-d tree.
The advanced mining algorithms and their interaction with
database index structure is evaluated using many parameters
C. TensorFlow like ROC, histogram, Scatterplot, etc.
TensorFlow [168] is an open source software library for
machine learning applications, developed by Google Brain G. Massive Online Analysis (MOA)
Team. It was released under Apache 2.0 open source license MOA [172] is a popular open source data stream mining
on Nov 2015. Version 1.0.0 is recently released in Feb 2017. tool to perform big data streaming in real time. It consists of
TensorFlow is a very useful tool for deep learning as it pro- various machine learning algorithms for classification, clus-
vides support for building and training neural networks. Data tering, regression and outlier detection, etc. It is written in
flow graphs are used to create machine learning models and Java and can be extended for newer algorithms, streams and
perform computations. Data arrays are edges between nodes of evaluation methods. It provides storable settings for real and
graphs, called as tensors. TensorFlow supports multiple APIs synthetic data stream for conducting repeatable experiments.
such as python, C++, Go, Java, Haskell and Rust APIs. Third We discussed some of the important data mining tools
party packages are available for Julia, Scala and R. It can used for data analysis using machine learning algorithms.
run multiple CPUs and GPUs and supports different 64 bits Some of them also support deep learning algorithms. Most
OS such as Linux, Windows, MacOS, Android and iOS, etc. of these tools provide easy to use GUI interface that can be
TensorFlow Lite is a recent release for Android. easily used by researchers in their research domain. Some
other machine learning libraries are Apache SAMOA [173]
D. KMINE and MLlib (Spark) [174], etc. those can also be used by
researchers.
Konstanz Information Miner (KNIME [169] is an open
source tool for data analytics, developed at University of
Konstanz, released under a dual licensing scheme. It uses X. F UTURE D IRECTIONS
the data pipelining concepts to integrate the components of Deep learning is an advancement of the neural network.
machine learning and data mining. It provides GUI to per- Deep learning uses the subsequent layers of information-
form various tasks such as data loading, transformation, feature processing in some hierarchy for classification or feature
extraction, modeling and visualization. It is written in Java representation. It makes use of the deep networks having
but provides a wrapper to run other codes like python, perl. multiple layers of processing. It consists of input tier provid-
It provides the processing of large data volumes. For exam- ing the basic data and followed by consecutive hidden layers
ple, it can analyze 300 million customer addresses, 10 million which analyze data and output is produced. It has gained pop-
molecular structure and 20 million cell images. It integrates ularity in recent years. The existing IDS can be improved by
open source projects such as ML algorithms from Weka, R embracing this latest technique. Deng and Yu [175] provided
packages, LibSVM, JFreeChart, ImageJ, etc. using plugins. the categorization of deep learning methods based on their
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 723

TABLE XII
R ECENT W ORKS BASED ON A PPLICATION OF D EEP L EARNING FOR class. A reliable intrusion detection system should be able to
I NTRUSION D ETECTION handle the noisy inputs and large discrete or continuous data.
Reinforcement learning (RL) is another interesting area of
[177] research. Reinforcement learning (RL) is one of the machine
[178] learning algorithms where multiple agents and machine
[179]
[180]
work/interact together to learn the behavior within a particu-
[181] lar context and improve the performance for attack detection.
[182] Sensors or agents sense the environment in discrete time
[183]
[184] intervals and the input is mapped to locate the state infor-
[185] mation. Once RL agents execute the action and feedback
[186]
[187] is observed from the environment. Correct actions of agents
[188] are rewarded by the environment, called reinforcement signal.
Agents then leverage the rewards and improve the knowledge
about the environment to select the next action. Some of the
researchers have applied RL to detect the distributed denial of
architecture into following types: generative (unsupervised), service attacks [190].
discriminative (supervised) and hybrid. Unsupervised deep Servin and Kudenko [191] proposed a hierarchical dis-
learning or generative architectures make use of following tributed architecture for intrusion detection in which multiple
methods: Auto Encoder (AE) and Boltzman Machine (BM). network-agents learns to capture the local state information.
Similar to ANN, AE makes use of hidden layers; however, it All the sensors communicate to the agent up in the hierar-
has only three hidden layers. The nodes in the input layer and chy. The topmost agent decides when to fire and generate an
output layer are same. The hidden nodes are used to reduce the alarm. Each agent uses the slightly modified version of the Q-
feature dimensionality and provide the new feature set [176]. learning and simple exploration/exploitation strategy to learn
A different set of features are learned in cascade depths to train the actions and execute in a particular state. Random selection
the more precisely. BM takes the stochastic decision using the is carried out to choose between normal and abnormal states
neuron’s structure of binary units. Deep BM (DBM) has a and global state of the network is simulated. The approach
cascaded structure whereas Restricted BM has no connections achieves an accuracy of 98.9% with 1.1% error rate with two
among the hidden units. The multiple layers which are stacked sensor agents in a self-generated dataset.
one by one form a deep belief networks (DBN). Supervised In order to improve the efficiency and improve the computa-
learning is used to distinguish some parts of data and has been tional power, Xu and Xie [192] applies the RL for host based
used for pattern classifications. Convolution Neural Network intrusion detection. They have applied Markov reward process
(CNN) is an example of supervised learning which provides model for modeling the behavior of the host. RL prediction
fast learning. CNN uses three fields: local receptive fields, method makes use of the temporal difference learning algo-
shared weights and pooling. Hybrid approach makes use of rithm [193] for learning the behavior of the processes. It helps
both the methods. An example of hybrid architecture is Deep in detecting the abnormal process behavior of the applications
Neural Networks (DNN). DNN provides a fully connected running on the host. The use of RL for predicting the behavior
hidden layer forming cascaded multilayer networks. of normal system call sequences has helped in improving the
The use of deep learning for image classification is quite accuracy. They have obtained 100% detection rate and 20 sec
popular. However, the challenge lies in adopting the deep of training time using MIT lpr dataset [194]. The computa-
learning for attack detection in network traffic. In recent years, tional cost is very low in comparison to traditional ML algos
it has been applied deep learning for intrusion detection as such as HMM, RIPPER etc. The summary of some of the
shown in Table XII. Seok and Kim [188] have employed crucial research challenges associated with deep learning and
deep learning for attack detection which is based on the con- reinforcement learning approaches are discussed as follows:
version of malware code into the image and applying CNN • One of the primary challenge using deep algorithms is
which takes these images as input to learn attack features. to generate/obtain a lot of data for training & classi-
Also, most of the work for deep learning based IDS uses this fication algorithm. For example, in order to make an
approach for reducing the dimensionality of features. It has IDS learn the various attack scenarios, researchers sup-
many advantages. Combining the supervised and unsupervised ply terabytes of data to the deep learning classifier to
approaches of deep learning improve the detection results of train itself. Availability of sufficient data specific to the
traditional approaches [189]. It helps in developing new meth- problem domain is one of the crucial challenges.
ods on network security which are more certain than traditional • It will be challenging to adopt deep learning algorithm for
machine learning approaches. Deep learning is adaptable to the real-time classification because of the level of complexity
changing context of data as it performs the exhaustive data involved in training huge amount of data. Most of the
analysis. However, the use of deep learning for attack analy- existing work apply deep learning for feature extraction
sis is still challenging and open area for researchers to work and dimensionality reduction only [176].
on. The resources required for training the network are also • Most of the existing deep learning methods are suitable
quite huge. Deep learning is suitable to be applied when it is for pattern and image recognition. However, how to apply
difficult to find the correlation between raw input and target deep learning to classify the network traffic and/or system
724 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

logs properly, is still a challenge. Some of the deep The critical performance analysis of various machine learning
learning algorithm’s like Convolution Neural Network algorithms has been done in an evolutionary way. The compar-
(CNN) and Deep Belief Network (DBN) has proven to ison has been carried out with single classifier approaches and
be good classifiers. However, the experimental work is multiple classifier approaches. The influence of a classifier with
in progress to determine the efficiency and reliability of other classifier is not only analyzed but also the influence of a
these learning algorithms to detect attacks [195]. feature subset with the classifier is analyzed. We have shown
• Another challenge in the deep learning algorithms is that even if an optimal feature set is sufficient for analyzing the
the requirement of high-performance hardware to pro- behavior of an attack, it is not good for analyzing the behavior
cess the huge training data. Machines need to have of other attacks. Hence, there is a need to define the optimal
sufficient processing power to solve the real world prob- feature subset and a suitable technique for each type of attack
lems. To reduce the training time and improve efficiency, as the behavior of an attack varies from each other.
researchers will require multi-core high performing GPUs The difficulties associated with detecting the low-frequency
which are very costly and consume more power. For attacks using machine learning techniques over network
example, Monte Carlo Tree search integrated with deep dataset have been described. It motivates researchers to work
neural network requires 48 CPUs, 8 GPUs for conducting on other solutions to detect the Low-frequency attacks. Future
40 multi-threaded search [183]. research directions are provided to help researchers exploring
• The growth in computer memory and computational more efficient solutions for attack detection.
power is possible through parallel and distributed com- Existing literature is described which are based on similar
puting. Researchers can work in this direction to cope techniques with most of the popular datasets as on date to
with the issues related to communication and computa- generalize our observations. All the techniques have not been
tion management to scale the deep learning algorithms implemented to evaluate the performance to ensure that results
for huge datasets. are reproducible. This remains a limitation of our paper and we
• Reinforcement learning (RL) is one of the growing field are very keen to improve this as a future work. In future, we
and research in this direction towards attacks detection would also like to propose an attack detection model especially
is still going on. The slow study speed problem of RL for improving the performance of low-frequency attacks by
affects the feasibility of multi-agent study in the real exploring deep learning approaches. Later, various issues will
world. How to speed up the performance of multi-agent be focused with IDS techniques when these are applied to
classifier is another important research challenge [196]. dynamic and changing network environment such as Cloud
• In the Multi-agent RL system, there has to be proper Computing etc.
coordination among each of the RL agent members.
Therefore, designing a fast and effective way for R EFERENCES
communication is another important research concern. [1] BBC News. (2008). Estonia Fines Man for ‘Cyber War’. [Online].
Researchers are still working on How to apply RL for Available: [Link]
filtering the network traffic more accurately. [2] L. Dignan. (2008). Amazon Exploits Its S3 Outage. [Online]. Available:
[Link]
Deep Learning and Reinforcement Learning are the future [3] M. Dekker, D. Liveri, and M. Lakka, “Cloud security incident report-
research directions in the field of intrusion detection for ing: Framework for reporting about major cloud security incidents,”
researchers. Deep Reinforcement Learning [197] is the ENSIA, St. Paul, MN, USA, Rep. TP-04-13-105-EN-N, 2013.
[4] DDoS. (2014). Ello Social Network Hit by Suspected Bloody
next step in this direction to make the learning effec- DDoS Attack. [Online]. Available: [Link]
tive for huge collection of data. It has been applied for social-network-hit-by-suspected-bloody-dDoS-attack/
pattern classification and resource control purpose in past [5] S. Panjwani, S. Tan, K. M. Jarrin, and M. Cukier, “An experimental
evaluation to determine if port scans are precursors to an attack,” in
years. Researchers can apply it for intrusion detection Proc. IEEE Int. Conf. Depend. Syst. Netw. DSN, 2005, pp. 602–611.
applications. [6] CISCO. (2014). Cisco Anual Report. [Online]. Available:
[Link]/web/offer/gist-ty2-asset/[Link]
[7] “Cisco annual cyber security report,” CISCO, San Jose, CA, USA,
Rep., 2017.
XI. C ONCLUSION [8] SNORT. (2017). Snort [Link]. [Online]. Available:
The increasing rate of intrusions in the network and host [Link]
[9] OISF. (2018). Suricata 4.0.4. [Online]. Available: [Link]
machines have badly affected the security and privacy of users. [Link]/about/
Researchers have extensively worked on various solutions to [10] T. F. Lunt, “Ides: An intelligent system for detecting intruders,” in Proc.
detect intrusions. The security aspects of intrusion detection Symp. Comput. Security Threat Countermeasures, 1990, pp. 30–45.
using machine learning approach have been considered in [11] L. Ertöz et al., “MINDS-minnesota intrusion detection system,” in Next
Generation Data Mining. Cambridge, MA, USA: MIT Press, 2004,
our paper. We have described various types of attacks in the pp. 199–218.
network and host systems with the brief description of their [12] KDD. (1999). KDD Cup 1999 Data. [Online]. Available:
attack features. The analysis performed, reveals that if a tech- [Link]
[13] N. Moustafa and J. Slay, “UNSW-NB15: A comprehensive data set
nique is performing well for detecting an attack, it may not for network intrusion detection systems (UNSW-NB15 network data
perform same for detecting other attacks. Hence, the relevance set),” in Proc. Mil. Commun. Inf. Syst. Conf. (MilCIS), Canberra, ACT,
of a technique for specific attacks has been presented by clas- Australia, 2015, pp. 1–6.
[14] E. Vasilomanolakis, S. Karuppayah, M. Mühlhäuser, and M. Fischer,
sifying various machine learning techniques for each type of “Taxonomy and survey of collaborative intrusion detection,” ACM
attack. Comput. Surveys, vol. 47, no. 4, p. 55, 2015.
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 725

[15] A. Torkaman, G. Javadzadeh, and M. Bahrololum, “A hybrid intelligent [41] A. L. Buczak and E. Guven, “A survey of data mining and
HIDS model using two-layer genetic algorithm and neural network,” machine learning methods for cyber security intrusion detection,” IEEE
in Proc. IEEE 5th Conf. Inf. Knowl. Technol. (IKT), 2013, pp. 92–96. Commun. Surveys Tuts., vol. 18, no. 2, pp. 1153–1176, 2nd Quart.,
[16] R. Puzis, M. D. Klippel, Y. Elovici, and S. Dolev, “Optimization of 2015.
NIDS placement for protection of intercommunicating critical infras- [42] D. Csubak and A. Kiss, “OpenStack firewall as a service rule analyser,”
tructures,” in Proc. IEEE Int. Conf. Intell. Security Informat., Taipei, in Proc. Int. Conf. Human Aspects Inf. Security Privacy Trust, 2016,
Taiwan, 2008, pp. 191–203. pp. 212–220.
[17] A.-S. K. Pathan, The State of the Art in Intrusion Prevention and [43] P. S. Kenkre, A. Pai, and L. Colaco, “Real time intrusion detection
Detection. Boca Raton, FL, USA: CRC Press, 2014. and prevention system,” in Proc.3rd Int. Conf. Front. Intell. Comput.
[18] R. Hecht-Nielsen, “Theory of the backpropagation neural network,” Theory Appl. (FICTA), 2015, pp. 405–411.
in Proc. IEEE Int. Joint Conf. Neural Netw., Washington, DC, USA, [44] P. Deshpande, A. Aggarwal, S. Sharma, P. S. Kumar, and A. Abraham,
1989, pp. 593–605. “Distributed port-scan attack in cloud environment,” in Proc. 5th Int.
[19] J. R. Quinlan, C4.5: Programs for Machine Learning. San Francisco, Conf. Comput. Aspects Soc. Netw. (CASoN), Fargo, ND, USA, 2013,
CA, USA: Morgan Kaufmann, 1993. pp. 27–31.
[20] C. Cortes and V. Vapnik, “Support vector machine,” Mach. Learn., [45] M. J. Schoelles and W. D. Gray, “Argus: A suite of tools for research
vol. 20, no. 3, pp. 273–297, 1995. in complex cognition,” Behav. Res. Methods Instrum. Comput., vol. 33,
[21] S. Kumar, Survey of Current Network Intrusion Detection Techniques, no. 2, pp. 130–140, 2001.
Washington Univ., St. Louis, MO, USA, pp. 1–18, 2007. [46] A. Crenshaw. (2008). OSfuscate: Change Your Windows OS TCP/IP
[22] M. Ahmed, A. N. Mahmood, and J. Hu, “A survey of network anomaly Fingerprint to Confuse P0f, NetworkMiner, Ettercap, Nmap and Other
detection techniques,” J. Netw. Comput. Appl., vol. 60, pp. 19–31, OS Detection Tools. [Online]. Available: [Link]
Jan. 2016. security/osfuscate-change-your-windows-os-tcp-ip-fingerprint-to-confu
[23] P. García-Teodoro, J. Díaz-Verdejo, G. Maciá-Fernández, and [Link]
E. Vázquez, “Anomaly-based network intrusion detection: Techniques, [47] D. Norton, An Ettercap Primer. Singapore: SANS Inst. InfoSec
systems and challenges,” Comput. Security, vol. 28, nos. 1–2, Reading Room, 2004.
pp. 18–28, 2009. [48] N. Hoque, M. H. Bhuyan, R. C. Baishya, D. Bhattacharyya, and
[24] P. Kumar et al., “A novel approach for security in cloud computing J. K. Kalita, “Network attacks: Taxonomy, tools and systems,” J. Netw.
using hidden Markov model and clustering,” in Proc. IEEE World Comput. Appl., vol. 40, pp. 307–324, Apr. 2014.
Congr. Inf. Commun. Technol. (WICT), 2011, pp. 810–815. [49] G. Mantas, N. Stakhanova, H. Gonzalez, H. H. Jazi, and
[25] T. Kohonen, “The self-organizing map,” Proc. IEEE, vol. 78, no. 9, A. A. Ghorbani, “Application-layer denial of service attacks:
pp. 1464–1480, Sep. 1990. Taxonomy and survey,” Int. J. Inf. Comput. Security, vol. 7, nos. 2–4,
[26] P. Casas, J. Mazel, and P. Owezarski, “Unsupervised network intru- pp. 216–239, 2015.
sion detection systems: Detecting the unknown without knowledge,” [50] F. Iglesias and T. Zseby, “Analysis of network traffic features for
Comput. Commun., vol. 35, no. 7, pp. 772–783, 2012. anomaly detection,” Mach. Learn., vol. 101, nos. 1–3, pp. 59–84, 2014.
[27] J. Yang, T. Deng, and R. Sui, “An adaptive weighted one-class SVM [51] E. Guillén, J. Rodríguez, R. Paez, and A. Rodriguez, “Detection of
for robust outlier detection,” in Proc. Chin. Intell. Syst. Conf., 2016, non-content based attacks using GA with extended KDD features,” in
pp. 475–484. Proc. World Congr. Eng. Comput. Sci., 2012, pp. 30–35.
[28] G. Kim, S. Lee, and S. Kim, “A novel hybrid intrusion detection method [52] D. Kumar, “DDoS attacks and their types,” in Network Security Attacks
integrating anomaly detection with misuse detection,” Exp. Syst. Appl., and Countermeasures. Hershey, PA, USA: Inf. Sci. Ref., 2016, p. 197.
vol. 41, no. 4, pp. 1690–1700, 2014.
[53] MIT. (1999). Darpa Intrusion Detection Attacks Database. [Online].
[29] C. Kolias, G. Kambourakis, A. Stavrou, and S. Gritzalis, “Intrusion
Available: [Link]
detection in 802.11 networks: Empirical evaluation of threats and a pub-
lic dataset,” IEEE Commun. Surveys Tuts., vol. 18, no. 1, pp. 184–208, [54] M. Malekzadeh, M. Ashrostaghi, and M. S. Abadi, “Amplification-
1st Quart., 2015. based attack models for discontinuance of conventional network
transmissions,” Int. J. Inf. Eng. Electron. Bus., vol. 7, no. 6, p. 15,
[30] Y. Zhang, W. Lee, and Y.-A. Huang, “Intrusion detection techniques for
2015.
mobile wireless networks,” Wireless Netw., vol. 9, no. 5, pp. 545–556,
2003. [55] S. Maiti, C. Garai, and R. Dasgupta, “A detection mechanism of DoS
[31] C. Kolias, V. Kolias, and G. Kambourakis, “TermID: A distributed attack using adaptive NSA algorithm in cloud environment,” in Proc.
swarm intelligence-based approach for wireless intrusion detection,” IEEE Int. Conf. Comput. Commun. Security (ICCCS), 2015, pp. 1–7.
Int. J. Inf. Security, vol. 16, no. 4, pp. 401–416, 2016. [56] T. Bass, A. Freyre, D. Gruber, and G. Watt, “E-mail bombs and coun-
[32] M. Halilovic and A. Subasi, “Intrusion detection on smartphones,” termeasures: Cyber attacks on availability and brand integrity,” IEEE
arXiv e-print 1211.6610, pp. 1–8, Nov. 2012. Netw., vol. 12, no. 2, pp. 10–17, Mar./Apr. 1998.
[33] A. Karim, R. Salleh, and M. K. Khan, “SMARTbot: A behavioral anal- [57] T. Halagan, T. Kováčik, P. Trúchly, and A. Binder, “Syn flood attack
ysis framework augmented with machine learning to identify mobile detection and type distinguishing mechanism based on counting bloom
botnet applications,” PLoS ONE, vol. 11, no. 3, pp. 1–35, 2016. filter,” in Proc. Inf. Commun. Technol. EurAsia Conf., 2015, pp. 30–39.
[34] A. Shabtai, U. Kanonov, Y. Elovici, C. Glezer, and Y. Weiss, [58] E. Bou-Harb, M. Debbabi, and C. Assi, “Cyber scanning: A com-
“‘Andromaly’: A behavioral malware detection framework for prehensive survey,” IEEE Commun. Surveys Tuts., vol. 16, no. 3,
android devices,” J. Intell. Inf. Syst., vol. 38, no. 1, pp. 161–190, pp. 1496–1519, 3rd Quart., 2014.
2012. [59] A. K. Kaushik, E. S. Pilli, and R. C. Joshi, “Network forensic system
[35] A. K. Sikder, H. Aksu, and A. S. Uluagac, “6thSense: A context-aware for port scanning attack,” in Proc. IEEE 2nd Int. Adv. Comput. Conf.
sensor-based attack detector for smart devices,” in Proc. 26th USENIX (IACC), 2010, pp. 310–315.
Security Symp., 2017, pp. 397–414. [60] L. Aniello, G. Lodi, and R. Baldoni, “Inter-domain stealthy port
[36] P. Faruki et al., “Android security: A survey of issues, malware pen- scan detection through complex event processing,” in Proc. 13th Eur.
etration, and defenses,” IEEE Commun. Surveys Tuts., vol. 17, no. 2, Workshop Depend. Comput., 2011, pp. 67–72.
pp. 998–1022, 2nd Quart., 2015. [61] P. Sangkatsanee, N. Wattanapongsakorn, and C. Charnsripinyo,
[37] P. Mishra, E. S. Pilli, V. Varadharajan, and U. Tupakula, “Intrusion “Practical real-time intrusion detection using machine learning
detection techniques in cloud environment: A survey,” J. Netw. Comput. approaches,” Comput. Commun., vol. 34, no. 18, pp. 2227–2235, 2011.
Appl., vol. 77, pp. 18–47, Jan. 2017. [62] D. Mankins, R. Krishnan, C. Boyd, J. Zao, and M. Frentz, “Mitigating
[38] S. Anwar et al., “Cross-VM cache-based side channel attacks and distributed denial of service attacks with dynamic resource pricing,”
proposed prevention mechanisms: A survey,” J. Netw. Comput. Appl., in Proc. 17th Annu. Comput. Security Appl. Conf. (ACSAC), 2001,
vol. 93, pp. 259–279, Sep. 2017. pp. 411–421.
[39] S. Agrawal and J. Agrawal, “Survey on anomaly detection using [63] G. Helmer et al., “A software fault tree approach to requirements anal-
data mining techniques,” Procedia Comput. Sci., vol. 60, pp. 708–713, ysis of an intrusion detection system,” Requirements Eng., vol. 7, no. 4,
Dec. 2015. pp. 207–220, 2002.
[40] N. F. Haq et al., “Application of machine learning approaches in intru- [64] A. Sridharan, T. Ye, and S. Bhattacharyya, “Connectionless port scan
sion detection system: A survey,” Int. J. Adv. Res. Artif. Intell., vol. 4, detection on the backbone,” in Proc. 25th IEEE Int. Perform. Comput.
no. 3, pp. 9–18, 2015. Commun. Conf. (IPCCC), Phoenix, AZ, USA, 2006, p. 10.
726 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

[65] E. Raftopoulos, E. Glatz, X. Dimitropoulos, and A. Dainotti, “How [91] A. M. Kibriya, E. Frank, B. Pfahringer, and G. Holmes, “Multinomial
dangerous is Internet scanning?” in Proc. Int. Workshop Traffic Monitor. naive Bayes for text categorization revisited,” in Proc. Aust. Conf. Artif.
Anal., 2015, pp. 158–172. Intell., Cairns, QLD, Australia, 2004, pp. 488–499.
[66] M. Rostamipour and B. Sadeghiyan, “Network attack origin forensics [92] L. Jiang, H. Zhang, and Z. Cai, “A novel Bayes model: Hidden naive
with fuzzy logic,” in Proc. IEEE 5th Int. Conf. Comput. Knowl. Eng. Bayes,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 10, pp. 1361–1371,
(ICCKE), 2015, pp. 67–72. Oct. 2009.
[67] S. Bahl and S. K. Sharma, “A minimal subset of features using cor- [93] W.-H. Chen, S.-H. Hsu, and H.-P. Shen, “Application of SVM and
relation feature selection model for intrusion detection system,” in ANN for intrusion detection,” Comput. Oper. Res., vol. 32, no. 10,
Proc. 2nd Int. Conf. Comput. Commun. Technol., 2016, pp. 337–346. pp. 2617–2634, 2005.
[68] C. Edge, W. Barker, B. Hunter, and G. Sullivan, “Malware security: [94] S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas, “Machine learning:
Combating viruses, worms, and root kits,” in Enterprise Mac Security, A review of classification and combining techniques,” Artif. Intell. Rev.,
Berkeley, CA, USA: Apress, 2016, pp. 221–242. vol. 26, no. 3, pp. 159–190, 2006.
[69] D. G. Johnson and T. M. Powers, “Computer systems and responsibil- [95] I. Ahmad, A. Abdullah, A. Alghamdi, and M. Hussain, “Optimized
ity: A normative look at technological complexity,” Ethics Inf. Technol., intrusion detection mechanism using soft computing techniques,”
vol. 7, no. 2, pp. 99–107, 2005. Telecommun. Syst., vol. 52, no. 4, pp. 2187–2195, 2013.
[70] A. A. Ghorbani, W. Lu, and M. Tavallaee, “Network attacks,” in [96] H.-Y. Huang and C.-J. Lin, “Linear and kernel classification: When to
Network Intrusion Detection and Prevention. Cham, Switzerland: use which?” in Proc. SIAM Int. Conf. Data Min., 2016, pp. 216–224.
Springer Int., 2010, pp. 1–25. [97] B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and
[71] P. K. Manadhata and J. M. Wing, “An attack surface metric,” IEEE J. C. Platt, “Support vector method for novelty detection,” in Proc.
Trans. Softw. Eng., vol. 37, no. 3, pp. 371–386, May/Jun. 2011. Adv. Neural Inf. Process. Syst., Denver, CO, USA, 2000, pp. 582–588.
[72] S. Singh and S. Silakari, “A survey of cyber attack detection systems,” [98] S. Owais, V. Snasel, P. Kromer, and A. Abraham, “Survey: Using
Int. J. Comput. Sci. Netw. Security, vol. 9, no. 5, pp. 1–10, 2009. genetic algorithm approach in intrusion detection systems techniques,”
[73] M. K. Sabhnani and G. Serpen, “KDD feature set complaint heuris- in Proc. 7th IEEE Comput. Inf. Syst. Ind. Manag. Appl. (CISIM),
tic rules for R2L attack detection,” in Proc. Security Manag., 2003, Ostrava, Czech Republic, 2008, pp. 300–307.
pp. 310–316. [99] S. Selvakani and R. S. Rajesh, “Genetic algorithm for framing rules for
[74] K. S. Wutyi and M. M. S. Thwin, “Heuristic rules for attack detec- intrusion detection,” Int. J. Comput. Sci. Netw. Security, vol. 7, no. 11,
tion charged by NSL KDD dataset,” in Genetic and Evolutionary pp. 285–290, 2007.
Computing. Cham, Switzerland: Springer Int., 2016, pp. 137–153. [100] P. Gupta and S. K. Shinde, “Genetic algorithm technique used to detect
[75] P. Mishra, E. S. Pilli, and R. C. Joshi, “Forensic analysis of e-mail date intrusion detection,” in Proc. 1st Int. Conf. Adv. Comput. Inf. Technol.,
and time spoofing,” in Proc. IEEE 3rd Int. Conf. Comput. Commun. Chennai, India, 2011, pp. 122–131.
Technol. (ICCCT), 2012, pp. 309–314. [101] O. Depren, M. Topallar, E. Anarim, and M. K. Ciliz, “An intelligent
[76] P. Mishra, E. S. Pilli, V. Varadharajan, and U. Tupakula, “VAED: intrusion detection system (IDS) for anomaly and misuse detection in
VMI-assisted evasion detection approach for infrastructure as a service computer networks,” Expert Syst. Appl., vol. 29, no. 4, pp. 713–722,
cloud,” Concurrency Comput. Pract. Exp., vol. 29, no. 12, pp. 1–21, 2005.
2017. [102] A. Ahmad and L. Dey, “A k -mean clustering algorithm for mixed
[77] P. Mishra, E. S. Pilli, V. Varadharajan, and U. Tupakula, “PSI-NetVisor: numeric and categorical data,” Data Knowl. Eng., vol. 63, no. 2,
Program semantic aware intrusion detection at network and hypervisor pp. 503–527, 2007.
layer in cloud,” J. Intell. Fuzzy Syst., vol. 32, no. 4, pp. 2909–2921, [103] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A
2017. survey,” ACM Comput. Surveys, vol. 41, no. 3, p. 15, 2009.
[78] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, “Toward [104] A. A. Aburomman and M. B. I. Reaz, “A novel SVM-KNN-PSO
developing a systematic approach to generate benchmark datasets for ensemble method for intrusion detection system,” Appl. Soft Comput.,
intrusion detection,” Comput. Security, vol. 31, no. 3, pp. 357–374, vol. 38, pp. 360–372, Jan. 2016.
2012. [105] H. H. Hosmer, “Security is fuzzy! Applying the fuzzy logic paradigm
[79] V. H. Garcia, R. Monroy, and M. Quintana, “Web attack detection to the multipolicy paradigm,” in Proc. ACM Workshop New Security
using ID3,” in Professional Practice in Artificial Intelligence. Cham, Paradigms, Little Compton, RI, USA, 1993, pp. 175–184.
Switzerland: Springer Int., 2006, pp. 323–332. [106] L. A. Zadeh, “Fuzzy sets,” Inf. Control, vol. 8, no. 3, pp. 338–353,
[80] J. R. Quinlan, C4.5: Programs for Machine Learning. San Francisco, 1965.
CA, USA: Morgan Kaufmann, 1993. [107] C. Tang, Y. Xiang, Y. Wang, J. Qian, and B. Qiang, “Detection and clas-
[81] A. L. Prodromidis and S. J. Stolfo, “Cost complexity-based pruning sification of anomaly intrusion using hierarchy clustering and SVM,”
of ensemble classifiers,” Knowl. Inf. Syst., vol. 3, no. 4, pp. 449–469, Security Commun. Netw., vol. 9, no. 16, pp. 3401–3411, 2016.
2001. [108] S. Raja and S. Ramaiah, “An efficient fuzzy-based hybrid system to
[82] T. M. Mitchell, Machine Learning, 1st ed. New York, NY, USA: cloud intrusion detection,” Int. J. Fuzzy Syst., vol. 19, no. 1, pp. 62–77,
McGraw-Hill, 1997. 2017.
[83] M. F. Augusteijn and B. A. Folkert, “Neural network classification and [109] M. Gyanchandani, J. Rana, and R. Yadav, “Taxonomy of anomaly based
novelty detection,” Int. J. Remote Sens., vol. 23, no. 14, pp. 2891–2902, intrusion detection system: A review,” Int. J. Sci. Res. Publ., vol. 2,
2002. no. 12, pp. 1–13, 2012.
[84] M. M. Moya, M. W. Koch, and L. D. Hostetler, “One-class classi- [110] L. R. Rabiner and B. H. Juang, “An introduction to hidden Markov
fier networks for target recognition applications,” Sandia Nat. Labs., models,” IEEE ASSP Mag., vol. 3, no. 1, pp. 4–16, Jan. 1986.
Albuquerque, NM, USA, Rep. SAND-93-0084C, 1993. [111] D. Ariu and G. Giacinto, “HMMPayl: An application of HMM to the
[85] S. Albrecht, J. Busch, M. Kloppenburg, F. Metze, and P. Tavan, analysis of the HTTP payload,” in Proc. WAPA, 2010, pp. 81–87.
“Generalized radial basis function networks for classification and nov- [112] A. Churbanov and S. Winters-Hilt, “Implementing EM and Viterbi
elty detection: Self-organization of optimal Bayesian decision,” Neural algorithms for hidden Markov model in linear memory,” BMC
Netw., vol. 13, no. 10, pp. 1075–1093, 2000. Bioinformat., vol. 9, no. 1, p. 224, 2008.
[86] A. Jagota, “Novelty detection on a very large number of memories [113] C. Kolias, G. Kambourakis, and M. Maragoudakis, “Swarm intelligence
stored in a hopfield-style network,” in Proc. IEEE Seattle Int. Joint in intrusion detection: A survey,” Comput. Security, vol. 30, no. 8,
Conf. Neural Netw. (IJCNN), vol. 2. Seattle, WA, USA, 1991, p. 905. pp. 625–642, 2011.
[87] D. Martinez, “Neural tree density estimation for novelty detection,” [114] M. Abadi and S. Jalili, “An ant colony optimization algorithm for
IEEE Trans. Neural Netw., vol. 9, no. 2, pp. 330–338, Mar. 1998. network vulnerability analysis,” Iran. J. Elect. Elect. Eng., vol. 2, no. 3,
[88] A. Bivens et al., “Network-based intrusion detection using neural pp. 106–120, 2006.
networks,” Intell. Eng. Syst. Artif. Neural Netw., vol. 12, no. 1, [115] C. Blum and X. Li, “Swarm intelligence in optimization,” in Swarm
pp. 579–584, 2002. Intelligence. Cham, Switzerland: Springer, 2008, pp. 43–85.
[89] G. H. John and P. Langley, “Estimating continuous distributions in [116] Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms.
Bayesian classifiers,” in Proc. 11th Conf. Uncertainty Artif. Intell., Boca Raton, FL, USA: CRC Press, 2012.
Montreal, QC, Canada, 1995, pp. 338–345. [117] M. Sewell, “Ensemble learning,” Res. Note, vol. 11, no. 2, pp. 1–18,
[90] A. McCallum and K. Nigam, “A comparison of event models for 2008.
naive Bayes text classification,” in Proc. AAAI Workshop Learn. Text [118] Y. Freund, “Boosting a weak learning algorithm by majority,” Inf.
Categorization, vol. 752. Madison, WI, USA, 1998, pp. 41–48. Comput., vol. 121, no. 2, pp. 256–285, 1995.
MISHRA et al.: DETAILED INVESTIGATION AND ANALYSIS OF USING MACHINE LEARNING TECHNIQUES FOR INTRUSION DETECTION 727

[119] K. Anusha and E. Sathiyamoorthy, “Comparative study for feature [143] L. Koc, T. A. Mazzuchi, and S. Sarkani, “A network intrusion detection
selection algorithms in intrusion detection system,” Autom. Control system based on a hidden Naïve Bayes multiclass classifier,” Expert
Comput. Sci., vol. 50, no. 1, pp. 1–9, 2016. Syst. Appl., vol. 39, no. 18, pp. 13492–13500, 2012.
[120] D. S. Kim and J. S. Park, “Network-based intrusion detection with [144] W.-C. Lin, S.-W. Ke, and C.-F. Tsai, “Cann: An intrusion detection
support vector machines,” in Information Networking. ICOIN 2003 system based on combining cluster centers and nearest neighbors,”
(LNCS 2662), H. K. Kahng. Heidelberg, Germany: Springer, 2003, Knowl. Based Syst., vol. 78, pp. 13–21, Apr. 2015.
pp. 747–756. [145] P. V. Amoli, T. Hamalainen, G. David, M. Zolotukhin, and
[121] N. B. Amor, S. Benferhat, and Z. Elouedi, “Naive Bayes vs deci- M. Mirzamohammad, “Unsupervised network intrusion detection
sion trees in intrusion detection systems,” in Proc. ACM Symp. Appl. systems for zero-day fast-spreading attacks and botnets,” Int. J. Digit.
Comput., Nicosia, Cyprus, 2004, pp. 420–424. Content Technol. Its Appl., vol. 10, no. 2, pp. 1–13, 2016.
[122] Y. Bouzida and F. Cuppens, “Neural networks vs. decision trees for [146] G. Kumar and K. Kumar, “A multi-objective genetic algorithm based
intrusion detection,” in Proc. IEEE/IST Workshop Monitor. Attack approach for effective intrusion detection using neural networks,”
Detection Mitigation (MonAM), vol. 28. Tübingen, Germany, 2006, in Intelligent Methods for Cyber Warfare (Studies in Computational
p. 29. Intelligence), vol. 563, R. Yager, M. Reformat, and N. Alajlan, Eds.
[123] C. Zhang, J. Jiang, and M. Kamel, “Intrusion detection using hier- Cham, Switzerland: Springer, 2015, pp. 173–200.
archical neural networks,” Pattern Recognit. Lett., vol. 26, no. 6, [147] A. N. Toosi and M. Kahani, “A new approach to intrusion detec-
pp. 779–791, 2005. tion based on an evolutionary soft computing model using neuro-fuzzy
[124] S. Mukkamala, A. H. Sung, and A. Abraham, “Intrusion detection using classifiers,” Comput. Commun., vol. 30, no. 10, pp. 2201–2212, 2007.
an ensemble of intelligent paradigms,” J. Netw. Comput. Appl., vol. 28, [148] S. Elhag, A. Fernández, A. Bawakid, S. Alshomrani, and F. Herrera,
no. 2, pp. 167–182, 2005. “On the combination of genetic fuzzy systems and pairwise learning
[125] W. Wang and R. Battiti, “Identifying intrusions in computer networks for improving detection rates on intrusion detection systems,” Expert
with principal component analysis,” in Proc. IEEE 1st Int. Conf. Syst. Appl., vol. 42, no. 1, pp. 193–202, 2015.
Availability Rel. Security (ARES), 2006, p. 8. [149] W. Yassin, N. I. Udzir, Z. Muda, and M. N. Sulaiman, “Anomaly-based
[126] L. Khan, M. Awad, and B. Thuraisingham, “A new intrusion detection intrusion detection through k-means clustering and Naives Bayes clas-
system using support vector machines and hierarchical clustering,” J. sification,” in Proc. 4th Int. Conf. Comput. Informat. (ICOCI), 2013,
Int. J. Very Large Data Bases, vol. 16, no. 4, pp. 507–521, 2007. pp. 298–303.
[127] Y. Li, B.-X. Fang, L. Guo, and Y. Chen, “TCM-KNN algorithm for [150] K. K. Gupta, B. Nath, and R. Kotagiri, “Layered approach using con-
supervised network intrusion detection,” in Proc. Pac. Asia Conf. Intell. ditional random fields for intrusion detection,” IEEE Trans. Depend.
Security Informat., Chengdu, China, 2007, pp. 141–151. Secure Comput., vol. 7, no. 1, pp. 35–49, Jan./Mar. 2010.
[128] Y. Chen, A. Abraham, and B. Yang, “Hybrid flexible neural-tree- [151] M. S. I. Mamun, A. A. Ghorbani, and N. Stakhanova, “An entropy
based intrusion detection systems,” Int. J. Intell. Syst., vol. 22, no. 4, based encrypted traffic classifier,” in Proc. Int. Conf. Inf. Commun.
pp. 337–352, 2007. Security, Beijing, China, 2015, pp. 282–294.
[129] C. Xiang, P. C. Yong, and L. S. Meng, “Design of multiple-level hybrid [152] D. Bhamare, T. Salman, M. Samaka, A. Erbad, and R. Jain, “Feasibility
classifier for intrusion detection system using Bayesian clustering and of supervised machine learning for cloud security,” in Proc. IEEE Int.
decision trees,” Pattern Recognit. Lett., vol. 29, no. 7, pp. 918–924, Conf. Inf. Sci. Security (ICISS), Pattaya, Thailand, 2016, pp. 1–5.
2008.
[153] H. Gharaee and H. Hosseinvand, “A new feature selection IDS based
[130] X. Tong, Z. Wang, and H. Yu, “A research using hybrid RBF/Elman
on genetic algorithm and SVM,” in Proc. 8th Int. Symp. Telecommun.
neural networks for intrusion detection system secure model,” Comput.
(IST), Tehran, Iran, 2016, pp. 139–144.
Phys. Commun., vol. 180, no. 10, pp. 1795–1801, 2009.
[154] M. N. Chowdhury, K. Ferens, and M. Ferens, “Network intrusion
[131] A. Tajbakhsh, M. Rahmati, and A. Mirzaei, “Intrusion detection using
detection using machine learning,” in Proc. Int. Conf. Security Manag.
fuzzy association rules,” Appl. Soft Comput., vol. 9, no. 2, pp. 462–469,
(SAM), 2016, pp. 1–7.
2009.
[132] G. Wang, J. Hao, J. Ma, and L. Huang, “A new approach to intrusion [155] N. Moustafa and J. Slay, “A hybrid feature selection for network intru-
detection using artificial neural networks and fuzzy clustering,” Expert sion detection systems: Central points,” in Proc. 16th Aust. Inf. Warfare
Syst. Appl., vol. 37, no. 9, pp. 6225–6232, 2010. Conf., 2015, pp. 1–10.
[133] F. Amiri, M. R. Yousefi, C. Lucas, A. Shakery, and N. Yazdani, “Mutual [156] A. M. Chandrasekhar and K. Raghuveer, “Intrusion detection technique
information-based feature selection for intrusion detection systems,” J. by using k-means, fuzzy neural network and SVM classifiers,” in Proc.
Netw. Comput. Appl., vol. 34, no. 4, pp. 1184–1199, 2011. IEEE Int. Conf. Comput. Commun. Informat. (ICCCI), Coimbatore,
[134] S.-J. Horng et al., “A novel intrusion detection system based on hier- India, 2013, pp. 1–7.
archical clustering and support vector machines,” Expert Syst. Appl., [157] H. H. Jazi, H. Gonzalez, N. Stakhanova, and A. A. Ghorbani,
vol. 38, no. 1, pp. 306–313, 2011. “Detecting HTTP-based application layer DoS attacks on Web servers
[135] D. Boughaci, M. D. E. Kadi, and M. Kada, “Fuzzy particle swarm in the presence of sampling,” Comput. Netw., vol. 121, pp. 25–36,
optimization for intrusion detection,” in Proc. Int. Conf. Neural Inf. Jul. 2017.
Process., Doha, Qatar, 2012, pp. 541–548. [158] H. Wang, J. Gu, and S. Wang, “An effective intrusion detection frame-
[136] S.-W. Lin, K. C. Ying, C.-Y. Lee, and Z.-J. Lee, “An intelligent algo- work based on SVM with feature augmentation,” Knowl. Based Syst.,
rithm with feature selection and decision rules applied to anomaly intru- vol. 136, pp. 130–139, Nov. 2017.
sion detection,” Appl. Soft Comput., vol. 12, no. 10, pp. 3285–3290, [159] Akashdeep, I. Manzoor, and N. Kumar, “A feature reduced intrusion
2012. detection system using ANN classifier,” Expert Syst. Appl., vol. 88,
[137] S. S. S. Sindhu, S. Geetha, and A. Kannan, “Decision tree based light pp. 249–257, Dec. 2017.
weight intrusion detection using a wrapper approach,” Expert Syst. [160] M. A. Ambusaidi, X. He, P. Nanda, and Z. Tan, “Building an intrusion
Appl., vol. 39, no. 1, pp. 129–141, 2012. detection system using a filter-based feature selection algorithm,” IEEE
[138] Y. Li et al., “An efficient intrusion detection system based on support Trans. Comput., vol. 65, no. 10, pp. 2986–2998, Oct. 2016.
vector machines and gradually feature removal method,” Expert Syst. [161] S. M. H. Bamakan, H. Wang, T. Yingjie, and Y. Shi, “An effec-
Appl., vol. 39, no. 1, pp. 424–430, 2012. tive intrusion detection framework based on MCLP/SVM optimized
[139] A. Chandrasekhar and K. Raghuveer, “Intrusion detection technique by by time-varying chaos particle swarm optimization,” Neurocomputing,
using k-means, fuzzy neural network and SVM classifiers,” in Proc. vol. 199, pp. 90–102, Jul. 2016.
IEEE Int. Conf. Comput. Commun. Informat. (ICCCI), Coimbatore, [162] P. J. Van Laarhoven and E. H. Aarts, “Simulated annealing,” in
India, 2013, pp. 1–7. Simulated Annealing: Theory and Applications (Mathematics and Its
[140] S. Kumar and A. Yadav, “Increasing performance of intrusion detec- Applications), vol. 37, P. J. van Laarhoven and E. H. Aarts, Eds.
tion system using neural network,” in Proc. Int. Conf. Adv. Commun. Dordrecht, The Netherlands: Springer, 1987, pp. 7–15.
Control Comput. Technol. (ICACCCT), 2014, pp. 546–550. [163] A. A. Olusola, A. S. Oladele, and D. O. Abosede, “Analysis of
[141] F. Kuang, W. Xu, and S. Zhang, “A novel hybrid KPCA and SVM KDD ’99 intrusion detection dataset for selection of relevance fea-
with GA model for intrusion detection,” Appl. Soft Comput., vol. 18, tures,” in Proc. World Congr. Eng. Comput. Sci., vol. 1, 2010,
pp. 178–184, May 2014. pp. 20–22.
[142] W. Feng, Q. Zhang, G. Hu, and J. X. Huang, “Mining network [164] H. G. Kayacik, A. N. Zincir-Heywood, and M. I. Heywood, “Selecting
data for intrusion detection through combining SVMs with ant features for intrusion detection: A feature relevance analysis on KDD
colony networks,” Future Gener. Comput. Syst., vol. 37, pp. 127–140, 99 intrusion detection datasets,” in Proc. 3rd Annu. Conf. Privacy
Jul. 2014. Security Trust, 2005, pp. 1–6.
728 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 1, FIRST QUARTER 2019

[165] P. Mishra, E. S. Pilli, V. Varadharajan, and U. Tupakula, “Securing vir- [191] A. Servin and D. Kudenko, “Multi-agent reinforcement learning for
tual machines from anomalies using program-behavior analysis in cloud intrusion detection,” in Adaptive Agents and Multi-Agent Systems
environment,” in Proc. IEEE 18th Int. Conf. High Perform. Comput. III. Adaptation and Multi-Agent Learning (LNCS 4865), K. Tuyls,
Commun., 2016, pp. 991–998. A. Nowe, Z. Guessoum, and D. Kudenko, Eds. Heidelberg, Germany:
[166] (2016). Weka 3.8.1: Data Mining Software in Java. [Online]. Available: Springer, 2008, pp. 211–223.
[Link] [192] X. Xu and T. Xie, “A reinforcement learning approach for host-based
[167] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” J. Mach. intrusion detection using sequences of system calls,” in Advances in
Learn. Res., vol. 12, pp. 2825–2830, Oct. 2011. Intelligent Computing. ICIC 2005 (LNCS 3644). Heidelberg, Germany:
[168] Google. (2017). Installing Tensorflow. [Online]. Available: Springer, 2005, pp. 995–1003.
[Link] [193] R. S. Sutton, “Learning to predict by the methods of temporal
[169] [Link]. (2017). Knime 3.4.1: Download Knime Analytics differences,” Mach. Learn., vol. 3, no. 1, pp. 9–44, 1988.
Platform & SDK. [Online]. Available: [Link] [194] UNM. (1998). UNM Dataset. [Online]. Available: [Link]
downloads [Link]/immsec/[Link]
[170] RapidMiner. (2017). Real Data Science, Fast and Simple (Stable [195] E. Hodo, X. J. A. Bellekens, A. Hamilton, C. Tachtatzis, and
Release 7.5). [Online]. Available: [Link] R. C. Atkinson, “Shallow and deep networks intrusion detection
system: A taxonomy and survey,” ACM Survey, 2017. [Online].
[171] E. Achtert, H.-P. Kriegel, and A. Zimek, “ELKI: A software system
Available: [Link]
for evaluation of subspace clustering algorithms,” in Scientific and
[196] W. Qiang and Z. Zhongli, “Reinforcement learning model, algorithms
Statistical Database Management (LNCS 5069), B. Ludäscher and
and its application,” in Proc. IEEE Int. Conf. Mechatronic Sci. Elect.
N. Mamoulis, Eds. Heidelberg, Germany: Springer, 2008, pp. 580–585.
Eng. Comput. (MEC), 2011, pp. 1143–1146.
[172] University of Waikato. (2014). MOA (Massive Online Analysis). [197] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
[Online]. Available: [Link] with double Q-learning,” in Proc. AAAI, 2016, pp. 2094–2100.
[173] A. Bifet and G. D. F. Morales, “Big data stream learning with
SAMOA,” in Proc. IEEE Int. Conf. Data Min. Workshop (ICDMW),
Shenzhen, China, 2014, pp. 1199–1202.
[174] X. Meng et al., “MLlib: Machine learning in apache spark,” J. Mach.
Preeti Mishra (M’14) received the Ph.D. degree in
Learn. Res., vol. 17, no. 1, pp. 1235–1241, 2016.
computer science and engineering from the Malaviya
[175] L. Deng and D. Yu, “Deep learning: Methods and applications,” Found.
National Institute of Technology Jaipur, India,
Trends R Signal Process., vol. 7, no. 3–4, pp. 197–387, 2014.
in 2017, under the supervision of Dr. Emmanuel
[176] E. Aminanto and K. Kim, “Deep learning in intrusion detection S. Pilli and Prof. V. Varadharajan. She is currently
system: An overview,” in Proc. Int. Res. Conf. Eng. Technol., 2016, an Associate Professor with Graphic Era University,
pp. 1–12. Dehradun, India. She has been a Visiting Scholar
[177] J. Saxe and K. Berlin, “Deep neural network based malware detec- at Macquarie University, Sydney, NSW, Australia,
tion using two dimensional binary program features,” in Proc. 10th in 2015. Her area of interest includes Cloud security,
IEEE Int. Conf. Malicious Unwanted Softw. (MALWARE), 2015, E-mail security, Network security, and IoT.
pp. 11–20.
[178] Z. Wang, “The applications of deep learning on traffic identification,”
presented at the BlackHat, Las Vegas, NV, USA, 2015, pp. 1–10.
[179] Y. Li, R. Ma, and R. Jiao, “A hybrid malicious code detection method Vijay Varadharajan is the Global Innovation
based on deep learning,” Int. J. Softw. Eng. Appl., vol. 9, no. 5, Chair Professor with the University of Newcastle,
pp. 205–216, 2015. Australia and the Director of the Advanced Cyber
[180] W. Yan and L. Yu, “On accurate and reliable anomaly detection for gas Security Research Centre. He has published over 380
turbine combustors: A deep learning approach,” in Proc. Annu. Conf. papers in international journals and conferences, ten
Prognostics Health Manag. Soc., 2015, pp. 1–8. books on information technology, security, networks,
[181] N. Gao, L. Gao, Q. Gao, and H. Wang, “An intrusion detection model and distributed systems, and has held three patents.
based on deep belief networks,” in Proc. IEEE 2nd Int. Conf. Adv. He has been/is on the Editorial Board of several
Cloud Big Data (CBD), Huangshan, China, 2014, pp. 247–252. journals including ACM Transactions on Information
[182] W. Jung, S. Kim, and S. Choi, “Poster: Deep learning for zero-day and System Security, the IEEE T RANSACTIONS
flash malware detection,” in Proc. 36th IEEE Symp. Security Privacy, ON D EPENDABLE AND S ECURE C OMPUTING , the
2015, pp. 1–2. IEEE T RANSACTIONS ON I NFORMATION F ORENSICS AND S ECURITY, and
[183] D. Silver et al., “Mastering the game of Go with deep neural networks the IEEE T RANSACTIONS ON C LOUD C OMPUTING.
and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
[184] U. Fiore, F. Palmieri, A. Castiglione, and A. De Santis,
“Network anomaly detection with the restricted Boltzmann machine,” Uday Tupakula (M’12) received the Ph.D. degree
Neurocomputing, vol. 122, pp. 13–23, Dec. 2013. in 2016, under the supervision of Prof. Varadharajan.
[185] Z. Yuan, Y. Lu, and Y. Xue, “Droiddetector: Android malware charac- He is a Senior Lecturer with the University of
terization and detection using deep learning,” Tsinghua Sci. Technol., Newcastle, Australia. He has 75 publications in
vol. 21, no. 1, pp. 114–123, 2016. different research areas such as network security,
[186] Y. Wang, W.-D. Cai, and P.-C. Wei, “A deep learning approach for DDoS attacks, MANET security, and secure virtual
detecting malicious JavaScript code,” Security Commun. Netw., vol. 9, systems. He is a member of BCS and ACM.
no. 11, pp. 1520–1534, 2016.
[187] M. A. Salama, H. F. Eid, R. A. Ramadan, A. Darwish, and
A. E. Hassanien, “Hybrid intelligent intrusion detection scheme,” in
Soft Computing in Industrial Applications (Advances in Intelligent and
Soft Computing), vol. 96, A. Gaspar-Cunha, R. Takahashi, G. Schaefer,
and L. Costa, Eds. Heidelberg, Germany: Springer, 2011, pp. 293–303.
[188] S. Seok and H. Kim, “Visualized malware classification based-on con- Emmanuel S. Pilli (SM’16) received the Ph.D.
volutional neural network,” J. Korea Inst. Inf. Security Cryptol., vol. 26, degree from IIT Roorkee, Roorkee, in 2012. He is
no. 1, pp. 197–208, 2016. currently an Associate Professor with the Malaviya
[189] B. Dong and X. Wang, “Comparison deep learning method to tradi- National Institute of Technology, Jaipur, India. He
tional methods using for network intrusion detection,” in Proc. 8th has 20 years of teaching, research, and administra-
IEEE Int. Conf. Commun. Softw. Netw. (ICCSN), Beijing, China, 2016, tive experience. His areas of interest include Security
pp. 581–585. and Forensics, Cloud computing, Big data, and IoT.
[190] K. Malialis and D. Kudenko, “Distributed response to network intru- He is also a Senior Member of ACM and CSI and
sions using multiagent reinforcement learning,” Eng. Appl. Artif. Intell., actively involved in Cloud Computing Innovation
vol. 41, pp. 270–284, 2015. Council of India, NIST Cloud Forensic Workgroup.

You might also like