Developing Realistic Distributed Denial of Service
(DDoS) Attack Dataset and Taxonomy
Iman Sharafaldin, Arash Habibi Lashkari, Saqib Hakak and Ali A. Ghorbani
(isharafa; a.habibi.l; saqib.hakak; ghorbani)@unb.ca
Canadian Institute for Cybersecurity (CIC), Faculty of Computer Science,
University of New Brunswick (UNB), Fredericton, NB, Canada
Abstract—Distributed Denial of Service (DDoS) attack is a researchers struggle to find comprehensive and valid datasets
menace to network security that aims at exhausting the target to test and evaluate their proposed detection and defense
networks with malicious traffic. Although many statistical meth- models. So, having a suitable dataset is a significant challenge
ods have been designed for DDoS attack detection, designing
a real-time detector with low computational overhead is still itself [2].
one of the main concerns. On the other hand, the evaluation Based on these two main concerns, our contributions in this
of new detection algorithms and techniques heavily relies on paper are twofold. Firstly, we analyze the existing datasets
the existence of well-designed datasets. In this paper, first, we to find their main shortcoming and limitations. Then, we
review the existing datasets comprehensively and propose a new present our approach for generating a new DDoS dataset
taxonomy for DDoS attacks. Secondly, we generate a new dataset,
namely CICDDoS2019, which remedies all current shortcomings. called CICDDoS2019, which remedies the shortcomings and
Thirdly, using the generated dataset, we propose a new detection limitations of previous datasets. The dataset is completely
and family classificaiton approach based on a set of network flow labelled with 80 network traffic features have been extracted
features. Finally, we provide the most important feature sets to and calculated for all benign and denial of service flows by
detect different types of DDoS attacks with their corresponding using the CICFlowMeter software that is publicly available
weights.
Index Terms—DDoS, IDS, DDoS Dataset, DDoS taxonomy, on the Canadian Institute for Cybersecurity website [10].
Network Traffic Secondly, the paper analyzes the generated dataset to propose
the best feature sets to detect different types of DDoS attacks,
I. INTRODUCTION including reflective DDoS (such as DNS, LDAP, MSSQL,
and TFTP), UDP, UDP-Lag and SYN. Also, we build our
Internet security is one of the most important challenges, models to capture the patterns by training data using four
especially when the demand of IT services increases daily. common machine learning algorithms, namely ID3, random
Among the many existing threats, Distributed Denial of Ser- forest, Naı̈ve Bayes, and logistic regression. We test them
vice (DDoS) attack is a relatively simple but very powerful using the testing component. The remaining part of this paper
technique to attack intranet and Internet resources. Usually, is organized as follows: Section II explanes the available
in this attack, the legitimate users are deprived of using web- datasets, Section III presents our proposed taxonomy, Section
based services by a large number of compromised machines. IV describes the experiments, Section V reports the dataset,
DDoS attacks can be implemented in network, transport and Section VI illustrates the analysis, and Section VII is the
application layers using different protocols, such as TCP, UDP, conclusion.
ICMP and HTTP.
Many researchers have been working with different tech- II. AVAILABLE DATASETS
niques, such as Machine Learning (ML), knowledge-based, In this section, we evaluate several publicly available DDoS
and statistical to propose detection and defense mechanisms attack datasets spanning 2007 to 2018 and explain the need
to combat the problem. On the one hand, each proposed for a comprehensive and reliable dataset to test and validate
method has different problems and shortcomings. For example, DDoS attack detection systems.
statistical methods are not able to determine with certainty the The CAIDA “DDoS Attack 2007” Dataset [4] contains
normal network packet distribution. ML techniques are good an hour of traffic traces. They are in the pcap format, and
as they do not have any prior known data distribution, but detail attack traffic to the victim, as well as responses to
defining the best feature-set is one of the main concerns for the attack from the victim. The traces are anonymized using
them [1]. CrytoPAn prefic-preserving anonymization using a single key.
On the other hand, since 2007, Subbulkashmi et al. [2], The payload has been removed from all packages. They note
Prasad and Rao [3], CAIDA UCSD [4], DARPA 2000 [5], that tracebacks are difficult to gather organically due to the
Brown et al. [6], Singh and De [7], Yu et al. [8], and ease of IP spoofing, the stateless nature of IP routing, link layer
shiravi et al. [9] tried to develop DDoS dataset. But due or MAC address spoofing and modern attack tools, featuring
to many shortcomings and problems, such as incomplete easy implementation of intelligent attack techniques.
traffic, anaonymized data, and out-dated attack scenarios, still In [3] a survey of different types of attacks and techniques of
978-1-7281-1576-4/19/$31.00 c 2019 IEEE
DDoS attacks and their countermeasures is conducted. They use to detect flash crowd DDoS attacks. This type of defense is
outline the two essential types of DDoS attacks as: vulnerabil- based on the principle that the flow standard deviation is lower
ity attacks, where attackers send malformed packets to confuse for an attack than for legitimate traffic. Testing their method
a protocol or an application; and flooding attacks, where of traffic generated by the DDoS tool Mstream and a couple of
either network, transport-level, or application level flooding legitimate captured flash crowds, as well as simulated traffic,
interrupts a legitimate user’s connectivity or services. Many they determined that they could detect flash crowd flooding
defense techniques analyze both with respective advantages attacks as well as distinguish them from legitimate flash
and drawbacks, with varying deployment locations. Traceback crowds. This is based on botnets and attack tools available
mechanisms are also surveyed, including the IP traceback for in 2012. They propose looking into the possibility of super
flooding attacks on Internet Threat Monitors using Honeypots. botnets capable of having a one-to-one bot to legitimate user
This method balances comparatively low overhead and no relationship when conducting mimicking attacks. Taking into
direct server damage, with processing delays and costs. They account that the current conditions do not reflect those of
conclude that a practical defense for a real-time network the past, we must explore the attacker’s actions since this
is difficult to design, and perfect detection is not possible. study, and develop new methods of detection based on current
Various performance parameters need to be considered and attacker behavior.
balanced. Shiravi et. al. [9] advocate for a dynamic dataset to gain real-
DARPA dataset [5] consists of three unique datasets each world insight into what types of attacks are being orches-
produced as per consensus from the Wisconsin Re-Think trated on a specific network. They then go on to establish
meeting and the July 2000 Hawaii PI meeting. The first dataset a set of guidelines to obtain a valid dataset in terms of
LLDOS 1.0, includes a DDoS attack run by a novice attacker realism, evaluation capabilities, total capture, completeness,
against a naive defender. The second dataset, LLDOS 2.0.2, and malicious activity. The goal is to establish profiles for
includes a DDoS attack run by a more stealthy attacker, malicious activities which can be used in future network
although still a novice, against a naive defender. The third intrusion detection systems. They created a testbed network
dataset is a Windows NT Attack Data Set, which includes NT consisting of 21 interconnected Windows workstations, three
auditing of one day’s traffic and attack impinging on the NT servers and devices running Tcpdump, Snort, QRadar, OSSIM,
machine. Although this is a universal dataset, it is produced and Ntop. With this setup they were able to capture virtually
by a naive attacker, and does not explore the techniques of all communications and in part of their dataset they covered
experienced attackers. DoS attacks.
Brown et. al. [6] analyze the 1999 synthetic database com- However, none of the mentioned datasets have executed
missioned by Lincoln Laboratory and DARPA to serve as and captured modern reflective DDoS attacks such as NTP,
a benchmark for evaluation of intrusion detection systems, NetBIOS, SSDP, UDP-Lag, and TFTP. Also, most of them
and determined that the dataset has little bearing on real anonymized the traffic and removed the payloads which is one
world attacks. The authors developed NetADHICT, a tool for of the important parts of analyzing network packets. To put it
understanding the structure of network traffic. briefly, most of the above data lack the origin of the dataset
Singh et. al. [7] focuse on application layer DoS and DDoS (synthetic instead of the real world), complete traffic, attack
attacks. By studying and using application layer attack tools diversity, data source heterogeneity, complete interaction, and
such RUDY and slowloris, they were able to extract parameters complete capture.
from packet captures that serve as flags for DDoS attacks.
They use a small number of items (79) for calculating the III. DD O S ATTACKS TAXONOMY
confusion matrix. They use Weka 3.6 for the classification of There are a number of survey studies that have proposed
the dataset and computing the threshold. They determine that taxonomies with respect to DDoS attacks. Mirkovic and Reiher
it is not possible to achieve a complete defense against these et. al. [11] presented taxonomies for classification of DDoS
attacks in a single stage. They also suggest using a blacklist of attacks and possible defense mechanism. The attacks were
IP addresses to prevent DDoS attacks. They propose looking categorized as: automation, vulnerability, source address va-
at larger scale attacks with more classifiers in the future. lidity, attack rate dynamics, characterization, persistence of
Subbulakshmi et. al. [2] generates a DDoS dataset in a testbed agents, victim, and impact on the victim. In automation-
of LAN connected systems, which collects 14 attributes from based methods, the attacker searches a vulnerable machine
10 types of the DDoS attack classes. They then use Enhanced manually/automatically. The authors explored the classifica-
Multi Class Support Vector Machines (EMCSVM) to compare tion for DDoS defense mechanism based on activity level,
their dataset with the Kddcup 99 dataset, which has only six cooperation degree (performs defensive measures either alone
types of attacks. The attacks are generated artificially using or in cooperation with other entities in the Internet.), and
attack generation scripts, and include 10 types of flooding deployment location).
attacks. Again, as this is not real traffic, their results can only Asosheh and Ramezani[12] proposed a taxonomy based on
be as accurate as their simulation. They propose deriving more known potential attacks and categorized attacks based on
attributes from a larger number of attacks in the future. eight features, namely architecture, degree of automation,
Yu et. al. [8] define a flow correlation coefficient, which they impact, vulnerability, attack rate dynamics, scanning strategy,
propagation strategy, and packet content. The authors also pro- Hence, we have analyzed new attacks that can be carried out
posed a taxonomy for the defense mechanism and categorized using TCP/UDP based protocols at the application layer and
defense mechanism strategies into two groups, i.e. prevention proposed a new taxonomy. The rest of this sub-section, has
and detection. The authors claimed that the best strategy to been explained the detailed taxonomy of DDoS attacks and
prevent/detect DDoS attacks is by focusing on deployment illustrated in the Figure 1, in terms of reflection-based and
from where an attack originated, i.e. target network (original exploitation-based attacks.
source of attack) and intermediate network (secondary targets). Reflection-based DDoS: Are those kind of attacks in which
The authors concluded the article by proposing a framework identity of the attacker remains hidden by utilizing legitimate
that can detect DDoS attacks automatically using a cluster- third-party component. The packets are sent to reflector servers
based algorithm such as a k-nearest neighbor. However, there by attackers with source IP address set to target victim’s IP
are no experiments being conducted to validate the proposed address to overwhelm the victim with response packets.
classification. These attacks can be carried out through application layer
Bhardwaj et al. [13] focused on DDoS attacks in the cloud protocols using transport layer protocols, i.e. Transmission
computing paradigm. The authors surveyed the articles pub- control protocol (TCP), User datagram protocol (UDP) or
lished from the year 2009 to 2015. The authors propose the through a combination of both. As Figure 1 shows, in this
a taxonomy for the various potential DDoS attacks. These are category, TCP based attacks include MSSQL, SSDP while as
four categories: degree of automation, vulnerability, attack rate UDP based attacks include CharGen, NTP and TFTP. There
dynamics, and attack impact. Although similar classification are certain attacks that can be carried out using either TCP or
has been proposed by [11], the difference in this article lies UDP like DNS, LDAP, NETBIOS, and SNMP [15], [16].
the analysis of parameters for effective DDoS detection. A Exploitation-based attacks: Are those kinds of attacks in
few key DDoS detection parameters that have been identified which the identity of the attacker remains hidden by utilizing
include real-time response, throughput, request, response time legitimate third-party component. The packets are sent to
and zero-day attack detection ability. reflector servers by attackers with the source IP address set
The work carried out by Masdari and Jalali [14] focused on an to the target victim’s IP address to overwhelm the victim with
in-depth analysis of DDoS attacks in the cloud computing. The response packets. These attacks can also be carried out through
authors illustrate the major types of DDoS attacks by identi- application layer protocols using transport layer protocols e.g.
fying the vulnerabilities that lead to these attacks and finally TCP and UDP. TCP based exploitation attacks include SYN
classified the DDoS attacks based on cloud components, i.e. flood and UDP based attacks include UDP flood and UDP-
virtual machines, cloud scheduler, hypervisor, web services, Lag.
cloud customers, IaaS and SaaS-based attacks. The major UDP flood attack is initiated on the remote host by sending
DDoS attacks in cloud computing have been identified as a large number of UDP packets. These UDP packets are
bandwidth attacks, connectivity attacks, resource exhaustion, sent to random ports on the target machine at a very high
limitation exploitation, process disruption, data corruption, and rate. As a result, the available bandwidth of the network gets
physical disruption. The study concludes that the severity of exhausted, system crashes and performance degrades. On the
DDoS attacks is greater on cloud computing due to more other hand, SYN flood also consumes server resources by
available resources compared to traditional networks. exploiting TCP-three-way handshake. This attack is initiated
In another study conducted by Singh et. al. [7], the authors by sending repeated SYN packets to the target machine until
have provided a comprehensive analysis of HTTP-GET flood server crashes/malfunctions.
DDoS attacks. The authors have carried out a systematic study The UDP-Lag attack is that kind of attack that disrupts the
that provides a brief overview of HTTP-GET flood attacks connection between the client and the server. This attack is
including its operation and attack strategies. The articles used mostly used in online gaming where the players want to slow
for the systematic survey have been taken from six standard down/interrupt the movement of other players to outmaneuver
electronic databases which include ACM digital library, IEEE them. This attack can be carried in two ways, i.e. using a
Xplore, ScienceDirect, Wiley, and Google scholar. The authors hardware switch known as lag switch or by a software program
have categorized attack strategies into a high rate and low rate. that runs on the network and hogs the bandwidth of other
High rate includes those kinds of attacks where bots utilize users.
their full capacity to attack the victim while low rate includes
attack with low request rates by bots. The high rate attacks IV. EXPERIMENTS
are further classified into server load and target webpage
attacks while low rate attacks are divided into symmetric and To create a comprehensive testbed, we have designed and im-
asymmetric attacks. plemented two networks, namely Attack-Network and Victim-
The primarily features of the above-mentioned works are Network. The Victim-Network is a highly security infras-
shown in Table I. Although, all the mentioned studies have tructure with firewall, router, switches, and several common
done a commendable work in proposing new taxonomies, but operating systems along with an agent that provides the benign
the scope of attacks has been yet limited. There is a need behaviors on each PC. The Attack-Network is a completely
to identify new attacks and come up with new taxonomies. separated third party infrastructure that executes different
Figure 1: DDoS Attack Taxonomy
Table I: Primary features of the related works Table II: Victim-Network Operating Systems and IPs
Network-based Known attacks/ Machine OS IPs
Authors OSI-Layer Environment potential threats Defense Server Ubuntu 16 (Web 192.168.50.01 (Train)
mechanism Server) 192.168.50.04 (Test)
[11] 7 7 X X Firewall Fortinet 205.174.165.81
Victim Network
[12] 7 7 X X Win 7Pro 192.168.50.8
Cloud Win Vista 192.168.50.5
[13] 7 computing X X PCs (Training day)
Win 8.1 192.168.50.6
Application, Win 10 (Pro 32) 192.168.50.7
Network, Cloud Win 7Pro 192.168.50.9
[14] Transport computing X X Win Vista 192.168.50.6
PCs (Testing day)
[17] Application 7 X 7 Win 8.1 192.168.50.7
Win 10 (Pro 32) 192.168.50.8
types of DDoS attacks. The following sections discuss the
infrastructure, benign profile agent and attack scenarios. our proposed B-Profile approach [18], which is responsible
for profiling the abstract behavior of human interactions and
A. Testbed Architecture generate a naturalistic benign background traffic. Our B-Profile
As Figure 2 shows, the testbed consistes of two com- for this dataset extracts the abstract behavior of 25 users based
pletely separated networks. Unlike the previous datasets, in the on the HTTP, HTTPS, FTP, SSH, and email protocols.
Victim-Network, we employ all commonly used and necessary At first, it encapsulates network events produced by users
equipment including router, firewall, switch, along with the with machine learning and statistical analysis techniques. The
different versions of the commonly used operating systems. encapsulated features are distributions of packet sizes of a
Table II shows the list of servers, firewall and workstations, protocol, the number of packets per flow, certain patterns
with their operating systems and related public and private IPs in the payload, the size of the payload, and request time
in the training and testing days. A third party has executed distribution of protocols. Then, after deriving the B-Profiles
the attack families in the training and testing days (Attack- from users, an agent which has been developed in Java is used
Network). The Victim-Network consists of one server (Web to generate realistic benign events and simultaneously simulate
server), one firewall, two switches and four PCs. Also, one port B-Profile behavior on the Victim-Network for the predefined
in the main switch of the Victim-Network has been configured five protocols.
as the mirror port and completely captures all send and receive
C. Attack Profiles
traffic to the network.
Since our proposed dataset is intended for testing DDoS
B. Benign Profile Agent attack detection techniques, it should cover a diverse set of
Generating the realistic background traffic is one of the DDoS attack techniques and scenarios. In this dataset, we cre-
highest priorities of this work. For this dataset, we used ated 11 different DDoS attack profiles listed in Table III. These
Figure 2: Testbed Architecture
Table III: Daily Label of Dataset VI. ANALYSIS
Days Attacks Attack times
PortMap 09:43 - 09:51 At first we extract the 80 traffic features from the dataset
NetBIOS 10:00 - 10:09 using CICFlowMeter [19],[10]. Afterwards, to select the
LDAP 10:21 - 10:30
Testing Set MSSQL 10:33 - 10:42 best detection feature set for each DDoS attack, we test
UDP 10:53 - 11:03 these extracted features using RandomForestRegressor. Then,
UDP-Lag 11:14 - 11:24 we examine the performance and accuracy of the selected
SYN 11:28 - 17:35
features with four common machine learning algoeithms
NTP 10:35 - 10:45 based on training and testing data.
DNS 10:52 - 11:05 For extracting the network traffic features, we used the
LDAP 11:22 - 11:32
MSSQL 11:36 - 11:45 CICFlowMeter [19], [10], which is a flow based feature
NetBIOS 11:50 - 12:00 extractor and can extract 80 features from a pcap file. The
SNMP 12:12 - 12:23 flow label in this application includes source IP, source Port,
Training Set
SSDP 12:27 - 12:37
UDP 12:45 - 13:09 destination IP, destination port, protocol and time stamp. Then
UDP-Lag 13:11 - 13:15 we labeled the generated flows based on the attack schedule
WebDDoS (ARME) 13:18 - 13:29 (timestamp) that is explained in Section V. All 80 extracted
SYN 13:29 - 13:34
TFTP 13:35 - 17:15 features have been defined and explained in the CICFlowMeter
webpage [19]. We used RandomForestRegressor class of
scikit-learn [20]. First, we calculate the importance of each
feature in the whole dataset, then we achieve the final result
attacks are based on proposed taxonomy (Section III) and by multiplying the average standardized mean value of each
executed them by using related tools and packages available feature split on each class, with the corresponding feature
by third party. As Figure 2 shows, we categorize these attacks importance’s value.
into reflection-based and exploitation-based attacks from the Table IV shows the list of the best selected features and
transport and application layer. corresponding weight of each section. Also, we depict Radviz
diagrams for different kind of network traffic. Through
V. DATASET Radviz, an N-dimensional dataset is projected into a 2D
space wherein each dimension is represented in relation to
The capturing period for the training day on January 12th the influence of all dimensions. We can discover interesting
started at 10:30 and ended at 17:15, and for testing day on characteristics of different DDoS attacks from these diagrams.
March 11th started at 09:40 and ended at 17:35. Attacks As we can see in Figure 3, ‘packet Length Std’ is one of
were subsequently executed during this period. As Table III the most influential features for benign traffic. One of the
shows, we executed 12 DDoS attacks includes NTP, DNS, reasons is that we have more variation in the size of packets
LDAP, MSSQL, NetBIOS, SNMP, SSDP, UDP, UDP-Lag, in benign traffic in comparison to different DDoS attacks,
WebDDoS, SYN and TFTP on training day and 7 attacks because DDoS attacks are conducted by automated tools and
including PortScan, NetBIOS, LDAP, MSSQL, UDP, UDP- botnets and usually they produce fixed-size or similar packets.
Lag and SYN in testing day. PortScan just has been ex- Also, as shown in Figure 12 it is obvious that the two
ecuted in testing day and will be unknown for evaluat- most influential features are ‘ACK Flag Count’ and ‘Flow
ing the proposed model. (Dataset is publicly available at Duration’, because this attack works by not responding
http://www.unb.ca/cic/datasets/CICDDoS2019) to the server with the expected ACK code (it exploits a
Figure 3: Bening Figure 4: DrDoS-DNS Figure 5: DrDoS-LDAP Figure 6: DrDoS-MSSQL
Figure 7: DrDoS-NTP Figure 8: DrDoS-NetBIOS Figure 9: DrDoS-SNMP Figure 10: DrDoS-SSDP
Figure 11: DrDoS-UDP Figure 12: SYN Figure 13: DrDoSTFTP Figure 14: UDP-lag
TCP protocol weakness). Moreover, as shown in Figure 6, flow.
two main influential features are the ‘Protocol’ and ‘Fwd For the next step of our analysis, we have used four common
Packets/s’ which makes sense, because the attacker abuses machine learning algorithms namely ID3, Random Forest
the Microsoft SQL Server Resolution Protocol (MC-SQLR) (RF), Naı̈ve Bayes, and logistic regression along with three
and sends millions of packets to the victim. Furthermore, common machine learning evaluation metrics:
IAT related features are considered as influential features in
many attacks such as DrDoS-TFTP, DrDoS-UDP-lag and • Precision (Pr) or Positive Predictive value: It is the ratio
DrDoS-NTP. One of the reasons is that many DDoS attacks of correctly classified attacks flows (TP), in front of all
show bursty behavior in sending packets to the victims, unlike the classified flows (TP+FP).
benign traffic that usually does not show any bursty behavior. • Recall (Rc) or Sensitivity: It is the ratio of correctly
The bursty behavior affects the arrival rate, and so it affects classified attack flows (TP), in front of all generated flows
IAT related features and this can be a reason why they are (TP+FN).
influential features for detecting DDoS attacks. • F-Measure (F1): It is a harmonic combination of the
On the other hand, when TCP accepts data from a data precision and recall into a single measure.
stream, it first divides it into chunks and then adds a TCP
header that finally will create a TCP segment. As we know, TP TP 2
the resource battle between the victims’ machine and the Pr = , Rc = , F1 = 1 1
TP + FP TP + FN Pr + Rc
attacker’s machine is one of the key features of DDoS attacks.
It means, to make a successful DDoS attack, an attacker ID3 is an algorithm designed by Ross Quinlan [21] in order
needs to send more packet than the victim can handle. Also, to generate decision tree from a training dataset. The ID3
attackers use different types of packets, such as SYN or uses entropy (or information gain) concept in order to find
ICMP packets to send many malicious packets all of which the best attributes in order to split the dataset recursively and
are similar in size and small because of the low cost of make the decision tree. Entropy is a measure of the amount
computing resources but are not like the packets in a benign of uncertainty in the set S:
flow. So, the minimum segment size of the packets in a
malicious flow would be less than the packets in a benign X
H(X) = − p(X) log p(X) (1)
Where P (x) represents the proportion of the number of
elements in class x to number of elements in the set S. Also,
H(S) = 0 means that all elements in S have the same label.
Table IV: Testing Dataset based on Training Part Moreover, in order to have a measure to find the difference in
entropy from before to after the set S is split on an attribute
Name Feature Weight Mean
ACK Flag Count 0.125438 0.86545908 A, we can use information gain I(S, A) that can be calculated
Init Win bytes forward 0.002093 5061.324632 by the following formula:
UDP-lag min seg size forward 0.000795 -3967113.958
Fwd IAT Mean 0.000612 1109423.738 I(S, A) = H(S) − H(S|A) (2)
Fwd IAT Max 0.000471 3310515.822
Fwd IAT Mean 0.000207 541101.4798 Where H(t) is the entropy of subset t.
min seg size forward 0.000198 -34648586.17
TFTP Fwd IAT Max 0.000151 1562350.615
Random Forest (RF) [22] is a machine learning algorithm
Flow IAT Max 0.000129 1562493.187 that combine two ideas of decision tree and ensemble learning.
Flow IAT Mean 0.000124 540958.2437 The forest contains many decision trees that use randomly
ACK Flag Count 0.043991 0.330296128 picked data attributes as their input. The forest has a collection
Init Win bytes forward 0.009357 21747.29613
WebDDoS Fwd Packet Length Std 0.002881 62.69322463 of trees with controlled variance. Finally, the result of a
Packet Length Std 0.002068 108.2461488 classification can be decided by majority voting or weighted
min seg size forward 0.000872 32.00 voting. One of the advantages of random forest is that the
Max Packet Length 1.139858 1378.802657
Fwd Packet Length Max 0.127708 1378.773093 variance of the model decreases as the number of trees in
DNS Fwd Packet Length Min 0.007794 1378.522451 the forest increases, while the bias remains the same. Also,
Average Packet Size 0.005849 2067.363444 random forests has many other advantages such as low number
Min Packet Length 0.003487 1378.521706
ACK Flag Count 0.020021 0.172801294 of parameters and resistance to over-fitting.
Flow IAT Min 0.016769 11817.55752 Naı̈ve Bayes is a probabilistic classifier based on Bayes The-
Benign Init Win bytes forward 0.003182 7560.598157 orem with strong independence assumptions between features.
Fwd Packet Length Std 0.001786 39.79290701
Packet Length Std 0.001678 88.27355725 We can decompose the conditional property by using Bayes’
MSSQL
Fwd Packets/s 0.000204 1676967.356 theorem as following:
Protocol 4.60E-05 N/A
Max Packet Length 1.278323 1463.73827 P (X | Ck ) P (Ck )
Fwd Packet Length Max 0.143219 1463.728043
P (Ck | X) = (3)
P (X)
LDAP Fwd Packet Length Min 0.008736 1463.717027
Average Packet Size 0.006532 2194.906258 Where X = (x1 , ..., xn ) represents a vector of n independent
Min Packet Length 0.003909 1463.716847 features and Ck represents each classes. Assuming that fea-
Fwd Packets/s 0.000172 1580350.263
min seg size forward 7.20E-05 -41140208.12 tures are not correlated with each other is not a true assumption
NetBIOS Protocol 4.60E-05 N/A in many problems and it can conversely affects the accuracy
Fwd Header Length 3.50E-05 -82335716.29 of the classifier. The main advantage of Naı̈ve Bayes is that
Fwd Header Length.1 3.20E-05 -82335716.29
Subflow Fwd Bytes 0.106481 27903.20181 it is an online algorithm and its training can be completed in
Length of Fwd Packets 0.058022 27903.20181 linear time.
NTP Fwd Packet Length Std 0.001081 25.03898396 Multinomial Logistic Regression is classification method that
min seg size forward 0.000707 -8471708.317
Flow IAT Min 0.000573 438.699766 uses the main idea of logistic regression to classify multiclass
Destination Port 0.000671 33266.62516 problems. Logistic regression is a predictive analysis like other
Fwd Packet Length Std 0.000597 14.90294147 regression analyses. Logistic regression can describe data and
SSDP Packet Length Std 0.000232 14.2504337
Protocol 4.60E-05 N/A explain the relationship between features and classes.
min seg size forward 1.20E-05 -44212168.85 Table V shows the performance examination results in terms
Max Packet Length 1.152048 1386.280102 of the weighted average of our evaluation metrics for the
Fwd Packet Length Max 0.129074 1386.258583
SNMP Fwd Packet Length Min 0.007879 1386.226774 four selected common machine learning algorithms derived
Average Packet Size 0.005912 2079.140846 from the generated dataset. We used five-fold cross validation
Min Packet Length 0.003526 1386.226619 for our experiments. Based on our experiments ID3 took few
ACK Flag Count 0.145834 0.999478603
Init Win bytes forward 0.002432 5837.969261 minutes to be trained and classify the the testing set. Random
Syn min seg size forward 0.000872 20.00076092 forest with 100 trees took more than 15 hours for the same
Fwd IAT Total 0.000571 8086250.716 process. Also, multinomial logistic regression took more than
Flow Duration 0.000409 8086316.455
Destination Port 0.000699 33284.8418
2 days to be trained and classify the testing test.
Fwd Packet Length Std 0.000615 15.28026139 In addition, according to the weighted average of the three
UDP Packet Length Std 0.000239 14.61817097 evaluation metrics (Pr, Rc, F1), the highest accuracy belongs
min seg size forward 9.80E-05 -39820342.76
Protocol 4.50E-05 N/A
to random forest and ID3 algorithms. Also, in terms of recall
ID3 won the first place by far. Logistic regression achieves the
worst result overall. Considering the execution time and the
evaluation metrics ID3 is the best algorithm with the shortest
execution time and highest accuracy.
Table V: The Performance Examination Results [13] A. Bhardwaj, G. Subrahmanyam, V. Avasthi, H. Sastry, and S. Goundar,
Algorithm Pr Rc F1 “Ddos attacks, new ddos taxonomy and mitigation solutions—a survey,”
in 2016 International Conference on Signal Processing, Communication,
ID3 0.78 0.65 0.69
Power and Embedded System (SCOPES), pp. 793–798, IEEE, 2016.
RF 0.77 0.56 0.62
[14] M. Masdari and M. Jalali, “A survey and taxonomy of dos attacks in
Naı̈ve Bayes 0.41 0.11 0.05
cloud computing,” Security and Communication Networks, vol. 9, no. 16,
Logistic regression 0.25 0.02 0.04 pp. 3724–3751, 2016.
[15] “Request for comments: 1001 (rfc1001),” in PROTOCOL STANDARD
FOR A NetBIOS SERVICE ON A TCP/UDP TRANSPORT: CONCEPTS
AND METHODS, 1987.
VII. CONCLUSION [16] “Request for comments: 7766 (rfc7766),” in DNS Transport over TCP
The main contribution of this paper is a new dataset for - Implementation Requirements, 2016.
[17] K. Singh, P. Singh, and K. Kumar, “Application layer http-get flood ddos
evaluation of IDS algorithms and systems on DDoS attacks attacks: Research landscape and challenges,” Computers & security,
namely CICDDoS2019. In this paper, we studied on the vol. 65, pp. 344–372, 2017.
several DDoS attack categories and families to propose a [18] A. H. L. Iman Sharafaldin, Amirhossein Gharib and A. Ghorbani,
“Towards a reliable intrusion detection benchmark dataset,” Software
new DDoS taxonomy for the application layer. Also, we Networking, pp. 177–200, 2017.
have reviewed the most popular available DDoS datasets and [19] CICFlowMeter, 2017. https://github.com/ISCX/CICFlowMeter.
listed the common shortcomings and weaknesses. In response [20] A. G. F. Pedregosa, G. Varoquaux and E. Duchesnay, Scikit-learn:
Machine learning in Python, 2011.
to these shortcomings and weaknesses, we generated a new [21] J. R. Quinlan, “Induction of decision trees,” Machine learning, vol. 1,
dataset including 11 DDoS attacks, namely CICDDoS2019 no. 1, pp. 81–106, 1986.
for evaluation of IDS/IPS algorithms and systems. Also, we [22] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–
32, 2001.
provided the most important features for detecting different
DDoS attacks. Furthermore, based on the 12 RadViz diagrams
of the most influential features for each type of network traffic
we provide a detailed analysis for each of them.
R EFERENCES
[1] S. Jin and D. S. Yeung, “A covariance analysis model for ddos attack
detection,” in 2004 IEEE International Conference on Communications,
vol. 4, pp. 1882–1886 Vol.4, 2004.
[2] T. Subbulakshmi, K. BalaKrishnan, S. M. Shalinie, D. AnandKumar,
V. GanapathiSubramanian, and K. Kannathal, “Detection of ddos attacks
using enhanced support vector machines with real time generated
dataset,” in Third International Conference on Advanced Computing,
pp. 17–22, 2011.
[3] A. M. R. K.Munivara Prasad and K. Rao, “Dos and ddos attacks:
Defense, detection and traceback mechanisms - a survey,” Global
Journal of Computer Science and Technology, vol. 14, 2014.
[4] The CAIDA UCSD ”DDoS Attack 2007” Dataset, accessed by Jan 2018.
http://www.caida.org/data/passive/ddos-20070804 dataset.xml.
[5] DARPA 2000 Intrustion Detection Scenario Specific Data Sets,
accessed by Jan 2018. https://www.ll.mit.edu/r-d/datasets/
2000-darpa-intrusion-detection-scenario-specific-data-sets.
[6] A. H. Carson Brown, Alex Cowperthwaite and A. Somayaji, “Analysis
of the 1999 darpa/lincoln laboratory ids evaluation data with netadhict,”
in Proceedings of the Second IEEE International Conference on Compu-
tational Intelligence for Security and Defense Applications, CISDA’09,
2009.
[7] K. J. Singh and T. De, “An approach of ddos attack detection using clas-
sifiers,” Emerging Research in Computing, Information, Communication
and Applications, 2015.
[8] S. Yu, W. Zhou, W. Jia, S. Guo, Y. Xiang, and F. Tang, “Discriminating
ddos attacks from flash crowds using flow correlation coefficient,”
IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 6,
pp. 1073–1080, 2012.
[9] M. T. Ali Shiravi, Hadi Shiravi and A. A.Ghorbani, “Toward developing
a systematic approach to generate benchmark datasets for intrusion
detection,” Computers and Security, vol. 31, pp. 357–374, 2012.
[10] M. M. Arash Habibi Lashkari, Gerard Draper Gil and A. Ghorbani,
“Characterization of tor traffic using time based features,” in In Pro-
ceedings of the 3rd International Conference on Information Systems
Security and Privacy (ICISSP), pp. 253–262, 2017.
[11] J. Mirkovic and P. Reiher, “A taxonomy of ddos attack and ddos
defense mechanisms,” ACM SIGCOMM Computer Communication Re-
view, vol. 34, no. 2, pp. 39–53, 2004.
[12] A. Asosheh and N. Ramezani, “A comprehensive taxonomy of ddos
attacks and defense mechanism applying in a smart classification,”
WSEAS Transactions on Computers, vol. 7, no. 4, pp. 281–290, 2008.