C4.
5 based Sequential Attack Detection and
Identification
Model
Radhika Kumar , Anjali Sardana, R. C. Joshi
Information Security Laboratory
Department of Electronics and Computer Engineering
Indian Institute of Technology
Roorkee – 247667
{Anjlsfec, radhsdec, rcjosfec}@iitr.ernet.in
9,000
Introduction
8,000
Total vulnerabilities cataloged
7,000
6,000
5,000
CERT Statistics 4,000
3,000
Internet was designed for openness 2,000
and functionality 1,000
Failures can be accidental or intentional
0
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
Year
Examples : Figure : The number of total vulnerabilities
Denial of Service (DoS) catalogued from 1995 to 2006
160000
Distributed Denial of Service (DDoS) 1,53,140
Number of incidents reported
140000
Domain Name System attack 120000
100000
IP Spoofing 82,094
80000
Sequence Number Hijacking 60000
40000
20000
6
0
Year
Figure : The number of Internet security
incidents reported from 1988 to
2003 2
Service denied to Legitimate Users
Packets drop due to queue
overflow
Victim
overflow
Buffer
Edge Router Transit Domain
Legitimate Packets
Edge Router Stub Domain
Attack Packets
bottleneck link
Figure : Packets drop under DDoS Attack
3
Motivation
Existing approaches to defend against Attacks
Before the attack
Prevention
During the attack
Detection and Characterization
After the attack
Response and Mitigation
All of them suffer from various constraints
4
Sequential Multi-Level Classification
Model
The objective is to find the natural hierarchy in the network traffic and to exploit
the generic and differentiating characteristics of different attacks to build a
more secure environment.
A differential approach is used to detect one kind of attack at a time from the
network traffic.
A sequential model with different binary classifiers at each level, categorizing
attacks in a step by step manner, is used.
Rules are also generated at different levels of abstraction.
KDD99 Dataset is used for evaluation
Node 1
Class 1 Node 2
Class 2 Node 3
Class 3 Class 4
Mathematical Model
Traffic Feature Distribution
Flow Number of
X {ni , i 1,2,3, N} Id (i) Packets
Where
(ni)
X is a random process that i occurs
ni times.
1 n1
2 n2
X
i is defined by one or combination
of following traffic features in
packet header like: 3 n3
Source IP address
Destination IP address
: :
: :
.
Source Port
Destination Port
Layer 3 Protocol Type N nN
6
Mathematical Model: Basis of C4.5
1. Traffic Feature Measurement : Entropy
ni N
S ni
N
H ( X ) pi log 2 pi Where pi ( )
1 S 1
Measure of Dispersal or concentration of a distribution.
Range is (0 to log2N) All observations same H(X) = 0
H(X)=log2N if n1=n2=………=nN.
2. Sampling
{t ,t} {t ,t} {t ,t}
{ X ( t ), t j , j n } 1 2 3 ... ... N H(X)
Δ X(Δ,1) X(Δ,2) X(Δ,1) ... ... X(Δ,N) H(Δ)
is constant time window
n is set of positive integers 2Δ X(2Δ,1) X(2Δ,2) X(2Δ,3) ... ... X(2Δ,N) H(2Δ)
X (t ) represents number of packet 3Δ X(3Δ,1) X(3Δ,2) X(3Δ,3) ... ... X(3Δ,N) H(3Δ)
arrivals for a flow in {t , t}
: : : : : : : :
3. Traffic Feature Selection : : : : : : : :
nΔ X(nΔ,1) X(nΔ,2) X(nΔ,3) ... ... X(nΔ,N) H(nΔ)
7
C4.5: The Classification Algorithm
The sequential nature of the proposed multi-level architecture
needed binary classification at each level.
C4.5 gives highest overall accuracy as the single level classifier compared to
other single level classifiers (Tavallaee, 2009, classifiers were tested on KDD99
Dataset)
C4.5 uses the concept of entropy,
x to measures the impurity of data items.
I(S) = RF(Ci,
j1
S) log(RF(Ci, S))
t
Information gain
G(S, B) = I(S) - (| Si | / | S |) I(Si)
i 1
t
Gain ratio P(S, B) = - (| Si | / | S |) log (| Si | / | S |)
i 1
The test B that maximizes G(S, B) / P(S, B) is then chosen as current
partitioning attribute.
KDD’99 Dataset
KDD attacks: Four main categories
DOS: Denial-of-Service Attack
Probing Attack
U2R: User to Root Attack
R2L: Remote to Local Attack
KDD’99 datasets: 41 features classified into three groups:
Basic Features
Traffic Features: Time based and Host based Traffic Features
Content Features
Sequential Classification
Some observations:
The Dos attack instances in training data are more than the combined
number of Probe, U2R and R2L attacks => Dos attack is the most common type
of attack.
The Dos attack by nature is characterized by Time-based traffic features.
Probe attack is defined by host-based features.
U2R and R2L attacks are detected by studying the content features of the
data.
Finally they all are attack so must have some common characteristic different
from the normal traffic.
First Stage:
Separation of attack data from normal traffic on the basis of characteristics
common to all attack traffic.
Second Stage:
Separation of most common attack – Dos attack from other three kinds using
time based features.
Third Stage:
Separation of Probe attacks from other two kinds using host- based traffic
features
Fourth Stage:
Separation of U2R and R2L attacks using Content features.
Snapshot of Level 4 classifier (Trained on U2R and R2L attack data )
Training Results
Correctly Classified Instances 98.9813 %
Incorrectly Classified Instances 1.0187 %
Evaluation Matrix
Actual Class Classified as
Normal Attack
Normal True Negative (TN) False Positive (FP)
Attack False Negative (FN) True Positive (TP)
Precision: proportion of predicted positives/negative which are
actual positive/negative
True Alarm Ratio: TP / (TP + FP)
False Alarm Ratio: FP / ( FP + TP )
Recall: proportion of actual positives/negative which are predicted
positive/negative
Sensitivity, Detection rate, Alarm rate -TP / (TP + FN)
False positive rate , False Alarm rate – FP / ( FP + TN)
False negative rate – FN / (FN + TP)
Actual Class Classified as
Normal Attack
Normal 60253 340 All data
Testing Results
Attack 22684 227752 Level 1
Correctly Classified Instances 92.5975%
Actual Classified as
Class Normal Dos Attack Other
Attacks Attack
Normal 0 83 257 Normal
Dos Attack 0 222524 795 Level 2
Other 0 435 3998
Attacks
Correctly Classified Instances 99.3117%
Other
Actual
Class Normal Dos
Classified as
Probe Others
Dos Attacks
Attack Level 3
Normal 0 0 253 7
Dos Attack 0 0 358 471
Probe 0 0 3086 0
Attack
Other 0 0 347 527 Other
Attacks Probe
Correctly Classified Instances 71.5587% Attacks
Attack Level 4
Actual Classified as
Class Normal Dos Probe U2R R2L
Normal 0 0 0 1 6
Dos 0 0 0 0 471
Probe 0 0 0 0 0 U2R R2L
U2R 0 0 0 9 8
R2L 0 0 0 2 508
Correctly Classified Instances 51.4428%
Improvements in Training
Dataset
KDD99 10% Training dataset and Testing dataset distribution
Training Set Testing Set
Normal 19.69% 19.48%
Probe 0.83% 1.34%
Dos 79.24% 73.90%
U2R 0.01% 0.07%
R2L 0.23% 5.20%
The 10% KDD99 training dataset has huge number of similar
records for Dos attack and normal traffic as compared to Probe,
U2R and R2L attacks.
Level 1 classifier get biased towards normal class
Testing Result: High false negative rate– 9.95%
Improvements
New dataset – U2R, R2L and Probe data was duplicated 5 times
Level 1 classifier was trained using this new dataset
Testing Results of Level-1 Classifier on earlier dataset
Attack detection rate increased from 90.942% to 92.2515%.
Accuracy percentage increased from 92.5975% to 93.5974%.
Results after Improvements in
Training Data
Confusion Matrix of Level 1 Classifier after data duplication
Actual Class Classified as
Normal Attack
Normal 60099 494
Attack 19405 231031
Correctly Classified Instances 93.5974%
Misuse and Anomaly Detection Rate of Level 1 Classifier before and
after data duplication
True Positives Known Attacks New Attacks
In Test dataset 220,525 29,911
Detected by level 1 classifier 219,827 7,905
(trained on original dataset) (99.6835 %) (26.4618%)
Detected by level 1 classifier 220,525 10,543
(trained on new dataset) (99.9832%) (35.2479%)
The data data duplication improved the misuse and anomaly detection
rate from 99.6835% and 26.4618% to 99.9832% and 35.2479%,
respectively.
Descriptive Modeling
The advantage of multi-level sequential approach is that we
get small and easily interpretable trees.
Rules can be derived from these decision trees at different level of
abstraction.
These rules are in terms of 41 features of KDD dataset.
E.g. Rule derived from second classifier
If ( %of connection to different services for same host for last
1000 connections < 0.1 and
% of connection to different host for same service for past
1000 connections < 0.01 and
number of connection to same host for the past two seconds
>2)
=> Dos Attack
Conclusion
The model has low false alarm ratio of 0.15%.
Individual attack detection rate of 99.644% for Dos and
100% for Probe is achievable.
The percentage accuracy for classification between U2R and
R2L is as high as 98.1024%.
New dataset gives better result :
Misuse detection rate 99.9832% and anomaly detection rate 35.247%
The trees generated are small and easy to derive rules at
different levels of abstraction.
References
[1] S. Axellson, “The Base-Rate Fallacy and the Difficulty of Intrusion Detection,” ACM
Transaction on Information and System Security, 2000.
[2] Corey, V. et. al.: Network forensics analysis, Internet Computing, IEEE , Volume:6 Issue:
6 , 2002 pp: 60 –66.
[3] R. J. Henery, “Classification,” Machine Learning, Neural and Statistical
Classification,” D. Michie , D. J. Spiegelhalter, and C. C. Taylor (Eds.), Ellis Horwood, New
York, 1994.
[4] E. Bloedorn, L. Talbot, C. Skorupka, A. Christiansen, W. Hill, and J. Tivel, “Data Mining
applied to Intrusion Detection: MITRE Experiences,” In Proc. IEEE International
Conference on Data Mining, 2001.
[5] Y. Ma, D. Choi, and S. Ata, Eds., Application of Data Mining to Network Intrusion
Detection: Classifier Selection Model, ser. Lecture Notes in Computer Science. Berlin
Heidelberg, Germany: Springer-Verlag , 2008, vol. 5297.
[6] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A Detailed Analysis of the KDD
CUP 99 Data Set,” in Proc. IEEE Symposium CISDA’09, 2009.
[7] J. R. Quinlan, “C4.5: Programs for machine learning,” Morgan Kaufmann, San Mateo,
California, 1993.
[8] Weka – Data Mining Machine Learning Software. [Online]. Available:
http://www.cs.waikato.ac.nz/ml/weka/
[9] KDD Cup 1999 Data. [Online]. Available:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
[10] M. Sabhnani, and G. Serpen, “Why Machine Learning Algorithms Fail in Misuse Detection on
KDD Intrusion Detection Dataset,” Intelligent Data Analysis, vol. 6, June 2004.
[11] K. Kendall, “A Database of Computer Attacks for the Evaluation of Intrusion Detection
Systems,” M. Eng. Thesis, Massachusetts Institute of Technology, Massachusetts, United
Thank You