Data Mining
Classification: Alternative Techniques
Imbalanced Class Problem
Introduction to Data Mining, 2nd Edition
by
Tan, Steinbach, Karpatne, Kumar
Class Imbalance Problem
Lots of classification problems where the classes
are skewed (more records from one class than
another)
– Credit card fraud
– Intrusion detection
– Defective products in manufacturing assembly line
– COVID-19 test results on a random sample
Key Challenge:
– Evaluation measures such as accuracy are not well-
suited for imbalanced class
2/15/2021 Introduction to Data Mining, 2 nd Edition 2
Confusion Matrix
Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS
Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
2/15/2021 Introduction to Data Mining, 2 nd Edition 3
Accuracy
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
(TP) (FN)
CLASS
Class=No c d
(FP) (TN)
Most widely-used metric:
ad TP TN
Accuracy
a b c d TP TN FP FN
2/15/2021 Introduction to Data Mining, 2 nd Edition 4
Problem with Accuracy
Consider a 2-class problem
– Number of Class NO examples = 990
– Number of Class YES examples = 10
If a model predicts everything to be class NO, accuracy is
990/1000 = 99 %
– This is misleading because this trivial model does not detect any class
YES example
– Detecting the rare class is usually more interesting (e.g., frauds,
intrusions, defects, etc)
PREDICTED CLASS
Class=Yes Class=No
Class=Yes 0 10
ACTUAL
CLASS
Class=No 0 990
2/15/2021 Introduction to Data Mining, 2 nd Edition 5
Which model is better?
PREDICTED
Class=Yes Class=No
A ACTUAL Class=Yes 0 10
Class=No 0 990
Accuracy: 99%
PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 500 490
Accuracy: 50%
2/15/2021 Introduction to Data Mining, 2 nd Edition 6
Which model is better?
PREDICTED
A Class=Yes Class=No
ACTUAL Class=Yes 5 5
Class=No 0 990
PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 500 490
2/15/2021 Introduction to Data Mining, 2 nd Edition 7
Alternative Measures
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL
CLASS Class=No c d
a
Precision (p)
ac
a
Recall (r)
ab
2rp 2a
F - measure (F)
r p 2a b c
2/15/2021 Introduction to Data Mining, 2 nd Edition 8
Alternative Measures
10
PREDICTED CLASS Precision (p) 0 .5
10 10
Class=Yes Class=No 10
Recall (r) 1
10 0
Class=Yes 10 0 2 * 1 * 0 .5
ACTUAL F - measure (F) 0.62
CLASS Class=No 10 980 1 0 .5
990
Accuracy 0.99
1000
2/15/2021 Introduction to Data Mining, 2 nd Edition 9
Alternative Measures
10
PREDICTED CLASS Precision (p) 0 .5
10 10
Class=Yes Class=No 10
Recall (r) 1
10 0
Class=Yes 10 0 2 * 1 * 0 .5
ACTUAL F - measure (F) 0.62
CLASS Class=No 10 980 1 0 .5
990
Accuracy 0.99
1000
PREDICTED CLASS 1
Precision (p) 1
1 0
Class=Yes Class=No
1
Recall (r) 0 .1
Class=Yes 1 9 1 9
ACTUAL
2 * 0 .1 * 1
CLASS Class=No 0 990 F - measure (F) 0.18
1 0.1
991
Accuracy 0.991
1000
2/15/2021 Introduction to Data Mining, 2 nd Edition 10
Which of these classifiers is better?
PREDICTED CLASS
Precision (p) 0.8
Class=Yes Class=No
A Class=Yes 40 10
Recall (r) 0.8
F - measure (F) 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy 0.8
PREDICTED CLASS
B Class=Yes Class=No Precision (p) ~ 0.04
Class=Yes 40 10 Recall (r) 0.8
ACTUAL F - measure (F) ~ 0.08
CLASS Class=No 1000 4000
Accuracy ~ 0.8
2/15/2021 Introduction to Data Mining, 2 nd Edition 11
Measures of Classification Performance
PREDICTED CLASS
Yes No
ACTUA
L Yes TP FN
CLASS No FP TN
is the probability that we reject
the null hypothesis when it is
true. This is a Type I error or a
false positive (FP).
is the probability that we
accept the null hypothesis when
it is false. This is a Type II error
or a false negative (FN).
2/15/2021 Introduction to Data Mining, 2 nd Edition 12
Alternative Measures
A PREDICTED CLASS
Class=Yes Class=No
Class=Yes 40 10
ACTUAL
CLASS Class=No 10 40
TPR
=4
FPR
Precision (p)=0.038
B PREDICTED CLASS
Class=Yes Class=No
Class=Yes 40 10
ACTUAL
CLASS Class=No 1000 4000
TPR
=4
FPR
2/15/2021 Introduction to Data Mining, 2 nd Edition 13
Which of these classifiers is better?
A PREDICTED CLASS
Class=Yes Class=No
Precision (p) 0.5
Class=Yes 10 40
TPR Recall (r) 0.2
ACTUAL FPR 0.2
CLASS Class=No 10 40
F measure 0.28
B PREDICTED CLASS
Precision (p) 0.5
Class=Yes Class=No
Class=Yes 25 25
TPR Recall (r) 0.5
ACTUAL
Class=No 25 25 FPR 0.5
CLASS
F measure 0.5
C PREDICTED CLASS
Precision (p) 0.5
Class=Yes Class=No
TPR Recall (r) 0.8
Class=Yes 40 10
ACTUAL FPR 0.8
CLASS Class=No 40 10
F measure 0.61
2/15/2021 Introduction to Data Mining, 2 nd Edition 14
ROC (Receiver Operating Characteristic)
A graphical approach for displaying trade-off
between detection rate and false alarm rate
Developed in 1950s for signal detection theory to
analyze noisy signals
ROC curve plots TPR against FPR
– Performance of a model represented as a point in an
ROC curve
2/15/2021 Introduction to Data Mining, 2 nd Edition 15
ROC Curve
(TPR,FPR):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal
Diagonal line:
– Random guessing
– Below diagonal line:
prediction is opposite
of the true class
2/15/2021 Introduction to Data Mining, 2 nd Edition 16
ROC (Receiver Operating Characteristic)
To draw ROC curve, classifier must produce
continuous-valued output
– Outputs are used to rank test records, from the most likely
positive class record to the least likely positive class record
– By using different thresholds on this value, we can create
different variations of the classifier with TPR/FPR tradeoffs
Many classifiers produce only discrete outputs (i.e.,
predicted class)
– How to get continuous-valued outputs?
Decision trees, rule-based classifiers, neural networks,
Bayesian classifiers, k-nearest neighbors, SVM
2/15/2021 Introduction to Data Mining, 2 nd Edition 17
Example: Decision Trees
Decision Tree
x2 < 12.63
x1 < 13.29 x2 < 17.35
Continuous-valued outputs
x1 < 6.56 x1 < 2.15
x2 < 12.63
x1 < 7.24
x2 < 8.64
x1 < 13.29 x2 < 17.35
x1 < 12.11
x2 < 1.38 x1 < 6.56 x1 < 2.15
0.059 0.220
x1 < 18.88
x1 < 7.24
x2 < 8.64 0.071
0.107
x1 < 12.11
x2 < 1.38 0.164 0.727
x1 < 18.88
0.143 0.669 0.271
0.654 0
2/15/2021 Introduction to Data Mining, 2 nd Edition 18
ROC Curve Example
x2 < 12.63
x1 < 13.29 x2 < 17.35
x1 < 6.56 x1 < 2.15
0.059 0.220
x1 < 7.24
x2 < 8.64 0.071
0.107
x1 < 12.11
x2 < 1.38 0.727
0.164
x1 < 18.88
0.143 0.669 0.271
0.654 0
2/15/2021 Introduction to Data Mining, 2 nd Edition 19
ROC Curve Example
- 1-dimensional data set containing 2 classes (positive and negative)
- Any points located at x > t is classified as positive
At threshold t:
TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88
2/15/2021 Introduction to Data Mining, 2 nd Edition 20
How to Construct an ROC curve
• Use a classifier that produces a
Instance Score True Class
continuous-valued score for
1 0.95 +
each instance
2 0.93 + • The more likely it is for the
3 0.87 - instance to be in the + class, the
4 0.85 - higher the score
5 0.85 - • Sort the instances in decreasing
6 0.85 + order according to the score
7 0.76 - • Apply a threshold at each unique
8 0.53 + value of the score
9 0.43 - • Count the number of TP, FP,
10 0.25 + TN, FN at each threshold
• TPR = TP/(TP+FN)
• FPR = FP/(FP + TN)
2/15/2021 Introduction to Data Mining, 2 nd Edition 21
How to construct an ROC curve
Class + - + - - - + - + +
P
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0
ROC Curve:
2/15/2021 Introduction to Data Mining, 2 nd Edition 22
Using ROC for Model Comparison
No model consistently
outperforms the other
M is better for
1
small FPR
M is better for
2
large FPR
Area Under the ROC
curve (AUC)
Ideal:
Area = 1
Random guess:
Area = 0.5
2/15/2021 Introduction to Data Mining, 2 nd Edition 23
Dealing with Imbalanced Classes - Summary
Many measures exists, but none of them may be ideal in
all situations
– Random classifiers can have high value for many of these measures
– TPR/FPR provides important information but may not be sufficient by
itself in many practical scenarios
– Given two classifiers, sometimes you can tell that one of them is
strictly better than the other
C1 is strictly better than C2 if C1 has strictly better TPR and FPR relative to C2 (or same
TPR and better FPR, and vice versa)
– Even if C1 is strictly better than C2, C1’s F-value can be worse than
C2’s if they are evaluated on data sets with different imbalances
– Classifier C1 can be better or worse than C2 depending on the scenario
at hand (class imbalance, importance of TP vs FP, cost/time tradeoffs)
2/15/2021 Introduction to Data Mining, 2 nd Edition 24
Which Classifer is better?
Precision (p) 0.98
T1 PREDICTED CLASS TPR Recall (r) 0.5
Class=Yes Class=No
FPR 0.01
ACTUAL
Class=Yes 50 50 TPR/FPR 50
CLASS Class=No 1 99
F measure 0.66
Precision (p) 0.9
T2 PREDICTED CLASS
TPR Recall (r) 0.99
Class=Yes Class=No
Class=Yes 99 1
FPR 0.1
ACTUAL TPR/FPR 9.9
CLASS Class=No 10 90
F measure 0.94
T3 PREDICTED CLASS Precision (p) 0.99
Class=Yes Class=No TPR Recall (r) 0.99
Class=Yes 99 1 FPR 0.01
ACTUAL
CLASS Class=No 1 99 TPR/FPR 99
2/15/2021 Introduction to Data Mining, 2 nd Edition F measure 0.99
25
Which Classifer is better? Medium Skew case
Precision (p) 0.83
T1 PREDICTED CLASS TPR Recall (r) 0.5
Class=Yes Class=No
FPR 0.01
ACTUAL
Class=Yes 50 50 TPR/FPR 50
CLASS Class=No 10 990
F measure 0.62
Precision (p) 0.5
T2 PREDICTED CLASS
TPR Recall (r) 0.99
Class=Yes Class=No
Class=Yes 99 1
FPR 0.1
ACTUAL TPR/FPR 9.9
CLASS Class=No 100 900
F measure 0.66
T3 PREDICTED CLASS Precision (p) 0.9
Class=Yes Class=No TPR Recall (r) 0.99
Class=Yes 99 1 FPR 0.01
ACTUAL
CLASS Class=No 10 990 TPR/FPR 99
2/15/2021 Introduction to Data Mining, 2 nd Edition F measure 0.94
26
Which Classifer is better? High Skew case
Precision (p) 0.3
T1 PREDICTED CLASS TPR Recall (r) 0.5
Class=Yes Class=No
FPR 0.01
ACTUAL
Class=Yes 50 50 TPR/FPR 50
CLASS Class=No 100 9900
F measure 0.375
Precision (p) 0.09
T2 PREDICTED CLASS
TPR Recall (r) 0.99
Class=Yes Class=No
Class=Yes 99 1
FPR 0.1
ACTUAL TPR/FPR 9.9
CLASS Class=No 1000 9000
F measure 0.165
T3 PREDICTED CLASS Precision (p) 0.5
Class=Yes Class=No TPR Recall (r) 0.99
Class=Yes 99 1 FPR 0.01
ACTUAL
CLASS Class=No 100 9900 TPR/FPR 99
2/15/2021 Introduction to Data Mining, 2 nd Edition F measure 0.66
27
Building Classifiers with Imbalanced Training Set
Modify the distribution of training data so that rare
class is well-represented in training set
– Undersample the majority class
– Oversample the rare class
2/15/2021 Introduction to Data Mining, 2 nd Edition 28