Lec09 Classifier Evaluation
Lec09 Classifier Evaluation
Learning
CS60050
Classifier/
Hypothesis
Evaluation
Classifier performance evaluation and
comparison
-1-
Classifier performance evaluation and
comparison
1 Introducti
on
2 Score
s
3 Estimation
Methods
4 Hypothesis
Testing
-2-
Classifier performance evaluation and
comparison Introduction
1 Introducti
on
2 Score
s
3 Estimation
Methods
4 Hypothesis
Testing
-3-
Classifier performance evaluation and
comparison Introduction
Classification Problem
Data set
-4-
Classifier performance evaluation and
comparison Introduction
Classification Problem
Data set
Expert
-5-
Classifier performance evaluation and
comparison Introduction
Supervised Classification
Data set
Classification
Model
Expert
-6-
Classifier performance evaluation and
comparison Introduction
Supervised Classification
Classification Model
Classifier labels new data (unknown class
value)
Classification
Model
Expert
-7-
Classifier performance evaluation and
comparison Introduction
Many classification
paradigms
Naive
Bayes
Data set
...
X4
...
Neural
Net
Decision
Tree
-8-
Classifier performance evaluation and
comparison Introduction
X4
...
? Neural
Net
Decision
Tree
-9-
Classifier performance evaluation and
comparison Introduction
Many parameter
configurations
Naive Bayes
Data set
... Naive
Bayes
...
- 10 -
Classifier performance evaluation and
comparison Introduction
...
- 11 -
Classifier performance evaluation and
comparison Introduction
Honest Evaluation
Need to know the goodness of a
classifier Methodology to compare
classifiers
Assess the validity of
evaluation/comparison
1 Introducti
on
2 Score
s
3 Estimation
Methods
4 Hypothesis
Testing
- 13 -
Classifier performance evaluation and
comparison Scores
Motivation
Score
Function that provides a quality measure for a
classifier when solving a classification problem
- 14 -
Classifier performance evaluation and
comparison Scores
Motivation
way to
d so m e
We nee ation
m ea s
cu
lare
ssi c
fi
the an c e!!!
perform
Score
Function that provides a quality measure for a
classifier when solving a classification problem
- 15 -
Classifier performance evaluation and
comparison Scores
Motivation
way to
d so m e
We nee ation
m ea s
cu
lare
ssi c
fi
the an c e!!!
perform
Score
Function that provides a quality measure for a
classifier when solving a classification problem
- 16 -
Classifier performance evaluation and
comparison Scores
Motivation
Different kind of
scores
- 17 -
Classifier performance evaluation and
comparison Scores
Scores
Recall
Specificit
y
Precision
F-Score
Scores
Recall
Specificit
y
Precision
F-Score
Scores
Recall
Specificity —→ Information
Retrieval Precision
F-Score
- 20 -
Classifier performance evaluation and
comparison Scores
Scores
Recall
Specificity —→ Information
Retrieval Precision
F-Score
—→ Medical Domains - 21 -
Classifier performance evaluation and
comparison Scores
Confusion Matrix
Two-Class
Problem Prediction
c+ c— Total
c+ TP FP N+
Actual
c— FN TN N—
ˆ + Nˆ —
Total N N
- 22 -
Classifier performance evaluation and
comparison Scores
Confusion Matrix
Several-Class
Problem Prediction
c1 c2 c3 .. . cn Total
5 X1 X2 C
3,1 2,4 c+
4
1,7 1,8 c—
3
3,3 5,2
X2
2 2,6 1,7 c+
1
1,8 2,9 c—
0,3 2,3
0
.. . .. . c+
c—
-1
-1 0 1 2 3 4 5 6
X1
..
.
- 24 -
Classifier performance evaluation and
comparison Scores
Prediction
5
4
c+
c+ c— Total
-
3
c
c+ 10 2 12
X2
Actual
2
1 c— 2 8 10
0 Total 12 10 22
-1
-1 0 1 2 3 4 5
X1
- 25 -
Classifier performance evaluation and
comparison Scores
Accuracy/Classification Error
Definition
Data samples classified
correctly/incorrectly
6
4
c+
Prediction
- c+ c— Total
3
c
X2
2 c+ 10 2 12
Actual
2 8 10
1
c—
0
Total 12 10 22
-1
-1 0 1 2 3 4 5
X1
Accuracy/Classification Error
6
Prediction
c+ c— Total
5
c+ 10 2 12
+
Actual
4 c
c— 2 8 10
c-
3 Total 12 10 22
X2
FP +
1
ϵ
FN N
0 =
2+
-1 = =
-1 0 1 2 3 4 5 22
X1
0,182
2
- 27 -
Classifier performance evaluation and
comparison Scores
Skew Data
5 X1 X2 C
4
0,8 2,2 c+
3
0,47 2,3 c+
0,5 2,1
X2
1
2,4 2,9 c+
0
3,1 1,2 c—
-1
-2
2,5 3,1
.. . .. . c—
-3
-3 -2 -1 0 1 2 3 4 5 6
c—
X1
..
.
- 28 -
Classifier performance evaluation and
comparison Scores
6
Prediction
5 c+ c— Total
4 c+ 0 5 5
-
Actual
3
c
c— 7 993 1000
+
2 c
X2
0
7+
ϵ =5 =
-1
1005
-2
0,012
Very low
-3
-3 -2 -1 0 1
X1
2 3 4 5 6 ϵ!!
- 29 -
Classifier performance evaluation and
comparison Scores
6 Prediction
c+ c— Total
5
c+ 0 5 5
Actual
4
3 c— 0 1000 1000
-
2
c Total 0 1005 1005
X2
+
c
1
0+
0
ϵ =5 =
-1
1005
-2 0,005
Better?
-3
-3 - -1 0 1 2 3 4 5 6
?
2 X1
- 30 -
Classifier performance evaluation and
comparison Scores
?
5
4 ?
Positive Labeled
3 ?
Data
?
Only unlabeled
Many positive samples:
X2
? ?
?
2
samples
? ?
?
1 ?
?
?
? Positive?
?
labeled
Negative
0
?
?
?
-1
-1 0 1 2
X1
3 4 5
Classificatio
n error is
useless
- 31 -
Classifier performance evaluation and
comparison Scores
Recall
Definition
Fraction of positive class
samples
correctly
⇢
classified True positive
Other
rate Sensitivity
names
TP TP
r( ) = =
TP + FN
P
Definition Based
on Probabilities
- 32 -
r ( ) = p( (x ) = c+|C = c + )
Classifier performance evaluation and
comparison Scores
6 Prediction
c+ c— Total
5
c+ 0 5 5
Actual
4
c-
3 c— 7 993 1000
0
0
-1
r ( ) =0 + =
-2
0 5
-3
-3 -2 -1 0 1 2 3 4 5 6
Very bad
X1 recall!!
- 33 -
Classifier performance evaluation and
comparison Scores
6 Prediction
c+ c? Total
?
5 c+ 0 5 5
Actual
7 10 1
c+ c?
4 ?
Total 12 10
c-
22
3 ?
?
5
X2
? ?
?
r( ) =
0+
2
? ?
?
?
= 1
5
?
1 ? ?
It is possible to
?
0
? calculate recall
?
-1
-1 0 1 2 3 4 5
in positive-
X1
unlabeled
problems
- 34 -
Classifier performance evaluation and
comparison Scores
Precision
Definition
Fraction of data samples
classified as c+ which are
actually c+
TP TP
pr ( ) = TP + =
P
FP
ˆ
Definition Based on
Probabilities
pr ( ) = p(C = c+ | (x ) = c + ) = E⇢(x | (x )=c+)[δ( (x ),
c+ )] - 35 -
Classifier performance evaluation and
comparison Scores
6 Prediction
c+ c— Total
5
c+ 0 5 5
Actual
4
c-
3 c— 7 993 1000
0
0
-1
pr ( ) = 0 + =
-2
0 7
-3
-3 -2 -1 0 1 2 3 4 5 6
Very bad
X1 precision!!
- 36 -
Classifier performance evaluation and
comparison Scores
?
5
Precision is not
c+
4 ?
c- ?
a good score for
3
?
positive-
X2
? ?
?
2
? ? ?
unlabeled data
?
1 ? ?
?
? samples
0
Not all the
?
? positive
-1
-1 0 1 2
X1
3 4 5 samples are
labeled
- 37 -
Classifier performance evaluation and
comparison Scores
Spam Filtering
Decide if an email is spam or not
Precision: Proportion of real spam in the spam-box
Recall: Proportion of total spam messages identified
by the system
Sentiment Analysis
Classify opinions about specific products given by
users in blogs, webs, forums, etc.
Precision: Proportion of opinions classified as
positive being actually positive
Recall: Proportion of positive opinions
identified as positive
- 38 -
Classifier performance evaluation and
comparison Scores
Specificity
Definition
Fraction of negative class
samples correctly identified
Specificity = 1 —
FalsePositiveRate
TN TN
sp( ) = =
TN + FP
N
Definition Based
on Probabilities
- 39 -
5
Prediction
c+ c— Total
4
c+ 0 5 5
-
Actual
3
+
c
2
c c— 7 993 1000
X2
-1
993
-2 sp( ) = 993 + =
-3
-3 -2 -1 0 1 2 3 4 5 6
0,99 7
X1
- 40 -
Classifier performance evaluation and
comparison Scores
6
Prediction
c+ c— Total
5
4
c+ 0 5 5
Actual
3 0 1000 1000
c—
-
2 c Total 0 1005
X2
1005
+
1
c
0
100
-1
sp( ) = 0 =
-2 1000 + 0
-3
-3 - -1 0 1 2 3 4 5 6
1,00
2 X1
- 41 -
Classifier performance evaluation and
comparison Scores
Balanced Scores
Balanced accuracy
rate ✓ ◆
1 TP TN recall + specificity
Bal. acc = + =
2 P N 2
Balanced error
rate FP FN
Bal. ϵ = ✓ + ◆
1
2 P N
Skew
Data
Prediction
c+ c— 1 993
Total Bal. acc = 0 5 +
2
≈
≈ 0,5
c+ 0 5 5 1 7 5
Bal. ϵ 2 7 +1000
Actual
7 993 1000
c—
= 1000 0,5
Total 7 998 1005
- 42 -
Classifier performance evaluation and
comparison Scores
Balanced Scores
(β2+1) Precision·Recall
F — Score
β2(Precision+Recall)
=
2·Precision·Recall
F1 — Score Precision+Recall —→ Harmonic
= Mean
Harmonic
Mean 1.2
0.8
0.6
balanced
Scor
e 0.2
components
Bal. acc →
0
-0.2
arithmetic
mea -0.4
TPR
TNR
Classification Cost
Classification Model
May be of interest to minimize the expected cost
instead the classification error
- 44 -
Classifier performance evaluation and
comparison Scores
Loss Function
Associate an economic/utility/etc. cost to each
classification.
Typical loss function in classification → 0/1 Loss
c+ Prediction
0 1
Actua
c— 1 0
l
- 45 -
Classifier performance evaluation and
comparison Scores
Loss Function
Associate an economic/utility/etc. cost to each
classification.
Typical loss function in classification → 0/1 Loss
c+ CostPrediction
CostFN
Actua
TP
c— CostFP CostTN
l
- 46 -
Classifier performance evaluation and
comparison Scores
Loss Function
Associate an economic/utility/etc. cost to each
classification.
Typical loss function in classification → 0/1 Loss
c+ CostPrediction
CostFN
Actua
TP
c— CostFP CostTN
l
ROC Space
Coordinate system used for visualizing classifiers
performance where TPR is plotted on the Y axis and
FPR is plotted on the X axis.
1
0.9
1: k NN
0.8
0.7
0.6
2: Neural network
TPR
0.5
0.2
5: Linear
0.1
0
0 0.1
0.9
0.2 0.3 0.4 0.5 0.6 0.7 0.8 1
regression
FPR
- 48 - 6: Decision tree
Classifier performance evaluation and
comparison Scores
ROC Space
Coordinate system used for visualizing classifiers
performance where TPR is plotted on the Y axis and
FPR is plotted on the X axis.
1
0.9
0.8
1: k NN
0.7
0.5
0.2
5: Linear
0.1
0
0 0.1
0.9
0.2 0.3 0.4 0.5 0.6 0.7 0.8 1
regression
FPR
- 49 - 6: Decision tree
Classifier performance evaluation and
comparison Scores
ROC Curve
For a probabilistic/fuzzy classifier, a ROC curve is a
plot of the TPR vs. FPR as its discrimination threshold is
1 varied p(c|x ) T = 0,2
+
T = 0,5
+
T = 0,8
+
C
+
0,99 c c c c
0.9
0,90 c+ c+ c+ c+
0.8
0,85
0.7 0,80 c+ c+ c+ c+
0,78
0.6 c+ c+ c+ c—
0,70
TPR
0.5
0,60 c+ c+ c— c+
0.4
c+ c+ c— c—
0.3
c+ c+ c— c+
0.2
0,45 c+ c— c— c—
0.1
0,40 c+ c— c— c—
0
0 0.2 0.4 0.6 0.8 1 0,30
FPR 0,20 c+ c— c— c—
0,15 c— c—
c+ c+
0,10
- 50 -0,05 c— c— c— c—
c— c— c— c—
Classifier performance evaluation and
comparison Scores
ROC Curve
For a crisp classifier a ROC curve can be
obtained by interpolation from a single point
1 p(c|x ) T = 0,2 T = 0,5 T = 0,8 C
0,99 c+ c+ c+ c+
0.9
0,90 c+ c+ c+ c+
0.8
0,85
0.7 0,80 c+ c+ c+ c+
0,78
0.6 c+ c+ c+ c—
0,70
TPR
0.5
0,60 c+ c+ c— c+
0.4
c+ c+ c— c—
0.3
c+ c+ c— c+
0.2
0,45 c+ c— c— c—
0.1
0,40 c+ c— c— c—
0
0 0.2 0.4 0.6 0.8 1 0,30
FPR 0,20 c+ c— c— c—
0,15 c— c—
c+ c+
0,10 c— c—
- 51 - c— c—
0,05
c— c— c— c—
Classifier performance evaluation and
comparison Scores
ROC Curve
Insensitive to skew class
distribution Insensitive to
misclassification cost
Dominance Relationship
A ROC curve A dominates another ROC curve B if A is
always above and to the left of B in the plot
- 52 -
Classifier performance evaluation and
comparison Scores
ROC Curve
Insensitive to skew class
distribution Insensitive to
misclassification cost
Dominance Relationship
A ROC curve A dominates another ROC curve B if A is
always above and to the left of B in the plot
- 53 -
Classifier performance evaluation and
comparison Scores
0.9
Dominance
0.8
A A dominates
0.7
B
0.6 B
throughout all the
range of T
TPR
0.5
0.4
- 54 -
Classifier performance evaluation and
comparison Scores
0.9
0.8
No-Dominance
A
0.7 The
B
0.6
dominance
relationship may not
0.5
0.4
be so clear
0.3
No model is the
best under all
0.2
0.1
0 possible scenarios
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1
0.9
- 55 -
Classifier performance evaluation and
comparison Scores
0.6 B test
Wilcoxon
0.5
If A dominates
0.4
B:
AUC(A) ≥
0.3
0.2
AUC(B)
If A does not
0.1
dominate B AUC
0
0
0.9
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1
“cannot identify the
best classifier”
- 56 -
Classifier performance evaluation and
comparison Scores
Generalization to Multilabel-Class
- 57 -
Classifier performance evaluation and
comparison Scores
Generalization to Multilabel-Class
- 58 -
Classifier performance evaluation and
comparison Scores
Generalization to Multilabel-Class
- 59 -
Classifier performance evaluation and
comparison Scores
Generalization to Multilabel-Class
- 60 -
Classifier performance evaluation and
comparison Scores
Generalization to Multilabel-Class
- 61 -
Classifier performance evaluation and
comparison Scores
Generalization to Multilabel-Class
Xn
score TOT scorei · p(ci )
= i
=1
- 62 -
Classifier performance evaluation and
comparison Scores
Scores
- 63 -
Classifier performance evaluation and
comparison Estimation Methods
1 Introducti
on
2 Score
s
3 Estimation
Methods
4 Hypothesis
Testing
- 64 -
Classifier performance evaluation and
comparison Estimation Methods
Introduction
Estimation
Select a score to measure the
quality Calculate the true value
of the score Limited information
is available
Physical Process
Data set
- 65 -
Classifier performance evaluation and
comparison Estimation Methods
Introduction
Estimation
Select a score to measure the
quality Calculate the true value
of the score Limited information
is available Classification
Physical Process
Model
Data set
- 66 -
Classifier performance evaluation and
comparison Estimation Methods
Introduction
Estimation
Select a score to measure the
quality Calculate the true value
of the score Limited information
is available Classification
Physical Process
Model
- 67 -
Classifier performance evaluation and
comparison Estimation Methods
Introduction
Estimation
Select a score to measure the
quality Calculate the true value
of the score Limited information
is available Classification
Physical Process
Model
- 68 -
Classifier performance evaluation and
comparison Estimation Methods
Introduction
True Value - ϵN
Expected value of the score for a set of N data
samples sampled from ⇢(C, X )
- 69 -
Classifier performance evaluation and
comparison Estimation Methods
Introduction
True Value - ϵN
Expected value of the score for a set of N data
samples sampled from ⇢(C, X )
- 70 -
Classifier performance evaluation and
comparison Estimation Methods
Introduction
Bias
Difference between the estimation of the score and
its true value: E⇢(ϵˆ— ϵN )
- 71 -
Classifier performance evaluation and
comparison Estimation Methods
Introduction
Variance
Deviation of the estimated value from its
expected value:
var (ϵˆ— ϵN )
- 72 -
Classifier performance evaluation and
comparison Estimation Methods
Introduction
- 73 -
Classifier performance evaluation and
comparison Estimation Methods
Introduction
Data set
Resubstitution
Data set
Learnin
g
- 75 -
Classifier performance evaluation and
comparison Estimation Methods
Resubstitution
- 76 -
Classifier performance evaluation and
comparison Estimation Methods
Resubstitution
- 77 -
Classifier performance evaluation and
comparison Estimation Methods
Hold-Out
Data set
Data set
Data set - Test
- 78 -
Classifier performance evaluation and
comparison Estimation Methods
Hold-Out
Trainin
Data set g
- 79 -
Classifier performance evaluation and
comparison Estimation Methods
Hold-Out
Data set
Test
Data set - Test
- 80 -
Classifier performance evaluation and
comparison Estimation Methods
Hold-Out
Classification Error
Estimation Unbiased
estimator of ϵN1 Biased
estimator of ϵN
Large bias (pessimistic estimation of the true
classification error)
Bias related to N1 and N2
- 81 -
Classifier performance evaluation and
comparison Estimation Methods
k -Fold Cross-Validation
- 82 -
Classifier performance evaluation and
comparison Estimation Methods
k -Fold Cross-Validation
- 83 -
Classifier performance evaluation and
comparison Estimation Methods
k -Fold Cross-Validation
- 84 -
Classifier performance evaluation and
comparison Estimation Methods
k -Fold Cross-Validation
- 85 -
Classifier performance evaluation and
comparison Estimation Methods
k -Fold Cross-Validation
- 86 -
Classifier performance evaluation and
comparison Estimation Methods
k -Fold Cross-Validation
- 87 -
Classifier performance evaluation and
comparison Estimation Methods
k -Fold Cross-Validation
- 88 -
Classifier performance evaluation and
comparison Estimation Methods
k -Fold Cross-Validation
Classification Error
Estimation
Unbiased estimator N
N k
of ϵ —
Biased estimation
Smaller of
bias than Hold-Out
ϵN
Leaving-One-Out
Special case of k -fold Cross-Validation (k
= N) Quasi unbiased estimation for N
Improves the bias with respect to
CV Increases the variance → more
unstable Higher computational cost
- 89 -
Classifier performance evaluation and
comparison Estimation Methods
Bootstrap
- 90 -
Classifier performance evaluation and
comparison Estimation Methods
Bootstrap
- 91 -
Classifier performance evaluation and
comparison Estimation Methods
Bootstrap
- 92 -
Classifier performance evaluation and
comparison Estimation Methods
Bootstrap
- 93 -
Classifier performance evaluation and
comparison Estimation Methods
Bootstrap
- 94 -
Classifier performance evaluation and
comparison Estimation Methods
Leaving-One-Out Bootstrap
Mimics Cross-
Validation Each i is
tested on D/Di⇤
Hold-Out/Cross-Validation
Several proposals
Improves bias
estimation
Surprisingly not very
extended
Bootstrap
Improves bias estimation
Well established - 96 -
methods
Classifier performance evaluation and
comparison Estimation Methods
(ϵˆ+ ) - (Burman,
Corrected Hold-Out ho
1989) +
ϵˆ
ho = ϵˆho + ϵˆres —
ϵˆho—N
Where
ϵˆho = standard Hold-Out estimator
ϵˆres = resubstitution error
ϵˆho—N = learned on Hold-Out learning set but
tested on
D.
- 97 -
Classifier performance evaluation and
comparison Estimation Methods
- 98 -
Classifier performance evaluation and
comparison Estimation Methods
Biasє+cv ≈ Cons 1 (k —1 2
ˆ 1)·N
- 99 -
Classifier performance evaluation and
comparison Estimation Methods
- 100
-
Classifier performance evaluation and
comparison Estimation Methods
XN XN
2
цˆ δ(c , (x ))/N
i x j
= i =1 j=1
ϵˆloo—boot — ϵˆres
Rˆ =
цˆ — ϵˆres
- 101
-
Classifier performance evaluation and
comparison Estimation Methods
P N P N
цˆ i =1 j δ(ci , x (x j )/N
2
= =1
Rˆ = єˆloo—boot —
єˆres
gˆ—єˆr e s
- 102
-
Classifier performance evaluation and
comparison Estimation Methods
Stratification
Keeps the proportion of each class in the
train/test data
Hold-Out: Stratified splitting
Cross-Validation: Stratified
splitting Bootstrap: Stratified
sampling
- 103
-
Classifier performance evaluation and
comparison Estimation Methods
Repeated Methods
Applicable to Hold-Out and Cross-
Validation Bootstrap already includes
sampling
Repeated Hold-Out/Cross-
Validation Repeat estimation
process t -times Simple
average over results
Estimation Methods
Estimation Methods
Estimation Methods
- 107
-
Classifier performance evaluation and
comparison Hypothesis Testing
1 Introducti
on
2 Score
s
3 Estimation
Methods
4 Hypothesis
Testing
- 108
-
Classifier performance evaluation and
comparison Hypothesis Testing
Motivation
Basic Concepts
Hypothesis testing form the basis of scientific
reasoning in experimental sciences
They are used to set scientific statements
A hypothesis Ho called null hypothesis is tested
against another hypothesis H1 called alternative
The two hypotheses are not at the same level:
reject Ho
does not mean acceptance of H1
The objective is to know when the differences in
H0 are due to randomness or not
- 109
-
Classifier performance evaluation and
comparison Hypothesis Testing
Hypothesis Testing
H0 TRUE H0 FALSE
√ Type II error (β)
Decision: ACCEPT √
Decision: REJECT Type I error (α)
- 110
-
Classifier performance evaluation and
comparison Hypothesis Testing
- 111
-
Classifier performance evaluation and
comparison Hypothesis Testing
β = PH1 (X 2 -A.R.)
112 = PH1 (X > 55)
-
Classifier performance evaluation and
comparison Hypothesis Testing
p-value = PH0 (X ≤ x )
- 113
-
Classifier performance evaluation and
comparison Hypothesis Testing
Power: (1 —
β) Depending on the hypotheses the type II error (β)
can not be calculated:
⇢
H0 : µ =
H60
1 : µ /= 60
- 114
-
Classifier performance evaluation and
comparison Hypothesis Testing
Scenarios
Two classifiers (algorithms) vs More
than two One dataset vs More than one
dataset Score
Score estimation method known vs
unknown
The classifiers are trained and tested in
the same datasets
.....
- 115
-
Classifier performance evaluation and
comparison Hypothesis Testing
The General
Approach
8
> H0 : classifier has the same score
<
value as classifier 0 in p(x, c)
>:
H1 : they have different
values
- 116
-
Classifier performance evaluation and
comparison Hypothesis Testing
The General
Approach
8
> H0 : classifier has the same score
<
value as classifier 0 in p(x, c)
>:
H1 : they have different
values
8
>
> H0 : algorithm has the same average score
<
value as algorithm 0 in p(x, c)
>:
H1 : they have different
values - 117
-
Classifier performance evaluation and
comparison Hypothesis Testing
ϵi ~ N (score( i ), si ) i = 1,
2
Therefore, under the null
hypothesis:
ϵ1 —
Z = qϵ2s2+s
~ N (0,
ˆ
1
2
n
2
1)
- 120
-
Classifier performance evaluation and
comparison Hypothesis Testing
2
error 2
1
error ok
1
ok n00 n11
10
n01
Under H0 we have n10 ≈ n01 and the statistic
- 123
-
Classifier performance evaluation and
comparison Hypothesis Testing
- 124
-
Classifier performance evaluation and
comparison Hypothesis Testing
- 125
-
Classifier performance evaluation and
comparison Hypothesis Testing
Initial
Approaches
Averaging Over
Datasets Paired t-test
P N
c i = i — ci and d = ci then follows a
1 2 N i=1 d
distribution
c d/σ1 with N — 1 degrees of t
freedom
Problems
Commensurabilit
y Outlier
susceptibility
(t-test) Gaussian
assumption - 126
-
Classifier performance evaluation and
comparison Hypothesis Testing
z= T — 14 N(N +
q 1) ~ N (0,
1
2 N(N + 1) 2N + 1)
4 ( 1)
- 127
-
Classifier performance evaluation and
comparison Hypothesis Testing
1 2
dif ran
Dataset1 0.763 0.598 f k
Dataset2 0.599 0.591
Dataset3 0.954 0.971
Dataset4 0.628 0.661
Dataset5 0.882 0.888
Dataset6 0.936 0.931
Dataset7 0.661 0.668
Dataset8 0.583 0.583
Dataset9 0.775 0.838
Dataset10 1.000 1.000
- 128
-
Classifier performance evaluation and
comparison Hypothesis Testing
1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591
Dataset3 0.954 0.971
Dataset4 0.628 0.661
Dataset5 0.882 0.888
Dataset6 0.936 0.931
Dataset7 0.661 0.668
Dataset8 0.583 0.583
Dataset9 0.775 0.838
Dataset10 1.000 1.000
- 129
-
Classifier performance evaluation and
comparison Hypothesis Testing
1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971
Dataset4 0.628 0.661
Dataset5 0.882 0.888
Dataset6 0.936 0.931
Dataset7 0.661 0.668
Dataset8 0.583 0.583
Dataset9 0.775 0.838
Dataset10 1.000 1.000
- 130
-
Classifier performance evaluation and
comparison Hypothesis Testing
1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971 +0.017
Dataset4 0.628 0.661 +0.033
Dataset5 0.882 0.888 +0.006
Dataset6 0.936 0.931 -0.005
Dataset7 0.661 0.668 +0.007
Dataset8 0.583 0.583 0.000
Dataset9 0.775 0.838 +0.063
Dataset10 1.000 1.000 0.000
- 131
-
Classifier performance evaluation and
comparison Hypothesis Testing
1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971 +0.017
Dataset4 0.628 0.661 +0.033
Dataset5 0.882 0.888 +0.006
Dataset6 0.936 0.931 -0.005
Dataset7 0.661 0.668 +0.007
Dataset8 0.583 0.583 0.000
Dataset9 0.775 0.838 +0.063
Dataset10 1.000 1.000 0.000
- 132
-
Classifier performance evaluation and
comparison Hypothesis Testing
1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971 +0.017
Dataset4 0.628 0.661 +0.033
Dataset5 0.882 0.888 +0.006
Dataset6 0.936 0.931 -0.005
Dataset7 0.661 0.668 +0.007
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063
Dataset10 1.000 1.000 0.000 1.5
- 133
-
Classifier performance evaluation and
comparison Hypothesis Testing
1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971 +0.017
Dataset4 0.628 0.661 +0.033
Dataset5 0.882 0.888 +0.006
Dataset6 0.936 0.931 -0.005
Dataset7 0.661 0.668 +0.007
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063
Dataset10 1.000 1.000 0.000 1.5
- 134
-
Classifier performance evaluation and
comparison Hypothesis Testing
1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971 +0.017
Dataset4 0.628 0.661 +0.033
Dataset5 0.882 0.888 +0.006
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063
Dataset10 1.000 1.000 0.000 1.5
- 135
-
Classifier performance evaluation and
comparison Hypothesis Testing
1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5
- 136
-
Classifier performance evaluation and
comparison Hypothesis Testing
1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5
R+ =
- 137
-
Classifier performance evaluation and
comparison Hypothesis Testing
1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5
R+ = 7 + 8 + 4 + 5 + 9 + 1/2(1,5 +
1,5)
- 138
-
Classifier performance evaluation and
comparison Hypothesis Testing
1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5
R+ =
34.5
- 139
-
Classifier performance evaluation and
comparison Hypothesis Testing
1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5
R+ = R— = 10 + 6 + 3 + 1/2(1,5 +
34.5 1,5)
- 140
-
Classifier performance evaluation and
comparison Hypothesis Testing
1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5
R+ = R— =
34.5 20.5
- 141
-
Classifier performance evaluation and
comparison Hypothesis Testing
1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5
1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5
R+ = R— = T = min(R+, R—) =
34.5 20.5 20.5
- 143
-
Classifier performance evaluation and
comparison Hypothesis Testing
- 144
-
Classifier performance evaluation and
comparison Hypothesis Testing
Signed Test
It is a non-parametric test that counts the
number of losses, ties and wins
Under the null the number of wins follows a
binomial distribution B(1/2, N)
For large
p v alues of N the number of wins
follows
N (N/2, N/2) under the null
This test does not make any
assumptions It is weaker than
Wilcoxon
- 145
-
Classifier performance evaluation and
comparison Hypothesis Testing
Dataset (Demˇsar, 1 2 3 4
2006)
D1 0.84 0.79 0.89 0.43
D2 0.57 0.78 0.78 0.93
D3 0.62 0.87 0.88 0.71
D4 0.95 0.55 0.49 0.72
D5 0.84 0.67 0.89 0.89
D6 0.51 0.63 0.98 0.55
- 146
-
Classifier performance evaluation and
comparison Hypothesis Testing
Multiple Hypothesis
Testing 8 i,
Testing all possible pairs of hypotheses µ i =µ j
j.
Multiple hypothesis testing
- 147
-
Classifier performance evaluation and
comparison Hypothesis Testing
Multiple Hypothesis
Testing 8 i,
Testing all possible pairs of hypotheses µ i =µ j
j.
Multiple hypothesis testing
- 148
-
Classifier performance evaluation and
comparison Hypothesis Testing
Multiple Hypothesis
Testing 8 i,
Testing all possible pairs of hypotheses µ i =µ j
j.
Multiple hypothesis testing
ANOVA vs Friedman
Repeated measures ANOVA: Assumes Gaussianity
and sphericity
Friedman: Non-parametric test
- 149
-
Classifier performance evaluation and
comparison Hypothesis Testing
Freidman
Test
1 Rank the algorithms for each dataset separately
(1-best). In case of ties assigned average ranks
2
Calculate the average rank Rj of each algorithm j
3
The following statistic:
2 3
12N
2 X 2 k (k + 2 5
4 Rj
χF = 1) 4
k (k + j —
1)
follows a χ2 with k — 1 degrees of freedom
(N>10, k>5)
- 150
-
Classifier performance evaluation and
comparison Hypothesis Testing
Friedman Test:
Example 1 2 3 4
- 151
-
Classifier performance evaluation and
comparison Hypothesis Testing
Friedman Test:
Example 1 2 3 4
Friedman Test:
Example 1 2 3 4
- 154
-
Classifier performance evaluation and
comparison Hypothesis Testing
Post-hoc Tests
Decision on the null hypothesis
In case of rejection use of post-hoc
tests to:
1 Compare all pairs
2 Compare all classifiers with a
control
- 155
-
Classifier performance evaluation and
comparison Hypothesis Testing
- 156
-
Classifier performance evaluation and
comparison Hypothesis Testing
- 157
-
Classifier performance evaluation and
comparison Hypothesis Testing
- 158
-
Classifier performance evaluation and
comparison Hypothesis Testing
- 159
-
Classifier performance evaluation and
comparison Hypothesis Testing
- 160
-
Classifier performance evaluation and
comparison Hypothesis Testing
= P(accept H 1 ) × P(accept H 2 ) × .. .
× P(accept Hn )
= (1 — α)n
and therefore
each
FWE = 1 — (1 — α)n -≈1611 — (1 — αn) = αn
-
test
Classifier performance evaluation and
comparison Hypothesis Testing
Bonferroni-Dunn Test
It is a one-step method
Modify α by taking into account the
number of
comparison
s: α
k—
1
- 162
-
Classifier performance evaluation and
comparison Hypothesis Testing
Holm Method
It is a step-down procedure
Starting from p1 check the first i = 1,. .., k — 1
such that
pi > α/(k — i )
The hypothesis H1,. .., Hi —1 are rejected. The
rest of hypotheses are kept
- 163
-
Classifier performance evaluation and
comparison Hypothesis Testing
- 164
-
Classifier performance evaluation and
comparison Hypothesis Testing
- 165
-
Classifier performance evaluation and
comparison Hypothesis Testing
z
0.335
z12
4
z13 1.569
7
z14
0.791
z23 5
1.234
z24
3
z3 0.456
- 166
-
4 1
Classifier performance evaluation and
comparison Hypothesis Testing
z24
z3
4
- 167
-
Classifier performance evaluation and
comparison Hypothesis Testing
z3
4
- 168
-
Classifier performance evaluation and
comparison Hypothesis Testing
z3
4
- 169
-
Classifier performance evaluation and
comparison Hypothesis Testing
z3
4
- 170
-
Classifier performance evaluation and
comparison Hypothesis Testing
z3
4
- 172
-
Classifier performance evaluation and
comparison Hypothesis Testing
z3
4
- 173
-
Classifier performance evaluation and
comparison Hypothesis Testing
z3
4
- 174
-
Classifier performance evaluation and
comparison Hypothesis Testing
Hochberg Method
It is a step-up procedure
Starting with pk —1 check the first i = k — 1,. .., 1
such that
pi < α/(k — i )
The hypothesis H1,. .., Hi —1 are rejected. The
rest of hypotheses are kept
Hommel Method
Find the largest j such that pn—j +k > k α/j for all
k = 1,. .., j
Reject all hypotheses i such that pi ≤ α/j
- 175
-
Classifier performance evaluation and
comparison Hypothesis Testing
- 176
-
Classifier performance evaluation and
comparison Hypothesis Testing
and C1
equal to C3
- 177
-
Classifier performance evaluation and
comparison Hypothesis Testing
- 179
-
Classifier performance evaluation and
comparison Hypothesis Testing
Remarks
Adjusted p-values
- 180
-
Classifier performance evaluation and
comparison Hypothesis Testing
Conclusions
Conclusions
- 182
-
Classifier performance evaluation and
comparison Hypothesis Testing
- 183
-
Thank You!