0% found this document useful (0 votes)
5 views185 pages

Lec09 Classifier Evaluation

The document discusses classifier performance evaluation and comparison in machine learning, outlining key concepts such as score functions, estimation methods, and hypothesis testing. It emphasizes the importance of honest evaluation to determine the best classification paradigm and parameter configurations. Various scoring metrics, including accuracy, recall, and precision, are introduced to assess classifier quality.

Uploaded by

udaynaik057
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views185 pages

Lec09 Classifier Evaluation

The document discusses classifier performance evaluation and comparison in machine learning, outlining key concepts such as score functions, estimation methods, and hypothesis testing. It emphasizes the importance of honest evaluation to determine the best classification paradigm and parameter configurations. Various scoring metrics, including accuracy, recall, and precision, are introduced to assess classifier quality.

Uploaded by

udaynaik057
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 185

Machine

Learning
CS60050
Classifier/
Hypothesis
Evaluation
Classifier performance evaluation and
comparison

Classifier performance evaluation


and comparison

Jose A. Lozano, Guzmán Santafé, Iñaki


Inza

Intelligent Systems Group


The University of the Basque Country
International Conference on Machine Learning and Applications
(ICMLA 2010) December 12-14, 2010

-1-
Classifier performance evaluation and
comparison

Outline of the Tutorial

1 Introducti
on
2 Score
s
3 Estimation
Methods
4 Hypothesis
Testing

-2-
Classifier performance evaluation and
comparison Introduction

Outline of the Tutorial

1 Introducti
on
2 Score
s
3 Estimation
Methods
4 Hypothesis
Testing

-3-
Classifier performance evaluation and
comparison Introduction

Classification Problem

Physical Process Usually unknown

Data set

-4-
Classifier performance evaluation and
comparison Introduction

Classification Problem

Physical Process Usually unknown

Data set

Expert

-5-
Classifier performance evaluation and
comparison Introduction

Supervised Classification

Learning from Experience


“Automate the work of the
expert” Tries to model ⇢(X ,
C)
Physical Process Usually unknown

Data set
Classification
Model

Expert

-6-
Classifier performance evaluation and
comparison Introduction

Supervised Classification

Classification Model
Classifier labels new data (unknown class
value)

Data set Data set

Classification
Model

Expert

-7-
Classifier performance evaluation and
comparison Introduction

Motivation for Honest Evaluation

Many classification
paradigms
Naive
Bayes

Data set
...

X4
...

Neural
Net

Decision
Tree

-8-
Classifier performance evaluation and
comparison Introduction

Motivation for Honest Evaluation

Which is the best paradigm for a classification


problem?
Naive
Bayes ? ?
Data set
...

X4
...

? Neural
Net

Decision
Tree

-9-
Classifier performance evaluation and
comparison Introduction

Motivation for Honest Evaluation

Many parameter
configurations

Naive Bayes

Data set
... Naive
Bayes

...

- 10 -
Classifier performance evaluation and
comparison Introduction

Motivation for Honest Evaluation

Which is the best parameter


configuration for a classification
problem?
Naive
Bayes ?
Data set
... Naive
Bayes

...

- 11 -
Classifier performance evaluation and
comparison Introduction

Motivation for Honest Evaluation

Honest Evaluation
Need to know the goodness of a
classifier Methodology to compare
classifiers
Assess the validity of
evaluation/comparison

Steps for Honest


Evaluation Scores:
quality measures
Estimation methods: estimate value of a score
Statistical tests: comparison
- 12 - among different
solutions
Classifier performance evaluation and
comparison Scores

Outline of the Tutorial

1 Introducti
on
2 Score
s
3 Estimation
Methods
4 Hypothesis
Testing

- 13 -
Classifier performance evaluation and
comparison Scores

Motivation

How to compare classification


models?

Score
Function that provides a quality measure for a
classifier when solving a classification problem

- 14 -
Classifier performance evaluation and
comparison Scores

Motivation

How to compare classification


models?

way to
d so m e
We nee ation
m ea s
cu
lare
ssi c
fi
the an c e!!!
perform
Score
Function that provides a quality measure for a
classifier when solving a classification problem

- 15 -
Classifier performance evaluation and
comparison Scores

Motivation

How to compare classification


models?

way to
d so m e
We nee ation
m ea s
cu
lare
ssi c
fi
the an c e!!!
perform
Score
Function that provides a quality measure for a
classifier when solving a classification problem

- 16 -
Classifier performance evaluation and
comparison Scores

Motivation

What Does Best Quality Mean?


What are we interested in?
What do we want to
optimize? Characteristics
of the problem
Characteristics of the data
set

Different kind of
scores

- 17 -
Classifier performance evaluation and
comparison Scores

Scores

Based on Confusion Matrix


Accuracy/Classification
error

Recall
Specificit
y
Precision
F-Score

Based on Receiver Operating Characteristics


(ROC) Area under the ROC curve (AUC)
- 18 -
Classifier performance evaluation and
comparison Scores

Scores

Based on Confusion Matrix


Accuracy/Classification error —→
Classification

Recall
Specificit
y
Precision
F-Score

Based on Receiver Operating Characteristics


(ROC) Area under the ROC curve (AUC)
- 19 -
Classifier performance evaluation and
comparison Scores

Scores

Based on Confusion Matrix


Accuracy/Classification error —→
Classification

Recall
Specificity —→ Information
Retrieval Precision
F-Score

Based on Receiver Operating Characteristics


(ROC) Area under the ROC curve (AUC)

- 20 -
Classifier performance evaluation and
comparison Scores

Scores

Based on Confusion Matrix


Accuracy/Classification error —→ Classification

Recall
Specificity —→ Information
Retrieval Precision
F-Score

Based on Receiver Operating


Characteristics (ROC)
Area under the ROC curve (AUC)

—→ Medical Domains - 21 -
Classifier performance evaluation and
comparison Scores

Confusion Matrix

Two-Class
Problem Prediction
c+ c— Total

c+ TP FP N+
Actual

c— FN TN N—
ˆ + Nˆ —
Total N N

- 22 -
Classifier performance evaluation and
comparison Scores

Confusion Matrix

Several-Class
Problem Prediction
c1 c2 c3 .. . cn Total

c1 TP1 FN12 FN13 .. . FN1n N1


c2 FN21 TP2 FN23 .. . FN2n N2
Actual

c3 FN31 FN32 TP3 .. . FN3n N3


.. . .. . .. . .. . .. . .. . .. .
cn FNn1 FNn2 FNn3 .. . TPn Nn
Total Nˆ1 Nˆ2 Nˆ3 .. . Nˆn N
- 23 -
Classifier performance evaluation and
comparison Scores

Two-Class Problem - Example

5 X1 X2 C
3,1 2,4 c+
4

1,7 1,8 c—
3
3,3 5,2
X2

2 2,6 1,7 c+

1
1,8 2,9 c—
0,3 2,3
0
.. . .. . c+

c—
-1
-1 0 1 2 3 4 5 6
X1

..
.

- 24 -
Classifier performance evaluation and
comparison Scores

Two-Class Problem - Example

Prediction
5

4
c+
c+ c— Total
-
3
c
c+ 10 2 12
X2

Actual
2

1 c— 2 8 10
0 Total 12 10 22
-1
-1 0 1 2 3 4 5
X1

- 25 -
Classifier performance evaluation and
comparison Scores

Accuracy/Classification Error
Definition
Data samples classified
correctly/incorrectly
6

4
c+
Prediction
- c+ c— Total
3
c
X2

2 c+ 10 2 12

Actual
2 8 10
1

c—
0

Total 12 10 22
-1
-1 0 1 2 3 4 5
X1

ϵ( ) = p( (X ) /= C) = E⇢(x ,c)[1 — δ(c, (x


))] - 26 -
Classifier performance evaluation and
comparison Scores

Accuracy/Classification Error

6
Prediction
c+ c— Total
5
c+ 10 2 12
+

Actual
4 c
c— 2 8 10
c-
3 Total 12 10 22
X2

FP +
1
ϵ
FN N
0 =
2+
-1 = =
-1 0 1 2 3 4 5 22
X1
0,182
2

- 27 -
Classifier performance evaluation and
comparison Scores

Skew Data

5 X1 X2 C
4
0,8 2,2 c+
3
0,47 2,3 c+
0,5 2,1
X2

1
2,4 2,9 c+
0
3,1 1,2 c—
-1

-2
2,5 3,1
.. . .. . c—
-3
-3 -2 -1 0 1 2 3 4 5 6

c—
X1

..
.
- 28 -
Classifier performance evaluation and
comparison Scores

Skew Data - Classification Error

6
Prediction
5 c+ c— Total
4 c+ 0 5 5
-

Actual
3
c
c— 7 993 1000
+
2 c
X2

Total 7 998 1005


1

0
7+
ϵ =5 =
-1
1005
-2
0,012
Very low
-3
-3 -2 -1 0 1
X1
2 3 4 5 6 ϵ!!

- 29 -
Classifier performance evaluation and
comparison Scores

Skew Data - Classification Error

6 Prediction
c+ c— Total
5
c+ 0 5 5

Actual
4

3 c— 0 1000 1000
-
2
c Total 0 1005 1005
X2

+
c
1
0+
0
ϵ =5 =
-1
1005
-2 0,005
Better?
-3
-3 - -1 0 1 2 3 4 5 6
?
2 X1

- 30 -
Classifier performance evaluation and
comparison Scores

Positive Unlabeled Learning

?
5

4 ?
Positive Labeled
3 ?
Data
?

Only unlabeled
Many positive samples:
X2

? ?
?
2

samples
? ?
?

1 ?
?
?
? Positive?
?
labeled
Negative
0
?
?
?
-1
-1 0 1 2
X1
3 4 5
Classificatio
n error is
useless

- 31 -
Classifier performance evaluation and
comparison Scores

Recall

Definition
Fraction of positive class
samples
correctly

classified True positive
Other
rate Sensitivity
names

TP TP
r( ) = =
TP + FN
P
Definition Based
on Probabilities
- 32 -

r ( ) = p( (x ) = c+|C = c + )
Classifier performance evaluation and
comparison Scores

Skew Data - Recall

6 Prediction
c+ c— Total
5
c+ 0 5 5

Actual
4

c-
3 c— 7 993 1000

2 c+ Total 7 998 1005


X2

0
0
-1
r ( ) =0 + =
-2
0 5
-3
-3 -2 -1 0 1 2 3 4 5 6
Very bad
X1 recall!!

- 33 -
Classifier performance evaluation and
comparison Scores

Positive Unlabeled Learning - Recall

6 Prediction
c+ c? Total
?
5 c+ 0 5 5

Actual
7 10 1
c+ c?
4 ?

Total 12 10
c-
22
3 ?
?
5
X2

? ?
?
r( ) =
0+
2
? ?
?
?
= 1
5
?
1 ? ?
It is possible to
?

0
? calculate recall
?
-1
-1 0 1 2 3 4 5
in positive-
X1
unlabeled
problems

- 34 -
Classifier performance evaluation and
comparison Scores

Precision

Definition
Fraction of data samples
classified as c+ which are
actually c+

TP TP
pr ( ) = TP + =
P
FP
ˆ
Definition Based on
Probabilities
pr ( ) = p(C = c+ | (x ) = c + ) = E⇢(x | (x )=c+)[δ( (x ),
c+ )] - 35 -
Classifier performance evaluation and
comparison Scores

Skew Data - Precision

6 Prediction
c+ c— Total
5
c+ 0 5 5

Actual
4

c-
3 c— 7 993 1000

2 c+ Total 7 998 1005


X2

0
0
-1
pr ( ) = 0 + =
-2
0 7
-3
-3 -2 -1 0 1 2 3 4 5 6
Very bad
X1 precision!!

- 36 -
Classifier performance evaluation and
comparison Scores

Positive Unlabeled Learning - Precision

?
5

Precision is not
c+
4 ?

c- ?
a good score for
3
?
positive-
X2

? ?
?
2
? ? ?
unlabeled data
?
1 ? ?
?
? samples
0
Not all the
?
? positive
-1
-1 0 1 2
X1
3 4 5 samples are
labeled

- 37 -
Classifier performance evaluation and
comparison Scores

Precision & Recall Application Domains

Spam Filtering
Decide if an email is spam or not
Precision: Proportion of real spam in the spam-box
Recall: Proportion of total spam messages identified
by the system

Sentiment Analysis
Classify opinions about specific products given by
users in blogs, webs, forums, etc.
Precision: Proportion of opinions classified as
positive being actually positive
Recall: Proportion of positive opinions
identified as positive
- 38 -
Classifier performance evaluation and
comparison Scores

Specificity

Definition
Fraction of negative class
samples correctly identified
Specificity = 1 —
FalsePositiveRate

TN TN
sp( ) = =
TN + FP
N

Definition Based
on Probabilities
- 39 -

sp( ) = p( (x ) = c—|C = c—)


Classifier performance evaluation and
comparison Scores

Skew Data - Specificity

5
Prediction
c+ c— Total
4
c+ 0 5 5
-

Actual
3
+
c
2
c c— 7 993 1000
X2

Total 7 998 1005


1

-1
993
-2 sp( ) = 993 + =
-3
-3 -2 -1 0 1 2 3 4 5 6
0,99 7
X1

- 40 -
Classifier performance evaluation and
comparison Scores

Skew Data - Specificity

6
Prediction
c+ c— Total
5

4
c+ 0 5 5

Actual
3 0 1000 1000
c—
-
2 c Total 0 1005
X2

1005
+
1
c
0
100
-1
sp( ) = 0 =
-2 1000 + 0
-3
-3 - -1 0 1 2 3 4 5 6
1,00
2 X1

- 41 -
Classifier performance evaluation and
comparison Scores

Balanced Scores
Balanced accuracy
rate ✓ ◆
1 TP TN recall + specificity
Bal. acc = + =
2 P N 2
Balanced error
rate FP FN
Bal. ϵ = ✓ + ◆
1
2 P N

Skew
Data
Prediction
c+ c— 1 993
Total Bal. acc = 0 5 +
2

≈ 0,5
c+ 0 5 5 1 7 5
Bal. ϵ 2 7 +1000
Actual

7 993 1000
c—
= 1000 0,5
Total 7 998 1005

- 42 -
Classifier performance evaluation and
comparison Scores

Balanced Scores
(β2+1) Precision·Recall
F — Score
β2(Precision+Recall)
=
2·Precision·Recall
F1 — Score Precision+Recall —→ Harmonic
= Mean

Harmonic
Mean 1.2

0.8

0.6

Maximized with 0.4

balanced
Scor
e 0.2

components
Bal. acc →
0

-0.2

arithmetic
mea -0.4
TPR
TNR

n -0.6 Bal. acc


Harmonic
-0.8 Mean
-0.2 0 0.2 0.8 1
0.4 0.6
- 43 -
Classifier performance evaluation and
comparison Scores

Classification Cost

All misclassifications cannot be equally


considered

E.g. Medical Diagnosis Problem


Does not have the same cost as diagnosing a healthy
patient as ill rather than diagnosing an ill patient as
healthy

Classification Model
May be of interest to minimize the expected cost
instead the classification error

- 44 -
Classifier performance evaluation and
comparison Scores

Dealing with Classification Cost

Loss Function
Associate an economic/utility/etc. cost to each
classification.
Typical loss function in classification → 0/1 Loss

We can use cost matrix to specify the associated


cost: c+ c—

c+ Prediction
0 1
Actua

c— 1 0
l

- 45 -
Classifier performance evaluation and
comparison Scores

Dealing with Classification Cost

Loss Function
Associate an economic/utility/etc. cost to each
classification.
Typical loss function in classification → 0/1 Loss

We can use cost matrix to specify the associated


cost: c+ c—

c+ CostPrediction
CostFN
Actua

TP

c— CostFP CostTN
l

- 46 -
Classifier performance evaluation and
comparison Scores

Dealing with Classification Cost

Loss Function
Associate an economic/utility/etc. cost to each
classification.
Typical loss function in classification → 0/1 Loss

We can use cost matrix to specify the associated


cost: c+ c—

c+ CostPrediction
CostFN
Actua

TP

c— CostFP CostTN
l

Usually not easy to give an associated


cost
- 47 -
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

ROC Space
Coordinate system used for visualizing classifiers
performance where TPR is plotted on the Y axis and
FPR is plotted on the X axis.
1

0.9

1: k NN
0.8

0.7

0.6
2: Neural network
TPR

0.5

0.4 3: Naive Bayes


4: SVM
0.3

0.2

5: Linear
0.1

0
0 0.1
0.9
0.2 0.3 0.4 0.5 0.6 0.7 0.8 1
regression
FPR

- 48 - 6: Decision tree
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

ROC Space
Coordinate system used for visualizing classifiers
performance where TPR is plotted on the Y axis and
FPR is plotted on the X axis.
1

0.9

0.8
1: k NN
0.7

0.6 2: Neural network


TPR

0.5

0.4 3: Naive Bayes


4: SVM
0.3

0.2

5: Linear
0.1

0
0 0.1
0.9
0.2 0.3 0.4 0.5 0.6 0.7 0.8 1
regression
FPR

- 49 - 6: Decision tree
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

ROC Curve
For a probabilistic/fuzzy classifier, a ROC curve is a
plot of the TPR vs. FPR as its discrimination threshold is
1 varied p(c|x ) T = 0,2
+
T = 0,5
+
T = 0,8
+
C
+
0,99 c c c c
0.9
0,90 c+ c+ c+ c+
0.8
0,85
0.7 0,80 c+ c+ c+ c+
0,78
0.6 c+ c+ c+ c—
0,70
TPR

0.5
0,60 c+ c+ c— c+
0.4
c+ c+ c— c—
0.3
c+ c+ c— c+
0.2
0,45 c+ c— c— c—
0.1
0,40 c+ c— c— c—
0
0 0.2 0.4 0.6 0.8 1 0,30
FPR 0,20 c+ c— c— c—
0,15 c— c—
c+ c+
0,10
- 50 -0,05 c— c— c— c—

c— c— c— c—
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

ROC Curve
For a crisp classifier a ROC curve can be
obtained by interpolation from a single point
1 p(c|x ) T = 0,2 T = 0,5 T = 0,8 C
0,99 c+ c+ c+ c+
0.9
0,90 c+ c+ c+ c+
0.8
0,85
0.7 0,80 c+ c+ c+ c+
0,78
0.6 c+ c+ c+ c—
0,70
TPR

0.5
0,60 c+ c+ c— c+
0.4
c+ c+ c— c—
0.3
c+ c+ c— c+
0.2
0,45 c+ c— c— c—
0.1
0,40 c+ c— c— c—
0
0 0.2 0.4 0.6 0.8 1 0,30
FPR 0,20 c+ c— c— c—
0,15 c— c—
c+ c+
0,10 c— c—
- 51 - c— c—
0,05
c— c— c— c—
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

ROC Curve
Insensitive to skew class
distribution Insensitive to
misclassification cost

Dominance Relationship
A ROC curve A dominates another ROC curve B if A is
always above and to the left of B in the plot

- 52 -
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

ROC Curve
Insensitive to skew class
distribution Insensitive to
misclassification cost

Dominance Relationship
A ROC curve A dominates another ROC curve B if A is
always above and to the left of B in the plot

- 53 -
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

0.9
Dominance
0.8
A A dominates
0.7
B
0.6 B
throughout all the
range of T
TPR

0.5

0.4

0.3 A has a better


0.2 predictive
0.1
performance over
0
0 0.2 0.4
FPR
0.6 0.8 1
any condition of cost
and class distribution

- 54 -
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

0.9

0.8
No-Dominance
A
0.7 The
B
0.6
dominance
relationship may not
0.5

0.4
be so clear
0.3
No model is the
best under all
0.2

0.1

0 possible scenarios
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1
0.9

- 55 -
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

0.9 Area Under ROC Curve


0.8
A Equivalent to
0.7

0.6 B test
Wilcoxon
0.5
If A dominates
0.4
B:
AUC(A) ≥
0.3

0.2
AUC(B)
If A does not
0.1
dominate B AUC
0
0
0.9
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1
“cannot identify the
best classifier”

- 56 -
Classifier performance evaluation and
comparison Scores

Generalization to Multilabel-Class

Most of the presented scores are for binary


classification Generalization to multilabel is
possible
E.g. One-vs-All approach
Prediction
c1 c2 c3 ... cn Total c1 vs. All (score1)

c1 TP1 FN12 FN13 ... FN1n P1 TP


c2 FN21 TP2 FN23 ... FN2n P2 TN
...
Actual

c3 FN31 FN32 TP3 FN3n P3 FN


... ... ... ... ... ... ...

cn FNn1 FNn2 FNn3 ... TPn Pn


FP
Total Pˆ 1 Pˆ 2 Pˆ 3 ... Pˆ n

- 57 -
Classifier performance evaluation and
comparison Scores

Generalization to Multilabel-Class

Most of the presented scores are for binary


classification Generalization to multilabel is
possible
E.g. One-vs-All approach
Prediction
c1 c2 c3 ... cn Total c1 vs. All (score1)

c1 TP1 FN12 FN13 ... FN1n P1 TP


c2 FN21 TP2 FN23 ... FN2n P2 TN
...
Actual

c3 FN31 FN32 TP3 FN3n P3 FN


... ... ... ... ... ... ...

cn FNn1 FNn2 FNn3 ... TPn Pn


FP
Total Pˆ 1 Pˆ 2 Pˆ 3 ... Pˆ n

- 58 -
Classifier performance evaluation and
comparison Scores

Generalization to Multilabel-Class

Most of the presented scores are for binary


classification Generalization to multilabel is
possible
E.g. One-vs-All approach
Prediction
c1 c2 c3 ... cn Total c1 vs. All (score1)

c1 TP1 FN12 FN13 ... FN1n P1 TP


c2 FN21 TP2 FN23 ... FN2n P2 TN
...
Actual

c3 FN31 FN32 TP3 FN3n P3 FN


... ... ... ... ... ... ...

cn FNn1 FNn2 FNn3 ... TPn Pn


FP
Total Pˆ 1 Pˆ 2 Pˆ 3 ... Pˆ n

- 59 -
Classifier performance evaluation and
comparison Scores

Generalization to Multilabel-Class

Most of the presented scores are for binary


classification Generalization to multilabel is
possible
E.g. One-vs-All approach
Prediction
c1 c2 c3 ... cn Total c1 vs. All (score1)

c1 TP1 FN12 FN13 ... FN1n P1 TP


c2 FN21 TP2 FN23 ... FN2n P2 TN
...
Actual

c3 FN31 FN32 TP3 FN3n P3 FN


... ... ... ... ... ... ...

cn FNn1 FNn2 FNn3 ... TPn Pn


FP
Total Pˆ 1 Pˆ 2 Pˆ 3 ... Pˆ n

- 60 -
Classifier performance evaluation and
comparison Scores

Generalization to Multilabel-Class

Most of the presented scores are for binary


classification Generalization to multilabel is
possible
E.g. One-vs-All approach
Prediction
c1 c2 c3 ... cn Total c1 vs. All (score1)

c1 TP1 FN12 FN13 ... FN1n P1 TP


c2 FN21 TP2 FN23 ... FN2n P2 TN
...
Actual

c3 FN31 FN32 TP3 FN3n P3 FN


... ... ... ... ... ... ...

cn FNn1 FNn2 FNn3 ... TPn Pn


FP
Total Pˆ 1 Pˆ 2 Pˆ 3 ... Pˆ n

- 61 -
Classifier performance evaluation and
comparison Scores

Generalization to Multilabel-Class

Most of the presented scores are for binary


classification Generalization to multilabel is
possible
E.g. One-vs-All approach
Prediction
c1 c2 c3 ... cn Total c1 vs. All (score1)

c1 TP1 FN12 FN13 ... FN1n P1 TP


c2 FN21 TP2 FN23 ... FN2n P2 TN
...
Actual

c3 FN31 FN32 TP3 FN3n P3 FN


... ... ... ... ... ... ...

cn FNn1 FNn2 FNn3 ... TPn Pn


FP
Total Pˆ 1 Pˆ 2 Pˆ 3 ... Pˆ n

Xn
score TOT scorei · p(ci )
= i
=1
- 62 -
Classifier performance evaluation and
comparison Scores

Scores

The Use of a Specific Score Depends


on: Application domain
Characteristics of the problem
Characteristics of the data
set
Our interest when solving the
problem etc.

- 63 -
Classifier performance evaluation and
comparison Estimation Methods

Outline of the Tutorial

1 Introducti
on
2 Score
s
3 Estimation
Methods
4 Hypothesis
Testing

- 64 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Estimation
Select a score to measure the
quality Calculate the true value
of the score Limited information
is available
Physical Process

Data set

Finite Data set

- 65 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Estimation
Select a score to measure the
quality Calculate the true value
of the score Limited information
is available Classification
Physical Process
Model

Data set

Finite Data set

- 66 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Estimation
Select a score to measure the
quality Calculate the true value
of the score Limited information
is available Classification
Physical Process
Model

Data set Quality


Measures
Error
Recall
Precisio
n
....

Finite Data set

- 67 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Estimation
Select a score to measure the
quality Calculate the true value
of the score Limited information
is available Classification
Physical Process
Model

Data set Quality


Measures
Error
Recall Rando
Precisio m
n Variable
.... s

Finite Data set

- 68 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

True Value - ϵN
Expected value of the score for a set of N data
samples sampled from ⇢(C, X )

- 69 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

True Value - ϵN
Expected value of the score for a set of N data
samples sampled from ⇢(C, X )

⇢(C, X ) unknown → Point estimation of the score


(ϵˆ)

- 70 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Bias
Difference between the estimation of the score and
its true value: E⇢(ϵˆ— ϵN )

- 71 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Variance
Deviation of the estimated value from its
expected value:
var (ϵˆ— ϵN )

- 72 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Bias and variance depend on the estimation


method Trade-off between bias and variance
needed

- 73 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Data set

Finite data set to estimate the score


Several choices depending on how this data set
is dealt with
- 74 -
Classifier performance evaluation and
comparison Estimation Methods

Resubstitution

Data set
Learnin
g

- 75 -
Classifier performance evaluation and
comparison Estimation Methods

Resubstitution

Data set Trainin


g

- 76 -
Classifier performance evaluation and
comparison Estimation Methods

Resubstitution

Classification Error Estimation


The simplest estimation
method Biased estimation
ϵN
Smaller variance
Too optimistic (overfitting
problem)
Bad estimator of the true
classification error

- 77 -
Classifier performance evaluation and
comparison Estimation Methods

Hold-Out

Data set - Training

Data set

Data set
Data set - Test

- 78 -
Classifier performance evaluation and
comparison Estimation Methods

Hold-Out

Data set - Training

Trainin
Data set g

Data set - Test

- 79 -
Classifier performance evaluation and
comparison Estimation Methods

Hold-Out

Data set - Training

Data set

Test
Data set - Test

- 80 -
Classifier performance evaluation and
comparison Estimation Methods

Hold-Out

Classification Error
Estimation Unbiased
estimator of ϵN1 Biased
estimator of ϵN
Large bias (pessimistic estimation of the true
classification error)
Bias related to N1 and N2

- 81 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Data set - Fold


1

Data set Data set - Fold


2

Data set - Fold


3

Data set - Fold


k

- 82 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Data set - Fold


1

Data set Data set - Fold


2

Data set - Fold


Trainin
3 g

Data set - Fold


k

- 83 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Data set - Fold 1

Data set Test


Data set - Fold 2

Data set - Fold 3

Data set - Fold k

- 84 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Data set - Fold


1

Data set Data set - Fold


2

Data set - Fold


3

Data set - Fold


k

- 85 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Data set - Fold


1

Data set Data set - Fold


2

Data set - Fold


3
Trainin
g

Data set - Fold


k

- 86 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Data set - Fold 1

Data set Data set - Fold 2

Data set - Fold 3

Data set - Fold k

- 87 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Data set - Fold


1

Data set Data set - Fold


2

Data set - Fold


3

Data set - Fold


k

- 88 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Classification Error
Estimation
Unbiased estimator N
N k
of ϵ —

Biased estimation
Smaller of
bias than Hold-Out
ϵN
Leaving-One-Out
Special case of k -fold Cross-Validation (k
= N) Quasi unbiased estimation for N
Improves the bias with respect to
CV Increases the variance → more
unstable Higher computational cost

- 89 -
Classifier performance evaluation and
comparison Estimation Methods

Bootstrap

Bootstrap Data set


-

Data set Bootstrap Data set


-

Bootstrap Data set


-

Bootstrap Data set


-

- 90 -
Classifier performance evaluation and
comparison Estimation Methods

Bootstrap

Bootstrap Data set


-

Data set Bootstrap Data set


-

Bootstrap Data set


-

Bootstrap Data set


-

- 91 -
Classifier performance evaluation and
comparison Estimation Methods

Bootstrap

Bootstrap Data set


-

Data set Bootstrap Data set


-

Bootstrap Data set


-

Bootstrap Data set


-

- 92 -
Classifier performance evaluation and
comparison Estimation Methods

Bootstrap

Bootstrap Data set -

Data set Bootstrap Data set -

Bootstrap Data set -

Bootstrap Data set -

- 93 -
Classifier performance evaluation and
comparison Estimation Methods

Bootstrap

Classification Error Estimation


Biased estimation of the classification
error Variance improved because of
resampling
Uses for testing part of the data used
for learning
“Similar to
resubstitution”
Problem of overfitting

- 94 -
Classifier performance evaluation and
comparison Estimation Methods

Leaving-One-Out Bootstrap

Mimics Cross-
Validation Each i is
tested on D/Di⇤

Tries to Avoid the


Overfitting Problem
Expected number of
distinct samples on
bootstrap data set
≈ 0,632N
Similar to repeated Hold-
Out Biased upwards: - 95 -
Tends to be a
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Bias

Bias correction terms can be used for error


estimation

Hold-Out/Cross-Validation
Several proposals
Improves bias
estimation
Surprisingly not very
extended

Bootstrap
Improves bias estimation
Well established - 96 -

methods
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Bias

(ϵˆ+ ) - (Burman,
Corrected Hold-Out ho
1989) +
ϵˆ
ho = ϵˆho + ϵˆres —
ϵˆho—N
Where
ϵˆho = standard Hold-Out estimator
ϵˆres = resubstitution error
ϵˆho—N = learned on Hold-Out learning set but
tested on
D.

- 97 -
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Bias

Corrected Hold-Out ho (ϵˆ+ ) - (Burman,


1989) +
ϵˆ
ho = ϵˆho + ϵˆres —
ϵˆho—N
Improveme
nt N2
Biasєˆho ≈ Cons0N1·N

Biasєˆho ≈ Cons 1 N1N·N


2
2
+

- 98 -
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Bias

Corrected Cross-Validationcv(ϵˆ+ ) - (Burman,


1989) ϵˆ +
cv = ϵˆcv + ϵˆres —
ϵˆcv —N
Improvement
Biasєˆcv ≈ Cons0 (k—1
1)·N

Biasє+cv ≈ Cons 1 (k —1 2
ˆ 1)·N

- 99 -
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Bias

0.632 Bootstrap boot


(ϵˆ.632 ) ϵ.63 = boot res + loo—boot
2
ˆ 0.368ϵˆ 0.632ϵˆ
Improvement
Tries to balance optimism (resubstitution) and
pessimism (loo-bootstrap)
Works well with “light-fitting” classifiers
bootϵˆ.632 is still too
With overfitting classifiers
optimistic

- 100
-
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Bias

(ϵˆ.632+ ) - (Efron & Tibshirani,


0.632+ Bootstrapboot
Correct bias when there is great amount of
1997)
overfitting
Based on the non-information error
rate (ц):

XN XN
2
цˆ δ(c , (x ))/N
i x j
= i =1 j=1

Uses the relative overfitting to correct


the bias:

ϵˆloo—boot — ϵˆres
Rˆ =
цˆ — ϵˆres
- 101
-
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Bias

0.632+ Bootstrapboot (ϵˆ.632+ ) - (Efron & Tibshirani,


1997) .63 = (1 —ˆ
ϵboot ˆ loo—boot
res + w
2
ˆ w )ϵˆ ϵˆ
wˆ = 0.632
1—0 . 638Rˆ

P N P N
цˆ i =1 j δ(ci , x (x j )/N
2

= =1

Rˆ = єˆloo—boot —

єˆres
gˆ—єˆr e s

- 102
-
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Variance

Stratification
Keeps the proportion of each class in the
train/test data
Hold-Out: Stratified splitting
Cross-Validation: Stratified
splitting Bootstrap: Stratified
sampling

May improve the variance


of the estimation

- 103
-
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Variance

Repeated Methods
Applicable to Hold-Out and Cross-
Validation Bootstrap already includes
sampling

Repeated Hold-Out/Cross-
Validation Repeat estimation
process t -times Simple
average over results

Classification Error Estimation


Same bias as standard estimation
methods Reduces the -variance
104 with
-
respect
Classifier performance evaluation and
comparison Estimation Methods

Estimation Methods

Which estimation method is better?

May Depend on Many


Aspects The size of the
data set
The classification paradigm used
The stability of the learning
algorithm
The characteristics of the classification
problem The bias/variance/computational
cost trade-off
.. . - 105
-
Classifier performance evaluation and
comparison Estimation Methods

Estimation Methods

Which estimation method is better?

Large Data Sets


Hold-out may be a good choice
Computationally not so expensive
Larger bias but depends on the data
set size

Smaller Data Sets


Repeated Cross-
Validation Bootstrap
0.632
- 106
-
Classifier performance evaluation and
comparison Estimation Methods

Estimation Methods

Which estimation method is better?

Small Data Sets


Bootstrap and repeated Cross-Validation may
not be informative
Permutation test (Ojala & Garriga, 2010):
Can be used to ensure the validity of the
estimation
Confidence intervals (Isaksson et al., 2008):
May provide more reliable information about
the estimation

- 107
-
Classifier performance evaluation and
comparison Hypothesis Testing

Outline of the Tutorial

1 Introducti
on
2 Score
s
3 Estimation
Methods
4 Hypothesis
Testing

- 108
-
Classifier performance evaluation and
comparison Hypothesis Testing

Motivation

Basic Concepts
Hypothesis testing form the basis of scientific
reasoning in experimental sciences
They are used to set scientific statements
A hypothesis Ho called null hypothesis is tested
against another hypothesis H1 called alternative
The two hypotheses are not at the same level:
reject Ho
does not mean acceptance of H1
The objective is to know when the differences in
H0 are due to randomness or not
- 109
-
Classifier performance evaluation and
comparison Hypothesis Testing

Hypothesis Testing

Possible Outcomes of a Test


Given a sample, a decision is taken about
the null hypothesis (H0)
The decision is taken under uncertainty

H0 TRUE H0 FALSE
√ Type II error (β)
Decision: ACCEPT √
Decision: REJECT Type I error (α)

- 110
-
Classifier performance evaluation and
comparison Hypothesis Testing

Hypothesis Testing: An Example

A Simple Hypothesis Test


A natural process is given in nature that follows a
Gaussian distribution N (µ, σ2)
We have a sample of this process {x1 ,. .., x n } and
a decision must be taken about the following
hypotheses: ⇢
H0 : µ =
60
H 1 : µ =
50
A statistic (function) of the sample
P is used to
n
take the In our example n1 i=1 xi
decision.
X=

- 111
-
Classifier performance evaluation and
comparison Hypothesis Testing

Hypothesis Testing: An Example


Accept and Reject Regions
The possible values of the statistic are divided in
accept and reject regions

A.R. = {(x 1 ,. .., xn )|X > 55}

R.R. = {(x 1 ,. .., xn )|X ≤ 55}


Assuming a probability distribution on the statistic
X (it depends on the distribution of {x1 ,. .., x n }) the
probability of each error type can be calculated:

α = PH0 (X 2 R.R.) = PH0 (X ≤ 55)

β = PH1 (X 2 -A.R.)
112 = PH1 (X > 55)
-
Classifier performance evaluation and
comparison Hypothesis Testing

Hypothesis Testing: An Example

Accept and Reject Regions


The A.R. and R.R. can be modified in order to
have a particular value of α:

0,1 = α = PH0 (X 2 R.R.) = PH0 (X ≤ 51)

0,05 = α = PH0 (X 2 R.R.) = PH0 (X ≤ 50,3)

p-value. Given a sample and the specific value of


the test statistic x for the sample:

p-value = PH0 (X ≤ x )
- 113
-
Classifier performance evaluation and
comparison Hypothesis Testing

Hypothesis Testing: Remarks

Power: (1 —
β) Depending on the hypotheses the type II error (β)
can not be calculated:

H0 : µ =
H60
1 : µ /= 60

In this case we do not know the value of µ for H1 so


not calculate the power (1 — β)
we can
A good hypothesis test: given an α the test
maximises the power (1 — β)

Parametric test vs non-parametric test

- 114
-
Classifier performance evaluation and
comparison Hypothesis Testing

Hypothesis Testing in Supervised Classification

Scenarios
Two classifiers (algorithms) vs More
than two One dataset vs More than one
dataset Score
Score estimation method known vs
unknown
The classifiers are trained and tested in
the same datasets
.....

- 115
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

The General
Approach
8
> H0 : classifier has the same score
<
value as classifier 0 in p(x, c)

>:
H1 : they have different
values

- 116
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

The General
Approach
8
> H0 : classifier has the same score
<
value as classifier 0 in p(x, c)

>:
H1 : they have different
values

8
>
> H0 : algorithm has the same average score
<
value as algorithm 0 in p(x, c)

>:
H1 : they have different
values - 117
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset


An Ideal Context: We Can Sample
p(x, c)
1 Sample i.i.d. 2n datasets from
2
p(x,
Learnc) 2n classifiers 1, 2 for i = 1,. ..,
n
i i
3 For each classifier obtain enough i.i.d. samples
{(x1, c1),. .., (xN , cN )} from p(x, c)
4 For each data set calculate the error of each algorithm
in the test set
XN XN
ϵ1i errori12 (xj ) ϵi errori2 (xj )
1 1
= j=1 = j=1
N N
5 Calculate the average values over the n training
datasets:
1 Xn 1 1 Xn 2
ϵ1 ϵi ϵ2 ϵi
n n
= i=1
= i=1
- 118 -
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

An Ideal Context: We Can Sample p(x, c)


Our test rejects the null hypothesis if |ϵ1 — ϵ2| (the
statistic) is big
Fortunately, by the central limit theorem:

ϵi ~ N (score( i ), si ) i = 1,
2
Therefore, under the null
hypothesis:
ϵ1 —
Z = qϵ2s2+s
~ N (0,
ˆ
1
2
n
2
1)

... and finally we reject H0 when |Zˆ | >


z1—↵/2 - 119
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

Properties of Our Ideal Framework


Training datasets are
independent Testing datasets
are independent

The Sad Reality


We can not get i.i.d. training samples from p(x,
c) We can not get i.i.d. testing samples
from p(x, c) We have only one sample from
p(x, c)

- 120
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset


McNemar Test (non-parametric)
Compare two classifiers in a dataset after a Hold-Out
process It is a paired non-parametric test

2
error 2

1
error ok
1
ok n00 n11
10

n01
Under H0 we have n10 ≈ n01 and the statistic

(|n01 — n10| — 1)2


n01 + n10

follows a χ2 distribution with 1 degree of freedom


When n01 + n10 is small (<2-512)1 t-he binomial dist. can
be used
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

Tests Based on Resampling: Resampled t-test (parametric)


The dataset is randomly divided n times in training

and test Let pˆi be the difference between the


performance of both
algorithms in run i and p the average. When it is
t= √
assumed that pˆi q P n n
p
ˆ
i=1 (p i —p)
2

are Gaussian and independent, n—1under the null

follows a t student distribution with n — 1 degree of


freedom Caution:

pˆi are not Gaussian as i pˆ1 and


i pˆ2 are not
pˆi are not independent (overlap in training and
independent
testing) - 122
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

Resampled t-test Improved (Nadeau & Bengio,


2003) The variance in this case is too
optimistic
Two alternatives
Corrected resampled t : ✓ 1 n2

2
+
n σ n1

Conservative Z (overestimation of the


variance)

- 123
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

t-test for k-fold Cross-validation


It is similar to t -test for resampling
In this case the testing datasets are
independent The training datasets are still
dependent

- 124
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

5x2 fold Cross-Validation (Dietterich 1998, Alpaydin


1999)
Each Cross-Validation process has independent
training and testing datasets
The following statistic:
P 5 P
(j) 2
2
i=1 P j=1 (p i )
5
2 s2i
i =1

follows a F distribution with 10 and 5 degrees of


freedom under the null hypothesis

- 125
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in Several Datasets

Initial
Approaches
Averaging Over
Datasets Paired t-test
P N
c i = i — ci and d = ci then follows a
1 2 N i=1 d
distribution
c d/σ1 with N — 1 degrees of t
freedom

Problems
Commensurabilit
y Outlier
susceptibility
(t-test) Gaussian
assumption - 126
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in Several Datasets

Wilcoxon Signed-Ranks Test


It is a non-parametric test that works as
follows:
1 Rank the module of the performance differences

between both algorithms


2 Calculate the sum of the ranks R+ and R— where
the first (resp. the second) algorithm outperforms
3 the other Calculate T = min(R+, R—)
For N ≤ 25 there are tables with critical
values For N > 25

z= T — 14 N(N +
q 1) ~ N (0,
1
2 N(N + 1) 2N + 1)
4 ( 1)
- 127
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1 2
dif ran
Dataset1 0.763 0.598 f k
Dataset2 0.599 0.591
Dataset3 0.954 0.971
Dataset4 0.628 0.661
Dataset5 0.882 0.888
Dataset6 0.936 0.931
Dataset7 0.661 0.668
Dataset8 0.583 0.583
Dataset9 0.775 0.838
Dataset10 1.000 1.000

- 128
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591
Dataset3 0.954 0.971
Dataset4 0.628 0.661
Dataset5 0.882 0.888
Dataset6 0.936 0.931
Dataset7 0.661 0.668
Dataset8 0.583 0.583
Dataset9 0.775 0.838
Dataset10 1.000 1.000

- 129
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971
Dataset4 0.628 0.661
Dataset5 0.882 0.888
Dataset6 0.936 0.931
Dataset7 0.661 0.668
Dataset8 0.583 0.583
Dataset9 0.775 0.838
Dataset10 1.000 1.000

- 130
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971 +0.017
Dataset4 0.628 0.661 +0.033
Dataset5 0.882 0.888 +0.006
Dataset6 0.936 0.931 -0.005
Dataset7 0.661 0.668 +0.007
Dataset8 0.583 0.583 0.000
Dataset9 0.775 0.838 +0.063
Dataset10 1.000 1.000 0.000

- 131
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971 +0.017
Dataset4 0.628 0.661 +0.033
Dataset5 0.882 0.888 +0.006
Dataset6 0.936 0.931 -0.005
Dataset7 0.661 0.668 +0.007
Dataset8 0.583 0.583 0.000
Dataset9 0.775 0.838 +0.063
Dataset10 1.000 1.000 0.000

- 132
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971 +0.017
Dataset4 0.628 0.661 +0.033
Dataset5 0.882 0.888 +0.006
Dataset6 0.936 0.931 -0.005
Dataset7 0.661 0.668 +0.007
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063
Dataset10 1.000 1.000 0.000 1.5

- 133
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971 +0.017
Dataset4 0.628 0.661 +0.033
Dataset5 0.882 0.888 +0.006
Dataset6 0.936 0.931 -0.005
Dataset7 0.661 0.668 +0.007
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063
Dataset10 1.000 1.000 0.000 1.5

- 134
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971 +0.017
Dataset4 0.628 0.661 +0.033
Dataset5 0.882 0.888 +0.006
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063
Dataset10 1.000 1.000 0.000 1.5

- 135
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5

- 136
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5
R+ =
- 137
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5
R+ = 7 + 8 + 4 + 5 + 9 + 1/2(1,5 +
1,5)
- 138
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5
R+ =
34.5
- 139
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5
R+ = R— = 10 + 6 + 3 + 1/2(1,5 +
34.5 1,5)
- 140
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5
R+ = R— =
34.5 20.5
- 141
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5

R+ = 34.5 R— = 20.5 T = min(R+, R—)


- 142
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5
R+ = R— = T = min(R+, R—) =
34.5 20.5 20.5
- 143
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in Several Datasets

Wilcoxon Signed-Ranks Test


It also suffers from commensurability but only
qualitatively
When the assumptions of the t test are met,
Wilcoxon is less powerful than t test

- 144
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in Several Datasets

Signed Test
It is a non-parametric test that counts the
number of losses, ties and wins
Under the null the number of wins follows a
binomial distribution B(1/2, N)
For large
p v alues of N the number of wins
follows
N (N/2, N/2) under the null
This test does not make any
assumptions It is weaker than
Wilcoxon

- 145
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Dataset (Demˇsar, 1 2 3 4
2006)
D1 0.84 0.79 0.89 0.43
D2 0.57 0.78 0.78 0.93
D3 0.62 0.87 0.88 0.71
D4 0.95 0.55 0.49 0.72
D5 0.84 0.67 0.89 0.89
D6 0.51 0.63 0.98 0.55

- 146
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis
Testing 8 i,
Testing all possible pairs of hypotheses µ i =µ j
j.
Multiple hypothesis testing

Testing the hypothesis µ 1 =µ 2 = .. . = µ k

- 147
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis
Testing 8 i,
Testing all possible pairs of hypotheses µ i =µ j
j.
Multiple hypothesis testing

Testing the hypothesis µ 1 =µ 2 = .. . = µ k

- 148
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis
Testing 8 i,
Testing all possible pairs of hypotheses µ i =µ j
j.
Multiple hypothesis testing

Testing the hypothesis µ 1 =µ 2 = .. . = µ k

ANOVA vs Friedman
Repeated measures ANOVA: Assumes Gaussianity
and sphericity
Friedman: Non-parametric test

- 149
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Freidman
Test
1 Rank the algorithms for each dataset separately
(1-best). In case of ties assigned average ranks
2
Calculate the average rank Rj of each algorithm j

3
The following statistic:
2 3
12N
2 X 2 k (k + 2 5
4 Rj
χF = 1) 4
k (k + j —
1)
follows a χ2 with k — 1 degrees of freedom
(N>10, k>5)

- 150
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test:
Example 1 2 3 4

D1 0.84 (2) 0.79 (3) 0.89 (1) 0.43 (4)


D2 0.57 (4) 0.78 0.78 0.93 (1)
D3 0.62 (4) (2.5) (2.5) 0.71 (3)
D4 0.95 (1) 0.87 (2) 0.88 (1) 0.72 (2)
D5 0.84 (3) 0.55 (3) 0.49 (4) 0.89
D6 0.51 (4) 0.67 (4) 0.89 (1.5)
0.63 (2) (1.5) 0.55 (3)
0.98 (1)
avr. rank 3 2.75 1.83 2.41

- 151
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test:
Example 1 2 3 4

D1 0.84 (2) 0.79 (3) 0.89 (1) 0.43 (4)


D2 0.57 (4) 0.78 0.78 0.93 (1)
D3 0.62 (4) (2.5) (2.5) 0.71 (3)
D4 0.95 (1) 0.87 (2) 0.88 (1) 0.72 (2)
D5 0.84 (3) 0.55 (3) 0.49 (4) 0.89
D6 0.51 (4) 0.67 (4) 0.89 (1.5)
0.63 (2) (1.5) 0.55 (3)
0.98 (1)
2 3
avr. rank 3 2.75 1.83 2.41
2 12 X 2 k (k + 2 5
4 Rj =
χF = N 1) 4
k (k + j —
1)
- 152
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test:
Example 1 2 3 4

D1 0.84 (2) 0.79 (3) 0.89 (1) 0.43 (4)


D2 0.57 (4) 0.78 0.78 0.93 (1)
D3 0.62 (4) (2.5) (2.5) 0.71 (3)
D4 0.95 (1) 0.87 (2) 0.88 (1) 0.72 (2)
D5 0.84 (3) 0.55 (3) 0.49 (4) 0.89
D6 0.51 (4) 0.67 (4) 0.89 (1.5)
0.63 (2) (1.5) 0.55 (3)
0.98 (1)
2 2.75 3
avr. rank 3 1.83 2 2.41
2 12 X 2 k (k + 5
4 Rj =
χF = N 1)
k (k + j — 2,5902
1) 4
- 153
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Iman & Davenport, 1980


An improvement of Friedman
test:
(N — 2
F
FF = (k1)χ— 1) — 2
N
χF
follows a F-distribution with k — 1 and (k — 1)
(N — 1) degrees of freedom

- 154
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Post-hoc Tests
Decision on the null hypothesis
In case of rejection use of post-hoc
tests to:
1 Compare all pairs
2 Compare all classifiers with a
control

- 155
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis Testing


Several related hypothesis simultaneously
H1,. .., Hn
H0 TRUE H0 FALSE
√ Type II error (β)
Decision: ACCEPT √
Decision: REJECT Type I error (α)

- 156
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis Testing


Several related hypothesis simultaneously
H1,. .., Hn
H0 TRUE H0 FALSE
√ Type II error (β)
Decision: ACCEPT √
Decision: REJECT Type I error (α)

- 157
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis Testing


Several related hypothesis simultaneously
H1,. .., Hn
H0 TRUE H0 FALSE
√ Type II error (β)
Decision: ACCEPT √
Decision: REJECT Type I error (α)

Family-wise error: Probability of rejecting at


least one hypothesis assuming that ALL ARE
TRUE

- 158
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis Testing


Several related hypothesis simultaneously
H1,. .., Hn
H0 TRUE H0 FALSE
√ Type II error (β)
Decision: ACCEPT √
Decision: REJECT Type I error (α)

Family-wise error: Probability of rejecting at


least one hypothesis assuming that ALL ARE
TRUE
False discovery rate

- 159
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis Testing


Several related hypothesis simultaneously
H1,. .., Hn
H0 TRUE H0 FALSE
√ Type II error (β)
Decision: ACCEPT √
Decision: REJECT Type I error (α)

Family-wise error: Probability of rejecting at


least one hypothesis assuming that ALL ARE
TRUE
False discovery rate

- 160
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets


Designing Multiple Hypothesis
Test Controlling family-wise
error
If each test Hi has a type I error α then the family-
wise error (FWE) in n tests is:

P(accept H1 ∩ accept H2 ∩ .. . ∩ accept Hn )

= P(accept H 1 ) × P(accept H 2 ) × .. .
× P(accept Hn )
= (1 — α)n

and therefore

each
FWE = 1 — (1 — α)n -≈1611 — (1 — αn) = αn
-
test
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Comparing with a Control


The statistic for comparing i and j
is:
z = (Ri — ~ N (0,
1)
qj )k (k +1)
R
6N

Bonferroni-Dunn Test
It is a one-step method
Modify α by taking into account the
number of
comparison
s: α
k—
1
- 162
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Comparing with a Control


Methods based on ordered p-values
The p-values are ordered p1 ≤ p2 ≤ .. . ≤ pk —1

Holm Method
It is a step-down procedure
Starting from p1 check the first i = 1,. .., k — 1
such that
pi > α/(k — i )
The hypothesis H1,. .., Hi —1 are rejected. The
rest of hypotheses are kept
- 163
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =


0.05) 1 2 3 4

D1 0.84 (2) 0.79 (3) 0.89 (1) 0.43 (4)


D2 0.57 (4) 0.78 0.78 0.93 (1)
D3 0.62 (4) (2.5) (2.5) 0.71 (3)
D4 0.95 (1) 0.87 (2) 0.88 (1) 0.72 (2)
D5 0.84 (3) 0.55 (3) 0.49 (4) 0.89
D6 0.51 (4) 0.67 (4) 0.89 (1.5)
0.63 (2) (1.5) 0.55 (3)
0.98 (1)
avr. rank 3 2.75 1.83 2.41

- 164
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =


0.05) 1 2 3 4

D1 0.84 (2) 0.79 (3) 0.89 (1) 0.43 (4)


D2 0.57 (4) 0.78 0.78 0.93 (1)
D3 0.62 (4) (2.5) (2.5) 0.71 (3)
D4 0.95 (1) 0.87 (2) 0.88 (1) 0.72 (2)
D5 0.84 (3) 0.55 (3) 0.49 (4) 0.89
D6 0.51 (4) 0.67 (4) 0.89 (1.5)
0.63 (2) (1.5) 0.55 (3)
0.98 (1)
avr. rank 3 z2.75
= (Ri — 1.83 2.41
q
Rj ) k (k
+1)
6N

- 165
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =


0.05)
z = (Ri —
q
Rj ) k (k
+1)
6N

z
0.335
z12
4
z13 1.569
7
z14
0.791
z23 5
1.234
z24
3
z3 0.456
- 166
-
4 1
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =


0.05) z p-value
0.3354 0.259
z12
2.1569 0.031
z13 0.7915 0.125
1.9843 0.042
z14
0.4561 0.221
z23 -2.7781 0.009

z24

z3
4
- 167
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =


0.05) z p-value Bonferroni (α/6)
0.3354 0.259 0.008
z12
2.1569 0.031 0.008
z13 0.7915 0.125 0.008
1.9843 0.042 0.008
z14
0.4561 0.221 0.008
z23 -2.7781 0.007 0.008
z24

z3
4
- 168
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =


0.05) z p-value Bonferroni (α/6)
0.3354 0.259 0.008
z12
2.1569 0.031 0.008
z13 0.7915 0.125 0.008
1.9843 0.042 0.008
z14
0.4561 0.221 0.008
z23 -2.7781 0.007 0.008
z24

z3
4
- 169
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =


0.05) z p-value Bonferroni (α/6) Holm (α/(7 —
i ))
z12
0.3354 0.259 0.008
z13 2.1569 0.031 0.008
0.7915 0.125 0.008
z14
1.9843 0.009 0.008
z23 0.4561 0.221 0.008
-2.7781 0.007 0.008
z24

z3
4
- 170
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =


0.05) z p-value Bonferroni (α/6) Holm (α/(7 —
i ))
z12
0.3354 0.259 0.008
z13 2.1569 0.031 0.008
0.7915 0.125 0.008
z14
1.9843 0.009 0.008
z23 0.4561 0.221 0.008
-2.7781 0.007 0.008
z24 0.008
z3
4
- 171
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =


0.05) z p-value Bonferroni (α/6) Holm (α/(7 —
i ))
z12 0.3354 0.259 0.008
z13 2.1569 0.031 0.008
0.7915 0.125 0.008
z14 1.9843 0.009 0.008 0.010
z23 0.4561 0.221 0.008
-2.7781 0.007 0.008 0.008
z24

z3
4
- 172
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =


0.05) z p-value Bonferroni (α/6) Holm (α/(7 —
i ))
z12 0.3354 0.259 0.008
z13 2.1569 0.031 0.008 0.012
0.7915 0.125 0.008
z14 1.9843 0.009 0.008 0.010
z23 0.4561 0.221 0.008
-2.7781 0.007 0.008 0.008
z24

z3
4
- 173
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =


0.05) z p-value Bonferroni (α/6) Holm (α/(7 —
i ))
z12 0.3354 0.259 0.008
z13 2.1569 0.031 0.008 0.012
0.7915 0.125 0.008
z14 1.9843 0.009 0.008 0.010
z23 0.4561 0.221 0.008
-2.7781 0.007 0.008 0.008
z24

z3
4
- 174
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Hochberg Method
It is a step-up procedure
Starting with pk —1 check the first i = k — 1,. .., 1
such that
pi < α/(k — i )
The hypothesis H1,. .., Hi —1 are rejected. The
rest of hypotheses are kept

Hommel Method
Find the largest j such that pn—j +k > k α/j for all
k = 1,. .., j
Reject all hypotheses i such that pi ≤ α/j
- 175
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Comments on the Tests


Holm, Hochberg and Hommel tests are more
powerful than Bonferroni
Hochberg and Hommel are based on Simes
conjecture and can have a higher than α FWE
In practice Holm obtains very similar results to
the other

- 176
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

All Pairwise Comparisons


Differences with Comparing with a Control
The all pairwise hypotheses are logically related:
not all combinations of true and false
hypotheses are possible

C1 better than C2 and C2


better than C3

and C1
equal to C3
- 177
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Shaffer Static Procedure


It is a modification of Homl’s procedure
Starting from p1 check the first i = 1,. .., k (k — 1)/2
such that pi > α/ti
The hypothesis H1,. .., Hi —1 are rejected. The
rest of hypotheses are kept
ti is the maximum number of hypotheses that can
be true given that (i — 1) are false
It is a static procedure: ti is determined
given the hypotheses independently of the
p-values
- 178
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Shaffer Dynamic Procedure


It is similar to the previous procedure but ti is
changed by ti⇤
ti⇤ considers the maximum number of hypotheses
that can be true given that the previous (i — 1)
hypotheses are false
It is a dynamic procedure as ti⇤ depends on the
hypotheses already rejected
It is more powerful than the Shaffer Static Procedure

- 179
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Bregmann & Hommel


More powerful alternative than Shaffer Dynamic
Procedure Difficult implementation

Remarks
Adjusted p-values

- 180
-
Classifier performance evaluation and
comparison Hypothesis Testing

Conclusions

Two Classifiers in a Dataset


The complexity of the estimation of the scores
makes it difficult to carry out good statistical
testing

Two Classifiers in Several Datasets


Wilcoxon Signed-Ranks Test is a good
choice In case of many datasets and to
avoid the
commensurability problem the Signed
test could be used
- 181
-
Classifier performance evaluation and
comparison Hypothesis Testing

Conclusions

Several Classifiers in Several Datasets


Friedman or Iman & Davenport are
required
Post-hoc test more powerful than
Bonferroni: Comparison with a
control: Holm method All-to-all
comparison: Shaffer Static method

An Idea for Future Work


To consider the variability of the score in each
classifier and dataset

- 182
-
Classifier performance evaluation and
comparison Hypothesis Testing

Classifier performance evaluation


and comparison

Jose A. Lozano, Guzmán Santafé, Iñaki


Inza

Intelligent Systems Group


The University of the Basque Country
International Conference on Machine Learning and Applications
(ICMLA 2010) December 12-14, 2010

- 183
-
Thank You!

You might also like