0% found this document useful (0 votes)

5 views185 pages

Lec09 Classifier Evaluation

The document discusses classifier performance evaluation and comparison in machine learning, outlining key concepts such as score functions, estimation methods, and hypothesis testing. It emphasizes the importance of honest evaluation to determine the best classification paradigm and parameter configurations. Various scoring metrics, including accuracy, recall, and precision, are introduced to assess classifier quality.

Uploaded by

udaynaik057

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views185 pages

Lec09 Classifier Evaluation

Uploaded by

udaynaik057

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 185

Machine

Learning
CS60050
Classifier/
Hypothesis
Evaluation
Classifier performance evaluation and
comparison

Classifier performance evaluation

and comparison

Jose A. Lozano, Guzmán Santafé, Iñaki

Inza

Intelligent Systems Group

The University of the Basque Country
International Conference on Machine Learning and Applications
(ICMLA 2010) December 12-14, 2010

-1-
Classifier performance evaluation and
comparison

Outline of the Tutorial

1 Introducti
on
2 Score
s
3 Estimation
Methods
4 Hypothesis
Testing

-2-
Classifier performance evaluation and
comparison Introduction

Outline of the Tutorial

1 Introducti
on
2 Score
s
3 Estimation
Methods
4 Hypothesis
Testing

-3-
Classifier performance evaluation and
comparison Introduction

Classification Problem

Physical Process Usually unknown

Data set

-4-
Classifier performance evaluation and
comparison Introduction

Classification Problem

Physical Process Usually unknown

Data set

Expert

-5-
Classifier performance evaluation and
comparison Introduction

Supervised Classification

Learning from Experience

“Automate the work of the
expert” Tries to model ⇢(X ,
C)
Physical Process Usually unknown

Data set
Classification
Model

Expert

-6-
Classifier performance evaluation and
comparison Introduction

Supervised Classification

Classification Model
Classifier labels new data (unknown class
value)

Data set Data set

Classification
Model

Expert

-7-
Classifier performance evaluation and
comparison Introduction

Motivation for Honest Evaluation

Many classification
paradigms
Naive
Bayes

Data set
...

X4
...

Neural
Net

Decision
Tree

-8-
Classifier performance evaluation and
comparison Introduction

Motivation for Honest Evaluation

Which is the best paradigm for a classification

problem?
Naive
Bayes ? ?
Data set
...

X4
...

? Neural
Net

Decision
Tree

-9-
Classifier performance evaluation and
comparison Introduction

Motivation for Honest Evaluation

Many parameter
configurations

Naive Bayes

Data set
... Naive
Bayes

...

- 10 -
Classifier performance evaluation and
comparison Introduction

Motivation for Honest Evaluation

Which is the best parameter

configuration for a classification
problem?
Naive
Bayes ?
Data set
... Naive
Bayes

...

- 11 -
Classifier performance evaluation and
comparison Introduction

Motivation for Honest Evaluation

Honest Evaluation
Need to know the goodness of a
classifier Methodology to compare
classifiers
Assess the validity of
evaluation/comparison

Steps for Honest

Evaluation Scores:
quality measures
Estimation methods: estimate value of a score
Statistical tests: comparison
- 12 - among different
solutions
Classifier performance evaluation and
comparison Scores

Outline of the Tutorial

1 Introducti
on
2 Score
s
3 Estimation
Methods
4 Hypothesis
Testing

- 13 -
Classifier performance evaluation and
comparison Scores

Motivation

How to compare classification

models?

Score
Function that provides a quality measure for a
classifier when solving a classification problem

- 14 -
Classifier performance evaluation and
comparison Scores

Motivation

How to compare classification

models?

way to
d so m e
We nee ation
m ea s
cu
lare
ssi c
fi
the an c e!!!
perform
Score
Function that provides a quality measure for a
classifier when solving a classification problem

- 15 -
Classifier performance evaluation and
comparison Scores

Motivation

How to compare classification

models?

way to
d so m e
We nee ation
m ea s
cu
lare
ssi c
fi
the an c e!!!
perform
Score
Function that provides a quality measure for a
classifier when solving a classification problem

- 16 -
Classifier performance evaluation and
comparison Scores

Motivation

What Does Best Quality Mean?

What are we interested in?
What do we want to
optimize? Characteristics
of the problem
Characteristics of the data
set

Different kind of
scores

- 17 -
Classifier performance evaluation and
comparison Scores

Scores

Based on Confusion Matrix

Accuracy/Classification
error

Recall
Specificit
y
Precision
F-Score

Based on Receiver Operating Characteristics

(ROC) Area under the ROC curve (AUC)
- 18 -
Classifier performance evaluation and
comparison Scores

Scores

Based on Confusion Matrix

Accuracy/Classification error —→
Classification

Recall
Specificit
y
Precision
F-Score

Based on Receiver Operating Characteristics

(ROC) Area under the ROC curve (AUC)
- 19 -
Classifier performance evaluation and
comparison Scores

Scores

Based on Confusion Matrix

Accuracy/Classification error —→
Classification

Recall
Specificity —→ Information
Retrieval Precision
F-Score

Based on Receiver Operating Characteristics

(ROC) Area under the ROC curve (AUC)

- 20 -
Classifier performance evaluation and
comparison Scores

Scores

Based on Confusion Matrix

Accuracy/Classification error —→ Classification

Recall
Specificity —→ Information
Retrieval Precision
F-Score

Based on Receiver Operating

Characteristics (ROC)
Area under the ROC curve (AUC)

—→ Medical Domains - 21 -
Classifier performance evaluation and
comparison Scores

Confusion Matrix

Two-Class
Problem Prediction
c+ c— Total

c+ TP FP N+
Actual

c— FN TN N—
ˆ + Nˆ —
Total N N

- 22 -
Classifier performance evaluation and
comparison Scores

Confusion Matrix

Several-Class
Problem Prediction
c1 c2 c3 .. . cn Total

c1 TP1 FN12 FN13 .. . FN1n N1

c2 FN21 TP2 FN23 .. . FN2n N2
Actual

c3 FN31 FN32 TP3 .. . FN3n N3

.. . .. . .. . .. . .. . .. . .. .
cn FNn1 FNn2 FNn3 .. . TPn Nn
Total Nˆ1 Nˆ2 Nˆ3 .. . Nˆn N
- 23 -
Classifier performance evaluation and
comparison Scores

Two-Class Problem - Example

5 X1 X2 C
3,1 2,4 c+
4

1,7 1,8 c—
3
3,3 5,2
X2

2 2,6 1,7 c+

1
1,8 2,9 c—
0,3 2,3
0
.. . .. . c+

c—
-1
-1 0 1 2 3 4 5 6
X1

..
.

- 24 -
Classifier performance evaluation and
comparison Scores

Two-Class Problem - Example

Prediction
5

4
c+
c+ c— Total
-
3
c
c+ 10 2 12
X2

Actual
2

1 c— 2 8 10
0 Total 12 10 22
-1
-1 0 1 2 3 4 5
X1

- 25 -
Classifier performance evaluation and
comparison Scores

Accuracy/Classification Error
Definition
Data samples classified
correctly/incorrectly
6

4
c+
Prediction
- c+ c— Total
3
c
X2

2 c+ 10 2 12

Actual
2 8 10
1

c—
0

Total 12 10 22
-1
-1 0 1 2 3 4 5
X1

ϵ( ) = p( (X ) /= C) = E⇢(x ,c)[1 — δ(c, (x

))] - 26 -
Classifier performance evaluation and
comparison Scores

Accuracy/Classification Error

6
Prediction
c+ c— Total
5
c+ 10 2 12
+

Actual
4 c
c— 2 8 10
c-
3 Total 12 10 22
X2

FP +
1
ϵ
FN N
0 =
2+
-1 = =
-1 0 1 2 3 4 5 22
X1
0,182
2

- 27 -
Classifier performance evaluation and
comparison Scores

Skew Data

5 X1 X2 C
4
0,8 2,2 c+
3
0,47 2,3 c+
0,5 2,1
X2

1
2,4 2,9 c+
0
3,1 1,2 c—
-1

-2
2,5 3,1
.. . .. . c—
-3
-3 -2 -1 0 1 2 3 4 5 6

c—
X1

..
.
- 28 -
Classifier performance evaluation and
comparison Scores

Skew Data - Classification Error

6
Prediction
5 c+ c— Total
4 c+ 0 5 5
-

Actual
3
c
c— 7 993 1000
+
2 c
X2

Total 7 998 1005

0
7+
ϵ =5 =
-1
1005
-2
0,012
Very low
-3
-3 -2 -1 0 1
X1
2 3 4 5 6 ϵ!!

- 29 -
Classifier performance evaluation and
comparison Scores

Skew Data - Classification Error

6 Prediction
c+ c— Total
5
c+ 0 5 5

Actual
4

3 c— 0 1000 1000
-
2
c Total 0 1005 1005
X2

+
c
1
0+
0
ϵ =5 =
-1
1005
-2 0,005
Better?
-3
-3 - -1 0 1 2 3 4 5 6
?
2 X1

- 30 -
Classifier performance evaluation and
comparison Scores

Positive Unlabeled Learning

?
5

4 ?
Positive Labeled
3 ?
Data
?

Only unlabeled
Many positive samples:
X2

? ?
?
2

samples
? ?
?

1 ?
?
?
? Positive?
?
labeled
Negative
0
?
?
?
-1
-1 0 1 2
X1
3 4 5
Classificatio
n error is
useless

- 31 -
Classifier performance evaluation and
comparison Scores

Recall

Definition
Fraction of positive class
samples
correctly
⇢
classified True positive
Other
rate Sensitivity
names

TP TP
r( ) = =
TP + FN
P
Definition Based
on Probabilities
- 32 -

r ( ) = p( (x ) = c+|C = c + )
Classifier performance evaluation and
comparison Scores

Skew Data - Recall

6 Prediction
c+ c— Total
5
c+ 0 5 5

Actual
4

c-
3 c— 7 993 1000

2 c+ Total 7 998 1005

0
0
-1
r ( ) =0 + =
-2
0 5
-3
-3 -2 -1 0 1 2 3 4 5 6
Very bad
X1 recall!!

- 33 -
Classifier performance evaluation and
comparison Scores

Positive Unlabeled Learning - Recall

6 Prediction
c+ c? Total
?
5 c+ 0 5 5

Actual
7 10 1
c+ c?
4 ?

Total 12 10
c-
22
3 ?
?
5
X2

? ?
?
r( ) =
0+
2
? ?
?
?
= 1
5
?
1 ? ?
It is possible to
?

0
? calculate recall
?
-1
-1 0 1 2 3 4 5
in positive-
X1
unlabeled
problems

- 34 -
Classifier performance evaluation and
comparison Scores

Precision

Definition
Fraction of data samples
classified as c+ which are
actually c+

TP TP
pr ( ) = TP + =
P
FP
ˆ
Definition Based on
Probabilities
pr ( ) = p(C = c+ | (x ) = c + ) = E⇢(x | (x )=c+)[δ( (x ),
c+ )] - 35 -
Classifier performance evaluation and
comparison Scores

Skew Data - Precision

6 Prediction
c+ c— Total
5
c+ 0 5 5

Actual
4

c-
3 c— 7 993 1000

2 c+ Total 7 998 1005

0
0
-1
pr ( ) = 0 + =
-2
0 7
-3
-3 -2 -1 0 1 2 3 4 5 6
Very bad
X1 precision!!

- 36 -
Classifier performance evaluation and
comparison Scores

Positive Unlabeled Learning - Precision

?
5

Precision is not
c+
4 ?

c- ?
a good score for
3
?
positive-
X2

? ?
?
2
? ? ?
unlabeled data
?
1 ? ?
?
? samples
0
Not all the
?
? positive
-1
-1 0 1 2
X1
3 4 5 samples are
labeled

- 37 -
Classifier performance evaluation and
comparison Scores

Precision & Recall Application Domains

Spam Filtering
Decide if an email is spam or not
Precision: Proportion of real spam in the spam-box
Recall: Proportion of total spam messages identified
by the system

Sentiment Analysis
Classify opinions about specific products given by
users in blogs, webs, forums, etc.
Precision: Proportion of opinions classified as
positive being actually positive
Recall: Proportion of positive opinions
identified as positive
- 38 -
Classifier performance evaluation and
comparison Scores

Specificity

Definition
Fraction of negative class
samples correctly identified
Specificity = 1 —
FalsePositiveRate

TN TN
sp( ) = =
TN + FP
N

Definition Based
on Probabilities
- 39 -

sp( ) = p( (x ) = c—|C = c—)

Classifier performance evaluation and
comparison Scores

Skew Data - Specificity

5
Prediction
c+ c— Total
4
c+ 0 5 5
-

Actual
3
+
c
2
c c— 7 993 1000
X2

Total 7 998 1005

-1
993
-2 sp( ) = 993 + =
-3
-3 -2 -1 0 1 2 3 4 5 6
0,99 7
X1

- 40 -
Classifier performance evaluation and
comparison Scores

Skew Data - Specificity

6
Prediction
c+ c— Total
5

4
c+ 0 5 5

Actual
3 0 1000 1000
c—
-
2 c Total 0 1005
X2

1005
+
1
c
0
100
-1
sp( ) = 0 =
-2 1000 + 0
-3
-3 - -1 0 1 2 3 4 5 6
1,00
2 X1

- 41 -
Classifier performance evaluation and
comparison Scores

Balanced Scores
Balanced accuracy
rate ✓ ◆
1 TP TN recall + specificity
Bal. acc = + =
2 P N 2
Balanced error
rate FP FN
Bal. ϵ = ✓ + ◆
1
2 P N

Skew
Data
Prediction
c+ c— 1 993
Total Bal. acc = 0 5 +
2
≈
≈ 0,5
c+ 0 5 5 1 7 5
Bal. ϵ 2 7 +1000
Actual

7 993 1000
c—
= 1000 0,5
Total 7 998 1005

- 42 -
Classifier performance evaluation and
comparison Scores

Balanced Scores
(β2+1) Precision·Recall
F — Score
β2(Precision+Recall)
=
2·Precision·Recall
F1 — Score Precision+Recall —→ Harmonic
= Mean

Harmonic
Mean 1.2

0.8

0.6

Maximized with 0.4

balanced
Scor
e 0.2

components
Bal. acc →
0

-0.2

arithmetic
mea -0.4
TPR
TNR

n -0.6 Bal. acc

Harmonic
-0.8 Mean
-0.2 0 0.2 0.8 1
0.4 0.6
- 43 -
Classifier performance evaluation and
comparison Scores

Classification Cost

All misclassifications cannot be equally

considered

E.g. Medical Diagnosis Problem

Does not have the same cost as diagnosing a healthy
patient as ill rather than diagnosing an ill patient as
healthy

Classification Model
May be of interest to minimize the expected cost
instead the classification error

- 44 -
Classifier performance evaluation and
comparison Scores

Dealing with Classification Cost

Loss Function
Associate an economic/utility/etc. cost to each
classification.
Typical loss function in classification → 0/1 Loss

We can use cost matrix to specify the associated

cost: c+ c—

c+ Prediction
0 1
Actua

c— 1 0
l

- 45 -
Classifier performance evaluation and
comparison Scores

Dealing with Classification Cost

Loss Function
Associate an economic/utility/etc. cost to each
classification.
Typical loss function in classification → 0/1 Loss

We can use cost matrix to specify the associated

cost: c+ c—

c+ CostPrediction
CostFN
Actua

c— CostFP CostTN
l

- 46 -
Classifier performance evaluation and
comparison Scores

Dealing with Classification Cost

Loss Function
Associate an economic/utility/etc. cost to each
classification.
Typical loss function in classification → 0/1 Loss

We can use cost matrix to specify the associated

cost: c+ c—

c+ CostPrediction
CostFN
Actua

c— CostFP CostTN
l

Usually not easy to give an associated

cost
- 47 -
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

ROC Space
Coordinate system used for visualizing classifiers
performance where TPR is plotted on the Y axis and
FPR is plotted on the X axis.
1

0.9

1: k NN
0.8

0.7

0.6
2: Neural network
TPR

0.5

0.4 3: Naive Bayes

4: SVM
0.3

0.2

5: Linear
0.1

0
0 0.1
0.9
0.2 0.3 0.4 0.5 0.6 0.7 0.8 1
regression
FPR

- 48 - 6: Decision tree
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

ROC Space
Coordinate system used for visualizing classifiers
performance where TPR is plotted on the Y axis and
FPR is plotted on the X axis.
1

0.9

0.8
1: k NN
0.7

0.6 2: Neural network

TPR

0.5

0.4 3: Naive Bayes

4: SVM
0.3

0.2

5: Linear
0.1

0
0 0.1
0.9
0.2 0.3 0.4 0.5 0.6 0.7 0.8 1
regression
FPR

- 49 - 6: Decision tree
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

ROC Curve
For a probabilistic/fuzzy classifier, a ROC curve is a
plot of the TPR vs. FPR as its discrimination threshold is
1 varied p(c|x ) T = 0,2
+
T = 0,5
+
T = 0,8
+
C
+
0,99 c c c c
0.9
0,90 c+ c+ c+ c+
0.8
0,85
0.7 0,80 c+ c+ c+ c+
0,78
0.6 c+ c+ c+ c—
0,70
TPR

0.5
0,60 c+ c+ c— c+
0.4
c+ c+ c— c—
0.3
c+ c+ c— c+
0.2
0,45 c+ c— c— c—
0.1
0,40 c+ c— c— c—
0
0 0.2 0.4 0.6 0.8 1 0,30
FPR 0,20 c+ c— c— c—
0,15 c— c—
c+ c+
0,10
- 50 -0,05 c— c— c— c—

c— c— c— c—
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

ROC Curve
For a crisp classifier a ROC curve can be
obtained by interpolation from a single point
1 p(c|x ) T = 0,2 T = 0,5 T = 0,8 C
0,99 c+ c+ c+ c+
0.9
0,90 c+ c+ c+ c+
0.8
0,85
0.7 0,80 c+ c+ c+ c+
0,78
0.6 c+ c+ c+ c—
0,70
TPR

0.5
0,60 c+ c+ c— c+
0.4
c+ c+ c— c—
0.3
c+ c+ c— c+
0.2
0,45 c+ c— c— c—
0.1
0,40 c+ c— c— c—
0
0 0.2 0.4 0.6 0.8 1 0,30
FPR 0,20 c+ c— c— c—
0,15 c— c—
c+ c+
0,10 c— c—
- 51 - c— c—
0,05
c— c— c— c—
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

ROC Curve
Insensitive to skew class
distribution Insensitive to
misclassification cost

Dominance Relationship
A ROC curve A dominates another ROC curve B if A is
always above and to the left of B in the plot

- 52 -
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

ROC Curve
Insensitive to skew class
distribution Insensitive to
misclassification cost

Dominance Relationship
A ROC curve A dominates another ROC curve B if A is
always above and to the left of B in the plot

- 53 -
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

0.9
Dominance
0.8
A A dominates
0.7
B
0.6 B
throughout all the
range of T
TPR

0.5

0.4

0.3 A has a better

0.2 predictive
0.1
performance over
0
0 0.2 0.4
FPR
0.6 0.8 1
any condition of cost
and class distribution

- 54 -
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

0.9

0.8
No-Dominance
A
0.7 The
B
0.6
dominance
relationship may not
0.5

0.4
be so clear
0.3
No model is the
best under all
0.2

0.1

0 possible scenarios
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1
0.9

- 55 -
Classifier performance evaluation and
comparison Scores

Receiver Operating Characteristics (ROC)

0.9 Area Under ROC Curve

0.8
A Equivalent to
0.7

0.6 B test
Wilcoxon
0.5
If A dominates
0.4
B:
AUC(A) ≥
0.3

0.2
AUC(B)
If A does not
0.1
dominate B AUC
0
0
0.9
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1
“cannot identify the
best classifier”

- 56 -
Classifier performance evaluation and
comparison Scores

Generalization to Multilabel-Class

Most of the presented scores are for binary

classification Generalization to multilabel is
possible
E.g. One-vs-All approach
Prediction
c1 c2 c3 ... cn Total c1 vs. All (score1)

c1 TP1 FN12 FN13 ... FN1n P1 TP

c2 FN21 TP2 FN23 ... FN2n P2 TN
...
Actual

c3 FN31 FN32 TP3 FN3n P3 FN

... ... ... ... ... ... ...

cn FNn1 FNn2 FNn3 ... TPn Pn

FP
Total Pˆ 1 Pˆ 2 Pˆ 3 ... Pˆ n

- 57 -
Classifier performance evaluation and
comparison Scores

Generalization to Multilabel-Class

Most of the presented scores are for binary

classification Generalization to multilabel is
possible
E.g. One-vs-All approach
Prediction
c1 c2 c3 ... cn Total c1 vs. All (score1)

c1 TP1 FN12 FN13 ... FN1n P1 TP

c2 FN21 TP2 FN23 ... FN2n P2 TN
...
Actual

c3 FN31 FN32 TP3 FN3n P3 FN

... ... ... ... ... ... ...

cn FNn1 FNn2 FNn3 ... TPn Pn

FP
Total Pˆ 1 Pˆ 2 Pˆ 3 ... Pˆ n

- 58 -
Classifier performance evaluation and
comparison Scores

Generalization to Multilabel-Class

Most of the presented scores are for binary

classification Generalization to multilabel is
possible
E.g. One-vs-All approach
Prediction
c1 c2 c3 ... cn Total c1 vs. All (score1)

c1 TP1 FN12 FN13 ... FN1n P1 TP

c2 FN21 TP2 FN23 ... FN2n P2 TN
...
Actual

c3 FN31 FN32 TP3 FN3n P3 FN

... ... ... ... ... ... ...

cn FNn1 FNn2 FNn3 ... TPn Pn

FP
Total Pˆ 1 Pˆ 2 Pˆ 3 ... Pˆ n

- 59 -
Classifier performance evaluation and
comparison Scores

Generalization to Multilabel-Class

Most of the presented scores are for binary

classification Generalization to multilabel is
possible
E.g. One-vs-All approach
Prediction
c1 c2 c3 ... cn Total c1 vs. All (score1)

c1 TP1 FN12 FN13 ... FN1n P1 TP

c2 FN21 TP2 FN23 ... FN2n P2 TN
...
Actual

c3 FN31 FN32 TP3 FN3n P3 FN

... ... ... ... ... ... ...

cn FNn1 FNn2 FNn3 ... TPn Pn

FP
Total Pˆ 1 Pˆ 2 Pˆ 3 ... Pˆ n

- 60 -
Classifier performance evaluation and
comparison Scores

Generalization to Multilabel-Class

Most of the presented scores are for binary

classification Generalization to multilabel is
possible
E.g. One-vs-All approach
Prediction
c1 c2 c3 ... cn Total c1 vs. All (score1)

c1 TP1 FN12 FN13 ... FN1n P1 TP

c2 FN21 TP2 FN23 ... FN2n P2 TN
...
Actual

c3 FN31 FN32 TP3 FN3n P3 FN

... ... ... ... ... ... ...

cn FNn1 FNn2 FNn3 ... TPn Pn

FP
Total Pˆ 1 Pˆ 2 Pˆ 3 ... Pˆ n

- 61 -
Classifier performance evaluation and
comparison Scores

Generalization to Multilabel-Class

Most of the presented scores are for binary

classification Generalization to multilabel is
possible
E.g. One-vs-All approach
Prediction
c1 c2 c3 ... cn Total c1 vs. All (score1)

c1 TP1 FN12 FN13 ... FN1n P1 TP

c2 FN21 TP2 FN23 ... FN2n P2 TN
...
Actual

c3 FN31 FN32 TP3 FN3n P3 FN

... ... ... ... ... ... ...

cn FNn1 FNn2 FNn3 ... TPn Pn

FP
Total Pˆ 1 Pˆ 2 Pˆ 3 ... Pˆ n

Xn
score TOT scorei · p(ci )
= i
=1
- 62 -
Classifier performance evaluation and
comparison Scores

Scores

The Use of a Specific Score Depends

on: Application domain
Characteristics of the problem
Characteristics of the data
set
Our interest when solving the
problem etc.

- 63 -
Classifier performance evaluation and
comparison Estimation Methods

Outline of the Tutorial

1 Introducti
on
2 Score
s
3 Estimation
Methods
4 Hypothesis
Testing

- 64 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Estimation
Select a score to measure the
quality Calculate the true value
of the score Limited information
is available
Physical Process

Data set

Finite Data set

- 65 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Estimation
Select a score to measure the
quality Calculate the true value
of the score Limited information
is available Classification
Physical Process
Model

Data set

Finite Data set

- 66 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Estimation
Select a score to measure the
quality Calculate the true value
of the score Limited information
is available Classification
Physical Process
Model

Data set Quality

Measures
Error
Recall
Precisio
n
....

Finite Data set

- 67 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Estimation
Select a score to measure the
quality Calculate the true value
of the score Limited information
is available Classification
Physical Process
Model

Data set Quality

Measures
Error
Recall Rando
Precisio m
n Variable
.... s

Finite Data set

- 68 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

True Value - ϵN
Expected value of the score for a set of N data
samples sampled from ⇢(C, X )

- 69 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

True Value - ϵN
Expected value of the score for a set of N data
samples sampled from ⇢(C, X )

⇢(C, X ) unknown → Point estimation of the score

(ϵˆ)

- 70 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Bias
Difference between the estimation of the score and
its true value: E⇢(ϵˆ— ϵN )

- 71 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Variance
Deviation of the estimated value from its
expected value:
var (ϵˆ— ϵN )

- 72 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Bias and variance depend on the estimation

method Trade-off between bias and variance
needed

- 73 -
Classifier performance evaluation and
comparison Estimation Methods

Introduction

Data set

Finite data set to estimate the score

Several choices depending on how this data set
is dealt with
- 74 -
Classifier performance evaluation and
comparison Estimation Methods

Resubstitution

Data set
Learnin
g

- 75 -
Classifier performance evaluation and
comparison Estimation Methods

Resubstitution

Data set Trainin

- 76 -
Classifier performance evaluation and
comparison Estimation Methods

Resubstitution

Classification Error Estimation

The simplest estimation
method Biased estimation
ϵN
Smaller variance
Too optimistic (overfitting
problem)
Bad estimator of the true
classification error

- 77 -
Classifier performance evaluation and
comparison Estimation Methods

Hold-Out

Data set - Training

Data set

Data set
Data set - Test

- 78 -
Classifier performance evaluation and
comparison Estimation Methods

Hold-Out

Data set - Training

Trainin
Data set g

Data set - Test

- 79 -
Classifier performance evaluation and
comparison Estimation Methods

Hold-Out

Data set - Training

Data set

Test
Data set - Test

- 80 -
Classifier performance evaluation and
comparison Estimation Methods

Hold-Out

Classification Error
Estimation Unbiased
estimator of ϵN1 Biased
estimator of ϵN
Large bias (pessimistic estimation of the true
classification error)
Bias related to N1 and N2

- 81 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Data set - Fold

Data set Data set - Fold

Data set - Fold

- 82 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Data set - Fold

Data set Data set - Fold

Data set - Fold

Trainin
3 g

Data set - Fold

- 83 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Data set - Fold 1

Data set Test

Data set - Fold 2

Data set - Fold 3

Data set - Fold k

- 84 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Data set - Fold

Data set Data set - Fold

Data set - Fold

- 85 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Data set - Fold

Data set Data set - Fold

Data set - Fold

3
Trainin
g

Data set - Fold

- 86 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Data set - Fold 1

Data set Data set - Fold 2

Data set - Fold 3

Data set - Fold k

- 87 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Data set - Fold

Data set Data set - Fold

Data set - Fold

- 88 -
Classifier performance evaluation and
comparison Estimation Methods

k -Fold Cross-Validation

Classification Error
Estimation
Unbiased estimator N
N k
of ϵ —

Biased estimation
Smaller of
bias than Hold-Out
ϵN
Leaving-One-Out
Special case of k -fold Cross-Validation (k
= N) Quasi unbiased estimation for N
Improves the bias with respect to
CV Increases the variance → more
unstable Higher computational cost

- 89 -
Classifier performance evaluation and
comparison Estimation Methods

Bootstrap

Bootstrap Data set

Data set Bootstrap Data set

Bootstrap Data set

- 90 -
Classifier performance evaluation and
comparison Estimation Methods

Bootstrap

Bootstrap Data set

Data set Bootstrap Data set

Bootstrap Data set

- 91 -
Classifier performance evaluation and
comparison Estimation Methods

Bootstrap

Bootstrap Data set

Data set Bootstrap Data set

Bootstrap Data set

- 92 -
Classifier performance evaluation and
comparison Estimation Methods

Bootstrap

Bootstrap Data set -

Data set Bootstrap Data set -

Bootstrap Data set -

- 93 -
Classifier performance evaluation and
comparison Estimation Methods

Bootstrap

Classification Error Estimation

Biased estimation of the classification
error Variance improved because of
resampling
Uses for testing part of the data used
for learning
“Similar to
resubstitution”
Problem of overfitting

- 94 -
Classifier performance evaluation and
comparison Estimation Methods

Leaving-One-Out Bootstrap

Mimics Cross-
Validation Each i is
tested on D/Di⇤

Tries to Avoid the

Overfitting Problem
Expected number of
distinct samples on
bootstrap data set
≈ 0,632N
Similar to repeated Hold-
Out Biased upwards: - 95 -
Tends to be a
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Bias

Bias correction terms can be used for error

estimation

Hold-Out/Cross-Validation
Several proposals
Improves bias
estimation
Surprisingly not very
extended

Bootstrap
Improves bias estimation
Well established - 96 -

methods
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Bias

(ϵˆ+ ) - (Burman,
Corrected Hold-Out ho
1989) +
ϵˆ
ho = ϵˆho + ϵˆres —
ϵˆho—N
Where
ϵˆho = standard Hold-Out estimator
ϵˆres = resubstitution error
ϵˆho—N = learned on Hold-Out learning set but
tested on
D.

- 97 -
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Bias

Corrected Hold-Out ho (ϵˆ+ ) - (Burman,

1989) +
ϵˆ
ho = ϵˆho + ϵˆres —
ϵˆho—N
Improveme
nt N2
Biasєˆho ≈ Cons0N1·N

Biasєˆho ≈ Cons 1 N1N·N

2
2
+

- 98 -
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Bias

Corrected Cross-Validationcv(ϵˆ+ ) - (Burman,

1989) ϵˆ +
cv = ϵˆcv + ϵˆres —
ϵˆcv —N
Improvement
Biasєˆcv ≈ Cons0 (k—1
1)·N

Biasє+cv ≈ Cons 1 (k —1 2
ˆ 1)·N

- 99 -
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Bias

0.632 Bootstrap boot

(ϵˆ.632 ) ϵ.63 = boot res + loo—boot
2
ˆ 0.368ϵˆ 0.632ϵˆ
Improvement
Tries to balance optimism (resubstitution) and
pessimism (loo-bootstrap)
Works well with “light-fitting” classifiers
bootϵˆ.632 is still too
With overfitting classifiers
optimistic

- 100
-
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Bias

(ϵˆ.632+ ) - (Efron & Tibshirani,

0.632+ Bootstrapboot
Correct bias when there is great amount of
1997)
overfitting
Based on the non-information error
rate (ц):

XN XN
2
цˆ δ(c , (x ))/N
i x j
= i =1 j=1

Uses the relative overfitting to correct

the bias:

ϵˆloo—boot — ϵˆres
Rˆ =
цˆ — ϵˆres
- 101
-
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Bias

0.632+ Bootstrapboot (ϵˆ.632+ ) - (Efron & Tibshirani,

1997) .63 = (1 —ˆ
ϵboot ˆ loo—boot
res + w
2
ˆ w )ϵˆ ϵˆ
wˆ = 0.632
1—0 . 638Rˆ

P N P N
цˆ i =1 j δ(ci , x (x j )/N
2

= =1

Rˆ = єˆloo—boot —

єˆres
gˆ—єˆr e s

- 102
-
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Variance

Stratification
Keeps the proportion of each class in the
train/test data
Hold-Out: Stratified splitting
Cross-Validation: Stratified
splitting Bootstrap: Stratified
sampling

May improve the variance

of the estimation

- 103
-
Classifier performance evaluation and
comparison Estimation Methods

Improving the Estimation - Variance

Repeated Methods
Applicable to Hold-Out and Cross-
Validation Bootstrap already includes
sampling

Repeated Hold-Out/Cross-
Validation Repeat estimation
process t -times Simple
average over results

Classification Error Estimation

Same bias as standard estimation
methods Reduces the -variance
104 with
-
respect
Classifier performance evaluation and
comparison Estimation Methods

Estimation Methods

Which estimation method is better?

May Depend on Many

Aspects The size of the
data set
The classification paradigm used
The stability of the learning
algorithm
The characteristics of the classification
problem The bias/variance/computational
cost trade-off
.. . - 105
-
Classifier performance evaluation and
comparison Estimation Methods

Estimation Methods

Which estimation method is better?

Large Data Sets

Hold-out may be a good choice
Computationally not so expensive
Larger bias but depends on the data
set size

Smaller Data Sets

Repeated Cross-
Validation Bootstrap
0.632
- 106
-
Classifier performance evaluation and
comparison Estimation Methods

Estimation Methods

Which estimation method is better?

Small Data Sets

Bootstrap and repeated Cross-Validation may
not be informative
Permutation test (Ojala & Garriga, 2010):
Can be used to ensure the validity of the
estimation
Confidence intervals (Isaksson et al., 2008):
May provide more reliable information about
the estimation

- 107
-
Classifier performance evaluation and
comparison Hypothesis Testing

Outline of the Tutorial

1 Introducti
on
2 Score
s
3 Estimation
Methods
4 Hypothesis
Testing

- 108
-
Classifier performance evaluation and
comparison Hypothesis Testing

Motivation

Basic Concepts
Hypothesis testing form the basis of scientific
reasoning in experimental sciences
They are used to set scientific statements
A hypothesis Ho called null hypothesis is tested
against another hypothesis H1 called alternative
The two hypotheses are not at the same level:
reject Ho
does not mean acceptance of H1
The objective is to know when the differences in
H0 are due to randomness or not
- 109
-
Classifier performance evaluation and
comparison Hypothesis Testing

Hypothesis Testing

Possible Outcomes of a Test

Given a sample, a decision is taken about
the null hypothesis (H0)
The decision is taken under uncertainty

H0 TRUE H0 FALSE
√ Type II error (β)
Decision: ACCEPT √
Decision: REJECT Type I error (α)

- 110
-
Classifier performance evaluation and
comparison Hypothesis Testing

Hypothesis Testing: An Example

A Simple Hypothesis Test

A natural process is given in nature that follows a
Gaussian distribution N (µ, σ2)
We have a sample of this process {x1 ,. .., x n } and
a decision must be taken about the following
hypotheses: ⇢
H0 : µ =
60
H 1 : µ =
50
A statistic (function) of the sample
P is used to
n
take the In our example n1 i=1 xi
decision.
X=

- 111
-
Classifier performance evaluation and
comparison Hypothesis Testing

Hypothesis Testing: An Example

Accept and Reject Regions
The possible values of the statistic are divided in
accept and reject regions

A.R. = {(x 1 ,. .., xn )|X > 55}

R.R. = {(x 1 ,. .., xn )|X ≤ 55}

Assuming a probability distribution on the statistic
X (it depends on the distribution of {x1 ,. .., x n }) the
probability of each error type can be calculated:

α = PH0 (X 2 R.R.) = PH0 (X ≤ 55)

β = PH1 (X 2 -A.R.)
112 = PH1 (X > 55)
-
Classifier performance evaluation and
comparison Hypothesis Testing

Hypothesis Testing: An Example

Accept and Reject Regions

The A.R. and R.R. can be modified in order to
have a particular value of α:

0,1 = α = PH0 (X 2 R.R.) = PH0 (X ≤ 51)

0,05 = α = PH0 (X 2 R.R.) = PH0 (X ≤ 50,3)

p-value. Given a sample and the specific value of

the test statistic x for the sample:

p-value = PH0 (X ≤ x )
- 113
-
Classifier performance evaluation and
comparison Hypothesis Testing

Hypothesis Testing: Remarks

Power: (1 —
β) Depending on the hypotheses the type II error (β)
can not be calculated:
⇢
H0 : µ =
H60
1 : µ /= 60

In this case we do not know the value of µ for H1 so

not calculate the power (1 — β)
we can
A good hypothesis test: given an α the test
maximises the power (1 — β)

Parametric test vs non-parametric test

- 114
-
Classifier performance evaluation and
comparison Hypothesis Testing

Hypothesis Testing in Supervised Classification

Scenarios
Two classifiers (algorithms) vs More
than two One dataset vs More than one
dataset Score
Score estimation method known vs
unknown
The classifiers are trained and tested in
the same datasets
.....

- 115
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

The General
Approach
8
> H0 : classifier has the same score
<
value as classifier 0 in p(x, c)

>:
H1 : they have different
values

- 116
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

The General
Approach
8
> H0 : classifier has the same score
<
value as classifier 0 in p(x, c)

>:
H1 : they have different
values

8
>
> H0 : algorithm has the same average score
<
value as algorithm 0 in p(x, c)

>:
H1 : they have different
values - 117
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

An Ideal Context: We Can Sample
p(x, c)
1 Sample i.i.d. 2n datasets from
2
p(x,
Learnc) 2n classifiers 1, 2 for i = 1,. ..,
n
i i
3 For each classifier obtain enough i.i.d. samples
{(x1, c1),. .., (xN , cN )} from p(x, c)
4 For each data set calculate the error of each algorithm
in the test set
XN XN
ϵ1i errori12 (xj ) ϵi errori2 (xj )
1 1
= j=1 = j=1
N N
5 Calculate the average values over the n training
datasets:
1 Xn 1 1 Xn 2
ϵ1 ϵi ϵ2 ϵi
n n
= i=1
= i=1
- 118 -
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

An Ideal Context: We Can Sample p(x, c)

Our test rejects the null hypothesis if |ϵ1 — ϵ2| (the
statistic) is big
Fortunately, by the central limit theorem:

ϵi ~ N (score( i ), si ) i = 1,
2
Therefore, under the null
hypothesis:
ϵ1 —
Z = qϵ2s2+s
~ N (0,
ˆ
1
2
n
2
1)

... and finally we reject H0 when |Zˆ | >

z1—↵/2 - 119
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

Properties of Our Ideal Framework

Training datasets are
independent Testing datasets
are independent

The Sad Reality

We can not get i.i.d. training samples from p(x,
c) We can not get i.i.d. testing samples
from p(x, c) We have only one sample from
p(x, c)

- 120
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

McNemar Test (non-parametric)
Compare two classifiers in a dataset after a Hold-Out
process It is a paired non-parametric test

2
error 2

1
error ok
1
ok n00 n11
10

n01
Under H0 we have n10 ≈ n01 and the statistic

(|n01 — n10| — 1)2

n01 + n10

follows a χ2 distribution with 1 degree of freedom

When n01 + n10 is small (<2-512)1 t-he binomial dist. can
be used
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

Tests Based on Resampling: Resampled t-test (parametric)

The dataset is randomly divided n times in training

and test Let pˆi be the difference between the

performance of both
algorithms in run i and p the average. When it is
t= √
assumed that pˆi q P n n
p
ˆ
i=1 (p i —p)
2

are Gaussian and independent, n—1under the null

follows a t student distribution with n — 1 degree of

freedom Caution:

pˆi are not Gaussian as i pˆ1 and

i pˆ2 are not
pˆi are not independent (overlap in training and
independent
testing) - 122
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

Resampled t-test Improved (Nadeau & Bengio,

2003) The variance in this case is too
optimistic
Two alternatives
Corrected resampled t : ✓ 1 n2
◆
2
+
n σ n1

Conservative Z (overestimation of the

variance)

- 123
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

t-test for k-fold Cross-validation

It is similar to t -test for resampling
In this case the testing datasets are
independent The training datasets are still
dependent

- 124
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in a Dataset

5x2 fold Cross-Validation (Dietterich 1998, Alpaydin

1999)
Each Cross-Validation process has independent
training and testing datasets
The following statistic:
P 5 P
(j) 2
2
i=1 P j=1 (p i )
5
2 s2i
i =1

follows a F distribution with 10 and 5 degrees of

freedom under the null hypothesis

- 125
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in Several Datasets

Initial
Approaches
Averaging Over
Datasets Paired t-test
P N
c i = i — ci and d = ci then follows a
1 2 N i=1 d
distribution
c d/σ1 with N — 1 degrees of t
freedom

Problems
Commensurabilit
y Outlier
susceptibility
(t-test) Gaussian
assumption - 126
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in Several Datasets

Wilcoxon Signed-Ranks Test

It is a non-parametric test that works as
follows:
1 Rank the module of the performance differences

between both algorithms

2 Calculate the sum of the ranks R+ and R— where
the first (resp. the second) algorithm outperforms
3 the other Calculate T = min(R+, R—)
For N ≤ 25 there are tables with critical
values For N > 25

z= T — 14 N(N +
q 1) ~ N (0,
1
2 N(N + 1) 2N + 1)
4 ( 1)
- 127
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1 2
dif ran
Dataset1 0.763 0.598 f k
Dataset2 0.599 0.591
Dataset3 0.954 0.971
Dataset4 0.628 0.661
Dataset5 0.882 0.888
Dataset6 0.936 0.931
Dataset7 0.661 0.668
Dataset8 0.583 0.583
Dataset9 0.775 0.838
Dataset10 1.000 1.000

- 128
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591
Dataset3 0.954 0.971
Dataset4 0.628 0.661
Dataset5 0.882 0.888
Dataset6 0.936 0.931
Dataset7 0.661 0.668
Dataset8 0.583 0.583
Dataset9 0.775 0.838
Dataset10 1.000 1.000

- 129
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971
Dataset4 0.628 0.661
Dataset5 0.882 0.888
Dataset6 0.936 0.931
Dataset7 0.661 0.668
Dataset8 0.583 0.583
Dataset9 0.775 0.838
Dataset10 1.000 1.000

- 130
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971 +0.017
Dataset4 0.628 0.661 +0.033
Dataset5 0.882 0.888 +0.006
Dataset6 0.936 0.931 -0.005
Dataset7 0.661 0.668 +0.007
Dataset8 0.583 0.583 0.000
Dataset9 0.775 0.838 +0.063
Dataset10 1.000 1.000 0.000

- 131
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

- 132
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971 +0.017
Dataset4 0.628 0.661 +0.033
Dataset5 0.882 0.888 +0.006
Dataset6 0.936 0.931 -0.005
Dataset7 0.661 0.668 +0.007
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063
Dataset10 1.000 1.000 0.000 1.5

- 133
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971 +0.017
Dataset4 0.628 0.661 +0.033
Dataset5 0.882 0.888 +0.006
Dataset6 0.936 0.931 -0.005
Dataset7 0.661 0.668 +0.007
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063
Dataset10 1.000 1.000 0.000 1.5

- 134
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165
Dataset2 0.599 0.591 -0.008
Dataset3 0.954 0.971 +0.017
Dataset4 0.628 0.661 +0.033
Dataset5 0.882 0.888 +0.006
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063
Dataset10 1.000 1.000 0.000 1.5

- 135
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

1
diff rank
2
Dataset1 0.763 0.598 -0.165 10
Dataset2 0.599 0.591 -0.008 6
Dataset3 0.954 0.971 +0.017 7
Dataset4 0.628 0.661 +0.033 8
Dataset5 0.882 0.888 +0.006 4
Dataset6 0.936 0.931 -0.005 3
Dataset7 0.661 0.668 +0.007 5
Dataset8 0.583 0.583 0.000 1.5
Dataset9 0.775 0.838 +0.063 9
Dataset10 1.000 1.000 0.000 1.5

- 136
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

R+ = 34.5 R— = 20.5 T = min(R+, R—)

- 142
-
Classifier performance evaluation and
comparison Hypothesis Testing

Wilcoxon Signed-Ranks Test: Example

Testing Two Algorithms in Several Datasets

Wilcoxon Signed-Ranks Test

It also suffers from commensurability but only
qualitatively
When the assumptions of the t test are met,
Wilcoxon is less powerful than t test

- 144
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Two Algorithms in Several Datasets

Signed Test
It is a non-parametric test that counts the
number of losses, ties and wins
Under the null the number of wins follows a
binomial distribution B(1/2, N)
For large
p v alues of N the number of wins
follows
N (N/2, N/2) under the null
This test does not make any
assumptions It is weaker than
Wilcoxon

- 145
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Dataset (Demˇsar, 1 2 3 4
2006)
D1 0.84 0.79 0.89 0.43
D2 0.57 0.78 0.78 0.93
D3 0.62 0.87 0.88 0.71
D4 0.95 0.55 0.49 0.72
D5 0.84 0.67 0.89 0.89
D6 0.51 0.63 0.98 0.55

- 146
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis
Testing 8 i,
Testing all possible pairs of hypotheses µ i =µ j
j.
Multiple hypothesis testing

Testing the hypothesis µ 1 =µ 2 = .. . = µ k

- 147
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis
Testing 8 i,
Testing all possible pairs of hypotheses µ i =µ j
j.
Multiple hypothesis testing

Testing the hypothesis µ 1 =µ 2 = .. . = µ k

- 148
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis
Testing 8 i,
Testing all possible pairs of hypotheses µ i =µ j
j.
Multiple hypothesis testing

Testing the hypothesis µ 1 =µ 2 = .. . = µ k

ANOVA vs Friedman
Repeated measures ANOVA: Assumes Gaussianity
and sphericity
Friedman: Non-parametric test

- 149
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Freidman
Test
1 Rank the algorithms for each dataset separately
(1-best). In case of ties assigned average ranks
2
Calculate the average rank Rj of each algorithm j

3
The following statistic:
2 3
12N
2 X 2 k (k + 2 5
4 Rj
χF = 1) 4
k (k + j —
1)
follows a χ2 with k — 1 degrees of freedom
(N>10, k>5)

- 150
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test:
Example 1 2 3 4

D1 0.84 (2) 0.79 (3) 0.89 (1) 0.43 (4)

- 151
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test:
Example 1 2 3 4

D1 0.84 (2) 0.79 (3) 0.89 (1) 0.43 (4)

D2 0.57 (4) 0.78 0.78 0.93 (1)
D3 0.62 (4) (2.5) (2.5) 0.71 (3)
D4 0.95 (1) 0.87 (2) 0.88 (1) 0.72 (2)
D5 0.84 (3) 0.55 (3) 0.49 (4) 0.89
D6 0.51 (4) 0.67 (4) 0.89 (1.5)
0.63 (2) (1.5) 0.55 (3)
0.98 (1)
2 3
avr. rank 3 2.75 1.83 2.41
2 12 X 2 k (k + 2 5
4 Rj =
χF = N 1) 4
k (k + j —
1)
- 152
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test:
Example 1 2 3 4

D1 0.84 (2) 0.79 (3) 0.89 (1) 0.43 (4)

D2 0.57 (4) 0.78 0.78 0.93 (1)
D3 0.62 (4) (2.5) (2.5) 0.71 (3)
D4 0.95 (1) 0.87 (2) 0.88 (1) 0.72 (2)
D5 0.84 (3) 0.55 (3) 0.49 (4) 0.89
D6 0.51 (4) 0.67 (4) 0.89 (1.5)
0.63 (2) (1.5) 0.55 (3)
0.98 (1)
2 2.75 3
avr. rank 3 1.83 2 2.41
2 12 X 2 k (k + 5
4 Rj =
χF = N 1)
k (k + j — 2,5902
1) 4
- 153
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Iman & Davenport, 1980

An improvement of Friedman
test:
(N — 2
F
FF = (k1)χ— 1) — 2
N
χF
follows a F-distribution with k — 1 and (k — 1)
(N — 1) degrees of freedom

- 154
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Post-hoc Tests
Decision on the null hypothesis
In case of rejection use of post-hoc
tests to:
1 Compare all pairs
2 Compare all classifiers with a
control

- 155
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis Testing

Several related hypothesis simultaneously
H1,. .., Hn
H0 TRUE H0 FALSE
√ Type II error (β)
Decision: ACCEPT √
Decision: REJECT Type I error (α)

- 156
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis Testing

Several related hypothesis simultaneously
H1,. .., Hn
H0 TRUE H0 FALSE
√ Type II error (β)
Decision: ACCEPT √
Decision: REJECT Type I error (α)

- 157
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis Testing

Several related hypothesis simultaneously
H1,. .., Hn
H0 TRUE H0 FALSE
√ Type II error (β)
Decision: ACCEPT √
Decision: REJECT Type I error (α)

Family-wise error: Probability of rejecting at

least one hypothesis assuming that ALL ARE
TRUE

- 158
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis Testing

Several related hypothesis simultaneously
H1,. .., Hn
H0 TRUE H0 FALSE
√ Type II error (β)
Decision: ACCEPT √
Decision: REJECT Type I error (α)

Family-wise error: Probability of rejecting at

least one hypothesis assuming that ALL ARE
TRUE
False discovery rate

- 159
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Multiple Hypothesis Testing

Several related hypothesis simultaneously
H1,. .., Hn
H0 TRUE H0 FALSE
√ Type II error (β)
Decision: ACCEPT √
Decision: REJECT Type I error (α)

Family-wise error: Probability of rejecting at

least one hypothesis assuming that ALL ARE
TRUE
False discovery rate

- 160
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Designing Multiple Hypothesis
Test Controlling family-wise
error
If each test Hi has a type I error α then the family-
wise error (FWE) in n tests is:

P(accept H1 ∩ accept H2 ∩ .. . ∩ accept Hn )

= P(accept H 1 ) × P(accept H 2 ) × .. .
× P(accept Hn )
= (1 — α)n

and therefore

each
FWE = 1 — (1 — α)n -≈1611 — (1 — αn) = αn
-
test
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Comparing with a Control

The statistic for comparing i and j
is:
z = (Ri — ~ N (0,
1)
qj )k (k +1)
R
6N

Bonferroni-Dunn Test
It is a one-step method
Modify α by taking into account the
number of
comparison
s: α
k—
1
- 162
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Comparing with a Control

Methods based on ordered p-values
The p-values are ordered p1 ≤ p2 ≤ .. . ≤ pk —1

Holm Method
It is a step-down procedure
Starting from p1 check the first i = 1,. .., k — 1
such that
pi > α/(k — i )
The hypothesis H1,. .., Hi —1 are rejected. The
rest of hypotheses are kept
- 163
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =

0.05) 1 2 3 4

D1 0.84 (2) 0.79 (3) 0.89 (1) 0.43 (4)

- 164
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =

0.05) 1 2 3 4

D1 0.84 (2) 0.79 (3) 0.89 (1) 0.43 (4)

- 165
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =

0.05)
z = (Ri —
q
Rj ) k (k
+1)
6N

z
0.335
z12
4
z13 1.569
7
z14
0.791
z23 5
1.234
z24
3
z3 0.456
- 166
-
4 1
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =

0.05) z p-value
0.3354 0.259
z12
2.1569 0.031
z13 0.7915 0.125
1.9843 0.042
z14
0.4561 0.221
z23 -2.7781 0.009

z24

z3
4
- 167
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =

0.05) z p-value Bonferroni (α/6)
0.3354 0.259 0.008
z12
2.1569 0.031 0.008
z13 0.7915 0.125 0.008
1.9843 0.042 0.008
z14
0.4561 0.221 0.008
z23 -2.7781 0.007 0.008
z24

z3
4
- 168
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =

0.05) z p-value Bonferroni (α/6)
0.3354 0.259 0.008
z12
2.1569 0.031 0.008
z13 0.7915 0.125 0.008
1.9843 0.042 0.008
z14
0.4561 0.221 0.008
z23 -2.7781 0.007 0.008
z24

z3
4
- 169
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =

0.05) z p-value Bonferroni (α/6) Holm (α/(7 —
i ))
z12
0.3354 0.259 0.008
z13 2.1569 0.031 0.008
0.7915 0.125 0.008
z14
1.9843 0.009 0.008
z23 0.4561 0.221 0.008
-2.7781 0.007 0.008
z24

z3
4
- 170
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =

0.05) z p-value Bonferroni (α/6) Holm (α/(7 —
i ))
z12
0.3354 0.259 0.008
z13 2.1569 0.031 0.008
0.7915 0.125 0.008
z14
1.9843 0.009 0.008
z23 0.4561 0.221 0.008
-2.7781 0.007 0.008
z24 0.008
z3
4
- 171
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =

0.05) z p-value Bonferroni (α/6) Holm (α/(7 —
i ))
z12 0.3354 0.259 0.008
z13 2.1569 0.031 0.008
0.7915 0.125 0.008
z14 1.9843 0.009 0.008 0.010
z23 0.4561 0.221 0.008
-2.7781 0.007 0.008 0.008
z24

z3
4
- 172
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =

0.05) z p-value Bonferroni (α/6) Holm (α/(7 —
i ))
z12 0.3354 0.259 0.008
z13 2.1569 0.031 0.008 0.012
0.7915 0.125 0.008
z14 1.9843 0.009 0.008 0.010
z23 0.4561 0.221 0.008
-2.7781 0.007 0.008 0.008
z24

z3
4
- 173
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Friedman Test: Example (α =

z3
4
- 174
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Hochberg Method
It is a step-up procedure
Starting with pk —1 check the first i = k — 1,. .., 1
such that
pi < α/(k — i )
The hypothesis H1,. .., Hi —1 are rejected. The
rest of hypotheses are kept

Hommel Method
Find the largest j such that pn—j +k > k α/j for all
k = 1,. .., j
Reject all hypotheses i such that pi ≤ α/j
- 175
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Comments on the Tests

Holm, Hochberg and Hommel tests are more
powerful than Bonferroni
Hochberg and Hommel are based on Simes
conjecture and can have a higher than α FWE
In practice Holm obtains very similar results to
the other

- 176
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

All Pairwise Comparisons

Differences with Comparing with a Control
The all pairwise hypotheses are logically related:
not all combinations of true and false
hypotheses are possible

C1 better than C2 and C2

better than C3

and C1
equal to C3
- 177
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Shaffer Static Procedure

It is a modification of Homl’s procedure
Starting from p1 check the first i = 1,. .., k (k — 1)/2
such that pi > α/ti
The hypothesis H1,. .., Hi —1 are rejected. The
rest of hypotheses are kept
ti is the maximum number of hypotheses that can
be true given that (i — 1) are false
It is a static procedure: ti is determined
given the hypotheses independently of the
p-values
- 178
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Shaffer Dynamic Procedure

It is similar to the previous procedure but ti is
changed by ti⇤
ti⇤ considers the maximum number of hypotheses
that can be true given that the previous (i — 1)
hypotheses are false
It is a dynamic procedure as ti⇤ depends on the
hypotheses already rejected
It is more powerful than the Shaffer Static Procedure

- 179
-
Classifier performance evaluation and
comparison Hypothesis Testing

Testing Several Algorithms in Several Datasets

Bregmann & Hommel

More powerful alternative than Shaffer Dynamic
Procedure Difficult implementation

Remarks
Adjusted p-values

- 180
-
Classifier performance evaluation and
comparison Hypothesis Testing

Conclusions

Two Classifiers in a Dataset

The complexity of the estimation of the scores
makes it difficult to carry out good statistical
testing

Two Classifiers in Several Datasets

Wilcoxon Signed-Ranks Test is a good
choice In case of many datasets and to
avoid the
commensurability problem the Signed
test could be used
- 181
-
Classifier performance evaluation and
comparison Hypothesis Testing

Conclusions

Several Classifiers in Several Datasets

Friedman or Iman & Davenport are
required
Post-hoc test more powerful than
Bonferroni: Comparison with a
control: Holm method All-to-all
comparison: Shaffer Static method

An Idea for Future Work

To consider the variability of the score in each
classifier and dataset

- 182
-
Classifier performance evaluation and
comparison Hypothesis Testing

Classifier performance evaluation

and comparison

Jose A. Lozano, Guzmán Santafé, Iñaki

Inza

Intelligent Systems Group

The University of the Basque Country
International Conference on Machine Learning and Applications
(ICMLA 2010) December 12-14, 2010

- 183
-
Thank You!

Mlslides 2
No ratings yet
Mlslides 2
92 pages
Classification Metrics Mod 6
No ratings yet
Classification Metrics Mod 6
8 pages
19-Performance Metrics
No ratings yet
19-Performance Metrics
23 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
3 - Model Evaluation & Validation
No ratings yet
3 - Model Evaluation & Validation
47 pages
D3 IT Performance Metrics May 2023
No ratings yet
D3 IT Performance Metrics May 2023
48 pages
Classification Metrics
No ratings yet
Classification Metrics
39 pages
ML Lecture 11 Evaluation
No ratings yet
ML Lecture 11 Evaluation
17 pages
Notes 03
No ratings yet
Notes 03
38 pages
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
No ratings yet
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
17 pages
Lec5 Classification
No ratings yet
Lec5 Classification
27 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
Chap3 Part1 Classification
No ratings yet
Chap3 Part1 Classification
38 pages
CIVI6731 Lecture (Week9)
No ratings yet
CIVI6731 Lecture (Week9)
18 pages
ML CH 5
No ratings yet
ML CH 5
45 pages
Precision, Recall and ROC Curves
No ratings yet
Precision, Recall and ROC Curves
17 pages
Binary Classification PDF
No ratings yet
Binary Classification PDF
27 pages
Chapter 5 Model Evaluation
No ratings yet
Chapter 5 Model Evaluation
21 pages
DL IT324a 4
No ratings yet
DL IT324a 4
52 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
Unit6 - 7 Issues
No ratings yet
Unit6 - 7 Issues
53 pages
Module 6
No ratings yet
Module 6
24 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
3-Performance Measures
No ratings yet
3-Performance Measures
35 pages
Performance Parameters
No ratings yet
Performance Parameters
14 pages
06-FSSR DS610 2024 2025T1 Metrics
No ratings yet
06-FSSR DS610 2024 2025T1 Metrics
24 pages
Machine Learning Evaluation Metrics
No ratings yet
Machine Learning Evaluation Metrics
16 pages
Performance Metrics
No ratings yet
Performance Metrics
34 pages
Hands On Machine Learning 3 Edition
No ratings yet
Hands On Machine Learning 3 Edition
31 pages
Lecture11evaluationmetricsforclassification 240913060639 0c766554
No ratings yet
Lecture11evaluationmetricsforclassification 240913060639 0c766554
28 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
CLASSIFICATION
No ratings yet
CLASSIFICATION
36 pages
Compare Class I Fiers Part 13
No ratings yet
Compare Class I Fiers Part 13
32 pages
ML Model Evaluation Metrics
No ratings yet
ML Model Evaluation Metrics
8 pages
Classification - Performance Evlaution
No ratings yet
Classification - Performance Evlaution
13 pages
Lecture 3b - Evaluation
No ratings yet
Lecture 3b - Evaluation
37 pages
A10 Model Performance v2 2up
No ratings yet
A10 Model Performance v2 2up
11 pages
6.data Mining - Classification
No ratings yet
6.data Mining - Classification
37 pages
CH 4
No ratings yet
CH 4
9 pages
Evaluation Matrix
No ratings yet
Evaluation Matrix
29 pages
11.2 - Classification Evaluation Metrics
No ratings yet
11.2 - Classification Evaluation Metrics
22 pages
TensorFlow Classification
No ratings yet
TensorFlow Classification
68 pages
ML Unit 3
No ratings yet
ML Unit 3
127 pages
Analytics in Practice: Model Evaluation
No ratings yet
Analytics in Practice: Model Evaluation
40 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
Model Evaluation
No ratings yet
Model Evaluation
31 pages
Holte Slides
No ratings yet
Holte Slides
47 pages
20150908-Lecture-3-Draft Asd Def HFL DFGF Lkreglker Lerg Kelr GK
No ratings yet
20150908-Lecture-3-Draft Asd Def HFL DFGF Lkreglker Lerg Kelr GK
15 pages
Machine Learning Evaluation Metrics
No ratings yet
Machine Learning Evaluation Metrics
15 pages
Comparating Classifiers
No ratings yet
Comparating Classifiers
11 pages
Classification Metrics Guide
No ratings yet
Classification Metrics Guide
15 pages
IS4242 W6 Model Evaluation and Selection
No ratings yet
IS4242 W6 Model Evaluation and Selection
86 pages
Ca 3 Merged
No ratings yet
Ca 3 Merged
275 pages
Lesson 6 Analytics Methods
No ratings yet
Lesson 6 Analytics Methods
12 pages
6 Evaluation
No ratings yet
6 Evaluation
57 pages
ADHD and The QbTest
No ratings yet
ADHD and The QbTest
7 pages
TOFD Technique for Weld Inspection
100% (2)
TOFD Technique for Weld Inspection
4 pages
Laut Procedure
100% (1)
Laut Procedure
33 pages
Detecting Handwritten Signatures in Scanned Documents
No ratings yet
Detecting Handwritten Signatures in Scanned Documents
6 pages
Elementary Radiesthesia Amp The Use of The Pendulum 9780787300395
No ratings yet
Elementary Radiesthesia Amp The Use of The Pendulum 9780787300395
62 pages
Biotechniques Theory and Practice Ebook PDF
No ratings yet
Biotechniques Theory and Practice Ebook PDF
329 pages
Laboratory Hematology Criteria
100% (2)
Laboratory Hematology Criteria
7 pages
Machine Learning
No ratings yet
Machine Learning
87 pages
2014 - Sitaram, Srikanth - Effect of Chamfer Angle-1
No ratings yet
2014 - Sitaram, Srikanth - Effect of Chamfer Angle-1
12 pages
Cl1001, Hl1001, and Sl1001 Ifu - English
No ratings yet
Cl1001, Hl1001, and Sl1001 Ifu - English
2 pages
Salmonella SPT
No ratings yet
Salmonella SPT
185 pages
Jaswanth Narayana R (40738003) Vishesh K (40738007)
100% (1)
Jaswanth Narayana R (40738003) Vishesh K (40738007)
37 pages
Team 14 - Project Documentation - Taiwan Credit Defaults v1.0
No ratings yet
Team 14 - Project Documentation - Taiwan Credit Defaults v1.0
3 pages
Dr. Siswo Poerwanto MPH, MSC
No ratings yet
Dr. Siswo Poerwanto MPH, MSC
25 pages
C4.5 Based Sequential Attack Detection and Identification Model
No ratings yet
C4.5 Based Sequential Attack Detection and Identification Model
19 pages
Psyc2007 02 SDT
No ratings yet
Psyc2007 02 SDT
30 pages
APMA 3100 Chapter 1-Handout
No ratings yet
APMA 3100 Chapter 1-Handout
10 pages
Ashhad's Step 2 CK UW Notes PDF
No ratings yet
Ashhad's Step 2 CK UW Notes PDF
166 pages
QC Delta PV PPT Student 09
No ratings yet
QC Delta PV PPT Student 09
8 pages
B.Tech AI & DS Course Outline
No ratings yet
B.Tech AI & DS Course Outline
38 pages
Radiologist's Guide To Evaluating Publications of Clinical Research On AI: How We Do It
No ratings yet
Radiologist's Guide To Evaluating Publications of Clinical Research On AI: How We Do It
5 pages
Artigo 2023 ELASTOGRAFIA
No ratings yet
Artigo 2023 ELASTOGRAFIA
6 pages
ML in Disease Prediction
No ratings yet
ML in Disease Prediction
5 pages
CC Quality Control Transes
No ratings yet
CC Quality Control Transes
6 pages
Written Assignment #3
No ratings yet
Written Assignment #3
7 pages
Nedley, N. (2014) Nedley Depression Hit Hypothesis: Identifying Depression and Its Causes
No ratings yet
Nedley, N. (2014) Nedley Depression Hit Hypothesis: Identifying Depression and Its Causes
8 pages
Stegers Jager Karin Marieke BEWERKT
No ratings yet
Stegers Jager Karin Marieke BEWERKT
152 pages
PFASSampling Plan
No ratings yet
PFASSampling Plan
193 pages
Games and Information Assignment 1
No ratings yet
Games and Information Assignment 1
3 pages