Evaluation of Classifiers
ROC Curves
Reject Curves
Precision-Recall Curves
Statistical Tests
– Estimating the error rate of a classifier
– Comparing two classifiers
– Estimating the error rate of a learning
algorithm
– Comparing two algorithms
Cost-Sensitive Learning
In most applications, false positive and false
negative errors are not equally important. We
therefore want to adjust the tradeoff between
them. Many learning algorithms provide a way
to do this:
– probabilistic classifiers: combine cost matrix with
decision theory to make classification decisions
– discriminant functions: adjust the threshold for
classifying into the positive class
– ensembles: adjust the number of votes required to
classify as positive
Example: 30 decision trees
constructed by bagging
Classify as positive if K out of 30 trees
predict positive. Vary K.
Directly Visualizing the Tradeoff
We can plot the false positives versus false negatives directly. If
L(0,1) = R · L(1,0) (i.e., a FN is R times more expensive than a FP),
then the best operating point will be tangent to a line with a slope of
–R
If R=1, we should
set the threshold to
10.
If R=10, the
threshold should
be 29
Receiver Operating Characteristic
(ROC) Curve
It is traditional to plot this same information in a
normalized form with 1 – False Negative Rate
plotted against the False Positive Rate.
The optimal
operating point is
tangent to a line with
a slope of R
Generating ROC Curves
Linear Threshold Units, Sigmoid Units, Neural
Networks
– adjust the classification threshold between 0 and 1
K nearest neighbor
– adjust number of votes (between 0 and k) required to
classify positive
Naïve Bayes, Logistic Regression, etc.
– vary the probability threshold for classifying as
positive
Support vector machines
– require different margins for positive and negative
examples
SVM: Asymmetric Margins
Minimize ||w||2 + C ∑i ξi
Subject to
w · xi + ξi ≥ R (positive examples)
–w · xi + ξi ≥ 1 (negative examples)
ROC Convex Hull
If we have two classifiers h1 and h2 with (fp1,fn1)
and (fp2,fn2), then we can construct a stochastic
classifier that interpolates between them. Given
a new data point x, we use classifier h1 with
probability p and h2 with probability (1-p). The
resulting classifier has an expected false positive
level of p fp1 + (1 – p) fp2 and an expected false
negative level of p fn1 + (1 – p) fn2.
This means that we can create a classifier that
matches any point on the convex hull of the
ROC curve
ROC Convex Hull
ROC Convex
Hull
Original ROC
Curve
Maximizing AUC
At learning time, we may not know the cost ratio
R. In such cases, we can maximize the Area
Under the ROC Curve (AUC)
Efficient computation of AUC
– Assume h(x) returns a real quantity (larger values =>
class 1)
– Sort xi according to h(xi). Number the sorted points
from 1 to N such that r(i) = the rank of data point xi
– AUC = probability that a randomly chosen example
from class 1 ranks above a randomly chosen example
from class 0 = the Wilcoxon-Mann-Whitney statistic
Computing AUC
Let S1 = sum of r(i) for yi = 1 (sum of the
ranks of the positive examples)
d S1 − N1(N1 + 1)/2
AUC =
N0 N1
where N0 is the number of negative
examples and N1 is the number of positive
examples
Optimizing AUC
A hot topic in machine learning right now
is developing algorithms for optimizing
AUC
RankBoost: A modification of AdaBoost.
The main idea is to define a “ranking loss”
function and then penalize a training
example x by the number of examples of
the other class that are misranked (relative
to x)
Rejection Curves
In most learning algorithms, we can
specify a threshold for making a rejection
decision
– Probabilistic classifiers: adjust cost of
rejecting versus cost of FP and FN
– Decision-boundary method: if a test point x is
within θ of the decision boundary, then reject
Equivalent to requiring that the “activation” of the
best class is larger than the second-best class by
at least θ
Rejection Curves (2)
Vary θ and plot fraction correct versus fraction
rejected
Precision versus Recall
Information Retrieval:
– y = 1: document is relevant to query
– y = 0: document is irrelevant to query
– K: number of documents retrieved
Precision:
– fraction of the K retrieved documents (ŷ=1) that are
actually relevant (y=1)
– TP / (TP + FP)
Recall:
– fraction of all relevant documents that are retrieved
– TP / (TP + FN) = true positive rate
Precision Recall Graph
Plot recall on horizontal axis; precision on
vertical axis; and vary the threshold for making
positive predictions (or vary K)
The F1 Measure
Figure of merit that combines precision
and recall.
P ·R
F1 = 2 ·
P +R
where P = precision; R = recall. This is
twice the harmonic mean of P and R.
We can plot F1 as a function of the
classification threshold θ
Summarizing a Single Operating
Point
WEKA and many other systems normally report
various measures for a single operating point
(e.g., θ = 0.5). Here is example output from
WEKA:
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class
0.854 0.1 0.899 0.854 0.876 0
0.9 0.146 0.854 0.9 0.876 1
Visualizing ROC and P/R Curves in
WEKA
Right-click on the result list and choose
“Visualize Threshold Curve”. Select “1” from the
popup window.
ROC:
– Plot False Positive Rate on X axis
– Plot True Positive Rate on Y axis
– WEKA will display the AUC also
Precision/Recall:
– Plot Recall on X axis
– Plot Precision on Y axis
WEKA does not support rejection curves
Sensitivity and Selectivity
In medical testing, the terms “sensitivity”
and “selectivity” are used
– Sensitivity = TP/(TP + FN) = true positive rate
= recall
– Selectivity = TN/(FP + TN) = true negative
rate = recall for the negative class = 1 – the
false positive rate
The sensitivity versus selectivity tradeoff is
identical to the ROC curve tradeoff
Estimating the Error Rate of a
Classifier
Compute the error rate on hold-out data
– suppose a classifier makes k errors on n holdout data
points
– the estimated error rate is ê = k / n.
Compute a confidence internal on this estimate
– the standard error of this estimate is
s
²̂ · (1 − ²̂)
SE =
n
– A 1 – α confidence interval on the true error ε is
²̂ − zα/2SE <= ² <= ²̂ + zα/2SE
– For a 95% confidence interval, Z0.025 = 1.96, so we
use
²̂ − 1.96SE <= ² <= ²̂ + 1.96SE.
Comparing Two Classifiers
Goal: decide which of two classifiers h1 and h2 has lower
error rate
Method: Run them both on the same test data set and
record the following information:
– n00: the number of examples correctly classified by both
classifiers
– n01: the number of examples correctly classified by h1 but
misclassified by h2
– n10: The number of examples misclassified by h1 but correctly
classified by h2
– n00: The number of examples misclassified by both h1 and h2.
n00 n01
n10 n11
McNemar’s Test
(|n01 − n10| − 1)2
M= > χ2
1,α
n01 + n10
M is distributed approximately as χ2 with 1
degree of freedom. For a 95% confidence
test, χ21,095 = 3.84. So if M is larger than
3.84, then with 95% confidence, we can
reject the null hypothesis that the two
classifies have the same error rate
Confidence Interval on the
Difference Between Two Classifiers
Let pij = nij/n be the 2x2 contingency table
converted to probabilities
s
p01 + p10 + (p01 − p10)2
SE =
n
pA = p10 + p11
pB = p01 + p11
A 95% confidence interval on the difference in
the true error between the two classifiers is
µ ¶ µ ¶
1 1
pA−pB−1.96 SE + <= ²A−²B <= pA−pB+1.96 SE +
2n 2n
Cost-Sensitive Comparison of Two
Classifiers
Suppose we have a non-0/1 loss matrix L(ŷ,y) and we
have two classifiers h1 and h2. Goal: determine which
classifier has lower expected loss.
A method that does not work well:
– For each algorithm a and each test example (xi,yi) compute ℓa,i =
L(ha(xi),yi).
– Let δi = ℓ1,i – ℓ2,i
– Treat the δ’s as normally distributed and compute a normal
confidence interval
The problem is that there are only a finite number of
different possible values for δi. They are not normally
distributed, and the resulting confidence intervals are too
wide
A Better Method: BDeltaCost
Let ∆ = {δi}Ni=1 be the set of δi’s computed as
above
For b from 1 to 1000 do
– Let Tb be a bootstrap replicate of ∆
– Let sb = average of the δ’s in Tb
Sort the sb’s and identify the 26th and 975th
items. These form a 95% confidence interval on
the average difference between the loss from h1
and the loss from h2.
The bootstrap confidence interval quantifies the
uncertainty due to the size of the test set. It
does not allow us to compare algorithms, only
classifiers.
Estimating the Error Rate of a
Learning Algorithm
Under the PAC model, training examples x are drawn
from an underlying distribution D and labeled according
to an unknown function f to give (x,y) pairs where y =
f(x).
The error rate of a classifier h is
error(h) = PD(h(x) ≠ f(x))
Define the error rate of a learning algorithm A for sample
size m and distribution D as
error(A,m,D) = ES [error(A(S))]
This is the expected error rate of h = A(S) for training
sets S of size m drawn according to D.
We could estimate this if we had several training sets S1,
…, SL all drawn from D. We could compute A(S1), A(S2),
…, A(SL), measure their error rates, and average them.
Unfortunately, we don’t have enough data to do this!
Two Practical Methods
k-fold Cross Validation
– This provides an unbiased estimate of error(A, (1 –
1/k)m, D) for training sets of size (1 – 1/k)m
Bootstrap error estimate (out-of-bag estimate)
– Construct L bootstrap replicates of Strain
– Train A on each of them
– Evaluate on the examples that did not appear in the
bootstrap replicate
– Average the resulting error rates
Estimating the Difference Between
Two Algorithms: the 5x2CV F test
for i from 1 to 5 do
perform a 2-fold cross-validation
split S evenly and randomly into S1 and S2
for j from 1 to 2 do
(i,j)
Train algorithm A on Sj , measure error rate pA
Train algorithm B on Sj , measure error rate p(i,j)
B
(j) (i,j) (i,j)
pi := pA − pB Difference in error rates on fold j
end /* for j */
p(1)
i + p(2)
i
pi := Average difference in error rates in iteration i
µ 2 ¶ µ ¶2
2
s2i = p(1)
i − pi
(2)
+ pi − pi Variance in the difference, for iteration i
end /* for i */
P 2
p
F := Pi i 2
2 i si
5x2cv F test
p(1,1)
A p (1,1)
B p(1)
1
p1 s2
1
p(1,2)
A p (1,2)
B p(2)
1
(2,1) (2,1) (1)
pA pB p2
p2 s2
2
(2,2) (2,2) (2)
pA pB p2
(3,1) (3,1) (1)
pA pB p3
p3 s2
3
(3,2) (3,2) (2)
pA pB p3
p(4,1)
A p (4,1)
B p(1)
4
p4 s2
4
p(4,2)
A p (4,2)
B p(2)
4
p(5,1)
A p (5,1)
B p(1)
5
p5 s2
5
p(5,2)
A p (5,2)
B p(2)
5
5x2CV F test
If F > 4.47, then with 95% confidence, we
can reject the null hypothesis that
algorithms A and B have the same error
rate when trained on data sets of size m/2.
Summary
ROC Curves
Reject Curves
Precision-Recall Curves
Statistical Tests
– Estimating error rate of classifier
– Comparing two classifiers
– Estimating error rate of a learning algorithm
– Comparing two algorithms