0% found this document useful (0 votes)
115 views15 pages

Bradley PR97 PDF

Uploaded by

Ashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views15 pages

Bradley PR97 PDF

Uploaded by

Ashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Pattern Recognition, Vol. 30, No. 7, pp.

1145-1159, 1997
Pergamon © 1997 Pattern Recognition Society. Published by Elsevier Science Ltd
Printed in Great Britain. All rights reserved
0031-3203/97 $17.00+.00

PII: S 0 0 3 1 - 3 2 0 3 ( 9 6 ) 0 0 1 4 2 - 2

THE USE OF THE AREA UNDER THE ROC CURVE IN THE


EVALUATION OF MACHINE LEARNING ALGORITHMS

ANDREW E BRADLEY*
Cooperative Research Centre for Sensor Signal and Information Processing, Department of Electrical
and Computer Engineering, The University of Queensland, QLD 4072, Australia

(Received 15 April 1996; in revisedform 29 July 1996; receivedfor publication 10 September 1996)

Abstract--In this paper we investigate the use of the area under the receiver operating characteristic (ROC)
curve (AUC) as a performance measure for machine learning algorithms. As a case study we evaluate six
machine learning algorithms (C4.5, Multiscale Classifier, Perceptron, Multi-layer Perceptron, k-Nearest
Neighbours, and a Quadratic Discriminant Function) on six "real world" medical diagnostics data sets. We
compare and discuss the use of AUC to the more conventional overall accuracy and find that AUC exhibits a
number of desirable properties when compared to overall accuracy: increased sensitivity in Analysis of Variance
(ANOVA) tests; a standard error that decreased as both AUC and the number of test samples increased; decision
threshold independent; and it is invafiant to a priori class probabilities. The paper concludes with the
recommendation that AUC be used in preference to overall accuracy for "single number" evaluation of machine
learning algorithms. © 1997 Pattern Recognition Society. Published by Elsevier Science Ltd.

The ROC curve The area under the ROC curve (AUC) Accuracy measures
Cross-validation Wilcoxon statistic Standard error

1. INTRODUCTION classification scheme's overall accuracy and AUC. Dis-


The Receiver Operating Characteristic (ROC) curve has cussion is then focused both on the performance of the
long been used, in conjunction with the Neyman-Pearson different classification schemes and on the methodology
method, in signal detection theory. (1'2) As such, it is a used to compare them.
good way of visualising a classifier's performance in The paper is structured in the following way: Section 2
order to select a suitable operating point, or decision details some commonly used performance measures and
threshold. However, when comparing a number of dif- describes the use of the ROC curve and, in particular,
ferent classification schemes it is often desirable to AUC as a performance measure; Section 3 briefly de-
obtain a single figure as a measure of the classifier's scribes the six data sets to be used in the experimental
performance. Often this figure is a cross-validated esti- study; Section 4 details the implementations of the six
mate of the classifier's overall accuracy [probability of a learning algorithms used and describes how they are
correct response, P(C)]. In this paper we discuss the use modified so that the decision threshold can be varied
of the area under the ROC curve (AUC) as a measure of a and a ROC curve produced; Section 5 describes the
classifier's performance. experimental methodology used, outlines three types
This paper addresses the generic problem of how to of experimental bias, and describes how this bias can
accurately evaluate the performance of a system that be avoided; Section 6 gives specific details of the per-
learns by being shown labelled examples. As a case formance measures and Section 7 the statistical techni-
study, we look at the performance of six different classi- ques used to compare these measures. Section 8 presents
fication schemes on six "real w o r d " medical data sets. a summary of the results, which are then discussed in
These data sets are chosen to characterize those typically detail in the remaining sections of the paper.
found in medical diagnostics, they have primarily con-
tinuous input attributes and have overlapping output
classes. When comparing the performance of the classi- 2. AUC AS A PERFORMANCE MEASURE

fication schemes, Analysis of Variance (ANOVA) is used The "raw data" produced by a classification scheme
to test the statistical significance of any differences in the during testing are counts of the correct and incorrect
accuracy and AUC measures. Duncan's multiple range (3) classifications from each class. This information is then
test is then used to obtain the significant subgroups for normally displayed in a confusion matrix. A confusion
both these performance measures. Results are presented matrix is a form of contingency table showing the
in the form of ROC curves and ranked estimates of each differences between the true and predicted classes for
a set of labelled examples, as shown in Table 1.
* Present address: Department of Computing Science, 615 In Table 1, Tp and Tn are the number of true positives
General Services Building, University of Alberta, Edmonton, and true negatives respectively, Fp and Fn are the num-
Canada T6G 2H1. E-mail: [email protected]. bers of false positives and false negatives respectively.

1145
1146 A. E BRADLEY

Table 1. A confusion matrix sensitivity [P(Tp)] can be increased with little loss in
True class Predicted class specificity [P(Tn)], or they may not. This means that the
comparison of two systems can become ambiguous.
-ve +ve Therefore, there is a need for a single measure of
-ve T. Fp C. classifier performance [often termed accuracy, but not
+ve Fn Tp Cp to be confused with P(C)] that is invariant to the decision
R. Rp N criterion selected, prior probabilities, and is easily ex-
tended to include cost/benefit analysis. This paper de-
scribes the results of an experimental study to investigate
The row totals, Cn and Cp, are the number of truly the use of the area under the ROC curve (AUC) as such as
negative and positive examples, and the column totals, a measure of classifier performance.
Rn and Re, are the number of predicted negative and When the decision threshold is varied and a number of
positive examples, N being the total number of examples points on the ROC curve [P(Fp) = c~, P(Tp) = 1 -/3]
(N = Cn + Cp = Rn + Re). Although the confusion ma- have been obtained the simplest way to calculate the area
trix shows all of the information about the classifier's under the ROC curve is to use trapezoidal integration,
performance, more meaningful measures can be ex-
tracted from it to illustrate certain performance criteria, AUC=~/{(1-/3i.Ac~)+~[A(1-~).Ac~]},
for example:
(7)
(Tp + T~) _ P(C), (1) where
Accuracy (1 - Error) - (Ce + Cn)
A(1 - / 3 ) = (1 -/3i) - (1 - [3i-1), (8)
Sensitivity (1 - / 3 ) = ~Tp = P(T,), (2)
Aol = O~i - - O~i_ 1 . (9)
It is also possible to calculate the AUC by assuming that
Specificity (1 - a) = r. = e(r.), (3)
the underlying probabilities of predicting negative or
positive are Gaussian. The ROC curve will then have
Positive predictive value = Tp (4) an exponential form and can be fitted either: directly
R.' using an iterative Maximum Likelihood (ML) estima-
T. tion,(4) giving the difference in means and the ratio of the
Negative predictive value = - - . (5)
R. variances of the positive and negative distributions; or, if
the ROC curve is plotted on double probability paper, a
All of these measures of performance are valid only for
straight line can be fitted to the points on the ROC
one particular operating point, an operating point nor-
curve. (5) The slope and intercept of this fitted line are
mally being chosen so as to minimise the probability of
then used to obtain an estimate of the AUC.
error. However, in general it is not misclassification rate
As noted in reference (6), the trapezoidal approach
we want to minimise, but rather misclassification cost.
systematically underestimates the AUC. This is because
Misclassification cost is normally defined as follows:
of the way all of the points on the ROC curve are
Cost = Fp . CF. ÷ r . . CFn. (6) connected with straight lines rather than smooth concave
curves. However, providing there are a reasonable num-
Unfortunately, we rarely know what the individual mis-
ber of points on the ROC curve the underestimation of the
classification costs actually are (here, the cost of a false
area should not be too severe. In this experiment we
positive, Cvp and the cost of a false negative, CF.) and so
obtain at least seven points from which to estimate the
system performance is often specified in terms of the
AUC and in most cases there are 15 points. The trape-
required false positive and false negative rates, P(Fp) and
zoidal approach also does not rely on any assumptions as
P(F.). This then is equivalent to the Neyman-Pearson
to the underlying distributions of the positive and nega-
method, (1'2) where P(Fn) is specified and P(Fp) is mini-
tive examples and, as will be elaborated on in Sec-
raised with that constraint, or vice versa. Often, the only
tion 9.3, is exactly the same quantity measured using
way of doing the constrained minimisation required for
the Wilcoxon test of ranks.
the Neyman-Pearson method is to plot P(Tp) against
The Standard Error of the AUC(SE((J)) (6) is of im-
P(Fp) as the decision threshold is varied. Selecting the
portance if we wish to test the significance of one
operating point (decision threshold) that most closely
classification scheme producing a higher AUC than
meets the requirements for P(F.) and P(Fp). The plotted
another. Conventionally there have been three ways of
values of P(Tp) and P(Fp) as the decision threshold is
calculating this variability associated with the AUC: (7)
varied is called a Receiver Operating Characteristic
(ROC) curve. 1. from the confidence intervals associated with the
There is still, however, a problem with specifying maximum likelihood estimate of AUC, (0);
performance in terms of a single operating point [usually 2. from the standard error of the Wilcoxon statistic,
a P(Te), P(T~) pair], in that there is no indication as to SE(W); and
how these two measures vary as the decision threshold is 3. from an approximation to the Wilcoxon statistic that
varied. They may represent an operating point where assumes that the underlying positive and negative
The use of the area under the ROC curve in the evaluation 1147

distributions are exponential in type. (6) This assump- was less than 1% of the available data in most of the
tion has been shown to be conservative; it slightly data sets.
overestimates the standard error, when compared to
assuming a Gaussian based ROC curve (as in the ML 3.1. Cervical cell nuclear texture
method).
These data were gathered by Ross Walker as part of a
The standard error, SE(W), is given by study into the use of nuclear texture analysis for the
SE(W) diagnosis of cervical cancer. (9) The data set consisted of
117 segmented images of normal and abnormal cervical
/0(1 -O) + ( Ce - 1)(Q1-02)+ ( Cn -- 1)(Q2 -02) cell nuclei. Using Grey Levels Co-occurrence Matrix
(GLCM) techniques, 56 texture features were extracted
= V
from each of these images. The six most discriminatory
(10)
features were then selected using sequential forward
where, Cn and Cv are the number of negative and positive selection (SFS) with the Bhattacharyya distance mea-
examples respectively and sure, (16"17) giving 117 examples (58 normal, 59 abnor-
mal) each with six features:
0
Q1 - (2 - 0)' (11) 1. Inertia at distance one;
2. Correlation at distance one;
202
Q2 = (1 + 0)" (12) 3. Cluster prominence at distance one;
4. Entropy at distance 15;
In this paper we shall calculate AUC using trapezoidal 5. Inverse Difference Moment (IDM) at distance 11;
integration and estimate the standard deviation, SD(t~), 6. Cluster prominence at distance three.
using both SE(W) and cross-validation, details of which
are given in Sections 5 and 6. Next, we shall present the
details of the data sets, classification algorithms, and 3.2. Post-operative bleeding
methodology chosen for this experimental study.
The data were gathered independently as part of a
study into post-operative bleeding undertaken at the
3. THE DATA Prince Charles Hospital in Brisbane. °°) Over 200 para-
The data sets used in this experiment all have meters have been recorded for each of 134 patients.
two output classes and have between four and 13, However, due to the limited size of the data set, only
primarily continuous, input variables. Except for the the four routinely measured parameters with the highest
algorithms C4.5 and the Multiscale Classifier which statistical correlation to blood loss were used. z The four
automatically handle categorical inputs, any categorical parameters were
input variables were made continuous by producing 1. WBAGCOL: Aggregation with collagen (pre-opera-
dummy variables. (8)
tive);
The six data sets chosen for use in this experiment 2. POAGCOL: Aggregation with collagen (post-opera-
were: tive);
3. POSTPLT: Platelet count (post-operative);
1. Cervical cell nuclear texture analysis (Texture); ~9)
4. DILNPLAS: Plasma dilution (post-operative).
2. Post-operative bleeding after cardiopulmonary bypass
surgery (Heart); 0°) Of the original data set of 134 patient records only 113
3. Breast cancer diagnosis (Breast); ~11~ contained all four of the required input parameters. All of
4. Pima Indian's diabetes prediction (Pima); ~2~ the input parameters are continuous-valuedwith a lowest
5. Heart disease diagnosis: °3'14) possible value of zero. These parameters are then used to
predict the total blood loss, in the three hours post-
(a) Hungarian data set (Hungarian);
operative, expressed as a ratio of body surface area.
(b) Cleveland data set (Cleveland).
The blood loss is then quantised into two classes, normal
and excessive bleeding. Here, a prediction of excessive
All input variables were scaled to the range [0,1] using bleeding is defined as a total blood loss, in the 3 h post-
a linear transformation making the minimum value zero operative, of greater than 16.4 ml/m2. This defines 25%
and the maximum value 1. This is a requirement for the of all patients to have bled excessively and is an arbitrary
Multiscale Classifier, (15)1 but was done for all of the definition that includes patients not clinically assessed as
learning algorithms for consistency (with no loss of bleeding excessively. It was necessary to associate this
generality). Also, all examples in the data sets that absolute binary classification to the blood loss to make
had any missing input variables were removed; this the data set consistent with the others used in this paper,

lit is also recommended for methods such as k nearest ZThey were not highly correlated to the other features
neighbours.(16) selected.
1148 A. E BRADLEY

and as part of this preliminary study, this simplistic distributions are 62.5% heart disease absent, 37.5%
model was thought to be sufficient. However, most of heart disease present.
the classification algorithms detailed in Section 4 have
been used for regression, where the actual amount of
blood loss would be predicted quantitatively. 4. THE LEARNING ALGORITHMS
The remaining data sets were obtained from the
The learning algorithms chosen for this experimental
University of Southern California, machine learning comparison were:
repository, ftp://ic s.uci.edu:pub/machine-leaming-data-
bases. • Quadratic Discriminant Function°8) (Bayes); 3
• k-Nearest Neighbours °9) (KNN);
• C4.5 ~2°) (C4.5);
3.3. Breast cancer diagnosis
• Multiscale ClassifierO5) (MSC);
Collected by Wolberg ° 1) at the University of Wiscon- • Perceptron ~21) (PTRON);
sin, this domain contains some noise and residual varia- • and Multi-layer Perceptron ~22) (MLP).
tion in its 683 data points, the 16 examples with missing We chose a cross-section of popular machine learning
attributes being removed. There are nine integer inputs, techniques together with one algorithm developed in
each with a value between 1 and 10. The two output association with the author. There were two statistical
classes, benign and malignant, are non-evenly distributed methods (KNN and Bayes), two neural networks
(65.5% and 34.5% respectively). (PTRON, and MLP), and two decision trees (C4.5 and
MSC).
3.4. Pima Indian's diabetes The following should be noted about the imple-
mentations of each of the methods, Quadratic discrimi-
The diagnostic, binary-valued variable investigated is nant function (Bayes). The training data are used to
whether the patient shows signs of diabetes according to estimate the prior probabilities, P(wj), mean, mj, and
World Health Organization criteria (i.e. if the 2 h post- covariance, Cj of the two class distributions. The Bayes
load plasma glucose was at least 200 mg/dl at any survey decision function for class wj of an example x is then
examination or if found during routine medical care). The given by
population lives near Phoenix, Arizona, U.S.A. There are
eight continuously valued inputs with some noise 1 1
dj(x) = lnP(wj) - ~ In Icjl - ~ [(x - m j ) T c f l ( x -- mr) ].
and residual variation. ~12) The two non-uniformly dis-
tributed output classes (65.1% and 34.9%) are tested (13)
negative or positive for diabetes. There is a total of 768 This decision function is then a hyper-quadric, the class
data points. of an example being selected as the minimum distance
class. Misclassification costs, cj, are then applied to these
3.5. Heart disease diagnosis distances, dj, so as to weight the decision function and
The goal of this data set is to predict the presence of minimise the Bayes risk of misclassification. For these
coronary artery disease from a number of demographic, experiments misclassification costs were used in the
observed, and measured patient features. Here, we used range [0,1] in steps of 1/14.
two of the available data sets (the ones with the most k-Nearest Neighbours. For each test example, the five
instances); both data sets have the same instance format nearest neighbours (calculated in terms of the sum of the
but were collected at different hospitals. squared difference of each input attribute) in the training
set are calculated. Then, if greater than L, where
L=[0, 1, 2, 3, 4, 5], if the nearest neighbours are of class
3.5.1. Cleveland data. These data were collected by 1, the test sample is assigned to class 1; if not, it is
Robert Detrano, M.D., Ph.D. at V. A. Medical centre, assigned to class 0.
The Cleveland Clinic Foundation. The data originally Release 5 of the C4.5 decision tree generator (2°) was
were collected with 76 raw attributes; however, in used with the following modification: when pruning a
previous studies (13'14) only 14 attributes were actually decision tree (in file prune.c) weight the local class
used. The data set contains 297 examples, there being distributions with the misclassification costs for each
six examples removed because they had missing values. class. The default values for all parameters were used
Class distributions are 54% heart disease absent, 46% on all the data sets. Relative misclassification costs of
heart disease present. [0.0:1.0, 0.015625:1.0, 0.03125:1.0, 0.0625:1.0, 0.125:-
1.0, 0.25:1.0, 0.5:1.0] were used for both classes on all
3.5.2. Hungarian data. These data were collected by the data sets.
Andras Janosi, M.D. at the Hungarian Institute of
Cardiology, Budapest. The data are in exactly the same
format as the Cleveland data, except three attributes
3We shall refer to this method as "Bayes" even though it is
were removed due to a large percentage of missing not a truly Bayesian method. It would only be a Bayesian
values. There are 261 examples, 34 examples being method, i.e. optimal, if the true distributions of the input
removed because they had missing values. Class variables were Gaussian.
The use of the area under the ROC curve in the evaluation 1149

The Multiscale Classifier. Version 1.2bl of the Multi- the true error rate. (27) The cross-validation sampling
scale Classifier was used on each data set. The MSC was technique used was random but ensured that the approx-
first trained for 10 epochs, or until 100% classification imate proportions of examples of each class remain 90%
was achieved on the training set, then both pessimistic in the training set and 10% in the test set. This slight
(MSCP) and minimum error (MSCM) pruning were used adjustment to maintain the prevalence of each class does
on the decision trees produced on each training set. The not bias the error estimates and is supported in the
default pruning parameters of c f - l % and of m=8 were research literature. (26)
used on all data sets for pessimistic and minimum error As pointed out by Friedman, (2s) no classification
pruning respectively. Relative misclassification costs of method is universally better than any other, each method
[1.0:1.0, 1.25:1.0, 1.5:1.0, 2.0:1.0, 4.0:1.0, 8.0:1.0, 16.0- having a class of target functions for which it is best
:1.0, 32.0:1.0] were used for both of the classes on all suited. These experiments then, are an attempt to inves-
data sets. tigate which learning algorithms should be used on a
The Perceptron. Consisting of one neuron with a particular subset of problems. This subset of "medical
threshold activation function. The number of inputs diagnostic" problems is characterized by the six data sets
(and weights) to the neuron is equal the number of input presented. Our conclusions are therefore targeted to-
attributes for the problem, plus a bias. The network was wards this subset of problems and should not be extra-
trained, using the Perceptron learning algorithm (23) for polated beyond the scope of this class of problem. We
1000 epochs. The weights learnt were then tested using a have tried to minimise any bias in the selection of the
neuron with a linear activation function, scaled to give an problem domains, whilst tightly defining the subset of
output in the range [0,1]. The output of this linear neuron problems (selection bias). We have selected problems
was then thresholded at values [0, 0.1, 0.2, 0.3 . . . . . 1.0] with a wide range of inputs (4-13) which would represent
to simulate different misclassification c o s t s . (24) a typical number of features measured, or feature subset
The Multi-layer Perceptron. Three network architec- selected for medical diagnostic problems. The binary
tures were implemented, each with different numbers of output classes are, as would be typically expected, over-
hidden units. Their network architecture was as follows: lapping. This is due to varying amounts of noise and
an input layer consisting of a number of units equal to the residual variation in the measured features, and so a
number of input attributes for the problem domain; a 100% correct classification would not, in general, be
hidden layer consisting of 2, 4 and then 8 units; and possible.
finally one output unit (MLP2, MLP4, and MLP8 re- We have tried to minimise the effect of any expert bias
spectively). All of the neurons were fully connected, with by not attempting to tune any of the algorithms to the
log-sigmoid activation functions, i.e. their outputs were specific problem domains. Wherever possible, default
in the range [0,1]. All three networks were trained using values of learning parameters were used. These para-
back-propagation with a learning rate of 0.01, and a meters include the pruning parameters for the decision
momentum of 0.2. Initial values for the weights in the trees, the value of k for the nearest neighbour algorithm,
networks were set using the Nguyen-Widrow method, ~25) and the learning parameters (learning rate, momentum,
and the networks were trained for 20,000 epochs. Again, and initial conditions) for the neural networks. This naive
during the testing phase the output neuron was thre- approach undoubtedly results in lower estimates of the
sholded at values [0, 0.1, 0.2, 0.3 . . . . ,1.0] to simulate true error rate, but it is a bias that affects all of the
different misclassification costs. (24) learning algorithms equally. If we had attempted to tune
the performance of each algorithm on each data set, then
our different expertise with each method would of ad-
5. THE TRAINING METHODOLOGY vantaged some algorithms but disadvantaged others. The
experimentation time would also have increased drama-
It is known that single train and test partitions are not
tically as we evaluated different input representations,
reliable estimators of the true error rate of a classification
scheme on a limited data set. (26'27) Therefore, it was input transformations, network architectures, learning
parameters, pruning parameters, or identified outlying
decided that a random sub-sampling scheme should be
examples in the training set. Also, in domains with a
used in this experiment to minimise any estimation bias.
limited availability of data the introduction of an evalua-
A leave-one-out classification scheme was thought com-
tion set (extracted from the training set) could actually
putationally too expensive 4 and so, in accordance with
reduce the overall accuracy of the algorithms.
the recommendations in reference (26), 10-fold cross-
validation was used on all of the data sets. For consis-
tency, exactly the same data were used to train and test all
of the nine classification schemes, this is often called a 6. THE PERFORMANCE MEASURES
paired experimental design. (7) The 10-fold cross-valida-
For each learning algorithm (9 off) on each data set (6
tion scheme has been extensively tested and has been
off), 10 sets of results (one for each of the 10-fold cross-
shown to provide an adequate and accurate estimate of
validation partitions) were stored. The raw data were
stored in the form of a confusion matrix and for each of
the 10 test partitions the decision thresholds were varied
4particularly for the Multi-layer Perceptron. (to produce the ROC curves), giving between 7 and 15
1150 A. R BRADLEY

sets of results for each test partition. In order to obtained using averaging, SD(0), SE(W) was also cal-
evaluate the performance of the different learning algo- culated using the approximation to the Wilcoxon method,
rithms on each of the data sets, a number of measures given in equation (10).
have to be extracted from this raw data (over 6000 sets of
results).
Overall accuracy, P(C). For the default (conventional) 7. THE COMPARATIVE TECHNIQUES
decision thresholds, with equal misclassification costs,
the value of the estimate of the true error rate [equa- 7.1. Analysis of variance
tion (1)] was calculated for the 10 cross-validation parti- In this paper we will use Analysis of Variance (ANO-
tions. VA) techniques to test the hypothesis of equal means over
The ROC curve. On each test partition the decision a number of learning algorithms (populations) simulta-
thresholds were effectively varied (by varying neously. (3) The experimental design allows us to com-
misclassification costs, as described in Section 4) to give pare, on each data set, the mean performance for each
a set of values for P(Tp) and P(Fp). The "average" learning algorithm and for each train and test partition.
ROC curves for each classification scheme are shown This is called two-way classification and effectively tests
in Section 8. two hypotheses simultaneously:
The area under the ROC curve (AUC). As the
misclassification costs were varied, as described in 1. H i, that all the means are equal due to the different
Section 4, each successive point on the ROC curve train and test partitions;
was used in the trapezoidal integration to calculate 2. H~, that all the means are equal due to the different
AUC. The AUC was calculated for each learning algo- learning algorithms.
rithm on each of the 10 test partitions. This is in effect Of these two hypotheses we are only really inter-
using a jackknife estimate to calculate the standard error ested in the second, H~, and we could have used a
of the AUC (29) and will be discussed in more detail one-way ANOVA to test this hypothesis alone. However,
shortly. a one-way ANOVA assumes that all the populations
Remark. It should be noted that there are two distinct are independent, and can often be a less sensitive test
possibilities when it comes to combining the ROC curves than a two-way ANOVA, which uses the train and
from the different test partitions, (3°) test partitions as a blocking factor. (31) The f ratio
1. Pooling. In pooling, the raw data (the frequencies of calculated from this ANOVA table is insensitive to
true positives and false positives) are averaged. In departures from the assumption of equal variances when
this way one average, or group ROC curve is the sample sizes are equal, as in this case. O) For this
produced from the pooled estimates of each point reason a test for the equality of the variances was not
on the curve. In this case we have 10 estimates of done.
P(Tp) and P(Fp) for each point on the ROC curve.
The assumption made when pooling the raw data is 7.2. Duncan's multiple range test
that each of the classifiers produced on each of the When the analysis of variance test on an accuracy
training partitions comes from the same population. measure produces evidence to reject the null hypo-
Although the assumption that they come from the theses, H~ and H~, we can accept the alternative hypoth-
same population may be true in terms of their overall e s i s - t h a t all of the mean accuracies are not
discrimination capacity (accuracy), the assumption equal. However, we still do not know which of the means
that for each partition they are all estimating the are significantly different from which other means, so we
same points on the ROC curve is less palatable. This will use Duncan's multiple range test to separate sig-
can be seen from the fact that pooling the data in this nificantly different means into subsets of homogeneous
way depresses the combined index of accuracy, means.
AUC.(3°) For the difference between two subsets of means to be
2. Averaging. This alternative approach is to average significant it must exceed a certain value. This value is
the accuracy index extracted from each of the ROC called the least significant range for the p means, Rp, and
curves on the 10 train and test partitions. So, AUC is given by
is calculated for the 10 ROC curves and then
averaged, giving an estimate of the true area and an Rp = rp V / ~ , (14)
estimate of its standard error, calculated from the
where the sample variance, s 2, is estimated from the
standard deviation of the 10 areas. The only problem
error mean square from the analysis of variance, s~, r
with this approach is that it does not result in an
the number of observations (rows), and rp the least
average ROC curve, only an average AUC. For this
significant studentized-range for a given level of
reason the pooled responses are used when actually
significance (we chose ct=0.05), and the degrees of
visually showing the whole ROC curves, as in
freedom [ ( r - 1 ) ( c - 1 ) = 721. Tables 2-7 show the
Section 8.
subsets of adjacent means that are not significantly
The standard deviation of AUC, SD(0). In order to different, this being indicated by drawing a line under
validate our estimate of the standard deviation of AUC the subset.
The use of the area under the ROC curve in the evaluation 1151

8. RESULTS • Breast Cancer: See Table 4 and Figs 5 and 6.


In this section we give the s u m m a r y o f the results. • Pima Indians Diabetes: See Table 5 a n d Figs 7 and 8.
• Cleveland Heart Disease: See Table 6 a n d Figs 9 adn
• Nuclear Texture: See Table 2 a n d Figs 1 a n d 2. 10.
• Post-operative Heart Bleeding: See Table 3 a n d Figs 3 • Hungarian Heart Disease: See Table 7 a n d Figs 11
and 4. a n d 12.

Table 2. Rank ordered significant subgroups from Duncan's multiple range test on the nuclear texture data

Classifier: PTRON MSCM MSCP C4.5 KNN BAYES MLP8 MLP4 MLP2
Accuracy: 85.0 85.0 85.0 89.2 89.2 89.2 90.0 90.0 91.7

Classifier: MSCP MSCM C4.5 KNN BAYES PTRON MLP4 MLP8 MLP2
AUC: 88.1 88.7 92.1 96.2 96.7 97.8 98.3 98.5 98.6

Table 3. Rank ordered significant subgroups from Duncan's multiple range test on the heart bleeding data

Classifier: MSCM MSCP C4.5 PTRON KNN MLP8 MLP4 MLP2 BAYES
Accuracy: 69.2 70.8 71.7 72.5 74.2 75.0 76.7 78.3 79.1

Classifier: C4.5 KNN MLP4 MLP8 MLP2 PTRON MSCM MSCP BAYES
AUC: 48.7 60.9 65.5 65.7 66.1 69.8 70.0 70.5 73.3

Table 4. Rank ordered significant subgroups from Duncan's multiple range test on the breast cancer data

Classifier: PTRON C4.5 MSCM MSCP MLP8 MLP4 MLP2 KNN BAYES
Accuracy: 72.2 90.7 90.9 91.2 92,7 93.3 93.5 93.6 94.2

Classifier: C4.5 MSCM MSCP PTRON MLP4 MLP8 MLP2 KNN BAYES
AUC: 93.7 94.4 94.4 94.5 95.2 96.2 96.5 97.0 98.2

Table 5. Rank ordered significant subgroups from Duncan's multiple range test on the Pima diabetes data

Classifier: MSCM MSCP C4.5 PTRON KNN BAYES MLP8 MLP4 MLP2
Accuracy: 68.1 68.2 71.7 73.6 74.8 75.9 77.0 77.1 78.4

Classifier: MSCM MSCP BAYES KNN C4.5 MLP8 MLP4 PTRON MLP2
AUC: 74.1 74.4 76.3 79.4 80.2 82.3 83.4 84.7 85.3
1152 A.P. BRADLEY

Table 6. Rank ordered significant subgroups from Duncan's multiple range test on the Cleveland heart disease data

Classifier: MSCM MSCP PTRON C4.5 MLP8 MLP4 MLP2 KNN BAYES
Accuracy: 68.7 68.7 75.0 77.7 81.0 81.0 81.3 82.7 86.3

Classifier: MSCP MSCM C4.5 MLP8 MLP2 MLP4 KNN BAYES PTRON
AUC: 73.7 73.8 84.2 84.4 85.9 86.1 86.9 90.8 91.2

Table 7. Rank ordered significant subgroups from Duncan's multiple range test on the Hungarian heart disease data

Classifier: MSCM MSCP C4.5 KNN MLP4 PTRON MLP8 BAYES MLP2
Accuracy: 71.5 71.5 73.0 74.1 75.5 76.7 77.4 78.9 79.3

Classifier: MSCM MSCP C4.5 KNN MLP8 MLP4 BAYES MLP2 PTRON
AUC: 70.1 70.2 79.2 82.0 82.1 82.3 83.8 84.7 87.8

ROC Cuwe ROC Cu~e

0.9 0.9

0.8 0.8

2i.!
to~
g

o21f// / • Perceptron

o P(FaleePositive)(Alpha)
/oV P(FalsePositive)(Alpha)

Fig. 1. ROC curve for Bayes, KNN, and MLP on the nuclear Fig. 2. ROC curve for C4.5, MSC, and Perceptron on the
texture data. nuclear texture data.

9. DISCUSSION variance is of any scientific interest, but because it is


In this section we discuss only the second hypothesis necessary for the two-way ANOVA to be more efficient
tested by the two-way analysis of variance (ANOVA), than the one-way ANOVA.
H~. This is the variance due to the different learning
algorithms (column effects). The reason for this is that 9.1. Overall accuracy
the train and test partitions are being used as what is
called a blocking factor. We would hope for a significant All of the data sets showed some difference in average
effect due to the train and test partitions, 5 not because this accuracy for each of the learning algorithms. However,
the ANOVA showed that on one of these data sets
(Nuclear Texture) there was no significant evidence
5So that we can reject H~. (p < 0.05) for the mean accuracies to be actually differ-
The use of the area under the ROC curve in the evaluation 1153

ROC Curve ROC Curve

0.! 0.9

0.8 0.8

~0.7 0.7

6o.6 0.6

~
o.
0.5 0.5

~ 0.4 0.4

"~ 0.3

0.2

0.1 ?.1

i i i i P i i i
01 02 03 04 05 01.6 01.7 01,8 01.9 0 0.1 0.2 0.3 0.4 0,5 0.6 0,7 0.8 0.9 1
P(FalsePositive)(Alpha) P(FalsePositive)(Alpha)

Fig. 3. ROC curve for Bayes, KNN, and MLP on the heart Fig. 4. ROC curve for C4.5, MSC, and Perceptron on the heart
bleeding data. bleeding data.

ROC Curve ROC Curve


1

0.9

0.8

~0.7
®

i
0.6

:~0.5
o.
~ 0.4

0.3

0.2

0.1c

o'1 °'2 o'3 o', ols ols o17 oi~ o'9


P(FalsePositive)(Alpha) P(FalsePositive)(Alpha)

Fig. 5. ROC curve for Bayes, KNN, and MLP on the breast Fig. 6. ROC curve for C4.5, MSC, and Perceptron on the breast
cancer data. cancer data.

ent. On the other five data sets (where there was sig- groups as the number of hidden units in the MLP is
nificant evidence to reject the null hypothesis, I ~ ) increased above 2. The fact that the Perceptron is in the
Duncan's multiple range test was used to find the sig- lowest subgroup on its own would indicate that this
nificant subgroups. problem is not linearly separable and so the Perceptron
The Post-operative heart bleeding data set shows only lacks the representation power to achieve a high overall
two significant subgroups. Table 3 also shows that there accuracy. In addition, the lower performance observed
is only a significant difference between the two decision using the decision tree methods may indicate that the
trees methods (MSC and C4.5) and the MLP with two optimal decision surface is smooth in nature.
and four hidden units and Bayes. The Pima Indians diabetes data set (Table 5) shows
Table 4 shows that for the Breast Cancer data set there three significant subgroups under overall accuracy. The
are three significant subgroups: one subgroup containing lowest accuracy group contains the decision trees (MSC
only the Perceptron; one containing the two decision and C4.5) though only Bayes and the Multi-layer Per-
trees (MSC and C4.5); and the other learning algorithms ceptrons (MLP) have a significantly (p < 0.05) higher
in the third. There is also an overlap between the last two overall accuracy. The poor performance of the decision
1154 A.P. BRADLEY

ROCCurve ROCCurve

0.9 0.9
0.8 0.8

PI /7 / /t/ /

o.~
io,ry /
0.2
,'"',j ! ~-o.314¢// y,
Ill I
0,2I/I/ /
/
/ +
. Pe,:,,oo
- -

- -

0. 0.1

014 015 016 017 018 019 0 0.1 0.2 0.3


' 0.4
-' 0.5'- 0.6 -' 0.7'- 0.8
' 0.9
' 1'
P(FalsePositive)(Alpha) P(FalsePositive)(Alpha)
Fig. 7. ROC curve for Bayes, KNN, and MLP on the Pima Fig. 8. ROC curve for C4.5, MSC, and Perceptron on the Pima
diabetes data. diabetes data.

ROCCurve ROCCurve

0.9 0.9

0.8 0.8

"~
Om7 ~0"7

!t//
06

Om5~~/I /
/ I+ ~-- It
-0.6

~ 0.5
~ 0m4 ~ 014 --
¢

:'l// / t °"0"3
0.2 ~ +" Perceptron
MSC

0 . ~ 0.1

~0 011 012 013 014 0'5 016 017 018 019 i


P(FalsePositive)(Alpha) P(FalsePositive)(Alpha)
Fig. 9. ROC curve for Bayes, KNN, and MLP on the Cleveland Fig. 10. ROC curve for C4.5, MSC, and Perceptron on the
heart disease data. Cleveland heart disease data.

trees may indicate that the smooth decision hyperplanes it being considered that when you have more than 10
are perhaps better suited to this problem, especially with input features the curse of dimensionality will start
the limited training data available. The relative success of having a major effect. (8) Of all the learning algorithms
the MLPs over the Bayesian method would indicate that used in this experiment, one would expect the Bayes and
the input features are not Normally distributed and so the KNN to be the most severely affected by the curse of
covariance matrix is not being reliably calculated. dimensionality. However, on this domain, this was ob-
From Table 6, it can be seen that the Cleveland heart viously not the case.
disease data set has four significant subgroups under Table 7 shows two significant subgroups for overall
overall accuracy. However, due to the large amount of accuracy on the Hungarian heart disease data set. How-
subgroup overlap, there seems to be little discrimination ever, both of these subgroups are widely overlapping, the
due to the classification method. Perhaps of note, though, only significant differences being between the MSC and
is the fact that on this problem the Bayes and KNN both the Bayes and the MLP (with two hidden units).
methods obtained the highest overall accuracies. This In general, when performance is measured in terms of
was surprising because the number of input features is 13, overall accuracy, the hyper-plane (Bayes and MLP) and
The use of the area under the ROC curve in the evaluation 1155

ROCCurve ROCCurve

OC,.
O.E 01
A0.7 05
v- 0.13 0.(
0.,'
0.
~o4 0.4
o C4.5
9.3 0.3

0.2
0.2o n
9.1 3.1
l
o11 o12 0'.3 01, o15 0'.6 017 o18 0'.9 (] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P(False Positive) (Alpha) P(False Positive) (Alpha)

Fig. 11. ROC curve for Bayes, KNN, and MLP on the Fig. 12. ROC curve for C4.5, MSC, and Perceptron on the
Hungarian heart disease data. Hungarian heart disease data.

exemplar (KNN) based methods seemed to have a better • Though the ROC curves often appear to be producing a
performance when compared to the decision trees (MSC similar AUC, one curve may be preferable because it
and C4.5). This result confirms what, from previous may have a lower P(Fp) or P(F,) at a specific operating
discussion, might be expected for data sets of this type, point. This reiterates the fact that for one particular
where the optimal decision boundary is a smooth hyper- application, the best way to select a classifier, and its
plane. For the decision tree methods to accurately esti- operational point, is to use the Neyman-Pearson
mate this type of decision boundary they would require a method. (]'2) Here, we select the required sensitivity
lot more training data to adequately populate decision and then maximise the specificity with this constraint
nodes deep in the tree. (or vice versa).

The ROC curve is mainly of use when visualizing the


9.2. The ROC curve performance of a classification algorithm as the decision
threshold is varied. Any one point on the curve is a
The ROC curves for each learning algorithm on each
possible operational point for the classifier and so can be
data set are shown in Figs 1-12. These curves are the
evaluated in the same manner as accuracy, P(C), as
pooled ROC curves over the 10 train and test partitions.
above. However, in order to evaluate the whole curve
Curves for the MLPs with four and eight hidden units are
we need to extract some distinguishing feature from it.
not shown because of their similarity to the MLP with
The feature we have chosen to measure and evaluate is
two hidden units (MLP2); also, for the same reason, only
the area under the ROC curve (AUC).
the curves for MSC with minimum error pruning are
shown. It is perhaps worth making a couple of general
comments as to the visual shape of the ROC curves 9.3. The area under the ROC curve
presented in Figs 1-12.
As was the case for overall accuracy, all of the data sets
• Decision trees (MSC and C4.5) do not appear to be showed some difference in average AUC for each of the
producing ROC curves that conform to any Gaussian learning algorithms. However, for the AUC the analysis
underlying distributions for the negative and positive of variance showed that on all of the data sets there were
classes, i.e. they do not form smooth exponential significant (p < 0.01) differences in mean AUCs. In
curves. This confirms our choice of trapezoidal inte- addition, on all but one data set (Breast Cancer) the
gration over Maximum Likelihood estimation to cal- computedfvalues were greater for the AUC ANOVA test
culate AUC. The dips and peaks seen in the ROC than for overall accuracy ANOVA. These largerfvalues
curves for the decision trees are probably due to the led to a higher level of significance (p < 0.01 rather than
discrete way in which the decision trees are pruned, i.e. p < 0.05) on two of the data sets (Post-operative bleed-
when the decision tree is pruned, a sub-tree is reduced ing and Hungarian heart disease). This indicates that the
to either a single leaf of the class with the minimum AUC is a more sensitive test than overall accuracy. The
error, this single leaf can then subsequently lead to a variance associated with the AUC, especially on the data
number of misclassifications and so, the error rises in a sets with either high accuracy or ample test data, was less
discrete step. than that associated with P(C). Again, Duncan's multiple
1156 A.P. BRADLEY

range test was carried out on all six data sets to determine means that the decision tree is actually doing very little
the significant subgroups. work. In previous experiments(32)we found that the MSC
On the nuclear texture data set, three significant sub- obtained a higher accuracy (76%) when no pruning was
groups were obtained, as shown in Table 2. The decision done on the tree. This is an example of a problem domain
trees (MSC and C4.5) are in a lower performance sub- where the algorithm has been biased by the decision tree
group of their own, with C4.5 in a second subgroup with pruning.(33)
KNN, and Bayes, the third, highest performance group, There are three significant subgroups shown for the
now includes the Perceptron and Multi-layer Perceptrons Breast Cancer data set in Table 4. There is a large amount
but excludes the decision trees (C4.5 and MSC). The poor of overlap in these subgroups and so no real identifiable
performance obtained using the decision tree methods groups seem to exist. However, there is an indication of a
can be attributed to the fact there are limited data with general increase in performance from the decision trees
which to construct and prune the trees and that smooth through the Perceptron on to the MLPs and then up to the
decision hyper-planes are probably more suitable than KNN and Bayes methods. Again, with the exception of
hyper-rectangles in this problem domain. Of note is the the Perceptron, which again obtained a higher ranking of
fact that the Perceptron and MSC obtained the same performance under AUC than it did under P(C), there is
accuracy, P(C), but the Perceptron now has a signifi- good agreement in the ranking between the two perfor-
cantly higher (p < 0.05) AUC. With that exception there mance measures.
is an extremely good correlation between the rankings Table 5 shows that for the Pima Indians Diabetes data
given from P(C) and that given from AUC. However, set there are four significant subgroups (as compared to
AUC produced significant differences between the mean three for overall accuracy). This again would indicate the
performance, whereas P(C) did not. increased sensitivity of AUC over P(C) as a measure of
There are two significant subgroups for the post- classifier performance. In fact, it may well be worth
operative bleeding data set, as shown in Table 3. The going to a higher level of significance (say p=0.01) to
lowest performance subgroup contains C4.5 only, the reduce the number of subgroups and reveal a more
other subgroup containing all of the other methods. The general underlying trend. In addition, it can be seen from
low performance of C4.5 when measured using AUC can the ROC curve for the Bayes classifier (Fig. 7) that there
also be visually seen in the ROC curves of Figs 3 and 4. are only really three points from which to estimate the
In this data set there are patients who have bled exces- AUC. This means that the AUC calculated for the Bayes
sively due to a surgically related complication (a tech- classifier on this data set will be pessimistically biased.
nical error). Some of the training data have therefore To avoid this effect it may be possible to implement a
effectively been misclassified because the excessive systematic way of varying the decision threshold when
bleeding was not related to any of the features measured, producing the ROC curves, rather than using linear
but was a consequence of the technical error. These cases steps. (34)
should randomly affect the data and therefore become The Cleveland heart disease data set has three sig-
isolated examples in feature space. We would hope that nificant subgroups of performance under AUC (see
they would have little effect on the classifier during Table 6). The MSC is in a subgroup of its own, the other
training, but this will be dependent on the classification two groups being fairly overlapping and so no mean-
algorithm used. The effect of these points on the MLP, ingful subgroups can be identified. Again, the Perceptron
Perceptron, and Bayes methods is to bias the position of obtained a higher ranking under AUC than it did under
the decision boundary(s); however, if, as is thought for overall accuracy. With this exception, there is a good
this case, the number of misclassified points is not too level of agreement in the ranking of the performance of
large, this effect should be minimal. KNN will be af- the classification algorithms under accuracy and AUC.
fected dependent upon the amount of smoothing built Where accuracy found two broad significant sub-
into the classification, i.e. upon the choice of K. For the groups, Table 7 shows that AUC has produced three
decision tree methods (C4.5 and MSC) these points will subgroups on the Hungarian Heart Disease data set.
cause the formation of erroneous decision nodes in the The MSC is in the lowest performance subgroup (on
tree. However, it should then be possible to prune these its own) while the remaining two subgroups are broadly
examples from the tree to eliminate their effect, as they overlapping with only a significant difference between
will be nodes that have seen very few training points, i.e. the AUC for C4.5 (lowest) and the Perceptron (highest).
they have a low confidence level. However, because of As was the case for the Cleveland heart disease data set,
the lack of data in this domain it is very difficult to the Perceptron performed better under AUC than it did
determine with certainty which data points are due to a under overall accuracy, but otherwise accuracy and AUC
technical error and therefore should be pruned and which produced similar rankings of performance.
data points are due to the underlying problem. This can
be seen in Fig. 4 particularly in the cases of the decision 9.3.1. The meaning of AUC. It may seem that
tree C4.5 where the pruning has reduced the structure of extracting the area under the ROC curve is an
the tree too much and hence reduced the sensitivity. This arbitrary feature to extract. However, it has been
means that C4.5 is very rarely predicting any cases as known for some time that this area actually represents
being positive; this "over caution" leads to what appears the probability that a randomly chosen positive example
to be a acceptable accuracy, but a very low AUC. This is correctly rated (ranked) with greater suspicion than a
The use of the area under the ROC curve in the evaluation 1157

0.3 i , i i i 0.35 . . . . .
/ /

0.3 ,,,"
0.25 x

x~ x O.25
.~ 0.;

× ×
0.2 ,"
D

~0.15
D.15 ,'/ ~
//1

~0.1 0.1 ~
~.o~, ~ -

o.o,

AUCStandard Error
-o.o5 o.'o o.'5
Fig. 13. Scatter plot of the standard error of the Wilcoxon AUG StandardError
statistic versus the standard deviation of the AUC. There are
nine learning algorithms, each data set being shown with a Fig. 14. Linear relationship between the standard error of the
different tick mark. Wilcoxon statistic and the standard deviation of the AUC.

randomly chosen negative example. (6) Moreover, this deviation of the averaged AUC, SD(0), calculated using
probability of correct ranking is the same quantity 10-fold cross-validation. The correlation coefficient be-
estimated by the Wilcoxon statistic. (6'35) tween SE(W) and SD(0) is 0.9608 which indicates that
The Wilcoxon statistic, W, is usually used to test the there is a very strong linear relationship between SE(W)
hypothesis that the distribution of some variable, x, from and SD(0). Over all six data sets, SE(W) has a mean value
one population (p) is equal to that of a second population of 0.0770 and a standard deviation of 0.0482, whilst
(n), H0 : Xp = Xn.(3) If this (null) hypothesis is rejected SD(0) has mean 0.0771 and standard deviation 0.0614.
then we can calculate the probability, p, that Xp > xn, This again would indicate that although SD(0) has a
Xp < xn, or Xp ¢ xn. In our case, where we want good higher variance it is a very good estimator of SE(W). The
discrimination between the populations p and n, we straight line fitted (in a least squared sense) to SE(W) and
require P(xp > xn) to be as close to unity as possible. SD(0) in Fig. 14 again reiterates the quality of SD(0) as
The Wilcoxon test makes no assumptions about the an estimate of SE(W).
distributions of the underlying populations and can work The larger variance observed for SD(0) can be ex-
on continuous, quantitative, or qualitative variables. plained when you consider the fact that SD(0) has two
As already discussed AUC effectively measures sources of variance. The first source of variance, which is
P(xp > xn). In the same situation, given one normal also the variance estimated by SE(W), is due to the
example and one positive example, 6 a classifier with variation of the test data. That is, in each of the 10
decision threshold t will get both examples correct with a iterations of cross-validation there is a different 10% of
probability, the data in each test partition. These different sets of test
P(C) = P(xp > t)P(x. < t). (15) data therefore produce different ROC curves, and AUC
varies accordingly. The second source of variance is due
P(C) is dependent on the location of the decision thresh- to variation of the training data in each cross-validation
old t and is therefore not a general measure of classifier partition. The variation in the training data used in each
performance. cross-validation partition also affect the ROC curves
produced and this causes AUC to vary. However, because
9.3.2. The standard error of AUC. The AUC, 0, is an only one-ninth of the training data vary with each sub-
excellent way to measure P(xp > x n ) for binary sequent training partition, this second source of variance
classifiers and the direct relationship between, W, and is small and therefore, as was shown, SD(0) is a good
0 can be used to estimate the standard error of the AUC, estimator of SE(W).
using SE(W) in equation (10). Figure 15 shows how the standard error of the Wil-
Figures 13 and 14 show how the standard error of the coxon statistic varies with the number of test samples and
Wilcoxon statistic, SE(W), is related to the standard the actual value of the AUC. The two trends to notice are:

6Often referred to as a two alternative forced choice 1. As the number of test samples increase the stan-
experiment (2AFC). dard error decreases, SE(W) being inversely pro-
1158 A.P. BRADLEY

0.12
• It gives an indication of how well separated the
-- AUC = 095 negative and positive classes are for the decision index,
0.1
- - A U C = 0.85
e(xp > Xn);
• A U C = 0.75
• It is invariant to prior class probabilities•
~ 0 . 0 8
• I t g i v e s a n i n d i c a t i o n o f t h e a m o u n t of "work done" by
a classification scheme, giving low scores to the ran-
dom or "one class only" classifiers•
t~ 0.06
x~
However, there was good agreement between accuracy
and AUC as to the ranking of the performance of the
0.04
classification algorithms. It was also found that the
::_::: ...................... standard deviation of the averaged AUCs from the 10-
0.02 fold cross-validation can be used as an estimate of the
standard error of the AUC calculated using the approx-
i i i imation to the Wilcoxon statistic.
°0 2'0 4'0 0'0 ,0 1~0 ,~0 1~0 180 180 200
Number of Abnormal Examples, N a ( = Nn) The results quoted for the all the algorithms are only
valid for the particular architecture or parameter settings
Fig. 15. Variation of the standard error of the Wilcoxon
statistic with AUC and the number of test examples, assuming tested, there may be other architectures that offer better
c.=cp. performance. However, care should be taken when
choosing parameters so as not to optimistically bias
the results• Using a training, evaluation, and test set
methodology should prevent this. Finally, for one parti-
portional to x/N, where N is the number of test
cular application, the best way to select a classifier and its
samples•
operational point is to use the Neyman-Pearson method,
2. SE(W) is inversely proportional to AUC. There is a
of selecting the required sensitivity and then maximising
high variance associated with small values of AUC
the specificity with this constraint (or vice versa). The
( < 0.8) and the variance becomes very small as the
AUC however, appears to be one of the best ways to
AUC gets close to 1. This effect can also be seen in
evaluate a classifier's performance on a data set when a
Fig. 13; the " x " points represent the standard
"single number" evaluation is required or an operational
error and deviation estimated for the heart bleeding
point has not yet been determined.
domain. On this domain the AUC was quite low
(~0.66) and so the variation is noticeably higher. Acknowledgements--The Author is grateful to Geoffrey
There are also methods to reduce the standard error Hawson and Michael Ray of the Prince Charles in Brisbane
for allowing access to the post-operative heart bleeding data set
estimate for classifiers tested on the same data, ~7>with its used in this study. The work of Michael Ray and Geoffrey
own significance test (to compare two AUCs). There are Hawson is kindly supported by the Prince Charles Hospital
other measures of performance such as output signal-to- Private Practice Study, Education, and Research Trust Fund.
noise ratio, or deflection criterion, O6) but the AUC seems Thanks are also due to Gary Glonek, Brian Lovell, Dennis
Longstaff, and the anonymous referees for helpful comments
to be the only one that is independent of the decision
on earlier drafts of this paper.
threshold and not biased by prior probabilities.

REFERENCES
I0. C O N C L U S I O N S
1. K. Fukunaga, Introduction to Statistical Pattern Recogni-
In general there was not a great difference in the tion, 2nd Edn. Academic Press, San Diego, California
accuracies obtained from each of the learning algorithms (1990).
over all the data sets. Generally, the hyperplane (Bayes, 2. C. W. Therrien, in Decision Estimation and Classification:
An Introduction to Pattern Recognition and Related Topics.
MLP) and exemplar (KNN) based methods performed
Wiley, New York (1989).
better than the decision trees (C4.5, MSC) in terms of 3. R. E. Walpole and R. H. Myers, Probability and Statistics
overall accuracy and AUC. However, this is due, in part, .for Engineers and Scientists. Macmillan, New York
to the type of problems we have analysed, being primar- (1990).
ily continuous inputs with overlapping classes; the mod- 4. D. D. Dorfmann and E. Alf, Maximum likelihood
estimation of parameters of signal detection theory and
els used by these methods are particularly well suited to
determination of confidence intervals--rating method data,
this type of problem. .L Math. Psychology 6, 487-496 (1969).
The area under the ROC curve (AUC) has been shown 5. J. A. Swets, ROC analysis applied to the evaluation of
to exhibit a number of desirable properties as a classi- medical imaging techniques, Invest. Radiol. 14, 109-121
fication performance measure when compared to overall (1979).
6. J. A. Hanley and B. J. McNeil, The meaning and use of the
accuracy: area under a receiver operating characteristic (ROC) curve,
Radiology 143, 29-36 (1982).
• Increased sensitivity in the Analysis of Variance
7. J. A. Hanley and B. J. McNeil, A method of comparing the
(ANOVA) tests; areas under receiver operating characteristic curves derived
• It is not dependent on a decision threshold chosen; from the same cases, Radiology 148, 839-843 (1983).
The use of the area under the ROC curve in the evaluation 1159

8. J. H. Friedman, An overview of predictive learning and 22. D. E. Rumelhart, G. E. Hinton and R. J. Williams,
function approximation, From statistics to neural Learning Internal Representations by Error Propagation,
networks: Theory and pattern recognition applications, Parallel Distributed Computing: Explorations in the
V. Cherkassky, J. H. Friedman and H. Wechsler, eds, NATO Microstructure of Cognition, D. E. Rumelhart and J. L.
ASI Series: Computer and Systems Sciences, Vol. 136, pp. McClelland, eds. MIT Press, Cambridge, Massachusetts
1-61. Springer, Berlin (1993). (1986).
9. R. Walker, P. Jackway, B. Lovell and D. Longstaff, 23. E Rosenblatt, Principles of Neurodynamics. Spartan Press,
Classification of cervical cell nuclei using morphological Washington D.C. (1961).
segmentation and textural feature extraction, Proc. 2nd 24. J. M. Twomey and A. E. Smith, Power curves for
Australian and New Zealand Conf. on Intelligent Informa- pattern classification networks, Proc. IEEE Int. Conf. on
tion Systems, pp. 297-301 (1994). Neural Networks, San Francisco, California, pp. 950-955
10. M. J. Ray, G. A. T. Hawson, S. J. E. Just, G. McLachlan (1993).
and M. O'Brien, Relationship of platelet aggregation to 25. D. Nguyen and B. Widrow, Improving the Learning Speed
bleeding after cardiopulmonary bypass, Ann. Thoracic of 2-Layer Neural networks by Choosing Initial Values of
Surgery 57, 981-986 (1994). Adaptive Weights, Proc. Int. Joint Conf. on Neural
11. W. H. Wolberg and O. L. Mangasarian, Multisurface Networks 3, 21-26 (1990).
method of pattern separation for medical diagnosis applied 26. S. M. Weiss and C. A. Kulikowski, Computer Systems
to breast cytology, Proc. Nat. Acad. Sci., U.S.A. #87#12, That Learn: Classification and Prediction Methods
9193-9196 (1990). from Statistics, Neural Networks, Machine Learning
12. J. W. Smith, J. E. Everhart, W. C. Dickson, W. C. Knowler and Expert Systems. Morgan Kaufmann, San Mateo
and R. S. Johannes, Using the (ADAP) learning algorithm (1991).
to forecast the onset of diabetes mellitus, Proc. Syrup. on 27. L. Breiman, J. Friedman, R. Olshen and C. Stone,
Computer Applications and Medical Care, pp.261-265. Classification and Regression Trees. Wadsworth, Belmont
IEEE Computer Society Press, Silver Spring, Maryland (1984).
(1988). 28. J. H. Friedman, Introduction to computational learning and
13. R. Detrano, A. Janosi, W. Steinbrnnn, M. Pfisterer, J. statistical prediction, Tutorial, Twelth Int. Conf. on
Schmid, S. Sandu, K. Guppy, S. Lee and V. Froelicher, Machine Learning, Lake Tahoe, California (1995).
International application of a new probability algorithm for 29. B. Efron, Bootstrap methods: Another look at the
the diagnosis of coronary artery disease, Am. J. Cardiol. jackknife, Ann. Statist. 7, 1-26 (1979).
64, 304-310 (1989). 30. J. A. Swets and R. M. Pickets, Evaluation of Diagnostic
14. J. H. Gennari, P. Langley and D. Fisher, Models of Systems: Methods from Signal Detection Theory. Academic
incremental conceot formation, Artif lntell. 40, l l~il Press, New York (1982).
(1989). 31. C. R. Hicks, Fundamental Concepts in the Design of
15. B. C. Lovell and A. P. Bradley, The multiscale classifier, Experiments. Saunders College Publishing, London
IEEE Trans. Pattern Analysis Mach. lntell. 18, 124-137 (1993).
(1996). 32. A. P. Bradley, B. C. Lovell, M. Ray and G. Hawson, On the
16. E A. Devijver and J. Kittler, Pattern Recognition: A methodology for comparing learning algorithms: A case
Statistical Approach. Prentice-Hall, London (1982). study, in Proc. Second Australian and New Zealand Conf.
17. T. W. Ranber, M. M. Barata and A. S. Steiger-Garcao, A on Intelligent Information Systems, pp. 37-41. IEEE
Toolbox for the Analysis and Visualisation of Sensor Data Publications, Brisbane, Australia (1994).
in Supervision, Intelligent Robots Group Technical report, 33. C. Schaffer, Overfitting avoidance as bias, Machine
Universidade Nova de Lisboa, Portugal (1993). Learning 10, 153-178 (1993).
18. J. T. Tou and R. C. Gonzalez, Pattern Recognition Prin- 34. R. E Raubertas, L. E. Rodewald, S. G. Humiston and P. G.
ciples. Addison-Wesley, Reading, Massachusetts ( 1981). Szilagyi, ROC Curves for Classification Trees, Med.
19. D.J. Hand, Discrimination and Classification. Wiley, New Decision Making 14, 169-174 (1994).
York (1981). 35. D. M. Green and J. A. Swets, Signal Detection Theory and
20. J. R. Quinlan, C4.5:Programs for Machine Learning. Psychophysics. Wiley, New York (1966, 1974).
Morgan Kaufmann, San Mateo (1993). 36. B. PicinBono, On deflection as a performance criterion in
21. M. L. Minsky and S. A. Papert, Perceptrons. MIT Press, dection, IEEE Trans. Aerospace Electronic Systems 31,
Cambridge, Massachusetts (1969). 1072 1081 (1995).

About the A u t h o r - - A N D R E W BRADLEY received his B.Eng. Honours degree in Electrical and Electronic
Engineering in 1989 from the University of Plymouth, U.K. After working in industry for a number of years he
has recently completed a Ph.D. at the Department of Electrical and Computer Engineering at The University of
Queensland, Australia. His research interests include machine learning and pattern recognition for medical
diagnostics, image analysis and scale-space methods.

You might also like