0% found this document useful (0 votes)

48 views11 pages

Comparating Classifiers

This document discusses evaluating and comparing classifiers. It reviews common performance metrics like accuracy and classification error. It notes limitations of these metrics for imbalanced data. Alternative metrics are proposed that consider precision, recall, and tradeoffs between the two. Statistical methods for comparing classifiers are also discussed. The document provides recommendations for evaluating classifiers but notes there is no single best approach and the proper method depends on the specific classification problem and goals.

Uploaded by

Batista Carvalho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views11 pages

Comparating Classifiers

Uploaded by

Batista Carvalho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/316799504

Evaluating and Comparing Classiﬁers: Review, Some Recommendations and

Limitations

Conference Paper · May 2017

DOI: 10.1007/978-3-319-59162-9_2

CITATIONS READS

19 974

1 author:

Katarzyna Stapor
Silesian University of Technology
78 PUBLICATIONS 160 CITATIONS

SEE PROFILE

All content following this page was uploaded by Katarzyna Stapor on 22 February 2018.

The user has requested enhancement of the downloaded file.

Cite this paper as:
Stąpor K. (2018) Evaluating and Comparing Classifiers: Review,
Some Recommendations and Limitations. In: Kurzynski M., Wozniak
M., Burduk R. (eds) Proceedings of the 10th International Conference
on Computer Recognition Systems CORES 2017. CORES 2017.
Advances in Intelligent Systems and Computing, vol 578. Springer,
Cham

PREPRINT

Evaluating and comparing classifiers:

review, some recommendations and limitations
Katarzyna Stąpor1

Abstract
Performance evaluation of supervised classification learning method related to its
prediction ability on independent data is very important in machine learning. It is also
almost unthinkable to carry out any research work without the comparison of the new,
proposed classifier with other already existing ones. This paper aims to review the most
important aspects of the classifier evaluation process including the choice of evaluating
metrics (scores) as well as the statistical comparison of classifiers. Critical view,
recommendations and limitations of the reviewed methods are presented. The article
provides a quick guide to understand the complexity of the classifier evaluation process
and tries to warn the reader about the wrong habits.

Key words: supervised classification, classifier evaluation, performance metrics,

statistical classifier comparison

1. INTRODUCTION

In a supervised classification problem one aims to learn a classifier from a

dataset U  {( x (1) , t (1) ), ... , ( x ( n ) , t ( n ) )} of n labeled data instances and each
instance x (i ) is characterized by d predictive variables/features, X  ( X 1 ,..., X d ) ,
and a class T to which it belongs. This dataset is obtained from a physical process
described by an unknown probability distribution f ( X , T ) . Then, the learned
classifier, after evaluating its quality (usually on test dataset), can be used to
classify new samples, i.e. to obtain their unknown class labels. We do not make
here a distinction between a classifier (being a function that maps an input feature
space to a set of class labels) and a classification learning algorithm which is a
general methodology that can be used, given a specific dataset, to learn a specific
classifier. Theoretical background on supervised classification problem as well as

1
Prof. dr hab. eng., Institute of Computer Science, Silesian Technical University, Gliwice
2

the whole description of classifier construction process can be found in many

books on machine learning and pattern recognition (see for example 2, 8, 31, 33,
34, 44, 47, 49).
Usually, the problem of evaluating a new classifier is tackled by using the
score that try to summarize the specific conditions of interest. Classification error
and accuracy are widely used scores in the classification problems. In practice,
classification error must be estimated from all the available samples. The k-fold
cross-validation, for example, is on of the most frequently used such estimation
methods. Then, questions are whether such a new, proposed classifier (or
enhancement of the existing one) yields an improved score over the competitor
classifier (or classifiers) or the state of the art. It is almost impossible now to do
any research work without an experimental section where the score of a new
classifier is tested and compared with the scores of the existing ones. This last
step also requires the selection of datasets on which the compared classifiers are
learned and evaluated. The purpose of dataset selection step should not be to
demonstrate classifier’s superiority to another in all cases, but rather to identify its
areas of strengths with respect to domain characteristics.
This paper is focused only on a supervised classification problem as
defined in the beginning. Other types of classification such as classification from
data streams or multi-label classification are not addressed here, since they may
impose specific conditions to the calculation of the score (for the most important
reference in evaluating (static) data streams, see for example (15)).
The whole evaluation process of a classifier should include the following
steps:
1) choosing an evaluation metric (i.e. a score) according to the
properties of a classifier,
2) deciding the score estimation method to be used,
3) checking whether the assumptions made by 1) and 2) are fulfilled,
4) running the evaluation method and interpret the results with respect
to the domain,
5) compare a new classifier with the existing ones selected according
to the different criteria, for example problem dependent; this step
requires selection of datasets.
The main purpose of this paper is to provide the reader with a better
understanding about the overall classifier evaluation process. As there is no fixed,
concrete recipe for the classifier evaluation procedure, we believe that this paper
will facilitate the researcher in the machine learning area to decide which
alternative to choose for each specific case.
The paper is set up as follows. In section 2 we describe measures of classifier
quality while in section 3, a short overview of their estimation methods. Section 4
focuses on statistical methods for classifier quality comparison. Finally, in section
5 we conclude giving some recommendations.

2
3

2. MEASURES OF CLASSIFIER QUALITY

Usually the problem of evaluating a new classifier (i.e. measuring its quality)
is tackled by using the score that try to summarize the specific conditions of
interest when evaluating a classifier. There may be many scores according to how
we aim to quantify classifier’s behavior. In this section, we only present some of
the most extended scores.
Typical scores for measuring the performance of a classifier are accuracy and
classification error, which for a two-class problem can be easily derived from a
2x2 confusion matrix as that given in Table 1. These scores can be computed as:
Acc  (TP  TN ) /(TP  FN  TN  FP )
Err  ( FP  FN ) /(TP  FN  TN  FP )
Sometimes, accuracy and classification error are selected without considering
in depth whether it is the most appropriate score to measure the quality of a
classifier for the classification problem at hand. When both class labels are
relevant and the proportion of data samples for each class is very similar, these
scores are a good choice. Unfortunately, equally class proportions are quite rare in
real problems. This situation is known as the imbalance problem (29). Empirical
evidence shows that accuracy and error rate are biased with respect to data
imbalance: the use of these scores might produce misleading conclusions since
they do not take into account misclassification costs, the results are strongly
biased to favor the majority class, and are sensitive to class skews.
In some application domains, we may be interested in how our classifier classifies
only a part of the data. Examples of such measures are: True positive rate (Recall or
Sensitivity), TPrate = TP/(TP+FN), True negative rate (Specificity): TNrate =
TN/(TN+FP), False positive rate: FPrate = FP/(TN+FP), False negative rate
(Precision): Precision = TP/(TP+FP).

Table 1. Confusion matrix for a two-class problem

Predicted positive Predicted negative

Positive class True Positive (TP) False Negative (FN)
Negative class False Positive (FP) True Negative (TN)

Shortcomings of the accuracy or error rate have motivated search for new
measures which aim to obtain a trade-off between the evaluation of the
classification ability on both positive and negative data samples. Some
straightforward examples of such alternative scores are: the harmonic mean
between Recall and Precision values: F-measure = 2×TPrate×Precision/(TPrate
+ Precision), and the geometric mean of accuracies measured separately on each
class: G-mean= TPrate×TNrate . Harmonic and geometric means are symmetric
functions that give the same relevance to both components. There are other
proposals that try to enhance one of the two components of the mean. For
instance, the adjusted geometric mean (1), the optimized precision OP from (37)
compu-ted as: OP = Acc - (|TNrate - TPrate|/(TNrate+TPrate)), and F-score (30):

3
4

(β 2 +1)Precision×TPrate
F-score= 2
β ×Precision+TPrate
A parameter  can be tuned to obtain different trade-offs between both compo-
nents.
When a classifier classifies an instance into a wrong class group, a loss is
incurred. Cost-sensitive learning (10) aims to minimize this loss incurred by the
classifier. The above introduced scores use the 0/1 loss function, i.e. they treat all
the different types of misclassification as equally severe. The cost matrix can be
used if the severity of misclassifications can be quantified in terms of costs.
Unfortunately, in real applications, specific costs are difficult to obtain. In such
situations, however, the described above scores may be useful since they may also
be used to set more relevance into the costliest misclassification: minimizing the
cost may be equivalent to optimal trade-off between Recall and Specificity (7).
When the classification costs cannot be accessed, another most widely-used
techniques for the evaluation of classifiers is the ROC curve (4, 11), which is a
graphical representation of Recall versus FPrate (1-Specificity). The information
about classification performance in the ROC curve can be summarized into a
score known as AUC (Area under the ROC curve) which is more insensitive to
skewness in class distribution since it is a trade-off between Recall and Specificity
(43). However, recent studies have shown that AUC is a fundamentally incoherent
measure since it treats the costs of misclassification differently for each classifier.
This is undesirable because the cost must be a property of the problem, not of the
classification method. In (21, 22), the H measure is proposed as an alternative to
AUC.
While all of the scores described above in this section are appropriate for two-
class imbalanced learning problems, some of them can be modified to
accommodate the multi-class imbalanced learning problems (23). For example
(46) extends the G-mean definition to the geometric mean of Recall values of
every class. Similarly, in (12) they defined mean F-measure for multi-class
imbalance problem. The major advantage of this measure is that it is insensitive to
class distribution and error costs. However, it is now an open question if such
extended scores for multi-class classification problem are appropriate on scenarios
where there exist multiple minority and multiple majority classes (40). In (20)
they proposed the M measure, a generalization approach that aggregates all pairs
of classes based on the inherent characteristics of the AUC.
In this paper, we focus on the scores since they are popular way to
measure classification quality. But these measures do not capture all the
information about the quality of classification methods some graphical methods
may do. However, the use of quantitative measures of quality makes the
comparison among the classifiers easier (for more information on graphical
methods see for example 30, 9, 36). The presented list of scores is by no means
exhaustive. The described scores are focused only on the evaluating the
performance of a classifier. However, there are other important aspects of
classification such as robustness to noise, scalability, stability under data shifts,
etc. which are not addressed here.

4
5

3. QUALITY ESTIMATION METHODS

Various methods are commonly used to estimate classification error and the
other described classifier scores (the review of estimation methods can also be
found in the mentioned literature on machine learning). Holdout method of
estimation of classification error divides randomly the available dataset into
independent training and testing subsets which are then used for learning and
evaluating a classifier. This method gives a pessimistically biased error estimate
(calculated as a ratio of misclassified test samples to a size of test subset),
moreover it depends on a particular partitioning of a dataset. These limitations are
overcome with a family of resampling methods: cross validation (random sub-
sampling, k-fold cross-validation, leave-one-out) and bootstrap. Random subsam-
pling performs k random data splits of the entire dataset into training and testing
subsets. For each data split, we retrain a classifier and then estimate error with test
samples. The true error estimate is the average of separate errors obtained from k
splits. The k-fold cross-validation creates a k fold partition of the entire dataset
once: Then, for each of k experiments, it uses (k-1) folds for training and a
different fold for testing. The classification error is estimated as the average of
separate errors obtained from k experiments. It is approximately unbiased,
although at the expense of an increase in the variance of the estimate. Leave-one-
out is the degenerate case of k-fold cross-validation where k is chosen as the total
number of samples. This results in the unbiased error estimate, but have large
variance. In the bootstrap estimation, we randomly select with replacement the
samples and use this set for training. The remaining samples that were not
selected for training are used for testing. We repeat this procedure k times. The
error is estimated as the average error on test samples from k procedures. The
benefit of this method is its ability to obtain accurate measures of both bias and
variance of classification error estimate.

4. STATISTICAL COMPARISON OF CLASSIFIERS

The comparison of the scores obtained by two or more classifiers in a set of

problems is a central task in machine learning, so it is almost impossible to do any
research work without an experimental section where the score of a new classifier
is tested and compared with the scores of the existing ones. When the differences
are very clear (e.g., when the classifier is the best in all the problems considered),
the direct comparison of the scores may be enough. But in most situations, a direct
comparison may be misleading and not enough to draw sound conclusions. In
such situations, the statistical assessment of the scores such as hypothesis testing
is required. Statistical tests arise with the aim of giving answers to the above
mentioned questions, providing more precise assessments of the obtained scores
by analyzing them to decide whether the observed differences between the
classifiers are real or random. However, although the statistical tests have been
established as a basic part of classifier comparison task, they are not a definitive
tool, we have to be aware about their limitations and misuses.

5
6

The statistical tests for comparing classifiers are usually bound to a specific
estimation method of classifier score. Therefore, the selection of a statistical test is
also conditioned by this estimation method.
For the comparison of two classifiers on one dataset, the situation which is
very common in machine learning problems, the corrected resampled t test has
been suggested in the literature (35). This test is associated with a repeated
estimation method (for example holdout): in i-th of the m iterations, a random data
partition is conducted and the values for the scores Ak(1i ) and Ak(i2) of compared
classifiers k1 and k 2 , are obtained. The statistic is:
A
t
m A  A
(i ) 2
 1 N test 
     i 1
 m N train  m 1
1 m (i ) (i )
where A   i 1
A , A   Ak(1i )  Ak(i2)  , N test , N train are the number of
m
samples in the test and train partitions. The second parametric test that can be
used in this scenario whose behavior, however, has not been studied as for
previous, is the corrected t test for repeated cross-validation (3). These tests
assume the data follow the normal distribution which should be first checked
using the suitable normality test. A non-parametric alternative for comparing two
classifiers that is suggested in the literature is McNemar’s test (26).
For the comparison of two classifiers on multiple datasets the Wilcoxon
signed-ranks test (26) is widely recommended. It ranks the differences
d i  Ak(1i )  Ak( i2) between scores of two classifiers k1 and k 2 obtained on i-th of N
datasets, ignoring the signs. The test statistic of this test is:
T  min( R  , R  )
where:
1 1
R    rank (d i )   rank ( di ), R    rank (d i )   rank (di )
di  0 2 di 0 di  0 2 di  0
are the sums of ranks on which the k 2 classifier outperforms k1 , respectively.
Ranks di  0 are split evenly among the sums. Other test that can be used is the
sign test, but it is much weaker than the Wilcoxon signed-ranks test.
Comparison among multiple classifiers on multiple datasets arise in machine
learning when a new proposed classifier is compared with the state of the art. For
this situation, the general recommended methodology is as follows.
First, we apply an omnibus test to detect if at least one of the classifiers
performs different than the others. Friedman nonparametric test (14) with Iman-
Davenport extension (28) is probably the most popular omnibus test. It is a good
choice when comparing more than five different classifiers. Let Rij be the rank of
the j-th of K classifiers on the i-th of N data sets and
1
R j   i 1 Rij
N

6
7

is the mean rank of j-th classifier. The test compares the mean ranks of the
classifiers and is based on the test statistic:
( N  1)  F2 12 N  K 2 K ( K  1)2 
FF  F   R j 
2

N ( K  1)   F2 K ( K  1)  j 1 4 
which follows a F distribution with ( K  1) and ( K  1)( N  1) degrees of
freedom. For the comparison of five or less different classifiers, Friedman aligned
ranks (17) or the Quade test (38) are the more powerful alter-natives.
Second, if we find such a significant difference, then we apply a pair-wise test
with the corresponding post-hoc correction for multiple comparisons. For the
described above Friedman test, comparing the r-th and s-th classifiers is based on
the mean ranks and has the form:
K ( K  1)
z   Rr  Rs 
6N
The z value is used to find the corresponding probability from the table of normal
distribution, which is then compared with an appropriate significance level  . As
performing pair-wise comparisons is associated with a set or family of
hypotheses, the value of  must be adjusted for controlling the family-wise error
(42). There are multiple proposals in the literature to adjust the significance level
 : Holm (27), Hochberg (24), Finner (13).
The results of pair-wise comparisons, often, give not disjoint groups of
classifiers. In order to identify disjoint, homogenous groups, in (19) they apply
special cluster analysis approach. Their method results in dividing K classifiers
into groups in such a way that classifiers belonging to the same group do not
significantly differ with respect to the chosen distance.

4 RECOMMENDATIONS AND CONCLUSIONS

This paper covers the basic steps of classifier evaluation process, focusing
mainly on the evaluation metrics and conditions for their proper usage as well as
the statistical comparison of classifiers.
The evaluation of classification performance is very important to the
construction and selection of classifiers. The vast majority of the published
articles use the accuracy (or classification error) as the score in the classifier
evaluation process. But these two scores may be appropriate only when the
datasets are balanced and the misclassification costs are the same for false
positives and false negatives. In the case of skew datasets, which is rather typical
situation, the accuracy/error rate is questionable and other scores such as Recall,
Specificity, Precision, Optimized Precision, F-score, geometric or harmonic
means, H or M measures are more appropriate.
The comparison of two classifiers on a single dataset is generally unsafe due to
the lack of independence between the obtained score values. Thus, the corrected
versions of the resampled t test or t test for repeated cross-validation are more
appropriate. McNemar’s test, being non-parametric, does not make the assumption
about distribution of the scores (like the two previous tests) but it does not directly

7
8

measure the variability due to the choice of the training set nor the internal
randomness of the learning algorithm. When comparing two classifiers on
multiple datasets (especially from different sources), the measured scores are
hardly commensurable. Therefore, the Wilcoxon signed-rank test is more
appropriate. Regarding the comparison of multiple classifiers on multiple datasets,
if the number of classifiers involved is higher than five, the use of the Friedman
test with Iman and Davenport extension is recommended. When this number is
low, four or five, Friedman aligned ranks and the Quade test are more useful. If
the null hypothesis has been rejected, we should proceed with a post-hoc test to
check the statistical differences between pairs of classifiers.
The last but not least conclusion follows from no free lunch theorem (48)
which states that for any two classifiers, there are as many classification problems
for which the first classifier performs better than the second as vice versa. Thus, it
does not make sense to demonstrate that one classifier is, on average, better than
the others. Instead, we should focus our attention on exploring the conditions of
the classification problems which make our classifier to perform better or worse
than others. We must carefully choose the datasets to be included in the evaluation
process to reflect the specific conditions, for example class imbalance,
classification cost, dataset size, application domain, etc. In other words, the choice
of the datasets should be guided in order to identify specific conditions that make
a classifier to perform better than others.
Summarizing, this review tries to provide the reader with a better
understanding about the overall process of comparison in order to decide which
alternative to choose for each specific case. We believe, that this review can
improve the way in which researchers and practitioners in machine learning
contrast the results achieved in their experimental studies using statistical
methods.

REFERENCES
1. Batuvita R., Palade V. (2009). A new performance measure for class imbalance
learning: application to bioinformatics problem. Proc. 26th Int. Conf. Machine
Learning and Applications, p. 545-550.
2. Bishop Ch. (2006). Pattern recognition and machine learning. Springer, New
York.
3. Bouckaert R. (2004). Estimating replicability of classifier learning experiments.
Proc. 21st Conf. ICML, AAAI Press.
4. Bradley P. (1997). The use of the area under the ROC curve in the evaluation of
machine learning algorithms. Pattern recognition, 30, p. 1145-1159.
5. Dietterich T. (1998). Approximate statistical tests for comparing supervised
classification learning algorithms. Neural Computation, 10, pp. 1895-1924.
6. Demsar J. (2006). Statistical comparison of classifiers over multiple data sets.
Journal of Machine Learning Research, 7, pp. 1-30.
7. Dmochowski J. et al. (2010). Maximum likelihood in cost-sensitive learning:
model specification, approximation and upper bounds. J. Mach. Learn. Res., 11,
p. 3313-332.

8
9

8. Duda R., Hart P., Stork D. (2000). Pattern classification and scene analysis. John
Wiley&Sons, New York.
9. Drummond C., Holte R. (2006). Cost curves: an improved method for
visualizing classifier performance. Machine Learning, 65, 1, p. 95-130.
10. Elkan C. (2001). The foundation of cost-sensitive learning. In: Proc. 4th Int.
Conf. Artificial Intelligence, v. 17, p. 973-978.
11. Fawcett T. (2006). An introduction to ROC analysis. Pattern Recognition
Letters. 27, 8, p. 861-874.
12. Ferri C. et al. (2009). An experimental comparison of performance measures for
classification. Pattern recognition Letters, 30, 1, p. 27-38.
13. Finner H. (1993). On a monotonicity problem in step-down multiple test
procedures. Journal of the American Statistical Association, 88, p. 920-923.
14. Friedman M. (1940). A comparison of alternative tests of significance for the
problem of m rankings. Annals of Mathematical Statistics, 11, p. 86-92.
15. Gama J., et. al. (2013). On evaluating stream learning algorithms. Machine
Learning, p. 1-30.
16. Garcia S., Herrera F. (2008). An extension on statistical comparison of
classifiers over multiple datasets for all pair-wise comparisons. Journal of
Machine Learning Research, 9(12), p. 2677-2694.
17. Garcia S. Fernandez A., Lutengo J., Herrera F. (2010). Advanced nonparametric
tests for multiple comparisons in the design of experiments in the computational
intelligence and data mining: experimental analysis of power. Inf. Sci., 180(10),
p. 2044-2064.
18. Garcia V. et. al. (2009). Index of balanced accuracy: a performance measure for
skewed class distributions. 4th IbPRIA, p. 441-448.
19. Górecki T., Krzyśko M. (2015). Regression methods for combining multiple
classifiers, Communications in Statistics—Simulation and Computation, 44, pp.
739–755.
20. Hand D., Till R. (2001). A simple generalization of the area under the ROC
curve for multiple class classification problems. Machine Learning, 45, 171-186.
21. Hand D. (2009). Measuring classifier performance: a coherent alternative to the
area under the ROC curve. Machine Learning, 77, p. 103-123.
22. Hand D., Anagnostopoulos C. (2014) A better beta for the H measure of
classification performance. Pattern Recognition Letters, 40, p. 41-46.
23. He H., Garcia E. (2009). Learning from imbalanced data. IEEE Trans. On Data
and Knowledge Engineering, 21, 9, p. 1263-1284.
24. Hochberg Y. (1988). A sharper Bonferroni procedure for multiple tests of
significance. Biometrika, 75, p. 800-802.
25. Hodges J.L., Lehmann E.L. (1962). Ranks methods for combination of
independent experiments in analysis of variance. Annals of Math. Statistics, 33,
p. 482-487.
26. Hollander M., Wolfe D. (2013). Nonparametric statistical methods. John Wiley
& Sons.
27. Holm S. (1979). A simple sequentially rejective multiple test procedure.
Scandinavian Journal of Statistics, 6, p. 65-70.
28. Iman R., Davenport J. (1980). Approximations of the critical region of the
Friedman statistic. Computations in Statistics, p. 571-595.
29. Japkowicz N. Stephen N. (2002). The class imbalance problem: a systematic
study. Intell. Data Analysis, 6,5, p. 40-49.

9
View publication stats

30. Japkowicz N., Shah M. (2011). Evaluating learning algorithms: a classification

perspective. Cambridge University Press, Cambridge.
31. Krzyśko M., Wołyński W., Górecki T., Skorzybut M. (2008). Learning Systems.
WNT, Warszawa (in Polish).
32. Kubat M., Matwin S. (1997). Adressing the curse of imbalanced training sets:
one-sided selection. Proc. 14th ICML, 179-186.
33. Kurzyński M. (1997). Pattern Recognition. Statistical approach. Wrocław Univ.
Tech. Press, Wrocław (in Polish).
34. Malina W., Śmiatacz M. (2010). Pattern Recognition. EXIT Press, Warszawa (in
Polish).
35. Nadeau C., Bengio Y. (2003), Inference for the generalization error. Mach.
Learn, 52(3), p. 239-281.
36. Prati R. et al. (2011). A survey on graphical methods for classification predictive
performance evaluation. IEEE Trans. Knowl. Data Eng., 23(11), p. 1601-1618.
37. Ranavana R., Palade V. (2006). Optimized precision: a new measure for
classifier performance evaluation. Proc. 23rd IEEE Int. Conf. on Evolutionary
Computation, 2254-2261.
38. Quade D. (1979). Using weighted rankings in the analysis of complete blocks
with additive block effects. J. American Statistical Association, 74, p. 680-683.
39. Salzberg S. (1997). On comparing classifiers: pitfalls to avoid and recommended
approach. Data Mining and Knowledge Discovery, 1, p. 317-328.
40. Sanchez-Crisostomo J., et al. (2014). Empirical analysis of assessments metrics
for multi-class imbalance learning on the back-propagation context. In: LNCS,
8795, Y. Tan et al. (eds), p. 17-23.
41. Santafe G. et. al. (2015). Dealing with the evaluation of supervised classification
algorithms. Artif. Intell. Rev. 44, p. 467-508.
42. Shaffer J.P. (1995). Multiple hypothesis testing. Annual Review of Psychology,
46, p. 561-584.
43. Sokolova M., Lapalme G. (2009). A systematic analysis of performance
measures for classification tasks. Inf. Proc. and Manag., 45, p. 427-437.
44. Stąpor K. (2011). Classification methods in computer vision. PWN, Warszawa
(in Polish).
45. Sun Y. et. al (2009). Classification of imbalanced data: a review. Int. J. Pattern
Recognit. Artif. Intell. 23, 4, p. 687-719.
46. Sun Y. et. al. (2006). Boosting for Learning Multiple Classes with Imbalanced
Class Distribution. Proc. Int’l Conf. DataMining, p. 592-602.
47. Tadeusiewicz R., Flasiński M. (1991).Pattern Recognition. PWN, Warszawa (in
Polish).
48. Wolpert D. (1996). The lack of a priori distinctions between learning algorithms.
Neural Comput. 8(7), p.1341-1390.
49. Woźniak M. (2017). Hybrid classifiers. Methods of data, knowledge and
classifier combination. Studies in Computational Intelligence, 519, Springer.

Data Mining Evaluation Metrics Guide
No ratings yet
Data Mining Evaluation Metrics Guide
40 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
Data Mining Classification and Prediction
No ratings yet
Data Mining Classification and Prediction
17 pages
Machine Learning Classification Guide
No ratings yet
Machine Learning Classification Guide
28 pages
6.data Mining - Classification
No ratings yet
6.data Mining - Classification
37 pages
2234-Article Text-7020-1-10-20250217
No ratings yet
2234-Article Text-7020-1-10-20250217
8 pages
Unit 3
No ratings yet
Unit 3
28 pages
ClassificationandPrediction Module3
No ratings yet
ClassificationandPrediction Module3
88 pages
Mlslides 2
No ratings yet
Mlslides 2
92 pages
Adminojs, Ijsecs 5-1-05
No ratings yet
Adminojs, Ijsecs 5-1-05
10 pages
Unit 3
No ratings yet
Unit 3
27 pages
A Review On Evaluation Metrics For Data Classification Evaluations-2020
No ratings yet
A Review On Evaluation Metrics For Data Classification Evaluations-2020
12 pages
Week 5
No ratings yet
Week 5
72 pages
2 Supervised Learning
No ratings yet
2 Supervised Learning
52 pages
Lec09 Classifier Evaluation
No ratings yet
Lec09 Classifier Evaluation
185 pages
ch-4 FML
No ratings yet
ch-4 FML
13 pages
Most Cited Article in Academia - International Journal of Data Mining & Knowledge Management Process (IJDKP)
No ratings yet
Most Cited Article in Academia - International Journal of Data Mining & Knowledge Management Process (IJDKP)
39 pages
Machine Learning Unit-2
No ratings yet
Machine Learning Unit-2
89 pages
Classification Basics Explanation
No ratings yet
Classification Basics Explanation
2 pages
Classification and Prediction
No ratings yet
Classification and Prediction
41 pages
ML - Mod2 Classification
No ratings yet
ML - Mod2 Classification
74 pages
Unit6 - 7 Issues
No ratings yet
Unit6 - 7 Issues
53 pages
Ai Unit 5
No ratings yet
Ai Unit 5
13 pages
Introduction 22feb24
No ratings yet
Introduction 22feb24
29 pages
DMDM Part 2
No ratings yet
DMDM Part 2
94 pages
6 - Steps of The Classification Algorithm in Supervised Learning
No ratings yet
6 - Steps of The Classification Algorithm in Supervised Learning
15 pages
Data Classification & Prediction Guide
No ratings yet
Data Classification & Prediction Guide
38 pages
Unit 4
No ratings yet
Unit 4
52 pages
ML Model Evaluation
No ratings yet
ML Model Evaluation
17 pages
Supervised Learning Classification Algorithms Comparison
No ratings yet
Supervised Learning Classification Algorithms Comparison
6 pages
20150908-Lecture-3-Draft Asd Def HFL DFGF Lkreglker Lerg Kelr GK
No ratings yet
20150908-Lecture-3-Draft Asd Def HFL DFGF Lkreglker Lerg Kelr GK
15 pages
Classification and Prediction Guide
No ratings yet
Classification and Prediction Guide
98 pages
Unit 3
No ratings yet
Unit 3
123 pages
Classification
No ratings yet
Classification
22 pages
Machine Learningassignment
No ratings yet
Machine Learningassignment
10 pages
Session 5
No ratings yet
Session 5
91 pages
WS07 05 001
No ratings yet
WS07 05 001
6 pages
WINSEM2024-25 BCSE334L TH VL2024250502042 2025-03-03 Reference-Material-I
No ratings yet
WINSEM2024-25 BCSE334L TH VL2024250502042 2025-03-03 Reference-Material-I
18 pages
19-Introduction Classification Algorithm-18-09-2024
No ratings yet
19-Introduction Classification Algorithm-18-09-2024
102 pages
Classification Algorithms
No ratings yet
Classification Algorithms
23 pages
BI UNIT-03 Chap01 Classification
No ratings yet
BI UNIT-03 Chap01 Classification
13 pages
Unit 3
No ratings yet
Unit 3
28 pages
Data Classification
No ratings yet
Data Classification
159 pages
Introduction To Classification Analysis
No ratings yet
Introduction To Classification Analysis
10 pages
ML Unit 3 Part 2
No ratings yet
ML Unit 3 Part 2
8 pages
Classification and Prediction Guide
No ratings yet
Classification and Prediction Guide
93 pages
TTDS Lecture 4
No ratings yet
TTDS Lecture 4
31 pages
Naïve Bayes Classifier
No ratings yet
Naïve Bayes Classifier
39 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
ML-Lec-06-Supervised Learning-Decision Trees
No ratings yet
ML-Lec-06-Supervised Learning-Decision Trees
45 pages
DataMiningComparison Melody
No ratings yet
DataMiningComparison Melody
14 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Classification
No ratings yet
Classification
23 pages
DM 09 Classification and Prediction 19112024 102854am
No ratings yet
DM 09 Classification and Prediction 19112024 102854am
21 pages
Classification Techniques Overview
No ratings yet
Classification Techniques Overview
141 pages
Hands On Machine Learning 3 Edition
No ratings yet
Hands On Machine Learning 3 Edition
18 pages
Data Discretization
No ratings yet
Data Discretization
9 pages
Chapter 2 & 3 LP
No ratings yet
Chapter 2 & 3 LP
74 pages
Tic Tac Toe
No ratings yet
Tic Tac Toe
9 pages
Activity#5-Fourier Series and Fourier Transform
No ratings yet
Activity#5-Fourier Series and Fourier Transform
3 pages
Performance Task 4 PC
No ratings yet
Performance Task 4 PC
2 pages
Unit-8: Computer Animation 8.1 Overview
No ratings yet
Unit-8: Computer Animation 8.1 Overview
5 pages
Linear Regression
No ratings yet
Linear Regression
5 pages
Numerical Integration for Students
No ratings yet
Numerical Integration for Students
10 pages
USA Mathematical Talent Search Solutions To Problem 3/1/16
No ratings yet
USA Mathematical Talent Search Solutions To Problem 3/1/16
3 pages
Geometrical Properties of Cumulant Expansions
No ratings yet
Geometrical Properties of Cumulant Expansions
28 pages
Ex No: 7 Date: Write A Program For Implementing The FCFS Scheduling Algorithm
No ratings yet
Ex No: 7 Date: Write A Program For Implementing The FCFS Scheduling Algorithm
4 pages
Optical Encryption: First Line of Defense For Network Services
No ratings yet
Optical Encryption: First Line of Defense For Network Services
32 pages
Dead Lock Handling Approaches: Operating Systems
No ratings yet
Dead Lock Handling Approaches: Operating Systems
29 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
12 pages
Quadratic Function Activities
No ratings yet
Quadratic Function Activities
3 pages
Model Complexity Control For Regression Using VC Generalization Bounds
No ratings yet
Model Complexity Control For Regression Using VC Generalization Bounds
15 pages
Daa - LP 2023-24
No ratings yet
Daa - LP 2023-24
3 pages
Unit III Iml Final
No ratings yet
Unit III Iml Final
36 pages
ETH Zurich Talk - April 14, 2025
No ratings yet
ETH Zurich Talk - April 14, 2025
84 pages
Dataanalytics Unit-4
No ratings yet
Dataanalytics Unit-4
23 pages
Palisade Risk ASQVT1
No ratings yet
Palisade Risk ASQVT1
30 pages
Math4Phys: Master's Guide 2022-23
No ratings yet
Math4Phys: Master's Guide 2022-23
8 pages
Grade 12 Math Pre-Test
No ratings yet
Grade 12 Math Pre-Test
6 pages
Wadekar, Chaurasia - 2022 - MobileViTv3 Mobile-Friendly Vision Transformer With Simple and Effective Fusion of Local, Global and Input F
No ratings yet
Wadekar, Chaurasia - 2022 - MobileViTv3 Mobile-Friendly Vision Transformer With Simple and Effective Fusion of Local, Global and Input F
20 pages
Machine Learning (ML) in Medicine - Review, Applications, and Challenges PDF
No ratings yet
Machine Learning (ML) in Medicine - Review, Applications, and Challenges PDF
52 pages
Nonequilibrium Statistical Mechanics
No ratings yet
Nonequilibrium Statistical Mechanics
5 pages
Chapter 4
No ratings yet
Chapter 4
11 pages
Geroch Singularity AnnPhys1968
No ratings yet
Geroch Singularity AnnPhys1968
15 pages
Airworthiness Officer Question
No ratings yet
Airworthiness Officer Question
1 page
Name:Chaitanya Santosh Mhetre. Roll No (24) .: Assignment No.14: Implement Scheduling Algorithms
No ratings yet
Name:Chaitanya Santosh Mhetre. Roll No (24) .: Assignment No.14: Implement Scheduling Algorithms
2 pages

Comparating Classifiers

Uploaded by

Comparating Classifiers

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Evaluating and Comparing Classiﬁers: Review, Some Recommendations and

Conference Paper · May 2017

The user has requested enhancement of the downloaded file.

Evaluating and comparing classifiers:

Key words: supervised classification, classifier evaluation, performance metrics,

In a supervised classification problem one aims to learn a classifier from a

the whole description of classifier construction process can be found in many

2. MEASURES OF CLASSIFIER QUALITY

Table 1. Confusion matrix for a two-class problem

Predicted positive Predicted negative

3. QUALITY ESTIMATION METHODS

4. STATISTICAL COMPARISON OF CLASSIFIERS

The comparison of the scores obtained by two or more classifiers in a set of

4 RECOMMENDATIONS AND CONCLUSIONS

30. Japkowicz N., Shah M. (2011). Evaluating learning algorithms: a classification

You might also like