See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/316799504
Evaluating and Comparing Classifiers: Review, Some Recommendations and
Limitations
Conference Paper · May 2017
DOI: 10.1007/978-3-319-59162-9_2
CITATIONS READS
19 974
1 author:
Katarzyna Stapor
Silesian University of Technology
78 PUBLICATIONS 160 CITATIONS
SEE PROFILE
All content following this page was uploaded by Katarzyna Stapor on 22 February 2018.
The user has requested enhancement of the downloaded file.
Cite this paper as:
Stąpor K. (2018) Evaluating and Comparing Classifiers: Review,
Some Recommendations and Limitations. In: Kurzynski M., Wozniak
M., Burduk R. (eds) Proceedings of the 10th International Conference
on Computer Recognition Systems CORES 2017. CORES 2017.
Advances in Intelligent Systems and Computing, vol 578. Springer,
Cham
PREPRINT
Evaluating and comparing classifiers:
review, some recommendations and limitations
Katarzyna Stąpor1
Abstract
Performance evaluation of supervised classification learning method related to its
prediction ability on independent data is very important in machine learning. It is also
almost unthinkable to carry out any research work without the comparison of the new,
proposed classifier with other already existing ones. This paper aims to review the most
important aspects of the classifier evaluation process including the choice of evaluating
metrics (scores) as well as the statistical comparison of classifiers. Critical view,
recommendations and limitations of the reviewed methods are presented. The article
provides a quick guide to understand the complexity of the classifier evaluation process
and tries to warn the reader about the wrong habits.
Key words: supervised classification, classifier evaluation, performance metrics,
statistical classifier comparison
1. INTRODUCTION
In a supervised classification problem one aims to learn a classifier from a
dataset U {( x (1) , t (1) ), ... , ( x ( n ) , t ( n ) )} of n labeled data instances and each
instance x (i ) is characterized by d predictive variables/features, X ( X 1 ,..., X d ) ,
and a class T to which it belongs. This dataset is obtained from a physical process
described by an unknown probability distribution f ( X , T ) . Then, the learned
classifier, after evaluating its quality (usually on test dataset), can be used to
classify new samples, i.e. to obtain their unknown class labels. We do not make
here a distinction between a classifier (being a function that maps an input feature
space to a set of class labels) and a classification learning algorithm which is a
general methodology that can be used, given a specific dataset, to learn a specific
classifier. Theoretical background on supervised classification problem as well as
1
Prof. dr hab. eng., Institute of Computer Science, Silesian Technical University, Gliwice
2
the whole description of classifier construction process can be found in many
books on machine learning and pattern recognition (see for example 2, 8, 31, 33,
34, 44, 47, 49).
Usually, the problem of evaluating a new classifier is tackled by using the
score that try to summarize the specific conditions of interest. Classification error
and accuracy are widely used scores in the classification problems. In practice,
classification error must be estimated from all the available samples. The k-fold
cross-validation, for example, is on of the most frequently used such estimation
methods. Then, questions are whether such a new, proposed classifier (or
enhancement of the existing one) yields an improved score over the competitor
classifier (or classifiers) or the state of the art. It is almost impossible now to do
any research work without an experimental section where the score of a new
classifier is tested and compared with the scores of the existing ones. This last
step also requires the selection of datasets on which the compared classifiers are
learned and evaluated. The purpose of dataset selection step should not be to
demonstrate classifier’s superiority to another in all cases, but rather to identify its
areas of strengths with respect to domain characteristics.
This paper is focused only on a supervised classification problem as
defined in the beginning. Other types of classification such as classification from
data streams or multi-label classification are not addressed here, since they may
impose specific conditions to the calculation of the score (for the most important
reference in evaluating (static) data streams, see for example (15)).
The whole evaluation process of a classifier should include the following
steps:
1) choosing an evaluation metric (i.e. a score) according to the
properties of a classifier,
2) deciding the score estimation method to be used,
3) checking whether the assumptions made by 1) and 2) are fulfilled,
4) running the evaluation method and interpret the results with respect
to the domain,
5) compare a new classifier with the existing ones selected according
to the different criteria, for example problem dependent; this step
requires selection of datasets.
The main purpose of this paper is to provide the reader with a better
understanding about the overall classifier evaluation process. As there is no fixed,
concrete recipe for the classifier evaluation procedure, we believe that this paper
will facilitate the researcher in the machine learning area to decide which
alternative to choose for each specific case.
The paper is set up as follows. In section 2 we describe measures of classifier
quality while in section 3, a short overview of their estimation methods. Section 4
focuses on statistical methods for classifier quality comparison. Finally, in section
5 we conclude giving some recommendations.
2
3
2. MEASURES OF CLASSIFIER QUALITY
Usually the problem of evaluating a new classifier (i.e. measuring its quality)
is tackled by using the score that try to summarize the specific conditions of
interest when evaluating a classifier. There may be many scores according to how
we aim to quantify classifier’s behavior. In this section, we only present some of
the most extended scores.
Typical scores for measuring the performance of a classifier are accuracy and
classification error, which for a two-class problem can be easily derived from a
2x2 confusion matrix as that given in Table 1. These scores can be computed as:
Acc (TP TN ) /(TP FN TN FP )
Err ( FP FN ) /(TP FN TN FP )
Sometimes, accuracy and classification error are selected without considering
in depth whether it is the most appropriate score to measure the quality of a
classifier for the classification problem at hand. When both class labels are
relevant and the proportion of data samples for each class is very similar, these
scores are a good choice. Unfortunately, equally class proportions are quite rare in
real problems. This situation is known as the imbalance problem (29). Empirical
evidence shows that accuracy and error rate are biased with respect to data
imbalance: the use of these scores might produce misleading conclusions since
they do not take into account misclassification costs, the results are strongly
biased to favor the majority class, and are sensitive to class skews.
In some application domains, we may be interested in how our classifier classifies
only a part of the data. Examples of such measures are: True positive rate (Recall or
Sensitivity), TPrate = TP/(TP+FN), True negative rate (Specificity): TNrate =
TN/(TN+FP), False positive rate: FPrate = FP/(TN+FP), False negative rate
(Precision): Precision = TP/(TP+FP).
Table 1. Confusion matrix for a two-class problem
Predicted positive Predicted negative
Positive class True Positive (TP) False Negative (FN)
Negative class False Positive (FP) True Negative (TN)
Shortcomings of the accuracy or error rate have motivated search for new
measures which aim to obtain a trade-off between the evaluation of the
classification ability on both positive and negative data samples. Some
straightforward examples of such alternative scores are: the harmonic mean
between Recall and Precision values: F-measure = 2×TPrate×Precision/(TPrate
+ Precision), and the geometric mean of accuracies measured separately on each
class: G-mean= TPrate×TNrate . Harmonic and geometric means are symmetric
functions that give the same relevance to both components. There are other
proposals that try to enhance one of the two components of the mean. For
instance, the adjusted geometric mean (1), the optimized precision OP from (37)
compu-ted as: OP = Acc - (|TNrate - TPrate|/(TNrate+TPrate)), and F-score (30):
3
4
(β 2 +1)Precision×TPrate
F-score= 2
β ×Precision+TPrate
A parameter can be tuned to obtain different trade-offs between both compo-
nents.
When a classifier classifies an instance into a wrong class group, a loss is
incurred. Cost-sensitive learning (10) aims to minimize this loss incurred by the
classifier. The above introduced scores use the 0/1 loss function, i.e. they treat all
the different types of misclassification as equally severe. The cost matrix can be
used if the severity of misclassifications can be quantified in terms of costs.
Unfortunately, in real applications, specific costs are difficult to obtain. In such
situations, however, the described above scores may be useful since they may also
be used to set more relevance into the costliest misclassification: minimizing the
cost may be equivalent to optimal trade-off between Recall and Specificity (7).
When the classification costs cannot be accessed, another most widely-used
techniques for the evaluation of classifiers is the ROC curve (4, 11), which is a
graphical representation of Recall versus FPrate (1-Specificity). The information
about classification performance in the ROC curve can be summarized into a
score known as AUC (Area under the ROC curve) which is more insensitive to
skewness in class distribution since it is a trade-off between Recall and Specificity
(43). However, recent studies have shown that AUC is a fundamentally incoherent
measure since it treats the costs of misclassification differently for each classifier.
This is undesirable because the cost must be a property of the problem, not of the
classification method. In (21, 22), the H measure is proposed as an alternative to
AUC.
While all of the scores described above in this section are appropriate for two-
class imbalanced learning problems, some of them can be modified to
accommodate the multi-class imbalanced learning problems (23). For example
(46) extends the G-mean definition to the geometric mean of Recall values of
every class. Similarly, in (12) they defined mean F-measure for multi-class
imbalance problem. The major advantage of this measure is that it is insensitive to
class distribution and error costs. However, it is now an open question if such
extended scores for multi-class classification problem are appropriate on scenarios
where there exist multiple minority and multiple majority classes (40). In (20)
they proposed the M measure, a generalization approach that aggregates all pairs
of classes based on the inherent characteristics of the AUC.
In this paper, we focus on the scores since they are popular way to
measure classification quality. But these measures do not capture all the
information about the quality of classification methods some graphical methods
may do. However, the use of quantitative measures of quality makes the
comparison among the classifiers easier (for more information on graphical
methods see for example 30, 9, 36). The presented list of scores is by no means
exhaustive. The described scores are focused only on the evaluating the
performance of a classifier. However, there are other important aspects of
classification such as robustness to noise, scalability, stability under data shifts,
etc. which are not addressed here.
4
5
3. QUALITY ESTIMATION METHODS
Various methods are commonly used to estimate classification error and the
other described classifier scores (the review of estimation methods can also be
found in the mentioned literature on machine learning). Holdout method of
estimation of classification error divides randomly the available dataset into
independent training and testing subsets which are then used for learning and
evaluating a classifier. This method gives a pessimistically biased error estimate
(calculated as a ratio of misclassified test samples to a size of test subset),
moreover it depends on a particular partitioning of a dataset. These limitations are
overcome with a family of resampling methods: cross validation (random sub-
sampling, k-fold cross-validation, leave-one-out) and bootstrap. Random subsam-
pling performs k random data splits of the entire dataset into training and testing
subsets. For each data split, we retrain a classifier and then estimate error with test
samples. The true error estimate is the average of separate errors obtained from k
splits. The k-fold cross-validation creates a k fold partition of the entire dataset
once: Then, for each of k experiments, it uses (k-1) folds for training and a
different fold for testing. The classification error is estimated as the average of
separate errors obtained from k experiments. It is approximately unbiased,
although at the expense of an increase in the variance of the estimate. Leave-one-
out is the degenerate case of k-fold cross-validation where k is chosen as the total
number of samples. This results in the unbiased error estimate, but have large
variance. In the bootstrap estimation, we randomly select with replacement the
samples and use this set for training. The remaining samples that were not
selected for training are used for testing. We repeat this procedure k times. The
error is estimated as the average error on test samples from k procedures. The
benefit of this method is its ability to obtain accurate measures of both bias and
variance of classification error estimate.
4. STATISTICAL COMPARISON OF CLASSIFIERS
The comparison of the scores obtained by two or more classifiers in a set of
problems is a central task in machine learning, so it is almost impossible to do any
research work without an experimental section where the score of a new classifier
is tested and compared with the scores of the existing ones. When the differences
are very clear (e.g., when the classifier is the best in all the problems considered),
the direct comparison of the scores may be enough. But in most situations, a direct
comparison may be misleading and not enough to draw sound conclusions. In
such situations, the statistical assessment of the scores such as hypothesis testing
is required. Statistical tests arise with the aim of giving answers to the above
mentioned questions, providing more precise assessments of the obtained scores
by analyzing them to decide whether the observed differences between the
classifiers are real or random. However, although the statistical tests have been
established as a basic part of classifier comparison task, they are not a definitive
tool, we have to be aware about their limitations and misuses.
5
6
The statistical tests for comparing classifiers are usually bound to a specific
estimation method of classifier score. Therefore, the selection of a statistical test is
also conditioned by this estimation method.
For the comparison of two classifiers on one dataset, the situation which is
very common in machine learning problems, the corrected resampled t test has
been suggested in the literature (35). This test is associated with a repeated
estimation method (for example holdout): in i-th of the m iterations, a random data
partition is conducted and the values for the scores Ak(1i ) and Ak(i2) of compared
classifiers k1 and k 2 , are obtained. The statistic is:
A
t
m A A
(i ) 2
1 N test
i 1
m N train m 1
1 m (i ) (i )
where A i 1
A , A Ak(1i ) Ak(i2) , N test , N train are the number of
m
samples in the test and train partitions. The second parametric test that can be
used in this scenario whose behavior, however, has not been studied as for
previous, is the corrected t test for repeated cross-validation (3). These tests
assume the data follow the normal distribution which should be first checked
using the suitable normality test. A non-parametric alternative for comparing two
classifiers that is suggested in the literature is McNemar’s test (26).
For the comparison of two classifiers on multiple datasets the Wilcoxon
signed-ranks test (26) is widely recommended. It ranks the differences
d i Ak(1i ) Ak( i2) between scores of two classifiers k1 and k 2 obtained on i-th of N
datasets, ignoring the signs. The test statistic of this test is:
T min( R , R )
where:
1 1
R rank (d i ) rank ( di ), R rank (d i ) rank (di )
di 0 2 di 0 di 0 2 di 0
are the sums of ranks on which the k 2 classifier outperforms k1 , respectively.
Ranks di 0 are split evenly among the sums. Other test that can be used is the
sign test, but it is much weaker than the Wilcoxon signed-ranks test.
Comparison among multiple classifiers on multiple datasets arise in machine
learning when a new proposed classifier is compared with the state of the art. For
this situation, the general recommended methodology is as follows.
First, we apply an omnibus test to detect if at least one of the classifiers
performs different than the others. Friedman nonparametric test (14) with Iman-
Davenport extension (28) is probably the most popular omnibus test. It is a good
choice when comparing more than five different classifiers. Let Rij be the rank of
the j-th of K classifiers on the i-th of N data sets and
1
R j i 1 Rij
N
6
7
is the mean rank of j-th classifier. The test compares the mean ranks of the
classifiers and is based on the test statistic:
( N 1) F2 12 N K 2 K ( K 1)2
FF F R j
2
N ( K 1) F2 K ( K 1) j 1 4
which follows a F distribution with ( K 1) and ( K 1)( N 1) degrees of
freedom. For the comparison of five or less different classifiers, Friedman aligned
ranks (17) or the Quade test (38) are the more powerful alter-natives.
Second, if we find such a significant difference, then we apply a pair-wise test
with the corresponding post-hoc correction for multiple comparisons. For the
described above Friedman test, comparing the r-th and s-th classifiers is based on
the mean ranks and has the form:
K ( K 1)
z Rr Rs
6N
The z value is used to find the corresponding probability from the table of normal
distribution, which is then compared with an appropriate significance level . As
performing pair-wise comparisons is associated with a set or family of
hypotheses, the value of must be adjusted for controlling the family-wise error
(42). There are multiple proposals in the literature to adjust the significance level
: Holm (27), Hochberg (24), Finner (13).
The results of pair-wise comparisons, often, give not disjoint groups of
classifiers. In order to identify disjoint, homogenous groups, in (19) they apply
special cluster analysis approach. Their method results in dividing K classifiers
into groups in such a way that classifiers belonging to the same group do not
significantly differ with respect to the chosen distance.
4 RECOMMENDATIONS AND CONCLUSIONS
This paper covers the basic steps of classifier evaluation process, focusing
mainly on the evaluation metrics and conditions for their proper usage as well as
the statistical comparison of classifiers.
The evaluation of classification performance is very important to the
construction and selection of classifiers. The vast majority of the published
articles use the accuracy (or classification error) as the score in the classifier
evaluation process. But these two scores may be appropriate only when the
datasets are balanced and the misclassification costs are the same for false
positives and false negatives. In the case of skew datasets, which is rather typical
situation, the accuracy/error rate is questionable and other scores such as Recall,
Specificity, Precision, Optimized Precision, F-score, geometric or harmonic
means, H or M measures are more appropriate.
The comparison of two classifiers on a single dataset is generally unsafe due to
the lack of independence between the obtained score values. Thus, the corrected
versions of the resampled t test or t test for repeated cross-validation are more
appropriate. McNemar’s test, being non-parametric, does not make the assumption
about distribution of the scores (like the two previous tests) but it does not directly
7
8
measure the variability due to the choice of the training set nor the internal
randomness of the learning algorithm. When comparing two classifiers on
multiple datasets (especially from different sources), the measured scores are
hardly commensurable. Therefore, the Wilcoxon signed-rank test is more
appropriate. Regarding the comparison of multiple classifiers on multiple datasets,
if the number of classifiers involved is higher than five, the use of the Friedman
test with Iman and Davenport extension is recommended. When this number is
low, four or five, Friedman aligned ranks and the Quade test are more useful. If
the null hypothesis has been rejected, we should proceed with a post-hoc test to
check the statistical differences between pairs of classifiers.
The last but not least conclusion follows from no free lunch theorem (48)
which states that for any two classifiers, there are as many classification problems
for which the first classifier performs better than the second as vice versa. Thus, it
does not make sense to demonstrate that one classifier is, on average, better than
the others. Instead, we should focus our attention on exploring the conditions of
the classification problems which make our classifier to perform better or worse
than others. We must carefully choose the datasets to be included in the evaluation
process to reflect the specific conditions, for example class imbalance,
classification cost, dataset size, application domain, etc. In other words, the choice
of the datasets should be guided in order to identify specific conditions that make
a classifier to perform better than others.
Summarizing, this review tries to provide the reader with a better
understanding about the overall process of comparison in order to decide which
alternative to choose for each specific case. We believe, that this review can
improve the way in which researchers and practitioners in machine learning
contrast the results achieved in their experimental studies using statistical
methods.
REFERENCES
1. Batuvita R., Palade V. (2009). A new performance measure for class imbalance
learning: application to bioinformatics problem. Proc. 26th Int. Conf. Machine
Learning and Applications, p. 545-550.
2. Bishop Ch. (2006). Pattern recognition and machine learning. Springer, New
York.
3. Bouckaert R. (2004). Estimating replicability of classifier learning experiments.
Proc. 21st Conf. ICML, AAAI Press.
4. Bradley P. (1997). The use of the area under the ROC curve in the evaluation of
machine learning algorithms. Pattern recognition, 30, p. 1145-1159.
5. Dietterich T. (1998). Approximate statistical tests for comparing supervised
classification learning algorithms. Neural Computation, 10, pp. 1895-1924.
6. Demsar J. (2006). Statistical comparison of classifiers over multiple data sets.
Journal of Machine Learning Research, 7, pp. 1-30.
7. Dmochowski J. et al. (2010). Maximum likelihood in cost-sensitive learning:
model specification, approximation and upper bounds. J. Mach. Learn. Res., 11,
p. 3313-332.
8
9
8. Duda R., Hart P., Stork D. (2000). Pattern classification and scene analysis. John
Wiley&Sons, New York.
9. Drummond C., Holte R. (2006). Cost curves: an improved method for
visualizing classifier performance. Machine Learning, 65, 1, p. 95-130.
10. Elkan C. (2001). The foundation of cost-sensitive learning. In: Proc. 4th Int.
Conf. Artificial Intelligence, v. 17, p. 973-978.
11. Fawcett T. (2006). An introduction to ROC analysis. Pattern Recognition
Letters. 27, 8, p. 861-874.
12. Ferri C. et al. (2009). An experimental comparison of performance measures for
classification. Pattern recognition Letters, 30, 1, p. 27-38.
13. Finner H. (1993). On a monotonicity problem in step-down multiple test
procedures. Journal of the American Statistical Association, 88, p. 920-923.
14. Friedman M. (1940). A comparison of alternative tests of significance for the
problem of m rankings. Annals of Mathematical Statistics, 11, p. 86-92.
15. Gama J., et. al. (2013). On evaluating stream learning algorithms. Machine
Learning, p. 1-30.
16. Garcia S., Herrera F. (2008). An extension on statistical comparison of
classifiers over multiple datasets for all pair-wise comparisons. Journal of
Machine Learning Research, 9(12), p. 2677-2694.
17. Garcia S. Fernandez A., Lutengo J., Herrera F. (2010). Advanced nonparametric
tests for multiple comparisons in the design of experiments in the computational
intelligence and data mining: experimental analysis of power. Inf. Sci., 180(10),
p. 2044-2064.
18. Garcia V. et. al. (2009). Index of balanced accuracy: a performance measure for
skewed class distributions. 4th IbPRIA, p. 441-448.
19. Górecki T., Krzyśko M. (2015). Regression methods for combining multiple
classifiers, Communications in Statistics—Simulation and Computation, 44, pp.
739–755.
20. Hand D., Till R. (2001). A simple generalization of the area under the ROC
curve for multiple class classification problems. Machine Learning, 45, 171-186.
21. Hand D. (2009). Measuring classifier performance: a coherent alternative to the
area under the ROC curve. Machine Learning, 77, p. 103-123.
22. Hand D., Anagnostopoulos C. (2014) A better beta for the H measure of
classification performance. Pattern Recognition Letters, 40, p. 41-46.
23. He H., Garcia E. (2009). Learning from imbalanced data. IEEE Trans. On Data
and Knowledge Engineering, 21, 9, p. 1263-1284.
24. Hochberg Y. (1988). A sharper Bonferroni procedure for multiple tests of
significance. Biometrika, 75, p. 800-802.
25. Hodges J.L., Lehmann E.L. (1962). Ranks methods for combination of
independent experiments in analysis of variance. Annals of Math. Statistics, 33,
p. 482-487.
26. Hollander M., Wolfe D. (2013). Nonparametric statistical methods. John Wiley
& Sons.
27. Holm S. (1979). A simple sequentially rejective multiple test procedure.
Scandinavian Journal of Statistics, 6, p. 65-70.
28. Iman R., Davenport J. (1980). Approximations of the critical region of the
Friedman statistic. Computations in Statistics, p. 571-595.
29. Japkowicz N. Stephen N. (2002). The class imbalance problem: a systematic
study. Intell. Data Analysis, 6,5, p. 40-49.
9
View publication stats
10
30. Japkowicz N., Shah M. (2011). Evaluating learning algorithms: a classification
perspective. Cambridge University Press, Cambridge.
31. Krzyśko M., Wołyński W., Górecki T., Skorzybut M. (2008). Learning Systems.
WNT, Warszawa (in Polish).
32. Kubat M., Matwin S. (1997). Adressing the curse of imbalanced training sets:
one-sided selection. Proc. 14th ICML, 179-186.
33. Kurzyński M. (1997). Pattern Recognition. Statistical approach. Wrocław Univ.
Tech. Press, Wrocław (in Polish).
34. Malina W., Śmiatacz M. (2010). Pattern Recognition. EXIT Press, Warszawa (in
Polish).
35. Nadeau C., Bengio Y. (2003), Inference for the generalization error. Mach.
Learn, 52(3), p. 239-281.
36. Prati R. et al. (2011). A survey on graphical methods for classification predictive
performance evaluation. IEEE Trans. Knowl. Data Eng., 23(11), p. 1601-1618.
37. Ranavana R., Palade V. (2006). Optimized precision: a new measure for
classifier performance evaluation. Proc. 23rd IEEE Int. Conf. on Evolutionary
Computation, 2254-2261.
38. Quade D. (1979). Using weighted rankings in the analysis of complete blocks
with additive block effects. J. American Statistical Association, 74, p. 680-683.
39. Salzberg S. (1997). On comparing classifiers: pitfalls to avoid and recommended
approach. Data Mining and Knowledge Discovery, 1, p. 317-328.
40. Sanchez-Crisostomo J., et al. (2014). Empirical analysis of assessments metrics
for multi-class imbalance learning on the back-propagation context. In: LNCS,
8795, Y. Tan et al. (eds), p. 17-23.
41. Santafe G. et. al. (2015). Dealing with the evaluation of supervised classification
algorithms. Artif. Intell. Rev. 44, p. 467-508.
42. Shaffer J.P. (1995). Multiple hypothesis testing. Annual Review of Psychology,
46, p. 561-584.
43. Sokolova M., Lapalme G. (2009). A systematic analysis of performance
measures for classification tasks. Inf. Proc. and Manag., 45, p. 427-437.
44. Stąpor K. (2011). Classification methods in computer vision. PWN, Warszawa
(in Polish).
45. Sun Y. et. al (2009). Classification of imbalanced data: a review. Int. J. Pattern
Recognit. Artif. Intell. 23, 4, p. 687-719.
46. Sun Y. et. al. (2006). Boosting for Learning Multiple Classes with Imbalanced
Class Distribution. Proc. Int’l Conf. DataMining, p. 592-602.
47. Tadeusiewicz R., Flasiński M. (1991).Pattern Recognition. PWN, Warszawa (in
Polish).
48. Wolpert D. (1996). The lack of a priori distinctions between learning algorithms.
Neural Comput. 8(7), p.1341-1390.
49. Woźniak M. (2017). Hybrid classifiers. Methods of data, knowledge and
classifier combination. Studies in Computational Intelligence, 519, Springer.
10