0% found this document useful (0 votes)

10 views20 pages

SSRN 4530905

This paper investigates the impact of class imbalance on various performance metrics used in binary classification tasks, revealing that traditional metrics like accuracy can be misleading in imbalanced datasets. It finds that metrics such as balanced accuracy and Matthew's correlation coefficient provide a more comprehensive evaluation by considering both classes, while the F1 score increases with prevalence. The study emphasizes the importance of selecting appropriate performance metrics in machine learning applications across different fields, especially when dealing with imbalanced data.

Uploaded by

shahriarsajal56

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views20 pages

SSRN 4530905

Uploaded by

shahriarsajal56

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Reassessing class imbalance’s effect on classification model

performance metrics: a simulation study

Jing Li1,∗
1
1407 W Gregory Dr, Department of Political Science, University of Illinois at Urbana-Champaign, USA

Abstract
This paper examines a critical question in binary classification tasks, namely the use of perfor-
mance metrics for model evaluation when data are at different levels of class-imbalance. More
specifically, it investigates how sample data class imbalance as measured in prevalence level af-
fects the values of various confusion matrix model performance metrics including true positive
rate, true negative rate, positive predictive value, negative predictive value, balanced accuracy,
bookmaker informedness, F1 score and Matthew’s correlation coefficient as well as the commonly
used accuracy and Area Under the ROC (receiver operating characteristic) Curve. The results
indicate that the accuracy measure is dominated by the majority class as data become more
imbalanced. The balanced accuracy (as well as bookmaker informedness) and Matthew’s correla-
tion coefficient, meanwhile, are dominated by the minority class and take model performance on
both classes into account. The F1 score has a monotonically increasing relationship with preva-
lence level. The Area Under the ROC Curve is not sensitive to change in prevalence level. These
patterns of the confusion matrix performance metrics hold regardless of which specific type of
classification model is used. The results have significant implications for application of binary
classification machine learning models in various fields.
Keywords Prevalence; True Positive Rate; True Negative Rate; Model Performance Metrics

1 Introduction
More than 20 years after Breiman’s articulation on the two different statistical cultures of algo-
rithmic versus data model (Breiman, 2001), the algorithmic approach as exemplified by various
machine learning models is increasingly being applied in various different fields including the so-
cial sciences, for example in law enforcement (Dressel and Farid, 2018; Lin et al., 2020; Bansak,
2019) and conflict prediction (Muchlinski et al., 2016; Neunhoeffer and Sternberg, 2019; Wang,
2019; Blair et al., 2017; Blair and Sambanis, 2020; Beger et al., 2021; Blair and Sambanis, 2021).
While most scholarly attention has focused on relative performance of different models, this
paper looks instead at how data itself can affect the values of metrics used to measure model
prediction performance regardless of which specific type of model is being used.
Performance metrics is a critical issue in applying machine learning models in various sci-
entific fields (García V. and R., 2012; Lever et al., 2016; Luque et al., 2019; Jadhav, 2020; Zhu,
2020; Chicco and Jurman, 2020; Chicco et al., 2021; De Diego et al., 2022; Hicks et al., 2022;
Lavazza and Morasca, 2023; Chicco and Jurman, 2023). Model evaluation metric is an essential
component of the modeling process and has a direct impact on model performance evaluation
and model selection. Making use of the appropriate performance evaluation metric is critical for

∗
Corresponding author. Email: [email protected].

Electronic copy available at: https://ssrn.com/abstract=4530905

2 Li, J.

appropriate application of machine learning models. Yet despite performance measures’ obvious
significance, practitioners of machine learning models often have paid insufficient attention to it,
resulting in inappropriate use of model evaluation metrics and disputable claims about actual
model performance.
In particular, there has not been sufficient research explicitly looking at how class imbal-
ance affect the values of different performance metrics for binary classification tasks. As many
phenomena studied in various scientific fields, such as civil conflicts in the social sciences and
diseases in the biomedical sciences, are binary outcomes where the positive and negatives classes
are likely to be imbalanced, it is essential to assess the impact of class imbalance on values of
model performance metrics irrespective of which specific type of model is being used. The com-
monly used performance metrics including accuracy and Area Under the ROC Curve (AUC) are
ill-suited for model evaluation when data is class-imbalanced. For example, a classifier with 95
percent accuracy could fail to predict the occurrence of any civil conflict or disease as the data
sample is overwhelmingly dominated by the negative class of no civil conflict or no disease. A high
accuracy classifier will not necessarily have satisfactory or desirable classification performance.
Meanwhile, a random forest model with 0.97 Area Under the ROC Curve (AUC) can nonetheless
produce a large number of false positives. While existing research (Chicco and Jurman, 2020;
Chicco et al., 2021; Chicco and Jurman, 2023) has pointed out the issue with accuracy and AUC
when classes are imbalanced and has advocated for the use of Matthew’s correlation coefficient
in place of AUC, it does not show the dynamics of how performance metric values change as
class imbalance changes and in what ways. To the best of my knowledge, this is among the first
studies that explicitly investigate the effect of prevalence level change on binary classification
evaluation metrics using statistical simulation.
The paper is organized as follows, in section 2, we first provide background information and
notations on model evaluation metrics for binary classification tasks. In section 3, we look at
two prominent cases of applying machine learning models in the social sciences to illustrate how
model performance metric values can more or less truthfully reflect model performance when class
imbalance in the data changes. In section 4, we present a simulation study on the relationships
between prevalence (proportion of positive cases in a sample) and various model performance
metrics by explicitly varying the prevalence level in the data sample while keeping the sample
size constant and the relationships between the predictors and outcome largely unchanged. And
in section 5, we offer concluding remarks and directions for future research.

2 What to know about model performance metrics

2.1 Background and Basic Notations
Performance metrics are statistics used to evaluate model performance especially for out of
sample prediction tasks as is commonly the case for machine learning model applications. Below
we compare several performance metrics and look at how accurate and reliable they are at
capturing the model prediction performance for data of different prevalence level. We focus on
binary classification tasks in this study. First, let us introduce a few concepts. For every binary
prediction/classification task, the result could fall into four different categories:
• true positives (TP), positive cases correctly predicted as positive;
• false negatives (FN), positive cases incorrectly predicted as negative;
• true negatives (TN), negative cases correctly predicted as negative;
• false positives (FP), negative cases incorrectly predicted as positive.

Electronic copy available at: https://ssrn.com/abstract=4530905

Reassessing class imbalance’s effect on classification performance metrics 3

The partitionof these four

categories can be represented in a 2 × 2 tabular format called
TP FN
confusion matrix: .
FP TN
Table 1 below represents the different number of observations for each type of cases in the
confusion matrix.

Table 1: Confusion Matrix

Prediction
Yes No Total
Yes n11 n10 n1.
Actual No n01 n00 n0.
Total n.1 n.0 n

It can be observed that T P + F N = n1. , and T N + F P = n0. , and T P + F P = n.1 , and

T N + F N = n.0 .
The first prediction performance metric we look at is the most commonly used accuracy
measure, which has the following form:
TP + TN TP + TN
accuracy = =
n1. + n0. TP + FN + TN + FP
Then we have the true positive rate (TPR) or recall, which has the following form:
TP
True Positive Rate =
TP + FN
And correspondingly we have true negative rate (TNR), which has the following form:
TN
True Negative Rate =
TN + FP
Then there is positive predictive value (PPV) or precision, which has the following form:
TP
PPV =
TP + FP
And correspondingly there is negative predictive value (NPV):
TN
NPV =
TN + FN
.
And a few metrics that combine the True Positive Rate and True Negative Rate in certain
ways.
Balanced accuracy:
TPR + TNR
Balanced Accuracy (BA) =
2
.
Bookmaker Informedness:

Bookmaker Informedness (BI) = TPR + TNR − 1

Electronic copy available at: https://ssrn.com/abstract=4530905

4 Li, J.

.
The next performance metric we look at is F1 score:
2 2T P
F1 score = =
recall −1
+ precision−1 2T P + FP + FN
.
And the final model prediction performance metric we will examine is the Matthew’s corre-
lation coefficient (MCC):

TP · TN − FP · FN
MCC = p
(T P + F P ) · (T P + F N ) · (T N + F P ) · (T N + F N )

(minimum: -1, maximum: +1).

2.2 Alternative Derivation of confusion matrix metrics

As mentioned in the literature (Kruschke, 2015; Luque et al., 2019), all entries of a confusion
matrix and confusion matrix metrics can be derived from the four quantities of sample size n,
prevalence, TPR and TNR. To show that this is indeed the case, below we redefine the confusion
matrix metrics introduced above in terms of these four quantities and investigate how changes
in one or more of these four quantities can change values of the confusion matrix performance
metrics. As balanced accuracy and bookmaker informedness are already defined in terms of
TPR and TNR, we do not need to derive another expression for these two. We use ϕ to represent
prevalence, and ϕ = nn1. = T P +F
n
N
.

TP + TN
accuracy = = TPR · ϕ + TNR · (1 − ϕ). (1)
n
2TPR·ϕ·n
And F1 score = 2T P
2T P +F P +F N = = 2
1−ϕ .
2TPR·ϕ·n+(1−TNR)·(1−ϕ)·n+(1−TPR)·ϕ·n 2+ 1−TNR
TPR ϕ
+ 1−TPR
TPR
Therefore,
2
F1 score = . (2)
2+ 1
TPR ((1 − TNR) 1−ϕ
ϕ + 1 − TPR)

An alternative expression of Matthew’s correlation coefficient is as follows:

√ √
MCC = PPV×TPR×TNR×NPV− FDR×FNR×FPR×FOR

, where FOR (false omission rate) = 1 − NPV, FDR (false discovery rate) = 1 − PPV,
FPR (False Positive Rate) = 1 − TNR, FNR (False Negative Rate) = 1 − TPR.
Then we have

TPR · ϕ · n
PPV =
TPR · ϕ · n + (1 − TNR) · (1 − ϕ) · n
(3)
TPR · ϕ
= .
TPR · ϕ + (1 − TNR) · (1 − ϕ)
TNR · (1 − ϕ) · n
NPV =
TNR · (1 − ϕ) · n + (1 − TPR) · ϕ · n
(4)
TNR · (1 − ϕ)
= .
TNR · (1 − ϕ) + (1 − TPR) · ϕ

Electronic copy available at: https://ssrn.com/abstract=4530905

Reassessing class imbalance’s effect on classification performance metrics 5

r
⇒ MCC = TPR2 ·ϕ
·
TNR2 ·(1−ϕ)
TPR·ϕ+(1−TNR)·(1−ϕ) TNR·(1−ϕ)+(1−TPR)·ϕ
r
(1−TNR)2 ·(1−ϕ) (1−TPR)2 ·ϕ
− ·
TPR·ϕ+(1−TNR)·(1−ϕ) TNR·(1−ϕ)+(1−TPR)·ϕ
(5)
TPR + TNR − 1
=q .
ϕ 1−ϕ
(TPR · 1−ϕ + 1 − TNR)(TNR · ϕ + 1 − TPR)

2.3 Relationships between prevalence and model evaluation metrics

To investigate the change in the values of confusion matrix performance metrics as a result of
change in prevalence level, we start from the simplest case. If n, TPR and TNR are all kept
constant, then changing prevalence will not change the values of balanced accuracy and book-
maker informedness as these two metrics only depend on the sum of TPR and TNR. However,
the values of accuracy, F1 Score and Matthews’ correlation coefficient could change as each of
these three metrics depends on TPR, TNR as well as prevalence as detailed below.
As accuracy = TPR · ϕ + TNR · (1 − ϕ), if TPR is greater than TNR, then accuracy is
monotonically increasing with prevalence, otherwise accuracy is monotonically decreasing with
prevalence.
And for F1 score, as is shown in Appendix A, it is monotonically increasing with prevalence
when n, TPR, TNR are kept constant. This turns out to be also true when TPR and TNR are
not constants, as can be observed in the subsequent empirical analysis.
Also as is shown in Appendix B, when prevalence increases while n, TPR, TNR are kept
constant, we cannot determine the direction of change for Matthew’s correlation coefficient be-
cause the two quantities PPV and NPV (and correspondingly FDR and FOR) change in opposite
directions.
In reality, when prevalence namely the proportion of positive cases in a data sample changes,
both TPR and TNR are likely to change, and the direction of change for TPR and TNR are not
guaranteed to be in either increasing or decreasing direction. This is the case because a machine
learning model’s ability to accurately predict either positive case (namely true positive rate
(TPR)) or negative case (namely true negative rate (TNR)) is determined by the relationships
between various predictors and the outcome of interest. It is not straightforward to determine
and quantify how change in prevalence could influence these relationships and in what ways.
Assuming the sample size n is constant, then as prevalence increases, TPR and TNR could
change in either direction. Detailing the implications for model performance metrics as a result of
change in prevalence likely requires analysis that takes into account the specific applied settings,
and in the following we show one way to overcome this challenge by conducting a simulation
study below.
One general rule that is expected to hold regardless of applied settings is that as prevalence
goes to extremes, for example prevalence → 0, the number of positive cases in the sample becomes
too small to sufficiently train the classifier to detect the positive cases especially for a out-of-
sample test set. The same is true when prevalence → 1, the proportion of negative cases becomes
so small that the classifier cannot be trained with enough data variation on the predictors and
on the relationships between the predictors and the outcome to accurately predict the negative
outcome.
Meanwhile, from the definitions of the performance metrics listed above, it is also easy to
observe that Matthew’s correlation coefficient (MCC) is the only metric that considers all four

Electronic copy available at: https://ssrn.com/abstract=4530905

6 Li, J.

possible prediction outcomes of the confusion matrix in its numerator, therefore, it appears to be
the only metric that fully informs the performance of a binary classifier. In essence, Matthew’s
correlation coefficient represents the correlation between the actual labels in the data and the
predicted labels for both positive and negative cases. In fact, Matthew’s correlation coefficient is
the Pearson’s correlation coefficient between two binary variables as is shown in Appendix D.
Previous research indicates that when faced with data having imbalanced proportion of pos-
itive or negative cases, such as predicting the occurrence of civil war onset or a rare disease, only
Matthew’s correlation coefficient can truthfully inform how a model performs while other perfor-
mance metrics could give misleading information about model prediction performance (Chicco
and Jurman, 2020). In the following, we further investigate the use of model performance metrics
listed above as well as the commonly used Area Under the ROC Curve (AUC) for balanced and
unbalanced data. We also display the results of a simulation study that showcases the dynamics
of changing prevalence level and its impact on values of binary classification metrics.

3 Empirical Analysis
3.1 Case 1: Crime Recidivism
For case with more or less balanced positive and negative classes, we leverage the Broward County
data set used in (Dressel and Farid, 2018) and (Bansak, 2019). The data set consists of individuals
arrested in Broward County, Florida between 2013 and 2014. The sample size is 6214, with 2775
positive cases, individuals who reoffended and 3439 negatives, individuals who did not reoffend 2 .
The predictors for the classifier include seven features of the defendants: gender, age, number of
juvenile misdemeanors, number of juvenile felonies, number of prior (nonjuvenile) crimes, crime
degree, and crime charge (Dressel and Farid, 2018). In the following, we train six commonly used
machine learning models and see how the different performance metrics listed above capture the
prediction performance of these six different models: logistic regression (Logit), random forest
(RF), k-nearest neighbors (KNN), linear discriminant analysis (LDA), gradient boosting machine
(GBM), support vector machine with radial basis function kernel (SVM). The sample data are
randomly split into 80 percent training data and 20 percent test data. 10-fold cross-validation is
used with the model training process.
Table 2 below lists the confusion matrix entries for the six different model predictions (first
four rows). It also lists the values of the model performance metrics discussed above. The results
indicate that the true positive rates on average are about 20 percent lower than the true negative
rates as the proportion of positive cases is about 10 percent lower than the proportion of negative
cases. The differences in prediction performance across the six models are small regardless of the
performance metrics used. As TPR, TNR, PPV, NPV concern positive or negative cases only
for either actual or predicted labels, we focus more on accuracy, BA, BI, F1 score and MCC as
they care about model performance on both positive and negative cases. Based on accuracy, BA,
BI, F1 score and MCC, the two best performing models are GBM and random forest, the worst
performing one is KNN, and the other models perform somewhere in between. This is the case
across all five metrics. The AUC values indicate similar pattern for prediction performance of the
six different models, the main difference is that the SVM model performs much worse according
to AUC 3 .

2
We use the replication data in (Bansak, 2019).
3
AUC calculations are based on various classification threshold values while all other performance metric

Electronic copy available at: https://ssrn.com/abstract=4530905

Reassessing class imbalance’s effect on classification performance metrics 7

Table 2: Crime Recidivism Prediction Results

Metrics Logit RF KNN LDA GBM SVM
TP 292 311 296 286 339 295
FN 270 251 266 276 223 267
TN 537 543 521 543 517 532
FP 144 138 160 138 164 149
TPR 0.52 0.55 0.53 0.51 0.60 0.52
TNR 0.79 0.8 0.77 0.8 0.76 0.78
PPV 0.67 0.69 0.65 0.67 0.67 0.66
NPV 0.67 0.68 0.66 0.66 0.7 0.67
Accuracy 0.67 0.69 0.66 0.67 0.69 0.67
BA 0.65 0.68 0.65 0.65 0.68 0.65
BI 0.31 0.35 0.29 0.31 0.36 0.31
F1 Score 0.59 0.62 0.58 0.58 0.64 0.59
MCC 0.32 0.36 0.3 0.32 0.37 0.32
AUC 0.72 0.73 0.69 0.72 0.75 0.61

3.2 Case 2: Civil War Onset Prediction

Recent debates on civil conflict prediction using machine learning models are closely related to
the question of performance metrics, in particular the use of Area under the ROC Curve (AUC)
(Muchlinski et al., 2016; Wang, 2019; Blair and Sambanis, 2020; Beger et al., 2021; Blair and
Sambanis, 2021). Therefore, for unbalanced class case, we use the Civil War Data (CWD) that
record civil conflict cases annually for all recognized countries in the world from 1945 to 2000
(Hegre and Sambanis, 2006; Muchlinski et al., 2016). The data set has 7024 peace instances and
116 civil war instances, therefore, the outcome of interest in this data set is heavily imbalanced,
the positive cases constituting only about 1.6 percent of the data sample. As is the case with
balanced data in 3.1 above, we again train six different models for predicting civil war onset:
logistic regression (Logit), random forest (RF), k-nearest neighbors (KNN), linear discriminant
analysis (LDA), gradient boosting machine (GBM), support vector machine with radial basis
function kernel (SVM). In the model training process, we again use 10-fold cross-validation. As
there are 90 predictors used in the models, we do not list them here.
Table 3 below lists the confusion matrix entries for the six different model predictions.
It also lists the corresponding values of the model performance metrics. It can be observed
that all models show very high prediction accuracy of 0.98 or higher except the random forest
model. The random forest model shows a prediction accuracy level of about 76 percent, which
is significantly lower than the other five models. However, the random forest model is the only
model which correctly predicts all instances of civil war onset, while among the other five models,
the GBM model is the only one that can successfully predict over half of the positive cases. This
in one way shows how misleading the accuracy measure can be when data is highly imbalanced.
It is overwhelming dominated by the majority class of no civil war and cannot reflect model
classification performance on the positive case of civil war actually happening.
The commonly used Area Under the ROC Curve (AUC) also gives misleading information

calculations are based on the 0.5 cutoff.

Electronic copy available at: https://ssrn.com/abstract=4530905

8 Li, J.

about model classification performance. The LDA model has the lowest AUC of 0.88 among
the six models. Yet the LDA model actually correctly predicts more positive cases of civil war
happening than the logistic regression model, the KNN model and the SVM model, though all
four models correctly predict less than 20 percent of the positive cases. Yet all four models have
pretty high AUC values. In fact, all six models have high AUC values even though they only have
excellent classification accuracy in either positive or negative class but not both at the same time.
The random forest model has a perfect record for the positive class but produces excessively large
number of false positives. In short, AUC values cannot reflect model classification performance
on both positive and negative classes.

Table 3: Civil War Onset Prediction Results

Metrics Logit RF KNN LDA GBM SVM
TP 12 116 4 20 64 19
FN 104 0 112 96 52 97
TN 7019 5281 7020 6987 7018 7024
FP 5 1743 4 37 6 0
TPR 0.1 1 0.03 0.17 0.55 0.16
TNR 1 0.75 1 0.99 1 1
PPV 0.71 0.06 0.5 0.35 0.91 1
NPV 0.99 1 0.98 0.99 0.99 0.99
Accuracy 0.98 0.76 0.98 0.98 0.99 0.99
BA 0.55 0.88 0.52 0.58 0.78 0.58
BI 0.1 0.75 0.03 0.17 0.55 0.16
F1 Score 0.18 0.12 0.06 0.23 0.69 0.28
MCC 0.27 0.22 0.13 0.24 0.71 0.4
AUC 0.92 0.97 0.96 0.88 0.995 0.99

On the contrary, the Matthew’s correlation coefficient gives a more truthful and complete
indication of model prediction performance for both positive and negative cases. For instance,
the GBM model correctly predicts 55 percent of positive cases and 99.9 percent of negative cases,
resulting in a Matthew’s correlation coefficient value of about 0.71, much higher than that of
the random forest model, reflecting the fact that the GBM model performs comparatively well
for both positive and negative cases. The Matthew’s correlation coefficient value of 0.22 for the
random forest model reflects its poor performance for the negative class (only 75 percent accuracy
given the fact that over 98 percent of the data is negative.).
One thing worth noting is that for all the model predictions presented above, we are using
50 percent as the classification decision boundary or cutoff4 , that is a case is classified as positive
if the probability of the positive case happening is above 50 percent. If we lower the cutoff value
to 34 percent, that is a case is predicted as positive if the probability of civil war onset exceeds
34 percent, then the GBM model can correctly predict 80 true positive cases, reaching a true
positive rate of 67 percent while maintaining a true negative rate of 99.8 percent, resulting in
a Matthew’s correlation coefficient value of 0.767. Meanwhile, the same change results in the
random forest model continuing to have perfect prediction for positive cases but now having
an even larger number of 2822 false positive cases and an even lower Matthew’s correlation

4
The AUC calculations are based on various classification threshold values.

Electronic copy available at: https://ssrn.com/abstract=4530905

Reassessing class imbalance’s effect on classification performance metrics 9

coefficient value of 0.154. In other words, instead of producing a huge number of false positives,
an alternative strategy to improve model prediction of civil war instances could be to lower the
decision cutoff or alarm threshold.

4 Simulation study
While we have shown above how Matthew’s correlation coefficient could produce more truthful
and complete measure of model prediction performance than accuracy and Area under the ROC
Curve (AUC) in binary classification tasks especially when the outcome of interest is unbalanced,
we are still unsure exactly how change in the prevalence level of a data sample affects the
different performance metrics. As aforementioned, all confusion matrix performance metrics can
be derived from the four basic quantities of sample size n, prevalence, TPR, and TNR. Here,
through simulation, we show how change in prevalence could affect TPR and TNR, which in
turn affect values of the other performance metrics, while the sample size n is kept constant and
thus we are able to limit the influence of factors associated with sample size.
The following simulation is done with the Broward County data set (Dressel and Farid,
2018; Bansak, 2019) used above. The original data set has 2775 positive cases and 3439 negative
cases, which is equal to a prevalence level of about 0.45 5 . To simulate the dynamics associated
with changing prevalence levels, first, we iteratively drop 30 randomly sampled positive cases
while adding 30 randomly sampled negative cases at the same time, thus incrementally reducing
prevalence level while keeping the sample size n constant. In total, we run 76 such iterations
in this first phase, which along with the original data sample give us 77 instances for each
performance metric and each model. The 77th data sample has 495 positive cases and 5719
negative cases, equaling a prevalence level of about 0.08. We stop at this iteration as the true
positive rate decreases to almost 0 while the true negative rate increases to almost 100 percent.
Second, we do the opposite. From the original data sample, we iteratively drop 30 randomly
sampled negative cases while simultaneously adding 30 randomly sampled positive cases, thus
incrementally increasing prevalence level and keeping the sample size constant at the same time.
In this second phase, we run 79 iterations and stop when the true positive rate increases to
almost 100 percent while the true negative rate decreases to almost 0. The 79th data sample has
5145 positive cases and 1069 negative cases, equaling a prevalence level of about 0.83.
As units are randomly dropped from or added to the positive and negative classes, the
simulation steps described above are able to keep the relationships among the predictors and
between the predictors and the outcome largely unchanged. In fact, the bivariate correlations
among all 8 variables have only minimal changes. The correlation heat maps presented in the
Supplementary Information showcase the bivariate correlations among the predictors and the
outcome for the original data sample as well as 79 simulated data samples (every other ones
selected from all the data samples). It is clear to observe that the correlations among all the
variables largely stay the same across all the simulated data samples 6 .

5
Here prevalence level is the prevalence of the sample data as a whole. However, all the above definitions of
performance metrics in 2.1 refer to prevalence level of the test set. In this study, as the test sets are randomly
sampled from the whole sample data, the prevalence level of the test sets are the same as the whole sample data
(differences are negligible). Prevalence levels of the simulation results presented below are for the test sets. It will
be interesting to investigate the implications when prevalence levels of the training set and the out-of-sample test
set are different.
6
The other 76 simulated data samples have been verified to show similar patterns.

Electronic copy available at: https://ssrn.com/abstract=4530905

10 Li, J.

4.1 Simulation Results: Prevalence, TPR and TNR

Figure 1 below shows the relationships between changing prevalence level and true positive rate
and true negative rate for the six different models. The curves are LOESS-smoothed ones. It can
be observed that across the six different models, decreasing prevalence level corresponds with
decreasing level of true positive rate and increasing level of true negative rate. And the two rates
intersect at around the prevalence level of 0.5. This shows that when the prevalence level is 50
percent, the true positive rate and the true negative rate are similar in size, and in this example,
their values at the intersection point are about 0.6 across the 6 different models. The rate of
increase in true negative rate is higher compared with the rate of decrease in true positive rate
before the intersection point as the prevalence level goes down, while the opposite is true after
the intersection point.
Figure 1: Prevalence, TPR and TNR
Logistic Regression Random Forest KNN
1.00 1.00 1.00

0.75 0.75 0.75

value

value
0.50 0.50 0.50

0.25 0.25 0.25

0.00 0.00 0.00

0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2
Prevalence Prevalence Prevalence

LDA GBM svmRadial

1.00 1.00 1.00

0.75 0.75 0.75

value

0.50 0.50 0.50

0.25 0.25
0.25
0.00 0.00
0.00
0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2
Prevalence Prevalence Prevalence

TPR TNR

This shows that keeping the overall sample size n constant, the sample size for each class
plays a dominant role for model performance for each class in terms of true positive rate and true
negative rate. As the proportion of positive cases decrease, models’ ability to correctly predict
positive cases decrease. Meanwhile, as the proportion of negative cases increase, models’ ability
to correctly predict negative cases increase, and the rate of change for true positive rate and true
negative rate moves in opposite directions after the intersection point of 0.5 prevalence level.

4.2 Simulation Results: Prevalence, PPV and NPV

The following Figure 2 shows the relationships between changing prevalence level and positive
predictive value (PPV) and negative predictive value (PPV) for the six different models. It can
be observed that the general trend is that PPV decreases and NPV increases as prevalence level

Electronic copy available at: https://ssrn.com/abstract=4530905

Reassessing class imbalance’s effect on classification performance metrics 11

decreases. For all models, there are significant fluctuations in PPV values when prevalence level
is below 0.4, while the NPV values show most fluctuations when prevalence level is above 0.6.
The two curves for PPV and NPV again intersect at around the 0.5 prevalence level, indicating
that 0.5 prevalence level is a critical point for PPV and NPV just as it is for TPR and TNR.
Figure 2: Prevalence, PPV and NPV
Logistic Regression Random Forest KNN

0.8 0.8 0.8

value

value
0.6 0.6
0.6
0.4 0.4
0.4 0.2
0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2
Prevalence Prevalence Prevalence

LDA GBM svmRadial

0.8 0.8 0.8

value

value
0.6 0.6
0.6
0.4 0.4
0.4
0.2
0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2
Prevalence Prevalence Prevalence

PPV NPV

Mathematically, from derivations of the confusion matrix metrics above in 2.2, we know
TPR·ϕ
that PPV = TPR·ϕ+(1−TNR)·(1−ϕ) = 1−TNR 1−ϕ . As rate of change in TPR is similar to the
1
1+ TPR
· ϕ
rate of change in TNR and the rate of change in prevalence itself across the whole spectrum of
prevalence, and (1 − TNR) ↓, TPR
1
↑, 1−ϕ
ϕ ↑, therefore, we know that as prevalence level decreases,
1−ϕ
1+ 1−TNR
TPR · ϕ ↑ ⇒ PPV ↓.
TNR·(1−ϕ) ϕ
Similarly, NPV = = 1
ϕ , and (1 − TPR) ↑, TNR
1
↓, 1−ϕ ↓ ⇒
TNR·(1−ϕ)+(1−TPR)·ϕ 1+ 1−TPR
TNR
· 1−ϕ
1−TPR ϕ
TNR 1−ϕ ↓ ⇒ NPV ↑.
The observed patterns of PPV and NPV in Figure 2 largely correspond with their mathe-
matically expected behavior.
It is also noticeable that for changing levels of prevalence, the values of TPR, TNR are more
stable than those of PPV and NPV especially at extreme levels of prevalence. The likely reason is
that denominators of both TPR and TNR are known before predictions are made, therefore, only
the numerators of TPR and TNR are changing as a result of change in prevalence. For PPV and
NPV, things are different. Both the denominator and the numerator of PPV and NPV are only
known after predictions are made. Therefore, both denominator and numerator are changing as a
result of change in prevalence level for PPV and NPV. In other words, PPV and NPV vary more
because of a particular prediction task while TPR and TNR vary less and are proportionately
more affected by the independent influence of prevalence level. Existing research (for example,

Electronic copy available at: https://ssrn.com/abstract=4530905

12 Li, J.

(Chicco and Jurman, 2023)) often ignores this simple yet important difference between TPR and
TNR and PPV and NPV in terms of their stability to change in prevalence level. As is shown in
Figure 1 and Figure 2, PPV and NPV are less stable than TPR and TNR, and are more likely
to be subject to idiosyncratic noise of a particular prediction task especially at extreme levels of
prevalence.

4.3 Simulation Results: Prevalence and Performance Metrics

Figure 3 shows the relationships between changing prevalence level and accuracy, balanced ac-
curacy and bookmaker informedness for the six different models. It can be observed that the
accuracy measure shows a convex curve, similar to the shape of a parabola function, with the
lowest data point corresponding to prevalence level of about 0.5, and before this midpoint, accu-
racy goes down as the prevalence level goes down while accuracy level goes up as prevalence level
goes down after the 0.5 midpoint. Taking the patterns of true positive rate and true negative
rate in Figure 1 above into consideration, it can be understood that above the midpoint of 0.5
prevalence level, the accuracy measure is dominated by the positive class and decreasing level
of true positive rate while below the 0.5 prevalence level, the accuracy measure is dominated by
the negative class and increasing level of true negative rate. As accuracy is a direct account of
proportion of correctly classified cases of either positive or negative class, each observation has
an equal weight of 1, and so accuracy is dominated by whichever class is the majority class.
Both the balanced accuracy and bookmaker informedness can be straightforwardly derived
from the sum of TPR and TNR, these two measures, as a result, show similar patterns but differ-
ent values in Figure 3 as prevalence level decreases. The balanced accuracy shows a concave curve,
which looks like a mirror image of the accuracy measure curve along the approximately 0.65 value
line, the lowest value for the accuracy measure and the highest value for the balanced accuracy
measure. Therefore, in direct contrast to the accuracy measure, the balanced accuracy measure
is dominated by the negative class and increasing true negative rate when prevalence is above
0.5, while it is dominated by the positive class and decreasing true positive rate when prevalence
is below 0.5. Therefore, it appears that balanced accuracy as well as bookmaker informedness
are dominated by model classification accuracy for the minority class and by whichever one of
TPR and TNR has the larger rate of change.

Electronic copy available at: https://ssrn.com/abstract=4530905

Reassessing class imbalance’s effect on classification performance metrics 13

Figure 3: Prevalence and Performanec Metrics

Logistic Regression Random Forest KNN

0.75 0.75 0.75

value

value
0.50 0.50 0.50

0.25 0.25 0.25

0.00 0.00 0.00

0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2
Prevalence Prevalence Prevalence

LDA GBM svmRadial

0.75 0.75 0.75

value

value
0.50 0.50 0.50

0.25 0.25 0.25

0.00 0.00 0.00

0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2
Prevalence Prevalence Prevalence

Accuracy BA BI

Figure 4 below shows the relationships between changing prevalence and F1 score, AUC and
MCC for the six different models. The values of F1 score show a continuously downward trend as
the prevalence level goes down. This pattern also corresponds with the result of a mathematical
derivation for F1 score as is shown in Appendix C. The rate of decrease in F1 score is smaller
when prevalence level is above the midpoint of 0.5 while the decrease is steeper when prevalence
level is below 0.5. This is understandable as the rate of decrease in TPR is smaller than the rate
of increase in TNR before the 0.5 point but larger after the 0.5 point.
Similar to balanced accuracy, the Matthew’s correlation coefficient also shows a concave
curve with the highest data point at around 0.5 prevalence level. The curvature of the distribution
of data points is less obvious for the random forest model when the prevalence level is above 0.5.
We can understand the behavior of Matthew’s correlation coefficient from the behavior of
TPR, TNR, TPV and NPV. From the patterns shown in Figure 1 and 2 and the mathematical
derivations above, we know that TPR and TNR move in different directions, and the same is
true for PPV and NPV. In addition, we have no knowledge about the relative difference in rate
of change for PPV and NPV. Therefore, based on given information, we cannot analytically
derive the direction of change for Matthew’s correlation coefficient as prevalence level decreases.
From the concavely
√ curvy trend of MCC data points√ in Figure 4 and the alternative expression
of MCC = PPV ↓ ×TPR ↓ ×TNR ↑ ×NPV ↑ − FDR ↑ ×FNR ↑ ×FPR ↓ ×FOR ↓, however,
we can sense that above the prevalence level of 0.5, the increasing TNR and NPV dominate the
values of MCC while below the prevalence level of 0.5, the decreasing TPR and PPV dominate the
values of MCC. Therefore, similar to the balanced accuracy and bookmaker informedness, MCC
is dominated by the minority class. In addition, MCC shows relatively smaller values compared
with balanced accuracy in Figure 3, though both measures take model classification performance

Electronic copy available at: https://ssrn.com/abstract=4530905

14 Li, J.

on both classes into account.

Figure 4: Prevalence and Performanec Metrics
Logistic Regression Random Forest KNN

0.75 0.75 0.75

value

value
0.50 0.50 0.50

0.25 0.25 0.25

0.00 0.00 0.00

0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2
Prevalence Prevalence Prevalence

LDA GBM svmRadial

0.75 0.75 0.75

value

value
0.50 0.50 0.50

0.25 0.25
0.25

0.00 0.00
0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2
Prevalence Prevalence Prevalence

F1_score AUC MCC

As AUC values show the performance of a classifier at all possible classification thresholds,
it cannot be constructed from a single confusion matrix based on one classification threshold and
therefore, cannot have a direct comparison with the other performance metrics presented above.
However, AUC values show a most obvious pattern in Figure 4 compared with the other metrics.
That is AUC values are mostly flat regardless of the change in prevalence level except for small
drop off at certain small intervals of prevalence value for the SVM model, the random forest
model and the LDA model. Because ROC curves and the corresponding AUC values try to strike
an optimal balance between high true positive rate and low false positive rate (1 − TNR), it is
a measure that takes into account both TPR and TNR at all possible threshold levels but does
not take into account prevalence level or class imbalance. Therefore, for highly imbalanced data
such as the CWD data shown above, AUC values could give misleading and overly optimistic
evaluation of model classification performance as is shown in Table 2 above. Existing research
has suggested that precision-recall curves could be a better alternative to ROC curves in low
prevalence level cases (Ozenne et al., 2015).

Electronic copy available at: https://ssrn.com/abstract=4530905

Reassessing class imbalance’s effect on classification performance metrics 15

Figure 5: Prevalence, Cutoff, and True Positive Rate

Logistic Regression Random Forest KNN
0.8 0.8 0.8
Prevalence

Prevalence

Prevalence
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7
Cutoff Cutoff Cutoff

LDA GBM svmRadial

0.8 0.8 0.8
Prevalence

Prevalence

Prevalence
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7
Cutoff Cutoff Cutoff

TruePositiveRate (0,0.25] (0.25,0.5] (0.5,0.75] (0.75,1]

Figure 6: Prevalence, Cutoff, and False Positive Rate

Logistic Regression Random Forest KNN
0.8 0.8 0.8
Prevalence

Prevalence

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7
Cutoff Cutoff Cutoff

LDA GBM svmRadial

0.8 0.8 0.8
Prevalence

Prevalence

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7
Cutoff Cutoff Cutoff

FalsePositiveRate (0,0.25] (0.25,0.5] (0.5,0.75] (0.75,1]

Electronic copy available at: https://ssrn.com/abstract=4530905

16 Li, J.

To investigate further the flat trend of AUC values shown in Figure 4, we look at how the
true positive rate and false positive rate change as a result of change in both prevalence level and
classification cutoff value. Continuing to use the 176 data samples described above, now we make
test set model prediction based on a sequence of threshold values: incrementally from 0.25 to 0.75
with a step size of 0.1 (a total of 51 different threshold values). In Figures 5, 6, and 7, we present
heat maps for both the true positive rate, the false positive rate as well as true positive rate minus
false positive rate for different prevalence and classification cutoff values. It can be observed that
the false positive rate achieves high values in similar regions as the true positive rate, though
the regions where true positive rate achieves high values (from 0.75 to 1) is significantly larger
than the corresponding regions for false positive rate. In addition, true positive rate minus false
positive rate almost always achieves higher values around the 45 degree diagonal region in the
two-dimensional space of prevalence level and cutoff value. This helps explain why we see the
flat trend for AUC values in Figure 4 above as AUC values are insensitive to prevalence level
change. The patterns hold for all six models.
Figure 7: Prevalence, Cutoff, and TPR minus FPR
Logistic Regression Random Forest KNN
0.8 0.8 0.8
Prevalence

Prevalence

Prevalence
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7
Cutoff Cutoff Cutoff

LDA GBM svmRadial

0.8 0.8 0.8
Prevalence

Prevalence

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7
Cutoff Cutoff Cutoff

TPRminusFPR (−0.1,0] (0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4]

5 Discussions and Conclusion

The analysis presented above demonstrates that class imbalance significantly affects binary clas-
sification performance metric values regardless of which specific type of model is being used. At
different prevalence levels, different confusion matrix metrics are dominated by different classes
of data. Keeping the sample size n constant and the relationships among the predictors and
the outcome largely unchanged by iteratively randomly dropping or adding certain number of
positive cases (and adding or dropping same number of randomly sampled negative cases si-

Electronic copy available at: https://ssrn.com/abstract=4530905

Reassessing class imbalance’s effect on classification performance metrics 17

multaneously), the general trend in the presented results above shows that as the number of
positive cases decrease, TPR decreases and TNR increases. In addition, PPV increases and NPV
decreases as prevalence level decreases. Moreover, PPV and NPV are less stable measures of
model performance than TPR and TNR as a result of change in prevalence.
The accuracy measure, a direct account of proportion of correctly classified cases regardless
of class, actually has higher values when the prevalence level is farther away from the 0.5 point,
that is when the data is more unbalanced. The accuracy measure therefore is dominated by the
majority class. The balanced accuracy and Matthew’s correlation coefficient, on the contrary,
have higher values when the sample data is closer to be balanced. Therefore, these two perfor-
mance metrics take model performance on both classes of cases into account and have lower
values when a prediction model has skewed performance, that is having excellent prediction ac-
curacy on the majority class but bad prediction accuracy on the minority class. The F1 score
measure, meanwhile, is always decreasing as the prevalence level decreases, that is it appears to
have a monotonically increasing relationship with TPR and prevalence.
While practitioners using binary classification models would certainly favor models that cor-
rectly predict both majority of positive instances and majority of negative instances, in practice,
often times model performance are not ideal on all classes of data. When faced with unsatisfac-
tory model performance, the first question applied machine learning researchers often ask might
be whether the right model or right model specification is picked, the analysis above shows
that the nature of the sample data and more specifically the balance of classes as measured
by prevalence level also has a fundamental impact on all confusion matrix performance met-
rics regardless of which specific type of model is being used. And when faced with imbalanced
data, the commonly used accuracy and AUC measures could both give misleading evaluation
of model performance, and other measures such as Matthew’s correlation coefficient that take
model performance on both classes of data into account could provide more truthful indication
of classification performance. For example, the random forest model used for the Civil War Data
above has area under the ROC curve (AUC) value of 0.97. However, its very skewed perfor-
mance, 100 percent prediction accuracy for the positive class and about 75 percent accuracy for
the negative class (producing a large number of false positives), are not captured by the AUC
value. Its Matthew’s correlation coefficient value of 0.22, meanwhile, more truthfully reflects the
classification performance on both positive and negative classes.
The simulation study above is based on one data set only, therefore, whether the results
shown above can be generalized to other settings is worth further exploration. The general
patterns reflected in the results above, however, are expected to hold in other settings. Most
obvious is the pattern that decreasing number of cases of a certain class is associated with
decreasing level of classification accuracy for that class. The general rule in machine learning is
that a prediction model needs enough observations to train in order to pick up the signal rather
than the noise in the data. The results presented above show that we not only need enough data
to train a classifier to classify all cases but also need enough cases for each class to make accurate
classification for that particular class. And when data is class-imbalanced, accuracy and AUC
can give misleading indication of classification performance while alternative measures such as
Matthew’s correlation coefficient are more reliable measures of model performance.
In addition, in the simulation study above, there are only seven predictor variables, therefore
the models are still quite simple ones regardless of which specific type of classifier is being used.
Having more predictor variales and more complicated relationships among the predictors and
between the predictors and the outcome could certainly bring more complications to the analysis
results. Moreover, in this study we only consider binary classification tasks, similar analysis for

Electronic copy available at: https://ssrn.com/abstract=4530905

18 Li, J.

data sets with multi-class outcome is one worthwhile avenue for future research. Plus, as the
four basic quantities of sample size n, prevalence, TPR, and TNR define all confusion matrix
performance metrics, developing novel classification measures that take these four quantities into
account is also a viable path to deepen our understanding on this research topic.
To conclude, regardless of which specific machine learning model is used for binary classifica-
tion tasks, the effect of class imbalance on model performance metrics should always be assessed
seriously before making any conclusions about model performance.

Appendices
A F1 score is monotonically increasing with prevalence when n,
TPR, TNR are constants
F1 score = 2T P
2T P +F P +F N = 2+ F P2+ F N . As TPR and TNR are constants, then we have that
TP TP

TPR = TP+FN = a ⇒ TP = 1−a FN


TP a

TNR = TN+FP TN
= b ⇒ FP = ( 1b − 1)TN
n = TP + FN + TN + FP



If TPR = TP+FN
TP
= a is a constant, then TP
FN
is also a constant. Suppose prevalence ↑ and
the sample size n is constant, then TP ↑, FN ↑, TN ↓, FP ↓, therefore FT PP ↓ ⇒ FT PP + FT N
P ↓ ⇒
F1 score ↑. That is F1 score is monotonically increasing with prevalence when n, TPR, TNR are
constants.

B Matthew’s correlation coefficient as a result of prevalence change

when n, TPR, TNR are constants
As is the case with F1 score, suppose prevalence ↑ while n, TPR, TNR are constants, then
TP ↑, FN ↑, TN ↓, FP ↓, and TP+FP TP
= PPV ↑, TN+FN
TN
= NPV ↓, FDR ↓, FOR ↑. Therefore,
√ √
MCC
√ = PPV × TPR × TNRp× NPV − FDR × FNR × FPR × FOR
= PPV ↑ ×a × b × NPV ↓− FDR ↓ ×(1 − a) × (1 − b) × FOR ↑. Therefore, when prevalence ↑
while n, TPR, TNR are constants, we cannot determine the direction of change for Matthew’s
Correlation Coefficient because the two quantities PPV and NPV (and correspondingly FDR and FOR)
change in opposite directions.

C F1 score as a result of prevalence change when only sample

size n is constant
From the alternative definition of F1 score above, we know that
2
F1 score =
2+ 1
TPR ((1 − TNR) 1−ϕ
ϕ + 1 − TPR)
,
in addition, ϕ ↓, TPR ↓, TNR ↑
⇒ TPR1
↑, (1 − TNR) ↓, 1−ϕ
ϕ ↑, (1 − TPR) ↑.

Electronic copy available at: https://ssrn.com/abstract=4530905

Reassessing class imbalance’s effect on classification performance metrics 19

Across the whole spectrum of prevalence, the rate of change in TPR is similar to the rate
of change in both TNR and prevalence itself, and as only (1 − TNR) goes down while all other
terms in the denominator of F1 score goes up, we have that ((1 − TNR) 1−ϕ
ϕ + 1 − TPR) ↑ and
F1 score ↓ as prevalence level decreases.

D Matthew’s correlation coefficient is the Person’s correlation

coefficient of two binary variables
Using the notations of the confusion matrix entries in Table 1 above, we have the following:
For two binary variables x, y, the Person’s correlation coefficient is as follows:
Pn
(xi − x̄)(yi − ȳ)
r = pPn i=1 Pn
2 2
i=1 (xi − x̄) i=1 (yi − ȳ)
. Pn Pn
P(x
i=1 i − x̄)(yi − ȳ) = i=1 (xi yi + x̄ȳ − xi ȳ − yi x̄)
= n11 + Pni=1 (x̄(ȳ − yiP ) − xi ȳ)
n n
= n11 − P i=1 xi ȳ (x̄ i=1 (ȳ − yi ) = 0)
= n11 − ȳ ni=1 xi
= n11 − (n11 +n.1 )(nn
11 +n1. )

= nn11 −n 1. n.1
n P
And ni=1 (xi −P x̄)2 (yi − ȳ)2
= ( i=1 xi − nx̄ )( ni=1 yi2P
n 2 2 − nȳ 2 )
P
2 2 2 n 2
Pn
= n1. n.1 + n (x̄ȳ) − n(x̄) i=1 yi − n(ȳ) i=1 xi
2 n1. 2 n.1 2 n1. 2 n.1 2
= n1. n.1 + n ( n ) ( n ) − n( n ) · n.1 − n( n ) · n1.
n2 n1. n.1 +(n1. n.1 )2 −nn1. n.1 (n1. +n.1 )
= n2
n1. n.1 (n2 +n1. n.1 −n(n1. +n.1 ))
= n2
n1. n.1 (n−n1. )(n−n.1 )
= n2
We have: r = nn11 −n n
1. n.1 q 1
n1. n.1 (n−n1. )(n−n.1 )
n2
nn11 −n1. n.1
=√ (Matthew’s Correlation Coefficient)
n1. n.1 (n−n1. )(n−n.1 )

References
Bansak K (2019). Can nonexperts really emulate statistical learning methods? a comment on
“the accuracy, fairness, and limits of predicting recidivism”. Political Analysis, 27(3): 370–380.
Beger A, Morgan RK, Ward MD (2021). Reassessing the role of theory and machine learning in
forecasting civil conflict. Journal of Conflict Resolution, 65(7).
Blair RA, Blattman C, Hartman A (2017). Predicting local violence: Evidence from a panel
survey in liberia. Journal of Peace Research, 54(2): 298–312.
Blair RA, Sambanis N (2020). Forecasting civil wars: Theory and structure in an age of “big
data” and machine learning. Journal of Conflict Resolution, 64(10): 1885–1915.
Breiman L (2001). Statistical Modeling: The Two Cultures (with comments and a rejoinder by
the author). Statistical Science, 16(3): 199 – 231.
Chicco D, Jurman G (2020). The advantages of the matthews correlation coefficient (MCC) over
f1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1): 6.

Electronic copy available at: https://ssrn.com/abstract=4530905

20 Li, J.

Chicco D, Tötsch N, Jurman G (2021). The matthews correlation coefficient (MCC) is more reli-
able than balanced accuracy, bookmaker informedness, and markedness in two-class confusion
matrix evaluation. BioData Mining, 14(1): 13.
De Diego IM, Redondo AR, Fernández RR, Navarro J, Moguerza JM (2022). General performance
score for classification problems. Appl Intell, 52(10): 12049–12063.
Dressel J, Farid H (2018). The accuracy, fairness, and limits of predicting recidivism. Science
Advances, 4(1): eaao5580.
García V SSJ, R AM (2012). On the suitability of numerical performance measures for class
imbalance problems. Proceedings of the 1st International Conference on Pattern Recognition
Applications and Methods, 310–313.
Hegre H, Sambanis N (2006). Sensitivity analysis of empirical results on civil war onset. The
Journal of Conflict Resolution, 50(4): 508–535.
Hicks SA, Strümke I, Thambawita V, Hammou M, Riegler MA, Halvorsen P, et al. (2022). On
evaluation metrics for medical applications of artificial intelligence. Sci Rep, 12(1): 5979.
Jadhav AS (2020). A novel weighted TPR-TNR measure to assess performance of the classifiers.
Expert Systems with Applications, 152: 113391.
Kruschke J (2015). Bayes’ rule: Doing Bayesian Data Analysis. Elsevier, Amsterdam.
Lavazza L, Morasca S (2023). Common problems with the usage of f-measure and accuracy
metrics in medical research. IEEE Access, 11: 51515–51526.
Lever J, Krzywinski M, Altman N (2016). Classification evaluation. Nature Methods, 13(8):
603–604.
Lin Zung J, Goel S, Skeem J (2020). The limits of human predictions of recidivism. Science
Advances, 6(7): eaaz0652.
Luque A, Carrasco A, Martín A, De Las Heras A (2019). The impact of class imbalance in
classification performance metrics based on the binary confusion matrix. Pattern Recognition,
91: 216–231.
Muchlinski D, Siroky D, He J, Kocher M (2016). Comparing random forest with logistic regression
for predicting class-imbalanced civil war onset data. Political Analysis, 24(1): 87–103.
Neunhoeffer M, Sternberg S (2019). How cross-validation can go wrong and what to do about
it. Political Analysis, 27(1): 101–106.
Ozenne B, Subtil F, Maucort-Boulch D (2015). The precision–recall curve overcame the optimism
of the receiver operating characteristic curve in rare diseases. Journal of Clinical Epidemiology,
68(8): 855–859.
Blair RA, Sambanis N (2021). Is theory useful for conflict prediction? a response to beger,
morgan, and ward. Journal of Conflict Resolution, 65(7): 1427–1453.
Chicco D, Jurman G (2023). The matthews correlation coefficient (MCC) should replace the
ROC AUC as the standard metric for assessing binary classification. BioData Mining, 16(1).
Wang Y (2019). Comparing random forest with logistic regression for predicting class-imbalanced
civil war onset data: A comment. Political Analysis, 27(1): 107–110.
Zhu Q (2020). On the performance of matthews correlation coefficient (MCC) for imbalanced
dataset. Pattern Recognition Letters, 136: 71–80.

Electronic copy available at: https://ssrn.com/abstract=4530905

Predictive Accuracy: A Misleading Performance Measure For Highly Imbalanced Data
No ratings yet
Predictive Accuracy: A Misleading Performance Measure For Highly Imbalanced Data
12 pages
F Measure Paper
No ratings yet
F Measure Paper
41 pages
Handling Class Imbalance - Will Your Approach Differ Depending On The Level of Skewness in TH
No ratings yet
Handling Class Imbalance - Will Your Approach Differ Depending On The Level of Skewness in TH
12 pages
Novel Machine Learning Metric
No ratings yet
Novel Machine Learning Metric
19 pages
Introduction To Imbalanced Datasets
No ratings yet
Introduction To Imbalanced Datasets
10 pages
Machine Learningassignment
No ratings yet
Machine Learningassignment
10 pages
ML LVC 3 Post-Session Summary
No ratings yet
ML LVC 3 Post-Session Summary
16 pages
ML Model Evaluation Metrics
No ratings yet
ML Model Evaluation Metrics
11 pages
Imbalanced Data
No ratings yet
Imbalanced Data
54 pages
Iai&ml Unit-5
No ratings yet
Iai&ml Unit-5
15 pages
Unit 4
No ratings yet
Unit 4
20 pages
Lec 16
No ratings yet
Lec 16
13 pages
GR 10 - Final Evaluation
No ratings yet
GR 10 - Final Evaluation
45 pages
Session 1 Evaluation Model
No ratings yet
Session 1 Evaluation Model
58 pages
Unit3 Evaluating Models
No ratings yet
Unit3 Evaluating Models
10 pages
ML Model Evaluation Metrics
No ratings yet
ML Model Evaluation Metrics
8 pages
2234-Article Text-7020-1-10-20250217
No ratings yet
2234-Article Text-7020-1-10-20250217
8 pages
Unit III 1
No ratings yet
Unit III 1
21 pages
Unit - 5
No ratings yet
Unit - 5
57 pages
Chapt-5 Performance Evaluation
No ratings yet
Chapt-5 Performance Evaluation
116 pages
Performance Metrics
No ratings yet
Performance Metrics
12 pages
6 Evaluation
No ratings yet
6 Evaluation
57 pages
Handling Imbalanced Datasets
No ratings yet
Handling Imbalanced Datasets
21 pages
Handling Imbalanced Data
No ratings yet
Handling Imbalanced Data
21 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
Tres Hold
No ratings yet
Tres Hold
7 pages
A Gentle Introduction To Statistical Hypothesis Tests
No ratings yet
A Gentle Introduction To Statistical Hypothesis Tests
6 pages
Chapter 3 Model Evaluation Final
No ratings yet
Chapter 3 Model Evaluation Final
30 pages
Chapter 10
No ratings yet
Chapter 10
31 pages
Slides Imbalanced Learning Intro
No ratings yet
Slides Imbalanced Learning Intro
7 pages
Evaluation Metrics and Statistical Tests For Machine Learning
No ratings yet
Evaluation Metrics and Statistical Tests For Machine Learning
14 pages
AIML-HC Mod 03
No ratings yet
AIML-HC Mod 03
46 pages
Unit III Iml Final
No ratings yet
Unit III Iml Final
36 pages
Classification Metrics
No ratings yet
Classification Metrics
39 pages
جلسه 13
No ratings yet
جلسه 13
76 pages
Machine Learning Evaluation Metrics
No ratings yet
Machine Learning Evaluation Metrics
15 pages
AI Evaluation
No ratings yet
AI Evaluation
18 pages
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I
No ratings yet
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I
81 pages
Machine Learning Unit-2
No ratings yet
Machine Learning Unit-2
89 pages
M M - C C: O: Etrics For Ulti Lass Lassification AN Verview
No ratings yet
M M - C C: O: Etrics For Ulti Lass Lassification AN Verview
17 pages
Metrics For Multi-Class Classification
No ratings yet
Metrics For Multi-Class Classification
17 pages
Ad3501-Dl-Unit 4 Notes
No ratings yet
Ad3501-Dl-Unit 4 Notes
16 pages
Confusion Matrix ROC
No ratings yet
Confusion Matrix ROC
8 pages
Classification Model Metrics Guide
No ratings yet
Classification Model Metrics Guide
8 pages
Binary Classification Metrics
No ratings yet
Binary Classification Metrics
6 pages
Mlslides 2
No ratings yet
Mlslides 2
92 pages
Prevalence Threshold and Bounds in The Accuracy of Binary Classification Systems
No ratings yet
Prevalence Threshold and Bounds in The Accuracy of Binary Classification Systems
15 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
Binary Classification PDF
No ratings yet
Binary Classification PDF
27 pages
08 Classifier Evaluation
No ratings yet
08 Classifier Evaluation
39 pages
Session01 DataScience
No ratings yet
Session01 DataScience
79 pages
Unit 2 Classification
No ratings yet
Unit 2 Classification
59 pages
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
No ratings yet
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
10 pages
Evaluation of Predictive Models Final
No ratings yet
Evaluation of Predictive Models Final
6 pages
To SMOTE, or Not To SMOTE?
No ratings yet
To SMOTE, or Not To SMOTE?
10 pages
Data MIning Chapter 8
No ratings yet
Data MIning Chapter 8
11 pages
Kubo, 2013 PDF
No ratings yet
Kubo, 2013 PDF
19 pages
Fatigue 12 Secrets To Recovery
100% (2)
Fatigue 12 Secrets To Recovery
91 pages
Letter To Students and Parents
No ratings yet
Letter To Students and Parents
1 page
Encrypted Data Analysis
50% (2)
Encrypted Data Analysis
31 pages
Mycology: Prepared By: Jocelyn S - Angligen, RMT, DTA
No ratings yet
Mycology: Prepared By: Jocelyn S - Angligen, RMT, DTA
21 pages
Marvin L. Bittinger - Neal Brand - John Quintanilla - Calculus For The Life Sciences (2005, Pearson)
100% (1)
Marvin L. Bittinger - Neal Brand - John Quintanilla - Calculus For The Life Sciences (2005, Pearson)
826 pages
Why T3, or Liothyronine, Is Usually Taken in Multi-Doses Per Day - Paul Robinson Thyroid
No ratings yet
Why T3, or Liothyronine, Is Usually Taken in Multi-Doses Per Day - Paul Robinson Thyroid
12 pages
Diagnosis Dan Resusitasi Pada Pasien Syok Perdarahan & Syok Dehidrasi
No ratings yet
Diagnosis Dan Resusitasi Pada Pasien Syok Perdarahan & Syok Dehidrasi
29 pages
CS300 Operators Manual
No ratings yet
CS300 Operators Manual
24 pages
Neuro-Oncology Explained Through Multiple Choice Questions: Joe M Das
100% (2)
Neuro-Oncology Explained Through Multiple Choice Questions: Joe M Das
243 pages
Indices Format
No ratings yet
Indices Format
14 pages
EQUITONE Installation Guide English Version
No ratings yet
EQUITONE Installation Guide English Version
44 pages
Shin Lulu
No ratings yet
Shin Lulu
2 pages
Katzung
33% (3)
Katzung
35 pages
TB in Special Situations
No ratings yet
TB in Special Situations
49 pages
CATEGORY NO: 307/2023: Part I (Direct Recruitment)
No ratings yet
CATEGORY NO: 307/2023: Part I (Direct Recruitment)
5 pages
Neuroimaging of The Pituitary Gland Practical Anatomy and Pathology, 2020
100% (1)
Neuroimaging of The Pituitary Gland Practical Anatomy and Pathology, 2020
19 pages
CORDIS - Programme - HORIZON - HORIZON MISS 2022 CANCER 01 03 - en
No ratings yet
CORDIS - Programme - HORIZON - HORIZON MISS 2022 CANCER 01 03 - en
2 pages
CVS Pharmacology DR AM Fouda MCTU23
No ratings yet
CVS Pharmacology DR AM Fouda MCTU23
352 pages
Guideline Blood Grouping Pregnancy
No ratings yet
Guideline Blood Grouping Pregnancy
14 pages
Genomic Variation in Hexon Gene of Inclusion Body Hepatitis Virus in Broiler, Iraq Melad Ibrahim Oraibi, Sahar Hamdi Abdalmaged
No ratings yet
Genomic Variation in Hexon Gene of Inclusion Body Hepatitis Virus in Broiler, Iraq Melad Ibrahim Oraibi, Sahar Hamdi Abdalmaged
8 pages
Introduction Family Adoption Program Class 1
No ratings yet
Introduction Family Adoption Program Class 1
90 pages
Bharati Hospital Medical Records Manual
100% (1)
Bharati Hospital Medical Records Manual
12 pages
Ithihasa&Abbreviations
No ratings yet
Ithihasa&Abbreviations
17 pages
Widal Test
No ratings yet
Widal Test
3 pages
HEENT
No ratings yet
HEENT
2 pages
Adesign7 Major Plate 2022 23
100% (1)
Adesign7 Major Plate 2022 23
5 pages
Dry Socket
100% (1)
Dry Socket
15 pages
Yogurt HACCP Plan PDF
100% (5)
Yogurt HACCP Plan PDF
38 pages
Meningitis Complete
100% (1)
Meningitis Complete
26 pages

SSRN 4530905

Uploaded by

SSRN 4530905

Uploaded by

Reassessing class imbalance’s effect on classification model

performance metrics: a simulation study

Electronic copy available at: https://ssrn.com/abstract=4530905

2 What to know about model performance metrics

Electronic copy available at: https://ssrn.com/abstract=4530905

The partitionof these four

Table 1: Confusion Matrix

It can be observed that T P + F N = n1. , and T N + F P = n0. , and T P + F P = n.1 , and

Bookmaker Informedness (BI) = TPR + TNR − 1

Electronic copy available at: https://ssrn.com/abstract=4530905

(minimum: -1, maximum: +1).

2.2 Alternative Derivation of confusion matrix metrics

An alternative expression of Matthew’s correlation coefficient is as follows:

Electronic copy available at: https://ssrn.com/abstract=4530905

2.3 Relationships between prevalence and model evaluation metrics

Electronic copy available at: https://ssrn.com/abstract=4530905

Electronic copy available at: https://ssrn.com/abstract=4530905

Table 2: Crime Recidivism Prediction Results

3.2 Case 2: Civil War Onset Prediction

calculations are based on the 0.5 cutoff.

Electronic copy available at: https://ssrn.com/abstract=4530905

Table 3: Civil War Onset Prediction Results

Electronic copy available at: https://ssrn.com/abstract=4530905

Electronic copy available at: https://ssrn.com/abstract=4530905

4.1 Simulation Results: Prevalence, TPR and TNR

0.75 0.75 0.75

0.25 0.25 0.25

0.00 0.00 0.00

LDA GBM svmRadial

0.75 0.75 0.75

0.50 0.50 0.50

4.2 Simulation Results: Prevalence, PPV and NPV

Electronic copy available at: https://ssrn.com/abstract=4530905

0.8 0.8 0.8

LDA GBM svmRadial

0.8 0.8 0.8

Electronic copy available at: https://ssrn.com/abstract=4530905

4.3 Simulation Results: Prevalence and Performance Metrics

Electronic copy available at: https://ssrn.com/abstract=4530905

Figure 3: Prevalence and Performanec Metrics

0.75 0.75 0.75

0.25 0.25 0.25

0.00 0.00 0.00

LDA GBM svmRadial

0.75 0.75 0.75

0.25 0.25 0.25

0.00 0.00 0.00

Electronic copy available at: https://ssrn.com/abstract=4530905

on both classes into account.

0.75 0.75 0.75

0.25 0.25 0.25

0.00 0.00 0.00

LDA GBM svmRadial

0.75 0.75 0.75

F1_score AUC MCC

Electronic copy available at: https://ssrn.com/abstract=4530905

Figure 5: Prevalence, Cutoff, and True Positive Rate

0.4 0.4 0.4

0.2 0.2 0.2

LDA GBM svmRadial

0.4 0.4 0.4

0.2 0.2 0.2

TruePositiveRate (0,0.25] (0.25,0.5] (0.5,0.75] (0.75,1]

Figure 6: Prevalence, Cutoff, and False Positive Rate

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

LDA GBM svmRadial

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

FalsePositiveRate (0,0.25] (0.25,0.5] (0.5,0.75] (0.75,1]

Electronic copy available at: https://ssrn.com/abstract=4530905

0.4 0.4 0.4

0.2 0.2 0.2

LDA GBM svmRadial

The partitionof these four