FULLTEXT01
FULLTEXT01
Department of Statistics
Uppsala University
2020
ABSTRACT
In this thesis, the performance of two over-sampling techniques,
SMOTE and ADASYN, is compared. The comparison is done on three
imbalanced data sets using three different classification models and
evaluation metrics, while varying the way the data is pre-processed.
The results show that both SMOTE and ADASYN improve the
performance of the classifiers in most cases. It is also found that SVM
in conjunction with SMOTE performs better than with ADASYN as
the degree of class imbalance increases. Furthermore, both SMOTE
and ADASYN increase the relative performance of the Random forest
as the degree of class imbalance grows. However, no pre-processing
method consistently outperforms the other in its contribution to better
performance as the degree of class imbalance varies.
2
Contents
1 Introduction............................................................................................. 4
1.1 Background ............................................................................................................ 4
1.2 Purpose and research question ............................................................................... 7
2 Data ....................................................................................................... 10
2.1 Description of data ............................................................................................... 10
2.2 Data washing and dimensioning ........................................................................... 10
3 Theory ................................................................................................... 12
3.1 Over-sampling techniques ..................................................................................... 12
3.1.1 Synthetic Minority Over-sampling Technique ....................................... 12
3.1.2 Adaptive Synthetic sampling approach ................................................. 14
3.2 Learning models ................................................................................................... 15
3.2.1 Logistic regression .................................................................................. 15
3.2.2 Random forest classifier ......................................................................... 17
3.2.3 Support vector machines........................................................................ 19
4 Method .................................................................................................. 23
4.1 Training and test sets ........................................................................................... 23
4.2 Over-sampling rates.............................................................................................. 23
4.3 Model selection ..................................................................................................... 24
4.4 Evaluation metrics................................................................................................ 25
4.4.1 Confusion Matrix ................................................................................... 25
4.4.2 Sensitivity .............................................................................................. 26
4.4.3 F-measure .............................................................................................. 27
4.4.4 Matthews correlation coefficient ............................................................ 28
4.5 Bootstrap estimation of standard deviation ......................................................... 29
5 Results ................................................................................................... 30
5.1 Churn Modeling Data Set .................................................................................... 30
5.2 Home Credit Group Data Set............................................................................... 32
5.3 Credit Card Fraud Data Set ................................................................................ 33
6 Discussion .............................................................................................. 34
7 Conclusion ............................................................................................. 37
Reference list ............................................................................................ 38
Appendix .................................................................................................. 41
3
1 Introduction
1.1 Background
predicting qualitative responses and is used in a wide array of areas from predicting
R𝑑 , 𝑦𝑖 ∈ {−1, 1}, where 𝑥𝑖 denotes the explanatory variables and 𝑦𝑖 denotes the
response variables. In the training data, the values of the response variables are used
to learn a statistical model and to predict the classes of new observations. Different
statistical learning models possess varying degrees of flexibility and are thus different
in their prediction performance, although this may come with a trade-off with model
classification perform worse when the training data is imbalanced, and in many real-
world applications the proportion of one class is much larger than the proportion of
other classes. This results in the trained model being biased towards the majority group
(Krawczyk, 2016). This is not necessarily due to an error in the sampling procedure,
but rather an intrinsic property of the population of interest. For example, some
illnesses occur for only a small minority of the population and oftentimes only a
common setting is that of binary classification where one class, the majority class, is
much larger than the minority class (Herrera et al., 2016). In these settings, the interest
The class imbalance problem is present in a wide array of areas, and various approaches
have been proposed to overcome the issue in the past decade, see Kaur et al. (2019)
for a review. The methods of dealing with this issue can be partitioned into three
4
subgroups: data level-methods, algorithm level-methods and hybrid methods
on the data space by reducing the imbalance between the majority and the minority
class. Algorithm level-methods are based around the idea of upgrading existing learning
algorithms or creating new ones that address imbalanced data. Hybrid methods
methods balance the data by either removing observations from the majority class,
sampling. The simplest data level-methods are random over-sampling, in which the
minority class is augmented by making exact copies of minority class observations, and
random (Han et al., 2005). While random under-sampling results in data that is
drawback lies in the risk of losing informative observations of the majority class.
Random over-sampling on the other hand introduces the issue of over-fitting, since
this method replicates existing observations of the minority class. The data-level
methods that have been proposed over the last two decades focus on addressing these
issues. Kaur et al. (2019) and Ganganwar (2012) present reviews of several variations
of these methods.
One of the earliest and most popular algorithms in over-sampling is the Synthetic
et al. (2002) and adds synthetic minority class observations based on the k-nearest
neighbors’ algorithm for the minority class observations. Thus, more observations of
the minority class are added to the data to even out the imbalance between the classes.
observations based on all the minority class observations. Therefore, the class
5
boundaries between the majority class and the minority class after applying SMOTE
can look very different from the original data and may not reflect the underlying
distribution of the minority class. As a solution, He et al. (2008) addressed this problem
generated for observations of the minority class that are relatively harder to learn.
Although ADASYN was designed to tackle some of the problems of SMOTE, the
literature of comparisons between them does not unanimously favor either of the two.
For example, Taneja et al. (2019) compared SMOTE and ADASYN, among other pre-
processing methods, in conjunction with Random forest, Light Gradient Boosting and
Extreme Gradient boosting on a single data set with a high degree of imbalance. Their
results showed that the metrics from all models in conjunction with SMOTE
outperformed ADASYN. Another analysis on a single data set with a high degree of
imbalance was done by Barros et al. (2019) where SMOTE and ADASYN were used
study by Davagdorj et al. (2020) where SMOTE and ADASYN were used with seven
different learning models on a single data set, the results were once again varied with
Although the authors in the above cited papers have been able to conclude that the
particular data is best suited with a particular model and pre-processing method, there
have been more mixed results in studies that have included more than one data set.
For example, when He et al. (2008) compared the performance between SMOTE and
6
ADASYN on five data sets with varying degrees of imbalance, it was found that
ADASYN outperformed SMOTE in almost all performance metrics for all sets of data.
Although this pointed towards ADASYN performing better than SMOTE, the only
learning model that was used was Decision trees. A more general result may have been
obtained if more learning models would have been added to the analysis. Furthermore,
the results did not speak for either ADASYN or SMOTE being better with respect to
among other pre-processing methods, Gosain & Sardana (2017) used more than one
learning model when comparing the performance between the pre-processing methods.
The models were the Support vector machine, the Naïve Bayes classifier and the
Nearest Neighbors classifier, and the comparison included six different data sets with
ADASYN in a majority of the performance metrics for all models and all data sets.
However, the differences were minor in many cases. Although the data sets had
performance of the pre-processing methods with respect to the data sets. As with the
former study, Amin et al. (2016) also included several sets of data with varying degrees
of imbalance as well as four different learning models. Their results did not show either
There is a lack of comparative studies with focus on SMOTE and ADASYN, and the
existing experimental results does not consistently favor one method against another
in all instances. In the cited studies where ADASYN and SMOTE have been compared,
some authors have included a single data set while others have included multiple data
sets. For example, Gosain & Sardana (2017) included a variety of learning models and
7
data sets with varying degrees of imbalance, but the difference in degrees of imbalance
between the data sets is not particularly large (ranging from 44 % to 30 % of the
observations belonging to the minority class). He et al. (2008) on the other hand
included data sets where the degrees of imbalance had a greater difference (ranging
from 35 % to 5 % of the observations belonging to the minority class), but only a single
learning model was included. Lastly, Amin et al. (2016) included a variety of learning
models and multiple data sets with varying degrees of imbalance (ranging from around
27 % to 7 % of the observations belonging to the minority class), and the results did
not point towards either SMOTE or ADASYN being consistently better than the other.
the degree of imbalance is varied, using data sets with larger differences in degrees of
imbalance could help shed light on this issue. None of the authors have focused on the
performance of SMOTE and ADASYN with respect to the degree of imbalance. Thus,
a comparison between the pre-processing methods on data sets with very different
degrees of imbalance together with more than a single model could be a relevant
addition to the current literature on this issue. This could help gain a better
understanding of whether SMOTE or ADASYN works relatively better than the other
the higher the degree of imbalance there is. Furthermore, none of the cited articles
have focused on if specific learning models benefit more from SMOTE or ADASYN
than other models. It could be that some learning models benefit more from using
methods in conjunction with some canonical and often-used statistical learning models
Thus, the purpose of this thesis is to compare suitable evaluation metrics for models
of imbalanced data that have been pre-processed using SMOTE and ADASYN. Three
different data sets have been selected that reflect different degrees of class imbalance.
8
The processed data sets will be classified by a set of canonical and common statistical
learning models: the SVM, the random forest classifier, and the logistic regression
Are there any differences in performance of learning models after the data has been
pre-processed by SMOTE or ADASYN? If so, are they mainly reflected by the choice
The rest of the paper is organized as follows. Section 2 introduces and explains the
methods and the classifiers are presented. Section 4 explains the method and chosen
evaluation metrics. The results are presented in Section 5. Section 6 discusses the
9
2 Data
In this section, the data sets and the removal of observations and variables are
described. The description of the data sets includes the number of observations,
The data consists of three data sets, all with binary response variables. They were
retrieved from Kaggle.com and have different degrees of class imbalance. The first data
set is called Predicting Churn for Bank Customers (Kumar, 2018), with the response
variable being whether customers leave a bank. In this data set the minority class
makes up 20.37 percent of all the observations. The second data set is called Home
Credit Default Risk (Home Credit Group, 2018), with the response variable being
whether a customer defaults on payments. In this data set the minority class makes
up 8.07 percent of all the observations. The third data set is called Credit Card Fraud
Detection (Machine Learning Group, 2018), with the response variable being whether
a credit card transaction is fraudulent. In this data set the minority class makes up
The Home Credit Default Risk and the Credit Card Fraud Detection data sets have
observations with missing values which were deleted prior to the pre-processing and
the modeling. No analysis of missing values in the data has been done since interest
lies in comparing models with unprocessed and processed data and not necessarily in
making inference about the population. Thus, the comparison between them will be
the Home Credit Default Risk and the Predicting Churn for Bank Customers were
shrunk from 122 to 29 variables and from 14 to 11 variables, respectively. The rationale
10
behind the deletion of variables was, apart from easing the computational burden, to
remove inexplicit and vague variables. The descriptions and metadata of these
variables gave reasons to believe they lacked meaning in predicting the minority class.
For example, variables such as surname, customer id and row number were deemed to
have this property. For the Credit Card Fraud Detection data set, all variables had
names that lacked descriptions (Var1, Var2, etc.). Furthermore, no metadata of the
variables were available. Thus, there was no way of determining their meaning in
prediction, resulting in none of them being removed. Table 2.2 summarizes the data
sets after removing missing values and non-relevant variables; a description of the
response variables, the number of observations, the number of variables included and
their class imbalance ratio. The class imbalance ratio reflects the proportion of
11
3 Theory
In this section the theory behind the thesis is presented. This includes the over-
In this subsection, the theory behind the over-sampling techniques SMOTE and
ADASYN are described. This includes short descriptions of how they work as well as
SMOTE was proposed by Chawla et al. (2002) and generates synthetic observations
for the minority class. For a given minority class observation, synthetic observations
are generated in a random range between the observation and its k-nearest minority
class neighbors. This procedure is done for every minority class observation. For
synthetic observations is specified prior to the procedure and should reflect the degree
SMOTE Algorithm
Choose k and N, which denotes the number of nearest neighbors and the number of
12
3. Let 𝑆𝑖𝑘 denote the set of the k-nearest neighbors of 𝑥𝑖 .
with replacement.
5. Let 𝜆 denote a number in the range [0,1]. For a given 𝑥𝑖𝑗 , draw a 𝜆 uniformly
7. Stop algorithm.
Here, (𝑥𝑖 − 𝑥𝑖𝑗 ) is the difference vector in p-dimensional space, where p is the number
Figure 3.1.1. — Illustration of the SMOTE procedure on an imbalanced data set in two-
dimensional space. For the minority observation 𝑥1 with k = 3 and N = 2, the synthetic
observations 𝑥𝑘1 and 𝑥𝑘2 are at a random distance along the straight line between the nearest
neighbors.
13
3.1.2 Adaptive Synthetic sampling approach
ADASYN was proposed by He et al. (2008) and works similar to SMOTE in that it
generating more synthetic data for observations that are harder to learn than those
that are easier to learn for a given model. As with SMOTE, ADASYN generates
synthetic observations along a straight line between a minority class observation and
its k-nearest minority class neighbors. As with SMOTE, the number of k-nearest
minority class observations which have more majority class observations inside the k-
nearest neighbors’ region. On the other hand, if a minority observation has no majority
observations inside its k-nearest neighbor range, then no synthetic observations will be
generated for this observation. The rationale lies in that these observations are harder
for learning than minority observations that lie far from the majority observations.
ADASYN Algorithm
Choose k and 𝛽, which denote the number of nearest neighbors and the desired level
1. Let 𝑛𝑙 denote the number of observations of the majority class and let 𝑛𝑠 denote
and let 𝐴 denote the set of all 𝑥𝑖 , such that 𝐴 ∋ 𝑥𝑖 . For every 𝑥𝑖 :
14
5. Define 𝛥𝑖 as the number of observations in the k-nearest neighbors' region of
with replacement.
9. Let 𝜆 denote a number in the range [0,1]. For a given 𝑥𝑖𝑗 , generate a synthetic
each 𝑥𝑘 .
Here, (𝑥𝑖 − 𝑥𝑖𝑗 ) is the difference vector in p-dimensional space, where p is the number
of variables in the data. If 𝛽 = 1 then the data set will be fully balanced after the
procedure.
A suitable model in the classification setting is the Logistic regression model. According
to Hair et al. (2010, p.341), it is widely preferred among researchers because of its
simplicity, and in the often-present case where there is more than one explanatory
variable, the model is called multiple logistic regression. This model is given by
15
Where 𝑋 = (𝑋1 , . . . , 𝑋𝑝 ) are p explanatory variables used to predict the response
Where the left-hand side is the logit function of p. It can thus be seen that the multiple
logistic regression model has a logit that is linear in the explanatory variables. An
estimated multiple logistic regression model can thus be used to predict the probability
of a given observation to belong to either one of the classes. The question is then to
in order to correctly classify new observations which the model has not been trained
on. James et al. (2013, p.37-39) explain that the number of misclassifications on test
data, called the test error rate, is minimized on average by assigning each observation
to its most likely class, conditioned on the values of the explanatory variables. This
classifier is called Bayes classifier. In other words, a test observation with explanatory
class 1 if
16
and to class -1 otherwise.
The random forest classifier is based on classification trees, that is decision trees used
trees are made by dividing the predictor space 𝑋1 , . . . , 𝑋𝑝 , into J distinct and non-
observation that falls into the region 𝑅𝑗 , which in the classification setting is the
majority group that occupies that specific region, which again can be regarded as
Bayes classifier. The rule by which the predictor space is partitioned is called
recursive binary splitting, in which the predictor space is split iteratively based on
the highest reduction of some measure of classification error. More formally, consider
the predictor 𝑋𝑗 and the cutpoint s. Then recursive binary splitting is done by
In the classification setting one measure that is often used for splitting is the Gini
Here, 𝑝̂𝑚𝑘 is the proportion of training observations in the mth region that belong to
17
Applying decision trees in the learning setting will likely lead to overfitting the data.
An approach that addresses this issue is called bagging, which is a method that utilizes
the bootstrap. In bagging, bootstrap samples from the training set are generated. Then,
the model is trained on the individual bootstrapped training sets in order to get B
classification functions/models
Each individual tree has high variance but low bias. However, since the B trees are
averaged the variance is reduced. In the classification setting, the simplest approach
and take a majority vote, that is the overall class prediction is the most commonly
The random forest is an extension of bagged trees that aims at further reducing the
variance of the model. Averaging a set of classification models does not lead to a large
reduction of variance if the models are highly correlated. Random forest produces less
correlated models by only considering a subset of the predictors at each split. At each
split a random sample of the m predictors is chosen as split candidates from the full
approximately equal to the square root of the total numbers of predictors, such that
18
This process decorrelates the trees, and thus the total variance of the averaged models
According to James et al. (2013, p.337-353), support vector machines (SVM) consist
of a group of similar classifiers that have been shown to perform well in a wide variety
of settings and have grown in popularity since their introduction in the 1990s. They
For example, if the feature space is of dimension 2 then the hyperplane is simply a
determined by if
If the hyperplane perfectly classifies the observations, then the hyperplane can be used
as a classifier. However, if a data set can be separated using a hyperplane then there
are in fact an infinite number of such hyperplanes. Of all the possible hyperplanes the
maximal margin classifier chooses the hyperplane that lies the farthest from all the
training observations. Hence, of all the training observations, there are a subset of
19
these that lie on the margin of the hyperplane that determine its shape. These
The maximal margin classifier works only if a separating hyperplane exists in the first
place. In many settings, this is not true, where classes are non-separable by linear class
correctly classify most observations using a so-called soft margin. The generalization
of the maximal margin classifier to the non-separable case is known as the support
observations and better classification of most of the training observations. The support
hyperplane they lie and do so in a manner such that most of the observations are
of the hyperplane and the margins that are allowed. In practice, C is chosen via cross-
validation and controls the variance-bias trade-off of the model. A higher value of this
parameter allows for more violations of the hyperplane and makes it less likely to
overfit the data — some added bias is traded for a smaller amount of variance of the
model.
20
The support vector machine (SVM) is an extension of the support vector classifier.
expanding the feature space using functions known as kernels. The bridge to the SVM
from the support vector classifier starts in the above optimization problem. It turns
out that the solution to the problem involves only the inner products of the
parameters of this model, one needs to compute the inner product of each new point x
and each of the training points 𝑥𝑖 . It turns out that the parameters are nonzero only
for the support vectors. Now, suppose each time an inner product appears in the
21
Where S denotes the set of all observations that are support vectors. A popular choice
22
4 Method
In this section, the method is presented. This includes the split of training and test
sets, chosen over-sampling rates, model selection, choice of evaluation metrics and
The observations in the data sets were split into training sets and test sets. The split
of observations was done randomly while keeping the class imbalances in the training
and test sets the same as in the data sets prior to the split. The split in the Predicting
Churn for Bank Customers data set was done such that 70 percent of the observations
were included in the training set and the remaining 30 percent of the observations were
included in the test set. This is equivalent to approximately 7000 observations in the
training set and 3000 observations in the test set. The size of the training sets for the
Home Credit Default Risk and the Credit Card Fraud Detection data sets were set to
be roughly equal to the training set of the Predicting Churn for Bank Customers data
set. The reason for this was to ease the computational burden of learning the models
from these data sets, as they consist of a very large number of observations to begin
with. The remaining observations were assigned to the test sets, resulting in
approximately 300 000 and 278 000 observations for the Home Credit Default Risk
data set and the Credit Card Fraud Detection data set, respectively. The classification
models were then trained on the unprocessed training sets, training sets pre-processed
The over-sampling rates were set so that the number of observations in the minority
classes in each data set were as close as possible to equal to the number of observations
in the majority classes. Although not all of the existing literature explicitly states the
23
over-sampling rates used, the ones who do (Taneja et al., 2019; He et al., 2008 and
Barros et al., 2019) all aim at perfectly balancing the classes. Moreover, since the
reason for using the pre-processing methods is to shift the bias of the learning model
(He & Ma, 2013, p.30), aiming at balanced classes in order to completely reduce the
bias of the learning model against a particular class seems like a reasonable method
For the choice of kernel of the SVM, a radial basis kernel was selected. Moreover, to
determine the value of the hyperparameters in the model, a 5-fold cross validation was
conducted in order to calculate the error for different values of the hyperparameters.
In 5-fold cross validation, a given training set is randomly split into five sets that are
as close as equal as possible. Then, a model is trained on four of these sets and then
the classification error is calculated on the fifth set. The procedure is then repeated for
all the sets until five in-sample errors have been calculated. The weighted in-sample
error is then a good estimate of the out-of-sample error. Since the SVM contains two
cross-validation procedure was repeated for different values of the parameters. The
hyperparameters resulting in the lowest cross-validation error were then selected for
Both the Logistic regression and the Random forest classifier contain no
hyperparameters, and thus cross-validation was not used for these models. For the
Random forest classifier however, the choice of the number of bagged trees and number
of allowed variables at each split had to be set. The number of trees were set to 500,
which seemed sufficiently large in order for the error rate to decrease. According to
James et al. (2013, p.321) the Random forest does not overfit when the number of
24
trees is increased. The allowed number of randomly selected variables was set to the
square root of the total number of explanatory variables, which according to James et
In this section, the tools and metrics used in order to evaluate the model outputs are
presented. The used evaluation metrics have been selected to reflect the ability of a
classifier to classify the minority observations correctly as well as the classifiers overall
classification performance.
predicted and actual values, where values refer to class labels. On one axis lies the
actual values and on the other axis lies the predicted values. In the binary classification
setting where the response variable belongs to class -1 or 1, i.e., 𝑦𝑖 ∈ {−1,1} the
as a positive. An observation that truly belongs to class -1 and that the model correctly
25
case, where an observation belongs to class 1 and the model correctly predicted as
belonging to this class is referred to as a true positive. If the model predicts that an
observation is a positive but in fact is a negative, then that observation is called a false
positive. In the opposite case, where the model predicts that an observation is a
negative but in fact is a positive, that observation is called a false negative (Burkov,
2019). For the rest of this paper, observations referred to as positive are synonymous
4.4.1.
Actual values
‘Positive’ class: 1 -1 1
True False
-1
Predicted negative negative
values False True
1
positive positive
Table 4.4.1. — Confusion matrix presenting the output of a classifier in terms of actual and
predicted values, where -1 and 1 refer to class labels. True/false represent whether the
classifier predicted an observation correctly/incorrectly. Positive/negative refers to the class
label, -1 or 1, of the predicted observation.
4.4.2 Sensitivity
classified. It is defined as the fraction of true positives in relation to the total number
of observations belonging to the positive class. The formula for sensitivity is given by
26
This evaluation metric is suitable when dealing with imbalanced data, since it measures
how well the model can correctly classify observations belonging to the minority group
4.4.3 F-measure
Precision measures the ratio between positive observations being correctly classified
It is easy to see the similarity between precision and sensitivity. While sensitivity
measures the degree to which a model correctly classifies all truly positives correctly,
precision measures the degree to which a model correctly classifies positives in relation
and precision (Kaur et al., 2019). The rationale of the choice of this metric lies in its
widespread usage when comparing models and its intuitiveness when comparing it to
sensitivity: If the F-measure for a model is higher than its sensitivity then the precision
is higher than the sensitivity, and vice versa. It is thus, along with sensitivity, a useful
27
4.4.4 Matthews correlation coefficient
The Matthews correlation coefficient (MCC) is a measure used to evaluate the output
calculating the Pearson correlation coefficient between actual and predicted values in
a contingency table, such as the confusion matrix. Although it is not as widely used
as the F-measure, its usefulness has been shown in different scientific fields when it
comes to evaluation of predictive models (Liu et al., 2015; The MAQC Consortium,
2010). Chicco & Jurman (2020) argue that the MCC should be preferred to the F-
The reason is that the MCC generates results that reflect the overall predictions made
by a model, which is not the case for the F-measure. It can be shown that when a
model predicts only one of the two classes well (i.e., displaying many true positives
but few true negatives or vice versa), the F-measure can give a misleading result while
the MCC will not. The MCC is given by the following formula:
As the Pearson’s correlation coefficient, the MCC returns a value in the range [-1, +1].
based on coin tossing. The MCC will be used in conjunction with the F-measure and
sensitivity when evaluating the models. The reason for not only using the MCC is to
include metrics that give an idea at how the model performs in terms of predicting the
positive class only. The MCC will give an idea of the overall classification performance
of the model.
28
4.5 Bootstrap estimation of standard deviation
After training the models on the training sets, they will be used to classify the
observations in the test sets. Afterwards, the predicted values will be bootstrapped
into sets of bootstrap replicates. The replicates will then be compared to the actual
values in order to extract the evaluation metrics from each comparison. The reason for
deviations and 95% confidence intervals of the metrics. The confidence intervals will
be used in order to determine whether the results are statistically significant, where
bootstrapping the training sets would make it possible to calculate the standard
deviation of the estimated models, this option is not viable since the computation time
is too long. Instead of being able to calculate the standard deviation and confidence
interval of the estimated models, the procedure implemented here will arrive at an
estimated standard deviation and confidence interval of the evaluation metrics for each
29
5 Results
In this section, the results of using the pre-processing methods on the minority class
proportions of the training sets are presented. Furthermore, the sensitivity, F-measures
and MCC’s are presented using combinations of pre-processing methods and models
for the different data sets, whilst the confidence intervals are presented in the
Appendix. The class imbalance for each data set without any pre-processing, after
implementing SMOTE and after implementing ADASYN can be seen in Table 5.1.
The sensitivity, F-measure and MCC for each combination of pre-processing method
and classification model is presented in Table 5.1.1. The bootstrap standard deviations
of the metrics are within parentheses. SVM in conjunction with SMOTE or ADASYN
Although the highest sensitivity is achieved using SVM in conjunction with SMOTE,
the value is not significantly different than the value achieved using SVM in
conjunction with ADASYN. Regarding the F-measure and MCC, there are no
significant differences amongst the values for the different pre-processing methods
30
Logistic regression together with SMOTE or ADASYN gave higher sensitivity and F-
measure compared to the Logistic regression with no pre-processing method, whilst the
values when using SMOTE and ADASYN are not significantly different from each
other. There are no significant differences amongst the MCC’s when using the Logistic
regression. When using the Random forest in conjunction with the different pre-
processing methods, none of the differences between the evaluation metrics are
statistically significant.
Churn Modeling
Table 5.1.1. — Sensitivity, F-measure, and Matthews correlation coefficient of the models
using different pre-processing methods on the Churn Modeling data set. Highest values are
marked in bold. If two values are marked in bold, no statistically significant difference exists
between them. No bold values indicate that there is no significant difference amongst the
values.
31
5.2 Home Credit Group Data Set
The sensitivity, F-measure and MCC for each combination of pre-processing method
and classification model is presented in Table 5.2.1. The bootstrap standard deviations
of the metrics are within parentheses. As shown, Random forest has the overall worst
performance as a model compared to SVM and Logistic regression for almost all
metrics. Furthermore, there are no significant differences for the evaluation metrics
comparing SMOTE and ADASYN when using the SVM. The implementation of
ADASYN shows the largest increase in sensitivity when applying the Logistic
regression, whilst SMOTE performs better with regards to the F-measure when
applying the Logistic regression which also applies to the sensitivity and F-measure
when applying the Random forest. The MCC for the Random forest is the only case
Table 5.2.1. — Sensitivity, F-measure, and Matthews Correlation Coefficient of the models
using different pre-processing methods on the Home Credit Group data set. Highest values
are marked in bold. If two values are marked in bold, no statistically significant difference
exists between them. No bold values indicate that there is no significant difference amongst
the values.
32
5.3 Credit Card Fraud Data Set
The sensitivity, F-measure and MCC for each combination of pre-processing method
and classification model is presented in Table 5.3.1. The bootstrap standard deviations
of the metrics are within parentheses. SVM in conjunction with SMOTE or ADASYN
improves all metrics, whilst the largest improvement is achieved using SMOTE.
However, both SMOTE and ADASYN worsen the performance when applying the
Logistic regression. When applying the Random forest, the sensitivity is not
the highest sensitivity amongst all models and pre-processing methods, the highest F-
measure was obtained using the Random forest in conjunction with ADASYN. Lastly,
there are no significant differences amongst the MCC’s for the Random forest.
Table 5.3.1. — Sensitivity, F-measure, and Matthews Correlation Coefficient of the models
using different pre-processing methods on the Credit Card Fraud data set. Highest values are
marked in bold. If two values are marked in bold, no statistically significant difference exists
between them. No bold values indicate that there is no significant difference amongst the
values.
33
6 Discussion
When applying SVM to the data sets, all metrics are improved after implementing
SMOTE or ADASYN, except for the F-measure and the MCC on the Churn Modeling
data set where there are no significant differences. The increase in performance is larger
when using SMOTE compared to using ADASYN on the Credit Card Fraud data set,
whilst there are no significant differences in performance in any of the other cases.
Comparing the two pre-processing methods, there does seem to be a pattern that
speaks in favor of SMOTE when applying the SVM, especially as the degree of
imbalance increases. However, in the majority of scenarios the differences are non-
significant.
When applying the Logistic regression to the Churn Modeling data set, both the
When applying the Logistic regression to the Home Credit Group data set, both
ADASYN and SMOTE improve the performance of the classification models, although
the improvements from ADASYN are greater with regards to sensitivity, whilst the
between the two. When applying the Logistic regression to the Credit Card Fraud data
set, both SMOTE and ADASYN worsen the performance of the metrics. Thus, the
results point to that both ADASYN and SMOTE in conjunction with the Logistic
regression seem to have an equally large potential of improving the metrics. However,
this does not hold for all cases as the metrics from the Credit Card Fraud data set
speak against this claim. Lastly, the results from using the Logistic regression does not
speak for ADASYN and/or SMOTE being relatively better the more imbalance there
is.
34
When applying the Random forest to the Churn Modeling data set, there are no
significant differences between the metrics for the pre-processing methods, whilst an
increase of the sensitivity and F-measure can be identified when implementing SMOTE
on the Home Credit Group data set. For the Credit Card Fraud data set, both SMOTE
and ADASYN improve the performance of the classifier with regards to sensitivity,
whilst the F-measure is improved only by using Random forest in conjunction with
ADASYN. The results speak for ADASYN and SMOTE being relatively better the
more imbalance there is for the Random forest. However, there does not seem to be a
pattern that speaks in favor of either SMOTE or ADASYN being consistently better
The confidence intervals calculated using the bootstrap are used in order to determine
whether the results are statistically significant. However, a limitation of the standard
deviations and confidence intervals is that they are derived from bootstrap replicates
of the predicted values of the test sets. A more informative procedure would have been
to bootstrap the training sets and train the models on every replicate of the training
sets. The reason for not choosing this method was the computational time required to
implement this. One of the limitations of this paper lies in in this matter.
Another matter that is subject to discussion are the over-sampling rates. This paper
aimed at setting the classes of the processed training sets to as close as equal class
balances. The rationale was that the models were assumed to benefit the most if the
classes were perfectly balanced. However, there might be a risk associated with this
procedure since this may introduce substantial overlapping between the classes. Even
though SMOTE and ADASYN aims at making the class boundaries more present and
easier for the learning models to distinguish between, there may be cases where a
35
substantial over-sampling of the minority class leads to synthetic observations in the
data space that are placed in ill-fitted locations. The minority class observations may
not always be located in clusters more or less close to each other but may be more
observations where majority observations are located. Then, the learning models may
have an even harder time to correctly distinguish between the classes than without
the results. Following this hypothetical example, ADASYN which is meant to create
synthetic observations for the observations that are harder to learn from, may generate
synthetic observations for outlier minority observations that would have been ‘ignored’
by the learning model. SMOTE would not have generated the same number of
observations for the outlier observation, leading to better results from this pre-
processing method. Thus, lower levels of over-sampling rates might be more adequate
than an aim of full balance, in some cases. In other cases, implementing other over-
the risk of over-sampling when outliers or scattered minority observations are present
36
7 Conclusion
Of all the models, the results suggest that there is no pre-processing method that
consistently improves the performance of all the models with regards to sensitivity, the
F-measure and the MCC. The results do however point towards that using SMOTE
consistently improves the performance of the SVM in most cases, especially as the
degree of imbalance increases. Furthermore, the results point towards that both pre-
processing methods improve the relative performance of the Random forest the more
imbalance there is in the original data set. However, there does not seem to be a
pattern that speaks in favor of either SMOTE or ADASYN being consistently better
than the other. That is, no pre-processing method consistently outperforms the other
in its contribution to better performance as the degree of class imbalance varies. Both
SMOTE and ADASYN improve the overall performance of the models. However, this
does not apply to every case, as they sometimes give similar results as the models of
37
Reference list
Amin, Adnan; Anwar, Sajid; Adnan, Awais; Nawaz, Muhammad; Howard, Newton;
Qadir, Junaid; Hawalah, Ahmad; Hussain, Amir. 2016. Comparing Oversampling
Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction
Case Study. IEEE Access 4. 7940-7957.
Barros, Thiago M.; Souza Neto, Plácido A.; Silva, Ivanovitch; Guedes, Luiz Affonso.
2019. Predictive Models for Imbalanced Data: A School Dropout Perspective.
Education sciences 9(4). 275-292.
Burkov, Andriy. 2019. The Hundred-Page Machine Learning Book. Hardcover edn.
Andriy Burkov, 2019.
Chawla, Nitesh V.; Bowyer, Kevin W.; Hall, Lawrance O.; Kegelmeyer, W. Philip.
2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artifical
Intelligence Research 16. 321-357.
Chicco, Davide; Jurman, Giuseppe. 2020. The advantages of the Matthews correlation
coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC
Genomics 21(6). 6-19.
Davagdorj, Khishigsuren; Lee, Jong Seol; Pham, Van Huy; Ryu, Keun Ho. 2020. A
Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking
Cessation Intervention. Applied sciences 10(9). 3307-3327.
Gosain, Anjana; Sardana, Saanchi. 2017. Handling class imbalance problem using
oversampling techniques: A review. ICACCI: 2017 International Conference on
Advances in Computing, Communications and Informatics. Udupi, India. 79-85.
Hair, Joseph F.; Black, William C.; Babin, Barry J; Anderson, Rolph E. 2010.
Multivariate Data Analysis: A Global Perspective. 7th edn. New Jersey: Pearson
Education, Inc.
38
Zhang, Xiao-Ping; Huang, Guang-Bin. (eds) Advances in Intelligent Computing. ICIC
2005. Lecture Notes in Computer Science 3644(5). 878-887.
He, Haibo; Bai, Yang, Garcia, Edwardo A.; Li, Shutao. 2008. ADASYN: Adaptive
Synthetic Sampling Approach for Imbalanced Learning. IEEE World Congress on
Computational Intelligence: 2008 IEEE International Joint Conference on Neural
Networks. Hong Kong, China, 1322-1328.
He, Haibo; Ma, Yunqian. 2013. Imbalanced Learning: Foundations, Algorithms, and
Applications. 1st edn. New Jersey: John Wiley & Sons, Inc.
Herrera, Francisco; Ventura, Sebastián; Bello, Rafael; Cornelis, Chris; Zafra, Amelia;
Sánchez-Tarrago, Dánel; Vluymans, Sarah. 2016. Multiple Instance Learning:
Foundations and Algorithms. 1st edn. New York: Springer.
Home Credit Group. 2018. Home Credit Default Risk. Retrieved November 17, 2020
from https://www.kaggle.com/c/home-credit-default-risk.
Krawczyk, Bartosz. 2016. Learning from imbalanced data: open challenges and future
directions. Progress in Artificial Intelligence 5(4). 221-232.
Kumar, Santosh. 2018. Bank Customers Churn, Version 1. Retrieved November 17,
2020 from https://www.kaggle.com/santoshd3/bank-customers.
Liu, Yingbo; Cheng, Jiujun; Yan, Chendan, Wu, Xiao; Chen, Fuzchen. 2015. Research
on the Matthews correlation coefficients metrics of personalized recommendation
algorithm evaluation. International Journal of Hybrid Information Technology 8(1).
163-172.
Machine Learning Group. 2018. Credit Card Fraud Detection, Version 3. Retrieved
November 17, 2020 from https://www.kaggle.com/mlg-ulb/creditcardfraud.
39
International Conference on Computing, Power and Communication Technologies.
New Delhi, India. 753-758.
The MicroArray Quality Control (MAQC) Consortium. 2010. The MAQC-II Project:
a comprehensive study of common practices for the development and validation of
microarray-based predictive models. Nature Biotechnology 28(8). 827-838.
40
Appendix
Churn Modeling
Table 1.1. — 95% confidence intervals of the sensitivity, F-measure, and Matthews Correlation
Coefficient of the models using different pre-processing methods on the Churn Modeling data set.
SVM 0 0 0
LR 0 0 0
RF 0 0 (-0.0005, 0.0049)
Table 1.2. — 95% confidence intervals of the sensitivity, F-measure, and Matthews Correlation
Coefficient of the models using different pre-processing methods on the Home Credit Group data set.
41
Credit Card Fraud
Table 1.3. — 95% confidence intervals of the sensitivity, F-measure, and Matthews Correlation
Coefficient of the models using different pre-processing methods on the Credit Card Fraud data set.
42