0% found this document useful (0 votes)
9 views81 pages

WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I

Imbalanced data refers to situations where the distribution of samples across classes is uneven, leading to biased model predictions favoring the majority class. This can result in unreliable outcomes, particularly in critical applications like fraud detection and medical diagnosis. Solutions include resampling techniques, cost-sensitive methods, and evaluation metrics like precision, recall, and F1-score to better assess classifier performance in these scenarios.

Uploaded by

Kaustubh Arora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views81 pages

WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I

Imbalanced data refers to situations where the distribution of samples across classes is uneven, leading to biased model predictions favoring the majority class. This can result in unreliable outcomes, particularly in critical applications like fraud detection and medical diagnosis. Solutions include resampling techniques, cost-sensitive methods, and evaluation metrics like precision, recall, and F1-score to better assess classifier performance in these scenarios.

Uploaded by

Kaustubh Arora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 81

Imbalanced Data

Imbalance Data
• Imbalanced data is where the distribution of
samples among classes is not uniform

• one class is more common than the other, with the


more common class called the majority class and the
less common class called the minority class.

• Imbalanced data can lead to biased outcomes in


models, which can affect their reliability and
effectiveness.
Contd..
• A model trained on imbalanced data might
learn to consistently predict the majority class,
even if that's not the most important prediction
in a real-world scenario.

• Challenges: modeling and learning feature


correlation properties for lower sampled classes,
detecting relevant feature class separation, and
adding large bias to standard evaluation metrics
Examples
• Credit card transactions
• A bank might have a dataset of credit card
transactions where 99.9% of transactions are
legitimate and only 0.1% are fraudulent.
• Virus detection
• A dataset for detecting viruses might have a minority
class of 0.5% and a majority class of 99.5%.
• Bank transactions
• A dataset of bank transactions might have 100 "non-
fraud" cases and only 20 "fraud" cases
Setup
1. for 1 hour, google collects 1M e-mails
randomly
2. they pay people to label them as “phishing” or
“not-phishing”
3. they give the data to you to learn to classify
e-mails as phishing or not
4. you, having taken ML, try out a few of your
favorite classifiers
5. You achieve an accuracy of 99.997%
Should you be happy?
Imbalanced data

99.997% The phishing problem is what is called an


not-phishing imbalanced data problem

This occurs where there is a large discrepancy


labeled data

between the number of examples with each


class label

e.g. for our 1M example dataset only about


30 would actually represent phishing e-mails

0.003% What is probably going on with our classifier?


phishing
Imbalanced data

always
predict 99.997% accuracy
not-
phishing

Why does the classifier learn this?


Imbalanced data
Many classifiers are designed to optimize error/accuracy

This tends to bias performance towards the majority class

Anytime there is an imbalance in the data this can


happen

It is particularly pronounced, though, when the


imbalance is more pronounced
Imbalanced problem domains

Besides phishing (and spam) what are some other


imbalanced problems domains?
Imbalanced problem domains
Medical diagnosis

Predicting faults/failures (e.g. hard-drive failures,


mechanical failures, etc.)

Predicting rare events (e.g. earthquakes)

Detecting fraud (credit card transactions, internet


traffic)
Imbalanced data: current classifiers

99.997%
not-phishing
labeled data

0.003%
phishing

How will our current classifiers do on this problem?


Imbalanced data: current classifiers
All will do fine if the data can be easily separated/distinguished

Decision trees:
– explicitly minimizes training error
– when pruning pick “majority” label at leaves
– tend to do very poor at imbalanced problems

k-NN:
– even for small k, majority class will tend to overwhelm the vote

perceptron:
– can be reasonable since only updates when a mistake is made
– can take a long time to learn
Part of the problem: evaluation
Accuracy is not the right measure of classifier
performance in these domains

Other ideas for evaluation measures?


“identification” tasks
View the task as trying to find/identify “positive” examples (i.e.
the rare events)

Precision: proportion of test examples predicted as positive


that are correct
# correctly predicted as positive
# examples predicted as positive
Recall: proportion of test examples labeled as positive that
are correct

# correctly predicted as positive


# positive examples in test set
“identification” tasks
Precision: proportion of test examples predicted as positive that are correct
# correctly predicted as positive
# examples predicted as positive
Recall: proportion of test examples labeled as positive that are correct
# correctly predicted as positive
# positive examples in test set

predicted all positive precision recall


positive
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 1

0 1

1 1

0 0
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 1
2
0 1 precision =
4
1 1 2
recall =
0 0 3
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 1

0 1
Why do we have both measures?
1 1 How can we maximize precision?
How can we maximize recall?
0 0
Maximizing precision
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 0
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 0

0 0
Don’t predict anything as positive!
1 0

0 0
Maximizing recall
data label predicted
# correctly predicted as positive
precision =
0 1 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 1 # positive examples in test set

1 1

0 1
Predict everything as positive!
1 1

0 1
precision vs. recall
Often there is a tradeoff between precision and
recall

increasing one, tends to decrease the other

For our algorithms, how might be


increase/decrease precision/recall?
precision/recall tradeoff
data label predicted confidence

0 0 0.75 - For many classifiers we can


get some notion of the
0 1 0.60
prediction confidence
- Only predict positive if the
1 0 0.20
confidence is above a given
1 1 0.80 threshold
- By varying this threshold, we
0 1 0.50
can vary precision and recall
1 1 0.55

0 0 0.90
precision/recall tradeoff
data label predicted confidence
put most confident positive
1 1 0.80 predictions at top
0 1 0.60
put most confident negative
1 1 0.55 predictions at bottom

0 1 0.50 calculate precision/recall at


each break point/threshold
1 0 0.20

classify everything above


0 0 0.75
threshold as positive and
0 0 0.90 everything else negative
precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80

0 1 0.60 1/2 = 0.5 1/3 = 0.33

1 1 0.55

0 1 0.50

1 0 0.20

0 0 0.75

0 0 0.90
precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80

0 1 0.60

1 1 0.55

0 1 0.50

1 0 0.20 3/5 = 0.6 3/3 = 1.0

0 0 0.75

0 0 0.90
precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80

0 1 0.60

1 1 0.55

0 1 0.50

1 0 0.20

0 0 0.75

0 0 0.90 3/7 = 0.43 3/3 = 1.0


precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80 1.0 0.33

0 1 0.60 0.5 0.33

1 1 0.55 0.66 0.66

0 1 0.50 0.5 0.66

1 0 0.20 0.6 1.0

0 0 0.75 0.5 1.0

0 0 0.90 0.43 1.0


precision-recall curve

1.0

0.8
Precision

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Which is system is better?
precision

precision

recall recall

How can we quantify this?


Area under the curve
Area under the curve (AUC) is one metric that
encapsulates both precision and recall

calculate the precision/recall values for all thresholding


of the test set (like we did before)

then calculate the area under the curve

can also be calculated as the average precision for all the


recall points
Area under the curve?
precision

precision

recall recall

Any concerns/problems?
precision Area under the curve?

precision
?

recall recall

For real use, often only Eventually, need to deploy.


interested in performance in How do we decide what
a particular range threshold to use?
precision Area under the curve?

precision
?

recall recall

Ideas? We’d like a compromise between precision and recall


A combined measure: F
Combined measure that assesses precision/recall
tradeoff is F measure (weighted harmonic mean):

1 (   1) PR
2
F  2
1 1  P  R
  (1   )
P R
F1-measure
Most common α=0.5: equal balance/weighting
between precision and recall:
1 (  2  1) PR
F  2
1 1  PR
  (1   )
P R
A combined measure: F
Combined measure that assesses precision/recall
tradeoff is F measure (weighted harmonic mean):

1 (   1) PR 2
F  2
1 1  P  R
  (1   )
P R

Why harmonic mean?


Why not normal mean (i.e. average)?
F1 and other averages
Combined Measures

100

80 Minimum
Maximum
60
Arithmetic
40 Geometric
Harmonic
20

0
0 20 40 60 80 100
Precision (Recall fixed at 70%)

Harmonic mean encourages precision/recall values that are similar!


Evaluation summarized
Accuracy is often NOT an appropriate evaluation
metric for imbalanced data problems

precision/recall capture different characteristics


of our classifier

AUC and F1 can be used as a single metric to


compare algorithm variations (and to tune
hyperparameters)
Phishing – imbalanced data
Black box approach
Abstraction: we have a generic binary classifier,
how can we use it to solve our new problem
+1
binary optionally: also output
classifier a confidence/score
-1

Can we do some pre-processing/post-processing of our


data to allow us to still use our binary classifiers?
Solutions to Imbalanced Learning

Sampling methods
Data
Level

Algorithmic
Level Cost-sensitive methods

Kernel and Active Learning methods

42
Several Common Approaches
 At the data Level: Re-Sampling
 Oversampling (Random or Directed)
o Add more examples to minority class
 Undersampling (Random or Directed)
o Remove samples from majority class

 At the Algorithmic Level:


 Adjusting the Costs or weights of classes
 Adjusting the decision threshold / probabilistic
estimate at the tree leaf

Most of the machine learning models provide a parameter called class weights

43
Sampling Methods
Create balance through sampling

If data is Create
Modify data
balanced
Imbalanced… distribution
dataset

A widely adopted technique for dealing with highly unbalanced datasets


is called resampling.
1. Removing samples from the majority class (under-
sampling).
2. Adding more examples to the minority class (over-
sampling).
3. Or perform both simultaneously:
44
Sampling Methods
Create balance though sampling
Oversampling may just randomly replicate records within the dataset!
Can cause loss of Can cause overfitting!
information.

Advantages and disadvantages of Under-sampling and Oversampling?


45
Idea 1: subsampling
Create a new training data set by:
- including all k “positive” examples
- randomly picking k “negative”
99.997% examples
not-phishing
labeled data

50%
not-phishing

50%
phishing

pros/cons?
0.003%
phishing
Subsampling
Pros:
– Easy to implement
– Training becomes much more efficient (smaller
training set)
– For some domains, can work very well

Cons:
– Throwing away a lot of data/information
Idea 2: oversampling
Create a new training data set by:
- including all m “negative”
examples
- include m “positive examples:
99.997% - repeat each example a fixed
not-phishing number of times, or
- sample with replacement
labeled data

50%
not-phishing

50%
phishing

0.003%
phishing
pros/cons?
oversampling
Pros:
– Easy to implement
– Utilizes all of the training data
– Tends to perform well in a broader set of
circumstances than subsampling

Cons:
– Computationally expensive to train classifier
Idea 2b: weighted examples

cost/weights
Add costs/weights to the training set
99.997%
not-phishing
“negative” examples get weight 1
labeled data

1 “positive” examples get a much larger


weight

change learning algorithm to optimize


weighted training error

99.997/0.003 =
0.003%
33332 pros/cons?
phishing
weighted examples
Pros:
– Achieves the effect of oversampling without the
computational cost
– Utilizes all of the training data
– Tends to perform well in a broader set circumstances

Cons:
– Requires a classifier that can deal with weights

Of our three classifiers, can all be modified to handle weights?


Building decision trees
Otherwise:
- calculate the “score” for each feature if we used it to split the data
- pick the feature with the highest score, partition the data based
on that data value and call recursively

We used the training error to decide on which feature to choose:


use the weighted training error

In general, any time we do a count, use the weighted count (e.g.


in calculating the majority label at a leaf)
Idea 3: optimize a different error metric

Train classifiers that try and optimize F1 measure


or AUC or …

or, come up with another learning algorithm


designed specifically for imbalanced problems

pros/cons?
Idea 3: optimize a different error metric
Train classifiers that try and optimize F1 measure or AUC or …

Challenge: not all classifiers are amenable to this

or, come up with another learning algorithm designed


specifically for imbalanced problems

Don’t want to reinvent the wheel!

That said, there are a number of approaches that have been


developed to specifically handle imbalanced problems
SMOTE
SMOTE: Resampling
Approach
 SMOTE stands for:
Synthetic Minority Oversampling Technique
 It is a technique designed by Hall et. al in 2002.
 SMOTE is an oversampling method that
synthesizes new plausible examples in the minority
class.

 SMOTE not only increases the size of the training


set, but also increases the variety!!

 SMOTE currently yields the best results as far as


re-sampling and modifying the probabilistic
estimate techniques go (Chawla, 2003).

57
SMOTE’s Informed Oversampling Procedure

For each Minority Sample


I. Find its k-nearest minority neighbors
II. Randomly select j of these neighbors
III. Randomly generate synthetic samples
along the lines joining the minority sample
and its j selected neighbors
(j depends on the amount of oversampling
desired)
For instance, if it sees two examples (of the same class)
near each other, it creates a third artificial one, in the
middle of the original two.
58
SMOTE
Synthetic Minority Oversampling Technique
(SMOTE)

• Find its k-nearest minority • Randomly generate synthetic


neighbors samples
• Randomly select j of these along the lines joining the minority
sample and its j selected 59
SMOTE’s Shortcomings
• Overgeneralization
a) SMOTE’s procedure may blindly generalize the
minority area without regard to the majority
class.
b) It may oversample noisy samples
c) It may oversample uninformative samples

• Lack of Flexibility (i.e., Static)


a) The number of synthetic samples generated by
SMOTE is fixed in advance, thus not allowing for
any flexibility in the re-balancing rate.
b) It would be nice to increase the minority class
just to the right value (i.e., not excessive) to
avoid the side affects of unbalanced datasets
61
Cost-Sensitive LR
Cost-Sensitive Approach
• Cost-sensitive learning is a subfield of machine learning that
takes the costs of prediction errors (and potentially other
costs) into account when training a machine learning model.
• Log. regression doesn’t support imbalanced classification
directly.
• Instead, the training algorithm used to fit the log. regression
model must be modified to take the skewed distribution into
account.
• This can be achieved by specifying a class weighting
configuration that is used to influence the amount that
logistic regression coefficients are updated during training.
• The weighting can penalize the model less for errors made
on examples from the majority class and penalize the model
more for errors made on examples from the minority class.
• The result is a version of logistic regression that performs
better on imbalanced classification tasks, generally referred
to as cost-sensitive or weighted logistic regression.
https://machinelearningmastery.com/cost-sensitive-logistic-regression/
63
Cost-Sensitive Approach
• In Logistic regression, we calculate loss per example
using binary cross-entropy:

o Loss = −y log(p(y)) − (1−y) log(1−p(y))


o where y is the label (1 for class A and 0 for class B)
o p(y) is the predicted probability of the point being class A.

• In this particular form, we give equal weight to both


the positive and the negative classes.
• However, if we set class_weight as class_weight =
{0:1,1:20}, the classifier in the background tries to
minimize:
NewLoss = −20*y log (p(y)) + 1*(1−y) log (1−p(y))
• That means we penalize our model around 20 times
more when it misclassifies a positive minority example in
64
SKLearn Example
SKLearn Code
• The dataset is about Abalone.
• Abalone, is a species of marine snails.
• There are 4174 instances with 8 features for each record
• % of Negative instances: 99.23%
• % of Positive instances: 0.77%

• Our goal is to identify whether an abalone belongs to a specific


class. (Positives  19), (Negative all remaining).
• So, this is a binary classification problem of either positive (class
19) or negative.

• You can download the data from the following link


• https://github.com/liannewriting/YouTube-videos-public/tree/main/imbalanced-data-machine-learning-abalone19

https://www.youtube.com/watch?v=xFErz6I-FyE&list=PL2L4c5jChmctqiXvOaJA91o0OJhYq1rR9&index=1 66
SKLearn Code
# How to handle Imbalanced Data in machine learning
classification
# The slides presented are based on the following Tutorial
# https://www.justintodata.com/imbalanced-data-machine-learning-
classification/
# This tutorial will focus on imbalanced data in machine learning for
binary classes,
# but you could extend the concept to multi-class.

import pandas as pd
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import ClusterCentroids
from imblearn.combine import SMOTETomek
from imblearn.under_sampling import TomekLinks

from sklearn.linear_model import LogisticRegression


from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.utils import compute_class_weight 67
SKLearn Code
# Read the dataset
df = pd.read_csv('abalone19.dat')
df.head()

Sex Length Diameter Height W_weight S_weight V_weight Shell_weight Class


0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 negative
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 negative
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 negative
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 negative
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 negative

68
SKLearn Code
# Find out more about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4174 entries, 0 to 4173
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- --------- ------------------ --------
0 Sex 4174 non-null object
1 Length 4174 non-null float64
2 Diameter 4174 non-null float64
3 Height 4174 non-null float64
4 Whole_weight 4174 non-null float64
5 Shucked_weight 4174 non-null float64
6 Viscera_weight 4174 non-null float64
7 Shell_weight 4174 non-null float64
8 Class 4174 non-null object

dtypes: float64(7), object(2)


memory usage: 293.6+ KB

69
SKLearn Code
# Produce some stats on the dataset
df.describe()

Length Diameter Height Whole_weight Shucked_weight Viscera_weight Shell_weight Class Sex_I Sex_M
Count 4174.0 4174.0 4174.0 4174.0 4174.0 4174.0 4174.0 4174.000000 4174.0 4174.0
Mean 0.5240 0.4079 0.139524 0.828771 0.359361 0.180607 0.238853 0.007667 0.321275 0.365597
Std 0.1200 0.0991 0.041818 0.490065 0.221771 0.109574 0.139143 0.087233 0.467022 0.481655
Min 0.0750 0.0550 0.000000 0.002000 0.001000 0.000500 0.001500 0.000000 0.000000 0.000000
25% 0.4500 0.3500 0.115000 0.442125 0.186500 0.093500 0.130000 0.000000 0.000000 0.000000
50% 0.5450 0.4250 0.140000 0.799750 0.336000 0.171000 0.234000 0.000000 0.000000 0.000000
75% 0.6150 0.4800 0.165000 1.153000 0.501875 0.252875 0.328875 0.000000 1.000000 1.000000
Max 0.8150 0.6500 1.130000 2.825500 1.488000 0.760000 1.005000 1.000000 1.000000 1.000000

70
SKLearn Code
# We’ll use the most basic machine learning classification algorithm: logistic
regression.
# It is better to convert all the categorical columns for logistic regression to dummy
variables.
# we’ll convert the categorical columns (Sex and Class) within the dataset before
modeling.
# Lets look at the category of Sex
# Three Classes: Male, Infant and Female
M 1526
df['Sex'].value_counts() I 1341
F 1307
Name: Sex, dtype: int64

# Lets look at the category of Class


# Two Classes: Negative and Positive
df['Class'].value_counts()

71
SKLearn Code
# Let us convert the Class label into 0 and 1
df['Class'] = df['Class'].map(lambda x: 0 if x == 'negative' else
1)
df

Sex Length Diameter Height Whole_weight Shucked_weight Viscera_weight Shell_weight Class


0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 0
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 0
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 0
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 0
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 0
... ... ... ... ... ... ... ... ... ...
4169 M 0.560 0.430 0.155 0.8675 0.4000 0.1720 0.2290 0
4170 F 0.565 0.450 0.165 0.8870 0.3700 0.2390 0.2490 0
4171 M 0.590 0.440 0.135 0.9660 0.4390 0.2145 0.2605 0
4172 M 0.600 0.475 0.205 1.1760 0.5255 0.2875 0.3080 0
4173 F 0.625 0.485 0.150 1.0945 0.5310 0.2610 0.2960 0

4174 rows × 9 columns

72
SKLearn Code
# Let us convert the Sex feature into two dummy variables
df = pd.get_dummies(df, columns=['Sex'], drop_first=True)
df

Length Diameter Height Whole_weight Shucked_weight Viscera_weight Shell_weight Class Sex_I Sex_M
0 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 0 0 1
1 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 0 0 1

4174 rows × 10 columns

73
SKLearn Code
df['Class'].value_counts(normalize=Tr
ue)

0 0.992333
1 0.007667
Name: Class, dtype: float64

df['Class'].value_counts().plot(kind='bar')

74
SKLearn Code
# Splitting Training and Testing sets
# Let’s split the dataset into training (80%) and test sets (20%).
# Use the train_test_split function with stratify argument based on Class
categories.
# So that both the training and test datasets will have similar portions of
classes as # the complete dataset.
# This is important for imbalanced data.

df_train, df_test = train_test_split(df, test_size=0.2, stratify=df['Class'],


random_state=888)

features = df_train.drop(columns=['Class']).columns

75
SKLearn Code
# Two sets: df_train and df_test.
# We’ll use df_train for modeling, and df_test for evaluation.
# Print the different classes (0 and 1) that are present in the Training
Set
df_train['Class'].value_counts()

Training Data
0 3313
1 26
Name: Class, dtype: int64

# Print the different classes (0 and 1) that are present in the Testing Set
df_test['Class'].value_counts()

Testing Data
0 829
1 6
Name: Class, dtype: int64

76
SKLearn Code
# Let us train a Logistic Regression with the unbalanced Data and
check the auc
clf = LogisticRegression(random_state=888)

features = df_train.drop(columns=['Class']).columns
clf.fit(df_train[features], df_train['Class'])

y_pred = clf.predict_proba(df_test[features])[:, 1]

print("The AUC score for this model using the original unbalanced
data ...")
roc_auc_score(df_test['Class'],
The AUC score for this model y_pred)
using the original unbalanced
data ...
0.683956574185766
TPR

FPR 77
SKLearn Code
# we could use the library imbalanced-learn to random
oversample.

from imblearn.over_sampling import RandomOverSampler


from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

ros = RandomOverSampler(random_state=888)
X_resampled, y_resampled = ros.fit_resample(df_train[features],
df_train['Class'])
y_resampled.value_counts()

0 3313
1 3313
Name: Class, dtype: int64

78
SKLearn Code
# We can then apply Logistic Regression and calculate the AUC metric.
clf = LogisticRegression(random_state=888)
clf.fit(X_resampled, y_resampled)
y_pred = clf.predict_proba(df_test[features])[:, 1]

print("The AUC score for this model after Random Over Sampling ...")
roc_auc_score(df_test['Class'], y_pred)

The AUC score for this model after Random Over Sampling ...
0.838962605548854

79
SKLearn Code
# Random sampling is easy, but the new samples don’t add more
information.
# SMOTE improves on that.
# SMOTE oversamples the minority class by creating ‘synthetic’ examples.
# It involves some methods (nearest neighbors), to generate plausible
examples.
print("Oversampling using SMOTE ...")
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=888)
X_resampled, y_resampled = smote.fit_resample(df_train[features],
df_train['Class'])

y_resampled.value_counts()

Oversampling using SMOTE ...


0 3313
1 3313
Name: Class, dtype: int64

80
SKLearn Code
# We’ll apply logistic regression on the balanced dataset and calculate its
AUC.

clf = LogisticRegression(random_state=888)
clf.fit(X_resampled, y_resampled)
y_pred = clf.predict_proba(df_test[features])[:, 1]
print("The AUC score for this model after SMOTE ...")
roc_auc_score(df_test['Class'], y_pred)

The AUC score for this model after SMOTE ...


0.7913148371531966

81
SKLearn Code
# Now we will use Undersampling
# Undersampling, we will downsize majority class to balance with the
minority class.
# Simple random undersampling
# We’ll begin with simple random undersampling.
rus = RandomUnderSampler(random_state=888)
X_resampled, y_resampled = rus.fit_resample(df_train[features],
df_train['Class'])

y_resampled.value_counts()

0 26
1 26
Name: Class, dtype: int64

82
SKLearn Code
# And this produces the same AUC as pandas undersampling, since we
use the same
clf = LogisticRegression(random_state=888)
clf.fit(X_resampled, y_resampled)
y_pred = clf.predict_proba(df_test[features])[:, 1]
print("The AUC score for this model after Under Sampling ...")
roc_auc_score(df_test['Class'], y_pred)

The AUC score for this model after Under


Sampling ...
0.6465621230398071

83

You might also like