0% found this document useful (0 votes)

9 views81 pages

WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I

Imbalanced data refers to situations where the distribution of samples across classes is uneven, leading to biased model predictions favoring the majority class. This can result in unreliable outcomes, particularly in critical applications like fraud detection and medical diagnosis. Solutions include resampling techniques, cost-sensitive methods, and evaluation metrics like precision, recall, and F1-score to better assess classifier performance in these scenarios.

Uploaded by

Kaustubh Arora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views81 pages

WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I

Uploaded by

Kaustubh Arora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 81

Imbalanced Data

Imbalance Data
• Imbalanced data is where the distribution of
samples among classes is not uniform

• one class is more common than the other, with the

more common class called the majority class and the
less common class called the minority class.

• Imbalanced data can lead to biased outcomes in

models, which can affect their reliability and
effectiveness.
Contd..
• A model trained on imbalanced data might
learn to consistently predict the majority class,
even if that's not the most important prediction
in a real-world scenario.

• Challenges: modeling and learning feature

correlation properties for lower sampled classes,
detecting relevant feature class separation, and
adding large bias to standard evaluation metrics
Examples
• Credit card transactions
• A bank might have a dataset of credit card
transactions where 99.9% of transactions are
legitimate and only 0.1% are fraudulent.
• Virus detection
• A dataset for detecting viruses might have a minority
class of 0.5% and a majority class of 99.5%.
• Bank transactions
• A dataset of bank transactions might have 100 "non-
fraud" cases and only 20 "fraud" cases
Setup
1. for 1 hour, google collects 1M e-mails
randomly
2. they pay people to label them as “phishing” or
“not-phishing”
3. they give the data to you to learn to classify
e-mails as phishing or not
4. you, having taken ML, try out a few of your
favorite classifiers
5. You achieve an accuracy of 99.997%
Should you be happy?
Imbalanced data

99.997% The phishing problem is what is called an

not-phishing imbalanced data problem

This occurs where there is a large discrepancy

labeled data

between the number of examples with each

class label

e.g. for our 1M example dataset only about

30 would actually represent phishing e-mails

0.003% What is probably going on with our classifier?

phishing
Imbalanced data

always
predict 99.997% accuracy
not-
phishing

Why does the classifier learn this?

Imbalanced data
Many classifiers are designed to optimize error/accuracy

This tends to bias performance towards the majority class

Anytime there is an imbalance in the data this can

happen

It is particularly pronounced, though, when the

imbalance is more pronounced
Imbalanced problem domains

Besides phishing (and spam) what are some other

imbalanced problems domains?
Imbalanced problem domains
Medical diagnosis

Predicting faults/failures (e.g. hard-drive failures,

mechanical failures, etc.)

Predicting rare events (e.g. earthquakes)

Detecting fraud (credit card transactions, internet

traffic)
Imbalanced data: current classifiers

99.997%
not-phishing
labeled data

0.003%
phishing

How will our current classifiers do on this problem?

Imbalanced data: current classifiers
All will do fine if the data can be easily separated/distinguished

Decision trees:
– explicitly minimizes training error
– when pruning pick “majority” label at leaves
– tend to do very poor at imbalanced problems

k-NN:
– even for small k, majority class will tend to overwhelm the vote

perceptron:
– can be reasonable since only updates when a mistake is made
– can take a long time to learn
Part of the problem: evaluation
Accuracy is not the right measure of classifier
performance in these domains

Other ideas for evaluation measures?

“identification” tasks
View the task as trying to find/identify “positive” examples (i.e.
the rare events)

Precision: proportion of test examples predicted as positive

that are correct
# correctly predicted as positive
# examples predicted as positive
Recall: proportion of test examples labeled as positive that
are correct

# correctly predicted as positive

# positive examples in test set
“identification” tasks
Precision: proportion of test examples predicted as positive that are correct
# correctly predicted as positive
# examples predicted as positive
Recall: proportion of test examples labeled as positive that are correct
# correctly predicted as positive
# positive examples in test set

predicted all positive precision recall

positive
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 1

0 1

1 1

0 0
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 1
2
0 1 precision =
4
1 1 2
recall =
0 0 3
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 1

0 1
Why do we have both measures?
1 1 How can we maximize precision?
How can we maximize recall?
0 0
Maximizing precision
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive

0 0
# correctly predicted as positive
recall =
1 0 # positive examples in test set

1 0

0 0
Don’t predict anything as positive!
1 0

0 0
Maximizing recall
data label predicted
# correctly predicted as positive
precision =
0 1 # examples predicted as positive

0 1
# correctly predicted as positive
recall =
1 1 # positive examples in test set

1 1

0 1
Predict everything as positive!
1 1

0 1
precision vs. recall
Often there is a tradeoff between precision and
recall

increasing one, tends to decrease the other

For our algorithms, how might be

increase/decrease precision/recall?
precision/recall tradeoff
data label predicted confidence

0 0 0.75 - For many classifiers we can

get some notion of the
0 1 0.60
prediction confidence
- Only predict positive if the
1 0 0.20
confidence is above a given
1 1 0.80 threshold
- By varying this threshold, we
0 1 0.50
can vary precision and recall
1 1 0.55

0 0 0.90
precision/recall tradeoff
data label predicted confidence
put most confident positive
1 1 0.80 predictions at top
0 1 0.60
put most confident negative
1 1 0.55 predictions at bottom

0 1 0.50 calculate precision/recall at

each break point/threshold
1 0 0.20

classify everything above

0 0 0.75
threshold as positive and
0 0 0.90 everything else negative
precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80

0 1 0.60 1/2 = 0.5 1/3 = 0.33

1 1 0.55

0 1 0.50

1 0 0.20

0 0 0.75

0 0 0.90
precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80

0 1 0.60

1 1 0.55

0 1 0.50

1 0 0.20 3/5 = 0.6 3/3 = 1.0

0 0 0.75

0 0 0.90
precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80

0 1 0.60

1 1 0.55

0 1 0.50

1 0 0.20

0 0 0.75

0 0 0.90 3/7 = 0.43 3/3 = 1.0

precision/recall tradeoff
data label predicted confidence precision recall

1 1 0.80 1.0 0.33

0 1 0.60 0.5 0.33

1 1 0.55 0.66 0.66

0 1 0.50 0.5 0.66

1 0 0.20 0.6 1.0

0 0 0.75 0.5 1.0

0 0 0.90 0.43 1.0

precision-recall curve

1.0

0.8
Precision

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Which is system is better?
precision

precision

recall recall

How can we quantify this?

Area under the curve
Area under the curve (AUC) is one metric that
encapsulates both precision and recall

calculate the precision/recall values for all thresholding

of the test set (like we did before)

then calculate the area under the curve

can also be calculated as the average precision for all the

recall points
Area under the curve?
precision

precision

recall recall

Any concerns/problems?
precision Area under the curve?

precision
?

recall recall

For real use, often only Eventually, need to deploy.

interested in performance in How do we decide what
a particular range threshold to use?
precision Area under the curve?

precision
?

recall recall

Ideas? We’d like a compromise between precision and recall

A combined measure: F
Combined measure that assesses precision/recall
tradeoff is F measure (weighted harmonic mean):

1 (   1) PR
2
F  2
1 1  P  R
  (1   )
P R
F1-measure
Most common α=0.5: equal balance/weighting
between precision and recall:
1 (  2  1) PR
F  2
1 1  PR
  (1   )
P R
A combined measure: F
Combined measure that assesses precision/recall
tradeoff is F measure (weighted harmonic mean):

1 (   1) PR 2
F  2
1 1  P  R
  (1   )
P R

Why harmonic mean?

Why not normal mean (i.e. average)?
F1 and other averages
Combined Measures

100

80 Minimum
Maximum
60
Arithmetic
40 Geometric
Harmonic
20

0
0 20 40 60 80 100
Precision (Recall fixed at 70%)

Harmonic mean encourages precision/recall values that are similar!

Evaluation summarized
Accuracy is often NOT an appropriate evaluation
metric for imbalanced data problems

precision/recall capture different characteristics

of our classifier

AUC and F1 can be used as a single metric to

compare algorithm variations (and to tune
hyperparameters)
Phishing – imbalanced data
Black box approach
Abstraction: we have a generic binary classifier,
how can we use it to solve our new problem
+1
binary optionally: also output
classifier a confidence/score
-1

Can we do some pre-processing/post-processing of our

data to allow us to still use our binary classifiers?
Solutions to Imbalanced Learning

Sampling methods
Data
Level

Algorithmic
Level Cost-sensitive methods

Kernel and Active Learning methods

42
Several Common Approaches
 At the data Level: Re-Sampling
 Oversampling (Random or Directed)
o Add more examples to minority class
 Undersampling (Random or Directed)
o Remove samples from majority class

 At the Algorithmic Level:

 Adjusting the Costs or weights of classes
 Adjusting the decision threshold / probabilistic
estimate at the tree leaf

Most of the machine learning models provide a parameter called class weights

43
Sampling Methods
Create balance through sampling

If data is Create
Modify data
balanced
Imbalanced… distribution
dataset

A widely adopted technique for dealing with highly unbalanced datasets

is called resampling.
1. Removing samples from the majority class (under-
sampling).
2. Adding more examples to the minority class (over-
sampling).
3. Or perform both simultaneously:
44
Sampling Methods
Create balance though sampling
Oversampling may just randomly replicate records within the dataset!
Can cause loss of Can cause overfitting!
information.

Advantages and disadvantages of Under-sampling and Oversampling?

45
Idea 1: subsampling
Create a new training data set by:
- including all k “positive” examples
- randomly picking k “negative”
99.997% examples
not-phishing
labeled data

50%
not-phishing

50%
phishing

pros/cons?
0.003%
phishing
Subsampling
Pros:
– Easy to implement
– Training becomes much more efficient (smaller
training set)
– For some domains, can work very well

Cons:
– Throwing away a lot of data/information
Idea 2: oversampling
Create a new training data set by:
- including all m “negative”
examples
- include m “positive examples:
99.997% - repeat each example a fixed
not-phishing number of times, or
- sample with replacement
labeled data

50%
not-phishing

50%
phishing

0.003%
phishing
pros/cons?
oversampling
Pros:
– Easy to implement
– Utilizes all of the training data
– Tends to perform well in a broader set of
circumstances than subsampling

Cons:
– Computationally expensive to train classifier
Idea 2b: weighted examples

cost/weights
Add costs/weights to the training set
99.997%
not-phishing
“negative” examples get weight 1
labeled data

1 “positive” examples get a much larger

weight

change learning algorithm to optimize

weighted training error

99.997/0.003 =
0.003%
33332 pros/cons?
phishing
weighted examples
Pros:
– Achieves the effect of oversampling without the
computational cost
– Utilizes all of the training data
– Tends to perform well in a broader set circumstances

Cons:
– Requires a classifier that can deal with weights

Of our three classifiers, can all be modified to handle weights?

Building decision trees
Otherwise:
- calculate the “score” for each feature if we used it to split the data
- pick the feature with the highest score, partition the data based
on that data value and call recursively

We used the training error to decide on which feature to choose:

use the weighted training error

In general, any time we do a count, use the weighted count (e.g.

in calculating the majority label at a leaf)
Idea 3: optimize a different error metric

Train classifiers that try and optimize F1 measure

or AUC or …

or, come up with another learning algorithm

designed specifically for imbalanced problems

pros/cons?
Idea 3: optimize a different error metric
Train classifiers that try and optimize F1 measure or AUC or …

Challenge: not all classifiers are amenable to this

or, come up with another learning algorithm designed

specifically for imbalanced problems

Don’t want to reinvent the wheel!

That said, there are a number of approaches that have been

developed to specifically handle imbalanced problems
SMOTE
SMOTE: Resampling
Approach
 SMOTE stands for:
Synthetic Minority Oversampling Technique
 It is a technique designed by Hall et. al in 2002.
 SMOTE is an oversampling method that
synthesizes new plausible examples in the minority
class.

 SMOTE not only increases the size of the training

set, but also increases the variety!!

 SMOTE currently yields the best results as far as

re-sampling and modifying the probabilistic
estimate techniques go (Chawla, 2003).

57
SMOTE’s Informed Oversampling Procedure

For each Minority Sample

I. Find its k-nearest minority neighbors
II. Randomly select j of these neighbors
III. Randomly generate synthetic samples
along the lines joining the minority sample
and its j selected neighbors
(j depends on the amount of oversampling
desired)
For instance, if it sees two examples (of the same class)
near each other, it creates a third artificial one, in the
middle of the original two.
58
SMOTE
Synthetic Minority Oversampling Technique
(SMOTE)

• Find its k-nearest minority • Randomly generate synthetic

neighbors samples
• Randomly select j of these along the lines joining the minority
sample and its j selected 59
SMOTE’s Shortcomings
• Overgeneralization
a) SMOTE’s procedure may blindly generalize the
minority area without regard to the majority
class.
b) It may oversample noisy samples
c) It may oversample uninformative samples

• Lack of Flexibility (i.e., Static)

a) The number of synthetic samples generated by
SMOTE is fixed in advance, thus not allowing for
any flexibility in the re-balancing rate.
b) It would be nice to increase the minority class
just to the right value (i.e., not excessive) to
avoid the side affects of unbalanced datasets
61
Cost-Sensitive LR
Cost-Sensitive Approach
• Cost-sensitive learning is a subfield of machine learning that
takes the costs of prediction errors (and potentially other
costs) into account when training a machine learning model.
• Log. regression doesn’t support imbalanced classification
directly.
• Instead, the training algorithm used to fit the log. regression
model must be modified to take the skewed distribution into
account.
• This can be achieved by specifying a class weighting
configuration that is used to influence the amount that
logistic regression coefficients are updated during training.
• The weighting can penalize the model less for errors made
on examples from the majority class and penalize the model
more for errors made on examples from the minority class.
• The result is a version of logistic regression that performs
better on imbalanced classification tasks, generally referred
to as cost-sensitive or weighted logistic regression.
https://machinelearningmastery.com/cost-sensitive-logistic-regression/
63
Cost-Sensitive Approach
• In Logistic regression, we calculate loss per example
using binary cross-entropy:

o Loss = −y log(p(y)) − (1−y) log(1−p(y))

o where y is the label (1 for class A and 0 for class B)
o p(y) is the predicted probability of the point being class A.

• In this particular form, we give equal weight to both

the positive and the negative classes.
• However, if we set class_weight as class_weight =
{0:1,1:20}, the classifier in the background tries to
minimize:
NewLoss = −20*y log (p(y)) + 1*(1−y) log (1−p(y))
• That means we penalize our model around 20 times
more when it misclassifies a positive minority example in
64
SKLearn Example
SKLearn Code
• The dataset is about Abalone.
• Abalone, is a species of marine snails.
• There are 4174 instances with 8 features for each record
• % of Negative instances: 99.23%
• % of Positive instances: 0.77%

• Our goal is to identify whether an abalone belongs to a specific

class. (Positives  19), (Negative all remaining).
• So, this is a binary classification problem of either positive (class
19) or negative.

• You can download the data from the following link

• https://github.com/liannewriting/YouTube-videos-public/tree/main/imbalanced-data-machine-learning-abalone19

https://www.youtube.com/watch?v=xFErz6I-FyE&list=PL2L4c5jChmctqiXvOaJA91o0OJhYq1rR9&index=1 66
SKLearn Code
# How to handle Imbalanced Data in machine learning
classification
# The slides presented are based on the following Tutorial
# https://www.justintodata.com/imbalanced-data-machine-learning-
classification/
# This tutorial will focus on imbalanced data in machine learning for
binary classes,
# but you could extend the concept to multi-class.

import pandas as pd
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import ClusterCentroids
from imblearn.combine import SMOTETomek
from imblearn.under_sampling import TomekLinks

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.utils import compute_class_weight 67
SKLearn Code
# Read the dataset
df = pd.read_csv('abalone19.dat')
df.head()

Sex Length Diameter Height W_weight S_weight V_weight Shell_weight Class

0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 negative
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 negative
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 negative
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 negative
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 negative

68
SKLearn Code
# Find out more about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4174 entries, 0 to 4173
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- --------- ------------------ --------
0 Sex 4174 non-null object
1 Length 4174 non-null float64
2 Diameter 4174 non-null float64
3 Height 4174 non-null float64
4 Whole_weight 4174 non-null float64
5 Shucked_weight 4174 non-null float64
6 Viscera_weight 4174 non-null float64
7 Shell_weight 4174 non-null float64
8 Class 4174 non-null object

dtypes: float64(7), object(2)

memory usage: 293.6+ KB

69
SKLearn Code
# Produce some stats on the dataset
df.describe()

Length Diameter Height Whole_weight Shucked_weight Viscera_weight Shell_weight Class Sex_I Sex_M
Count 4174.0 4174.0 4174.0 4174.0 4174.0 4174.0 4174.0 4174.000000 4174.0 4174.0
Mean 0.5240 0.4079 0.139524 0.828771 0.359361 0.180607 0.238853 0.007667 0.321275 0.365597
Std 0.1200 0.0991 0.041818 0.490065 0.221771 0.109574 0.139143 0.087233 0.467022 0.481655
Min 0.0750 0.0550 0.000000 0.002000 0.001000 0.000500 0.001500 0.000000 0.000000 0.000000
25% 0.4500 0.3500 0.115000 0.442125 0.186500 0.093500 0.130000 0.000000 0.000000 0.000000
50% 0.5450 0.4250 0.140000 0.799750 0.336000 0.171000 0.234000 0.000000 0.000000 0.000000
75% 0.6150 0.4800 0.165000 1.153000 0.501875 0.252875 0.328875 0.000000 1.000000 1.000000
Max 0.8150 0.6500 1.130000 2.825500 1.488000 0.760000 1.005000 1.000000 1.000000 1.000000

70
SKLearn Code
# We’ll use the most basic machine learning classification algorithm: logistic
regression.
# It is better to convert all the categorical columns for logistic regression to dummy
variables.
# we’ll convert the categorical columns (Sex and Class) within the dataset before
modeling.
# Lets look at the category of Sex
# Three Classes: Male, Infant and Female
M 1526
df['Sex'].value_counts() I 1341
F 1307
Name: Sex, dtype: int64

# Lets look at the category of Class

# Two Classes: Negative and Positive
df['Class'].value_counts()

71
SKLearn Code
# Let us convert the Class label into 0 and 1
df['Class'] = df['Class'].map(lambda x: 0 if x == 'negative' else
1)
df

Sex Length Diameter Height Whole_weight Shucked_weight Viscera_weight Shell_weight Class

0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 0
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 0
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 0
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 0
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 0
... ... ... ... ... ... ... ... ... ...
4169 M 0.560 0.430 0.155 0.8675 0.4000 0.1720 0.2290 0
4170 F 0.565 0.450 0.165 0.8870 0.3700 0.2390 0.2490 0
4171 M 0.590 0.440 0.135 0.9660 0.4390 0.2145 0.2605 0
4172 M 0.600 0.475 0.205 1.1760 0.5255 0.2875 0.3080 0
4173 F 0.625 0.485 0.150 1.0945 0.5310 0.2610 0.2960 0

4174 rows × 9 columns

72
SKLearn Code
# Let us convert the Sex feature into two dummy variables
df = pd.get_dummies(df, columns=['Sex'], drop_first=True)
df

Length Diameter Height Whole_weight Shucked_weight Viscera_weight Shell_weight Class Sex_I Sex_M
0 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 0 0 1
1 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 0 0 1

4174 rows × 10 columns

73
SKLearn Code
df['Class'].value_counts(normalize=Tr
ue)

0 0.992333
1 0.007667
Name: Class, dtype: float64

df['Class'].value_counts().plot(kind='bar')

74
SKLearn Code
# Splitting Training and Testing sets
# Let’s split the dataset into training (80%) and test sets (20%).
# Use the train_test_split function with stratify argument based on Class
categories.
# So that both the training and test datasets will have similar portions of
classes as # the complete dataset.
# This is important for imbalanced data.

df_train, df_test = train_test_split(df, test_size=0.2, stratify=df['Class'],

random_state=888)

features = df_train.drop(columns=['Class']).columns

75
SKLearn Code
# Two sets: df_train and df_test.
# We’ll use df_train for modeling, and df_test for evaluation.
# Print the different classes (0 and 1) that are present in the Training
Set
df_train['Class'].value_counts()

Training Data
0 3313
1 26
Name: Class, dtype: int64

# Print the different classes (0 and 1) that are present in the Testing Set
df_test['Class'].value_counts()

Testing Data
0 829
1 6
Name: Class, dtype: int64

76
SKLearn Code
# Let us train a Logistic Regression with the unbalanced Data and
check the auc
clf = LogisticRegression(random_state=888)

features = df_train.drop(columns=['Class']).columns
clf.fit(df_train[features], df_train['Class'])

y_pred = clf.predict_proba(df_test[features])[:, 1]

print("The AUC score for this model using the original unbalanced
data ...")
roc_auc_score(df_test['Class'],
The AUC score for this model y_pred)
using the original unbalanced
data ...
0.683956574185766
TPR

FPR 77
SKLearn Code
# we could use the library imbalanced-learn to random
oversample.

from imblearn.over_sampling import RandomOverSampler

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

ros = RandomOverSampler(random_state=888)
X_resampled, y_resampled = ros.fit_resample(df_train[features],
df_train['Class'])
y_resampled.value_counts()

0 3313
1 3313
Name: Class, dtype: int64

78
SKLearn Code
# We can then apply Logistic Regression and calculate the AUC metric.
clf = LogisticRegression(random_state=888)
clf.fit(X_resampled, y_resampled)
y_pred = clf.predict_proba(df_test[features])[:, 1]

print("The AUC score for this model after Random Over Sampling ...")
roc_auc_score(df_test['Class'], y_pred)

The AUC score for this model after Random Over Sampling ...
0.838962605548854

79
SKLearn Code
# Random sampling is easy, but the new samples don’t add more
information.
# SMOTE improves on that.
# SMOTE oversamples the minority class by creating ‘synthetic’ examples.
# It involves some methods (nearest neighbors), to generate plausible
examples.
print("Oversampling using SMOTE ...")
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=888)
X_resampled, y_resampled = smote.fit_resample(df_train[features],
df_train['Class'])

y_resampled.value_counts()

Oversampling using SMOTE ...

0 3313
1 3313
Name: Class, dtype: int64

80
SKLearn Code
# We’ll apply logistic regression on the balanced dataset and calculate its
AUC.

clf = LogisticRegression(random_state=888)
clf.fit(X_resampled, y_resampled)
y_pred = clf.predict_proba(df_test[features])[:, 1]
print("The AUC score for this model after SMOTE ...")
roc_auc_score(df_test['Class'], y_pred)

The AUC score for this model after SMOTE ...

0.7913148371531966

81
SKLearn Code
# Now we will use Undersampling
# Undersampling, we will downsize majority class to balance with the
minority class.
# Simple random undersampling
# We’ll begin with simple random undersampling.
rus = RandomUnderSampler(random_state=888)
X_resampled, y_resampled = rus.fit_resample(df_train[features],
df_train['Class'])

y_resampled.value_counts()

0 26
1 26
Name: Class, dtype: int64

82
SKLearn Code
# And this produces the same AUC as pandas undersampling, since we
use the same
clf = LogisticRegression(random_state=888)
clf.fit(X_resampled, y_resampled)
y_pred = clf.predict_proba(df_test[features])[:, 1]
print("The AUC score for this model after Under Sampling ...")
roc_auc_score(df_test['Class'], y_pred)

The AUC score for this model after Under

Sampling ...
0.6465621230398071

Mod 7 Smote ML
No ratings yet
Mod 7 Smote ML
40 pages
Handling Imbalanced Class Imbalanced Data
No ratings yet
Handling Imbalanced Class Imbalanced Data
23 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
Session01 DataScience
No ratings yet
Session01 DataScience
79 pages
Handling Imbalanced Data in ML
No ratings yet
Handling Imbalanced Data in ML
6 pages
Imbalanced Classes in ML: 10 Techniques
No ratings yet
Imbalanced Classes in ML: 10 Techniques
10 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
Data Mining Evaluation Metrics Guide
No ratings yet
Data Mining Evaluation Metrics Guide
40 pages
Predictive Accuracy: A Misleading Performance Measure For Highly Imbalanced Data
No ratings yet
Predictive Accuracy: A Misleading Performance Measure For Highly Imbalanced Data
12 pages
Lecture 2 Classifier Performance Metrics
No ratings yet
Lecture 2 Classifier Performance Metrics
60 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
Imbalanced Data: How To Handle Imbalanced Classification Problems
No ratings yet
Imbalanced Data: How To Handle Imbalanced Classification Problems
17 pages
Handling Imbalanced Datasets
No ratings yet
Handling Imbalanced Datasets
21 pages
Imbalanced Classes in Big Data
No ratings yet
Imbalanced Classes in Big Data
20 pages
Introduction To Imbalanced Datasets
No ratings yet
Introduction To Imbalanced Datasets
10 pages
Handling Imbalanced Data
No ratings yet
Handling Imbalanced Data
21 pages
Handling Class Imbalance - Will Your Approach Differ Depending On The Level of Skewness in TH
No ratings yet
Handling Class Imbalance - Will Your Approach Differ Depending On The Level of Skewness in TH
12 pages
MEE22154 Task2
No ratings yet
MEE22154 Task2
4 pages
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
No ratings yet
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
11 pages
Tres Hold
No ratings yet
Tres Hold
7 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
F Measure Paper
No ratings yet
F Measure Paper
41 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
Classification Metrics
No ratings yet
Classification Metrics
39 pages
133 - Sampling Approaches For Imbalanced Data Classificatin Problem in Machine Learning
No ratings yet
133 - Sampling Approaches For Imbalanced Data Classificatin Problem in Machine Learning
14 pages
Data MIning Chapter 8
No ratings yet
Data MIning Chapter 8
11 pages
CH4 - Imbalanced Classes
No ratings yet
CH4 - Imbalanced Classes
18 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
41 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
19 Imbalanced Classes
No ratings yet
19 Imbalanced Classes
29 pages
Lesson 4 - Performance Metrics
No ratings yet
Lesson 4 - Performance Metrics
46 pages
ML - 03 Evaluation Metrics
No ratings yet
ML - 03 Evaluation Metrics
17 pages
Imbalanced Data
No ratings yet
Imbalanced Data
54 pages
Chapter 3 Model Evaluation Final
No ratings yet
Chapter 3 Model Evaluation Final
30 pages
Foundations of Data Imbalance and Solutions For A Data Democracy
No ratings yet
Foundations of Data Imbalance and Solutions For A Data Democracy
20 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Imbalance Problem
No ratings yet
Imbalance Problem
13 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
22 pages
Classification Metrics in Machine Learning
No ratings yet
Classification Metrics in Machine Learning
6 pages
9 - Session 9 - Visualizing Model Performance, Evidence and Probabilities
No ratings yet
9 - Session 9 - Visualizing Model Performance, Evidence and Probabilities
37 pages
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
An Insight Into Classification With Imbalanced Data
No ratings yet
An Insight Into Classification With Imbalanced Data
29 pages
Intermediate Analytics-Regression-Week 3-1
No ratings yet
Intermediate Analytics-Regression-Week 3-1
44 pages
Unit III 1
No ratings yet
Unit III 1
21 pages
Chapter 7 - LAST
No ratings yet
Chapter 7 - LAST
29 pages
CE880 Lecture6 Slides
No ratings yet
CE880 Lecture6 Slides
25 pages
Lecture 11
No ratings yet
Lecture 11
18 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
25 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
79 pages
Unit3 Evaluating Models
No ratings yet
Unit3 Evaluating Models
10 pages
Report 1
No ratings yet
Report 1
7 pages
Data Mining 4th Is
No ratings yet
Data Mining 4th Is
24 pages
Tackle Imbalanced Data in ML
No ratings yet
Tackle Imbalanced Data in ML
7 pages
Week 3
No ratings yet
Week 3
26 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
Learning From Imbalanced Classes
100% (1)
Learning From Imbalanced Classes
33 pages
5 DL
No ratings yet
5 DL
33 pages
Unit III Iml Final
No ratings yet
Unit III Iml Final
36 pages
IKEA Product Manufacturing LP Optimization
No ratings yet
IKEA Product Manufacturing LP Optimization
5 pages
Bfs Dfs TSP Assignment
No ratings yet
Bfs Dfs TSP Assignment
4 pages
Ma579 HM3
No ratings yet
Ma579 HM3
6 pages
Poster Compumag 2
No ratings yet
Poster Compumag 2
2 pages
Practice Set 2
No ratings yet
Practice Set 2
5 pages
Building Energy Use Prediction Using Time Series Analysis
No ratings yet
Building Energy Use Prediction Using Time Series Analysis
5 pages
Math4Phys: Master's Guide 2022-23
No ratings yet
Math4Phys: Master's Guide 2022-23
8 pages
AI Project
No ratings yet
AI Project
11 pages
C Programming Exercises
No ratings yet
C Programming Exercises
36 pages
Dynamic Programming for TSP
No ratings yet
Dynamic Programming for TSP
15 pages
ALG Part1 ASSIGNMENT
No ratings yet
ALG Part1 ASSIGNMENT
28 pages
Mead 1992
No ratings yet
Mead 1992
14 pages
A Time Series Is Worth 64 Words - Long-Term Forecasting With Transformers
No ratings yet
A Time Series Is Worth 64 Words - Long-Term Forecasting With Transformers
24 pages
Data Science Career Boost
No ratings yet
Data Science Career Boost
40 pages
10 1 1 83 586 PDF
No ratings yet
10 1 1 83 586 PDF
31 pages
Assignment One PRLD5121
No ratings yet
Assignment One PRLD5121
6 pages
Case Study On Demand Forecasting
No ratings yet
Case Study On Demand Forecasting
3 pages
Computer Vision Mini Project Report
No ratings yet
Computer Vision Mini Project Report
7 pages
KMP Algorithm for Developers
No ratings yet
KMP Algorithm for Developers
2 pages
Neuro-Fuzzy noPW
No ratings yet
Neuro-Fuzzy noPW
6 pages
Modelling Crack Propagation Using XFEM Important
No ratings yet
Modelling Crack Propagation Using XFEM Important
11 pages
Electronics
No ratings yet
Electronics
2 pages
Control Systems (CS) : Lecture-7 Routh-Herwitz Stability Criterion
No ratings yet
Control Systems (CS) : Lecture-7 Routh-Herwitz Stability Criterion
30 pages
Improved Genetic Algorithm Based On Reinforcement Learning
No ratings yet
Improved Genetic Algorithm Based On Reinforcement Learning
20 pages
Mpca Lab1 Programs
No ratings yet
Mpca Lab1 Programs
7 pages
J. Brinker Et Al. - 2015 - A Comparative Study of Inverse Dynamics Based On C
No ratings yet
J. Brinker Et Al. - 2015 - A Comparative Study of Inverse Dynamics Based On C
10 pages
Cyber Security PDF
No ratings yet
Cyber Security PDF
5 pages
Introduction To Classifier Performance Analysis With R by Sutaip L.C. Saw
No ratings yet
Introduction To Classifier Performance Analysis With R by Sutaip L.C. Saw
222 pages
Design of Rigid Pavements 2 PDF
No ratings yet
Design of Rigid Pavements 2 PDF
5 pages