WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I
Imbalance Data
• Imbalanced data is where the distribution of
samples among classes is not uniform
always
predict 99.997% accuracy
not-
phishing
99.997%
not-phishing
labeled data
0.003%
phishing
Decision trees:
– explicitly minimizes training error
– when pruning pick “majority” label at leaves
– tend to do very poor at imbalanced problems
k-NN:
– even for small k, majority class will tend to overwhelm the vote
perceptron:
– can be reasonable since only updates when a mistake is made
– can take a long time to learn
Part of the problem: evaluation
Accuracy is not the right measure of classifier
performance in these domains
0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set
1 1
0 1
1 1
0 0
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive
0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set
1 1
2
0 1 precision =
4
1 1 2
recall =
0 0 3
precision and recall
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive
0 1
# correctly predicted as positive
recall =
1 0 # positive examples in test set
1 1
0 1
Why do we have both measures?
1 1 How can we maximize precision?
How can we maximize recall?
0 0
Maximizing precision
data label predicted
# correctly predicted as positive
precision =
0 0 # examples predicted as positive
0 0
# correctly predicted as positive
recall =
1 0 # positive examples in test set
1 0
0 0
Don’t predict anything as positive!
1 0
0 0
Maximizing recall
data label predicted
# correctly predicted as positive
precision =
0 1 # examples predicted as positive
0 1
# correctly predicted as positive
recall =
1 1 # positive examples in test set
1 1
0 1
Predict everything as positive!
1 1
0 1
precision vs. recall
Often there is a tradeoff between precision and
recall
0 0 0.90
precision/recall tradeoff
data label predicted confidence
put most confident positive
1 1 0.80 predictions at top
0 1 0.60
put most confident negative
1 1 0.55 predictions at bottom
1 1 0.80
1 1 0.55
0 1 0.50
1 0 0.20
0 0 0.75
0 0 0.90
precision/recall tradeoff
data label predicted confidence precision recall
1 1 0.80
0 1 0.60
1 1 0.55
0 1 0.50
0 0 0.75
0 0 0.90
precision/recall tradeoff
data label predicted confidence precision recall
1 1 0.80
0 1 0.60
1 1 0.55
0 1 0.50
1 0 0.20
0 0 0.75
1.0
0.8
Precision
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Which is system is better?
precision
precision
recall recall
precision
recall recall
Any concerns/problems?
precision Area under the curve?
precision
?
recall recall
precision
?
recall recall
1 ( 1) PR
2
F 2
1 1 P R
(1 )
P R
F1-measure
Most common α=0.5: equal balance/weighting
between precision and recall:
1 ( 2 1) PR
F 2
1 1 PR
(1 )
P R
A combined measure: F
Combined measure that assesses precision/recall
tradeoff is F measure (weighted harmonic mean):
1 ( 1) PR 2
F 2
1 1 P R
(1 )
P R
100
80 Minimum
Maximum
60
Arithmetic
40 Geometric
Harmonic
20
0
0 20 40 60 80 100
Precision (Recall fixed at 70%)
Sampling methods
Data
Level
Algorithmic
Level Cost-sensitive methods
42
Several Common Approaches
At the data Level: Re-Sampling
Oversampling (Random or Directed)
o Add more examples to minority class
Undersampling (Random or Directed)
o Remove samples from majority class
Most of the machine learning models provide a parameter called class weights
43
Sampling Methods
Create balance through sampling
If data is Create
Modify data
balanced
Imbalanced… distribution
dataset
50%
not-phishing
50%
phishing
pros/cons?
0.003%
phishing
Subsampling
Pros:
– Easy to implement
– Training becomes much more efficient (smaller
training set)
– For some domains, can work very well
Cons:
– Throwing away a lot of data/information
Idea 2: oversampling
Create a new training data set by:
- including all m “negative”
examples
- include m “positive examples:
99.997% - repeat each example a fixed
not-phishing number of times, or
- sample with replacement
labeled data
50%
not-phishing
50%
phishing
0.003%
phishing
pros/cons?
oversampling
Pros:
– Easy to implement
– Utilizes all of the training data
– Tends to perform well in a broader set of
circumstances than subsampling
Cons:
– Computationally expensive to train classifier
Idea 2b: weighted examples
cost/weights
Add costs/weights to the training set
99.997%
not-phishing
“negative” examples get weight 1
labeled data
99.997/0.003 =
0.003%
33332 pros/cons?
phishing
weighted examples
Pros:
– Achieves the effect of oversampling without the
computational cost
– Utilizes all of the training data
– Tends to perform well in a broader set circumstances
Cons:
– Requires a classifier that can deal with weights
pros/cons?
Idea 3: optimize a different error metric
Train classifiers that try and optimize F1 measure or AUC or …
57
SMOTE’s Informed Oversampling Procedure
https://www.youtube.com/watch?v=xFErz6I-FyE&list=PL2L4c5jChmctqiXvOaJA91o0OJhYq1rR9&index=1 66
SKLearn Code
# How to handle Imbalanced Data in machine learning
classification
# The slides presented are based on the following Tutorial
# https://www.justintodata.com/imbalanced-data-machine-learning-
classification/
# This tutorial will focus on imbalanced data in machine learning for
binary classes,
# but you could extend the concept to multi-class.
import pandas as pd
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import ClusterCentroids
from imblearn.combine import SMOTETomek
from imblearn.under_sampling import TomekLinks
68
SKLearn Code
# Find out more about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4174 entries, 0 to 4173
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- --------- ------------------ --------
0 Sex 4174 non-null object
1 Length 4174 non-null float64
2 Diameter 4174 non-null float64
3 Height 4174 non-null float64
4 Whole_weight 4174 non-null float64
5 Shucked_weight 4174 non-null float64
6 Viscera_weight 4174 non-null float64
7 Shell_weight 4174 non-null float64
8 Class 4174 non-null object
69
SKLearn Code
# Produce some stats on the dataset
df.describe()
Length Diameter Height Whole_weight Shucked_weight Viscera_weight Shell_weight Class Sex_I Sex_M
Count 4174.0 4174.0 4174.0 4174.0 4174.0 4174.0 4174.0 4174.000000 4174.0 4174.0
Mean 0.5240 0.4079 0.139524 0.828771 0.359361 0.180607 0.238853 0.007667 0.321275 0.365597
Std 0.1200 0.0991 0.041818 0.490065 0.221771 0.109574 0.139143 0.087233 0.467022 0.481655
Min 0.0750 0.0550 0.000000 0.002000 0.001000 0.000500 0.001500 0.000000 0.000000 0.000000
25% 0.4500 0.3500 0.115000 0.442125 0.186500 0.093500 0.130000 0.000000 0.000000 0.000000
50% 0.5450 0.4250 0.140000 0.799750 0.336000 0.171000 0.234000 0.000000 0.000000 0.000000
75% 0.6150 0.4800 0.165000 1.153000 0.501875 0.252875 0.328875 0.000000 1.000000 1.000000
Max 0.8150 0.6500 1.130000 2.825500 1.488000 0.760000 1.005000 1.000000 1.000000 1.000000
70
SKLearn Code
# We’ll use the most basic machine learning classification algorithm: logistic
regression.
# It is better to convert all the categorical columns for logistic regression to dummy
variables.
# we’ll convert the categorical columns (Sex and Class) within the dataset before
modeling.
# Lets look at the category of Sex
# Three Classes: Male, Infant and Female
M 1526
df['Sex'].value_counts() I 1341
F 1307
Name: Sex, dtype: int64
71
SKLearn Code
# Let us convert the Class label into 0 and 1
df['Class'] = df['Class'].map(lambda x: 0 if x == 'negative' else
1)
df
72
SKLearn Code
# Let us convert the Sex feature into two dummy variables
df = pd.get_dummies(df, columns=['Sex'], drop_first=True)
df
Length Diameter Height Whole_weight Shucked_weight Viscera_weight Shell_weight Class Sex_I Sex_M
0 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 0 0 1
1 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 0 0 1
73
SKLearn Code
df['Class'].value_counts(normalize=Tr
ue)
0 0.992333
1 0.007667
Name: Class, dtype: float64
df['Class'].value_counts().plot(kind='bar')
74
SKLearn Code
# Splitting Training and Testing sets
# Let’s split the dataset into training (80%) and test sets (20%).
# Use the train_test_split function with stratify argument based on Class
categories.
# So that both the training and test datasets will have similar portions of
classes as # the complete dataset.
# This is important for imbalanced data.
features = df_train.drop(columns=['Class']).columns
75
SKLearn Code
# Two sets: df_train and df_test.
# We’ll use df_train for modeling, and df_test for evaluation.
# Print the different classes (0 and 1) that are present in the Training
Set
df_train['Class'].value_counts()
Training Data
0 3313
1 26
Name: Class, dtype: int64
# Print the different classes (0 and 1) that are present in the Testing Set
df_test['Class'].value_counts()
Testing Data
0 829
1 6
Name: Class, dtype: int64
76
SKLearn Code
# Let us train a Logistic Regression with the unbalanced Data and
check the auc
clf = LogisticRegression(random_state=888)
features = df_train.drop(columns=['Class']).columns
clf.fit(df_train[features], df_train['Class'])
y_pred = clf.predict_proba(df_test[features])[:, 1]
print("The AUC score for this model using the original unbalanced
data ...")
roc_auc_score(df_test['Class'],
The AUC score for this model y_pred)
using the original unbalanced
data ...
0.683956574185766
TPR
FPR 77
SKLearn Code
# we could use the library imbalanced-learn to random
oversample.
ros = RandomOverSampler(random_state=888)
X_resampled, y_resampled = ros.fit_resample(df_train[features],
df_train['Class'])
y_resampled.value_counts()
0 3313
1 3313
Name: Class, dtype: int64
78
SKLearn Code
# We can then apply Logistic Regression and calculate the AUC metric.
clf = LogisticRegression(random_state=888)
clf.fit(X_resampled, y_resampled)
y_pred = clf.predict_proba(df_test[features])[:, 1]
print("The AUC score for this model after Random Over Sampling ...")
roc_auc_score(df_test['Class'], y_pred)
The AUC score for this model after Random Over Sampling ...
0.838962605548854
79
SKLearn Code
# Random sampling is easy, but the new samples don’t add more
information.
# SMOTE improves on that.
# SMOTE oversamples the minority class by creating ‘synthetic’ examples.
# It involves some methods (nearest neighbors), to generate plausible
examples.
print("Oversampling using SMOTE ...")
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=888)
X_resampled, y_resampled = smote.fit_resample(df_train[features],
df_train['Class'])
y_resampled.value_counts()
80
SKLearn Code
# We’ll apply logistic regression on the balanced dataset and calculate its
AUC.
clf = LogisticRegression(random_state=888)
clf.fit(X_resampled, y_resampled)
y_pred = clf.predict_proba(df_test[features])[:, 1]
print("The AUC score for this model after SMOTE ...")
roc_auc_score(df_test['Class'], y_pred)
81
SKLearn Code
# Now we will use Undersampling
# Undersampling, we will downsize majority class to balance with the
minority class.
# Simple random undersampling
# We’ll begin with simple random undersampling.
rus = RandomUnderSampler(random_state=888)
X_resampled, y_resampled = rus.fit_resample(df_train[features],
df_train['Class'])
y_resampled.value_counts()
0 26
1 26
Name: Class, dtype: int64
82
SKLearn Code
# And this produces the same AUC as pandas undersampling, since we
use the same
clf = LogisticRegression(random_state=888)
clf.fit(X_resampled, y_resampled)
y_pred = clf.predict_proba(df_test[features])[:, 1]
print("The AUC score for this model after Under Sampling ...")
roc_auc_score(df_test['Class'], y_pred)
83