SMOTE for Imbalanced Classification with Python
Last Updated :
13 Nov, 2025
When a dataset has more samples of one class and very few of another, the model tends to predict the majority class more often. This problem is called class imbalance. The Synthetic Minority Over-sampling Technique (SMOTE) helps fix this issue by creating new synthetic samples for the smaller (minority) class. This makes the dataset more balanced and helps the model learn both classes properly.
- SMOTE creates new synthetic data instead of copying existing samples.
- It improves accuracy for the minority class.
- Variants like ADASYN, Borderline SMOTE, SMOTE-ENN and SMOTE-TOMEK make SMOTE even more effective.
- It can be easily used with the Python library imbalanced-learn (imblearn).
Synthetic Minority Over-Sampling Technique (SMOTE)
SMOTE is a data-level resampling technique that generates synthetic (artificial) samples for the minority class. Instead of simply duplicating existing examples, it creates new data points by interpolating between existing ones. This approach allows the model to learn broader patterns and reduces the risk of overfitting to repeated samples.
Working:
- Identify the Minority Class: The process begins by detecting which class (or classes) have significantly fewer samples compared to others.
- Find Nearest Neighbors: For each sample in the minority class, SMOTE locates its k nearest neighbours (based on distance in the feature space). The value of k is a user-defined parameter that controls how many neighbours are considered.
- Generate Synthetic Samples: A random neighbour from these k nearest points is chosen and a new synthetic instance is created along the line segment connecting the original sample and the chosen neighbour. This ensures the new points are realistic yet distinct.
- Control Oversampling Amount: The number of synthetic samples to be generated is determined by an oversampling ratio, which is chosen so that both classes reach a similar size or desired balance.
- Repeat for All Minority Samples: Steps 2 to 4 are repeated for all minority class examples to produce enough synthetic data for balancing.
- Form the Final Balanced Dataset: After generating these synthetic examples, the dataset becomes more balanced, helping machine learning models train more effectively and fairly across all classes.
Implementation
Python
import matplotlib.pyplot as plt
import pandas as pd
from imblearn.over_sampling import SMOTE
data = pd.read_csv('diabetes.csv')
X = data.drop("Outcome", axis=1)
y = data["Outcome"]
count_class = y.value_counts()
plt.bar(count_class.index, count_class.values)
plt.xlabel('Class')
plt.ylabel('Count')
plt.title('Class Distribution (Before SMOTE)')
plt.xticks(count_class.index, ['Class 0', 'Class 1'])
plt.show()
smote = SMOTE(sampling_strategy='minority')
X_res, y_res = smote.fit_resample(X, y)
print("After SMOTE:\n", y_res.value_counts())
Output:
ResultDataset used can be downloaded from here.
Variants of SMOTE
SMOTE effectively addresses data imbalance by generating synthetic samples, enriching the minority class and refining decision boundaries. Despite its benefits, SMOTE's computational demands can escalate with larger datasets and high-dimensional feature spaces. To enhance SMOTE's capability to handle various data scenarios, several extensions have been developed:
1. ADASYN (Adaptive Synthetic Sampling)
ADASYN stands for Adaptive Synthetic Sampling. It is an improved version of SMOTE that automatically focuses more on minority samples that are difficult to learn. Instead of generating synthetic samples uniformly, ADASYN creates more new samples for minority points that are near the decision boundary where the model usually makes more mistakes.
Working:
- Calculate the level of difficulty for each minority sample. Here samples surrounded by majority samples are considered harder to learn.
- Assign higher weights to difficult samples so that more synthetic examples are created around them.
- Generate synthetic samples by interpolating between each difficult sample and its nearest minority neighbors.
- The final dataset has more new samples near the boundary, improving the model’s ability to classify challenging regions.
Implementation:
Python
from imblearn.over_sampling import ADASYN
adasyn = ADASYN(sampling_strategy='minority')
X_res, y_res = adasyn.fit_resample(X, y)
print(y_res.value_counts())
Output:
Result2. Borderline SMOTE
Borderline SMOTE is a modified version of SMOTE that focuses only on minority samples that lie near the boundary between classes. These are the samples most likely to be misclassified, so generating synthetic samples around them helps strengthen the classifier’s performance near decision boundaries.
Working:
- Identify minority samples that have many majority samples as their nearest neighbors and these are called borderline samples.
- Generate synthetic samples only around these borderline points, avoiding areas deep inside the majority class.
- This keeps the generated data clean and helps the model learn class boundaries more accurately.
Implementation:
Python
from imblearn.over_sampling import BorderlineSMOTE
blsmote = BorderlineSMOTE(sampling_strategy='minority', kind='borderline-1')
X_res, y_res = blsmote.fit_resample(X, y)
print(y_res.value_counts())
Output:
Result3. SMOTE-ENN (Edited Nearest Neighbors)
SMOTE-ENN combines two techniques i.e SMOTE for oversampling and Edited Nearest Neighbors (ENN) for cleaning. First, SMOTE generates synthetic data to balance the dataset. Then, ENN removes noisy or misclassified points from both classes to make the dataset cleaner and more reliable.
Working:
- Apply SMOTE to oversample the minority class.
- For each sample, look at its nearest neighbors.
- If most neighbors belong to a different class, remove that sample beacuse it’s likely a noise.
- The result is a balanced and denoised dataset that helps improve model performance and generalization.
Implementation:
Python
from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN()
X_resampled, y_resampled = smote_enn.fit_resample(x, y)
y_resampled.value_counts()
Output:
result4. SMOTE-TOMEK (Hybrid Method)
SMOTE-TOMEK is a hybrid resampling technique that combines SMOTE and Tomek Links. After oversampling with SMOTE, Tomek Links are identified and removed to eliminate overlapping or borderline points between classes.
Working:
- SMOTE first oversamples the minority class by creating synthetic examples.
- Find Tomek Links as these are pairs of samples from opposite classes that are each other’s nearest neighbors.
- Remove those pairs, as they often lie in overlapping regions that confuse the model.
- The final dataset becomes balanced and cleaner, improving the separation between classes.
Implementation:
Python
from imblearn.combine import SMOTETomek
smt = SMOTETomek(sampling_strategy='auto')
X_resampled, y_resampled = smt.fit_resample(x, y)
y_resampled.value_counts()
Output:
Result5. SMOTE-NC (Nominal Continuous)
SMOTE-NC (Synthetic Minority Over-sampling Technique for Nominal and Continuous features) is a version of SMOTE designed for datasets that contain both numerical and categorical variables. Traditional SMOTE works by interpolating between numeric features, but it fails when applied directly to categorical data because we can’t interpolate between category labels. SMOTE-NC fixes this by treating categorical and continuous features differently during the generation of synthetic samples.
Working:
- Identify which features in the dataset are categorical and which are continuous.
- For continuous features, generate synthetic samples using interpolation just like standard SMOTE.
- For categorical features, assign the most frequent category among the nearest neighbors instead of interpolating.
- Combine the newly generated synthetic samples to create a balanced dataset that respects both numeric and categorical data integrity.
Note: If our dataset doesn’t include categorical features (like the diabetes dataset), use standard SMOTE instead of SMOTE-NC.
Implementation:
Python
import numpy as np
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTENC
from collections import Counter
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1,
n_features=5, n_clusters_per_class=1, n_samples=100, random_state=42)
categorical_features = [0, 3]
print("Before SMOTE-NC:", Counter(y))
smote_nc = SMOTENC(categorical_features=categorical_features, random_state=42)
X_res, y_res = smote_nc.fit_resample(X, y)
print("After SMOTE-NC:", Counter(y_res))
Output:
ResultWhen to Use Each SMOTE Variant
Let's discuss when to use each variant:
| SMOTE Variant | Best Use Case | Main Strength | When to Use / Key Notes |
|---|
| SMOTE (Standard) | Datasets with continuous numeric features and moderate imbalance. | Balances classes by generating synthetic samples through interpolation. | Use when the dataset is numeric and not too noisy or overlapping. Works well as a general solution. |
|---|
| ADASYN (Adaptive SMOTE) | Datasets where imbalance severity differs across regions. | Focuses more on harder-to-learn (borderline) samples by generating adaptive synthetic data. | Use when some areas of the minority class are harder to classify as it gives better boundary learning. |
|---|
| Borderline SMOTE | Minority samples close to class boundaries. | Generates samples only near decision boundaries where misclassification is likely. | Use when our data shows overlapping between classes or frequent boundary confusion. |
|---|
| SMOTE-ENN (Hybrid) | Noisy datasets containing misclassified or ambiguous samples. | Combines oversampling (SMOTE) and cleaning (ENN) to remove noisy instances. | Use when the dataset has noise or outliers and we want a cleaner, balanced dataset. |
|---|
| SMOTE-TOMEK (Hybrid) | Datasets with overlapping classes that need clearer separation. | Removes Tomek links after SMOTE to reduce class overlap and enhance class separation. | Use when we want to improve boundary clarity after oversampling. |
|---|
| SMOTE-NC (Nominal Continuous) | Datasets with both categorical and continuous features. | Handles mixed feature types properly by combining interpolation for numeric and mode assignment for categorical data. | Use when our dataset includes categorical columns as it is not suitable for purely numeric datasets. |
|---|
Explore
Machine Learning Basics
Python for Machine Learning
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advanced Techniques
Machine Learning Practice