Data
Preprocessing
Essential Steps for Preparing Data Before Modeling
Introduction
Data preprocessing is a crucial step in the machine learning and
statistical analysis pipeline.
It involves transforming raw data into a clean and usable format,
ensuring that the data is consistent, accurate, and relevant for the
analysis.
Here are the key reasons why data preprocessing is essential:
Improving Data Quality
Enhancing Model Performance
Improving Interpretability
Ensuring Consistency
Basic Steps in Data Preprocessing
Step 1 :Import important libraries
Step 2: Import dataset
Step 3: Preprocessing:
Find duplicates
Missing value treatment
Encoding
Handling data types
Outlier treatment
Feature scaling
Data balancing
Import Important Libraries
Purpose of Libraries
os: Functions to interact with the operating system.
Example Usage: os.listdir() lists files and directories in the specified
path.
numpy: Support for arrays, matrices, and mathematical functions.
pandas: Data manipulation and analysis.
matplotlib & seaborn: Data visualization.
warnings: Manage warning messages in code.
sns.set(): Automatically sets the seaborn plot aesthetics to a default
theme.
%matplotlib inline: A magic command used in Jupyter notebooks to
display matplotlib plots inline within the notebook.
Importing Data
Data = pd.read_csv(r"C:\Desktop\DataScience\data.csv“)
Data = pd.read_csv(r"C://Desktop//DataScience//data.csv“)
Data.head()
Data.tail()
Finding and Handling Duplicate
If there is any kind of repetition of data in dataset than it is
required to remove them for healthy analysis and prediction.
Handling Duplicates
Handling Missing Values
Identifying missing values using `df.isnull().sum()`.
In percentage form: df.isnull().sum()/len((df)*100
Techniques to handle missing values:
If any variable have missing value > 25% , drop it.
data = data.drop([‘name of column’], axis = 1)
- Else Imputing missing values
Imputation Method
Various imputation Approaches are:
Simple Statistical Imputation:
Mean: If no outliers.
Median:If data have outliers.
Mode : If variables are categorical type.
KNN imputation: Replaces missing values based on the mean or median of the
nearest neighbors' values.
Encoding Categorical Variables
Encoding categorical variables is a critical step in data
preprocessing for machine learning models, as most models
require numerical input.
There are two approach:
Label encoding and one-hot encoding.
Label Encoding
It converts each category in a categorical variable to a unique
integer.
When to Use: When the categorical variable has an ordinal
relationship (e.g., low, medium, high).
When there are a limited number of categories.
One-Hot Encoding
One-hot encoding converts categorical variables into a series of
binary columns, each representing a unique category.
When to Use
When the categorical variable is nominal (no intrinsic order).
When you want to avoid introducing ordinal relationships.
pd.get_dummies() is a function in the pandas library in Python used
for one-hot encoding categorical variables.
In one hot encoding usually drop one column.
Handling Outlier
Identifying outliers:
Visualization-Based Detection:
Box plots
Histograms with normal distribution curve
➢ Statistical Methods:
➢ Z-Score (standard deviation approach)
➢ IQR (Interquartile Range):
➢ Values below Q1−1.5×IQR
➢ Values above Q3+1.5×IQR.
Approaches to Handle Outliers
Capping method:
Using IQR or Z score
Transformation Approach:
Log Transformation
Square root Transformation
Box- Cox Transformation
Winsorization Method
Finding Outliers Using IQR
method
Handling outlier using capping by
IQR
Distribution and box plot
Handling outlier using capping
by Z score:
Step1: find mean and standard deviation
Step 2: Find minimum and maximum based on normal
distribution parameters to identify an outlier
Step 3: Capping
➢ To view the outliers based on mini or max
To Calculate Z-score
To view the outliers based on Z-score:
To visualize the outliers by distribution
plot and boxplot:
Log Transformation
Logarithmic transformation is often used to reduce the effect of
large outliers by compressing the range of data values.
Formula:
y=log(x)
Works best when all values are positive (since log of zero or negative
values is undefined).
Helps stabilize variance and normalize skewed distributions.
Square Root Transform
Square root transformation reduces the magnitude of
large values but less aggressively than logarithmic
transformation.
It works with zero values but not negatives.
Formula:
y=sqrt{x}
Box-Cox Transformation
The Box-Cox transformation is a family of power transformations that
stabilize variance, make the data more normal-like, and reduce the
impact of outliers.
Unlike logarithmic or square root transformations, the Box-Cox method
includes an adjustable parameter (λ that determines the transformation
applied.
When to Use
When data is not normally distributed and transformations like log or
square root are insufficient.
To stabilize variance across different scales of data.
Winsorization:
Winsorization replaces extreme values with specified percentiles
to limit the influence of outliers while retaining the dataset's size.
Steps:
Define lower and upper limits (e.g., 5th and 95th percentiles).
Replace values below the lower limit with the 5th percentile and
above the upper limit with the 95th percentile.
Feature Scaling
Importance of scaling features.
Methods: Standard Scaler, Min-Max Scaler, Normalizer.
We do not do feature scaling with dependent variables.
So Ist separate the data into independent and dependent variable.
Standardization:
Scaling features to have zero mean and unit variance. This is
particularly important for algorithms that rely on distance measures,
such as SVM and k-NN.
When to Use Standard Scaler
The Standard Scaler standardizes features by removing the mean and
scaling to unit variance.
Where "Remove the mean" means that for each feature in your dataset,
you subtract the mean (average) value of that feature from all the
values of that feature.
This process centers the feature around zero, ensuring that the
transformed feature has a mean of zero.
This means each feature will have a mean of ‘0’ and a standard
deviation of 1.
This ensures that each feature contributes equally to the model.
When your dataset contains features with different units (e.g.,
age in years, income in dollars, and height in centimeters),
StandardScaler helps to bring all features to the same scale.
You are using algorithms that assume normal distribution of
features.
X = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
X_scaled = [[-1.22474487 -1.22474487 -1.22474487]
[ 0. 0. 0. ]
[ 1.22474487 1.22474487 1.22474487]]
When to Use Standard Scaler:
➢ Algorithms that Assume Normal Distribution:
1. Linear Regression: Assumes that the relationship between the input and output is linear.
2. Logistic Regression: Assumes a linear relationship between the input features and the log-
odds of the target.
3. Linear Discriminant Analysis (LDA): Assumes data is normally distributed within each class.
4. Support Vector Machines (SVM): Assumes features have similar scales for optimal
performance.
5. Principal Component Analysis (PCA): Assumes the data is centered around the origin for
variance maximization.
6. K-Means Clustering: Assumes features are on similar scales for effective distance
calculation.
Normalizer:
Normalizer is suitable when the goal is to scale individual
samples to have unit norm(length).
This technique is useful when the direction of the data
points is more important than the magnitude of their
distance from the origin.
Normalizing the data to unit norm ensures that the focus is on the direction of each
sample.
Sample data X = [[3, 4], [1, 2], [4, 5]]
Normalized data: means sum of data point(3,4 = 1).
X = [[0.6 0.8 ] [0.4472136 0.89442719] [0.62469505
0.78086881]]
When to Use Normalizer:
1. Feature Comparison:
➢ Cosine Similarity: When you want to measure the cosine
similarity between samples. By normalizing, you ensure that
the angle between vectors becomes the metric rather than
their magnitude.
➢ Clustering: When using clustering algorithms like K-Means,
normalized data can improve the convergence speed and
cluster quality, especially if the data has different scales.
➢ Nearest Neighbor: When you want to perform k-nearest
neighbors (KNN) classification or regression, normalizing
ensures that all features contribute equally to the distance
calculations.
4 Sparse Data:
➢ When working with sparse data (data with a lot of zeros),
normalizing can make algorithms like Support Vector
Machines (SVM) and Principal Component Analysis (PCA)
perform better by ensuring that features with more non-
zero values don’t dominate.
5 Text Data:
➢ TF-IDF: In Natural Language Processing (NLP), TF-IDF
vectors are often normalized to have unit norm to
account for the difference in document lengths and to
focus on the relative importance of terms.
Concept of fit_transform and transform:
When to Use
fit_transform: Use this on your training data to compute
the necessary parameters(mean, standard deviation) and
apply the transformation in one step.
fit the preprocessing transformers on the training data to
learn the necessary parameters.
transform the training data using the fitted transformers.
transform: Use this on your test data (or any new data) to
apply the transformation using the parameters computed
from the training data.
For test data we do not use ‘fit’ and we are using the
same parameters as calculated for training.
Converting data types using `pd.to_numeric`.
Common Data Types and
Their Handling
Numerical Data:
Integers: Whole numbers.
Floats: Decimal numbers.
Handling: Ensure numerical columns are in the correct format and handle
missing values appropriately.
Numerical Data:
Categorical Data:
Nominal: Categories without a specific order (e.g., color: red,
green, blue).
Ordinal: Categories with a specific order (e.g., rating: low,
medium, high).
Handling: Encode categorical variables using techniques such as
label encoding or one-hot encoding.
Label encoding:
Label encoding is a technique used to convert ordinal type of
categorical data into numerical format.
Itassigns a unique integer to each category in the categorical
variable.
This
is particularly useful for machine learning algorithms that require
numeric input.
dataset[‘Col_Name'] = dataset[' Col_Name '].astype('category')
dataset[‘Col_Name'] = dataset[‘Col_Name'].cat.codes
convert the [‘Col_Name'] to categorical type using astype('category').
then use cat.codes to assign numeric labels to each category in the
[‘Col_Name'] and create a new column with the encoded values.
One-hot encoding:
One-hot encoding is a technique used to convert Nominal type
categorical data into a binary format, where each category is
represented as a binary vector.
It creates new binary columns (also known as dummy variables)
for each category, with a value of 1 indicating the presence of that
category and 0 indicating absence.
After OHE drop one variable. Here is python code:
dataset = pd.get_dummies(dataset, columns = [‘Col_Name'])
Where:
“pd.get_dummies()” is a pandas function used for one-hot
encoding categorical variables.
“columns=['Col_Name']” specifies the column(s) in the Data
Frame that you want to encode.
Categorical Data:
Datetime Data:
Convert Data Types:
Example
pd.to_numeric(df['number'], errors='coerce', downcast='integer')
➢ pd.to_numeric: This function is used to convert argument to a numeric
type.
➢ df['number']: This specifies the column number from the DataFrame df
that we want to convert.
➢ errors='coerce': This argument tells the function how to handle errors
during the conversion process. Specifically:'coerce': This means that any
values that cannot be converted to a numeric type will be set to NaN (Not
a Number).
➢ downcast='integer': This argument attempts to downcast the numeric
type to the smallest possible integer subtype, which helps in saving
memory.
➢ For example, if all values can be represented by a smaller integer type
(like int8), it will use that type instead of a larger one (like int64).
np.where(df['num_numerical'].isnull(), df['number'], np.nan)
1. np.where(condition, x, y):
•This function from the NumPy library is used for element-wise selection from two arrays (x and y) based on a
condition. It returns elements chosen from x or y depending on the condition.
•condition: This specifies the condition to be checked.
•x: The values to select where the condition is True.
•y: The values to select where the condition is False.
2. df['num_numerical'].isnull():
•This checks for NaN values in the num_numerical column of the DataFrame df. It returns a boolean Series where
True indicates the presence of a NaN value and False indicates the absence of a NaN value.
3. df['number']:
•This specifies the original number column from the DataFrame df, which contains the original values before any
numeric conversion.
4. np.nan:
•This specifies that NaN should be used where the condition is False.
Data Balancing
Importance of balanced datasets.
Data is said to be imbalance if twice of minority class is less
than the majority class.
To find it, check the count in dependent variables.
Two popular approach to solve the problem:
Oversampling
SMOTE
RandomOverSampler
Simply duplicates random instances of the minority class to increase its
representation in the dataset.
This can be effective but may lead to overfitting as the same instances are
repeated multiple times.
SMOTE
(Synthetic minority oversampling technique)
➢ Generates synthetic samples by interpolating between existing
minority class instances.
➢ This technique creates more diverse samples compared to simple
duplication, potentially reducing overfitting.
Cross Tabs for Feature Relationships
Exploratory Data Analysis
(EDA):
Definition: Exploratory Data Analysis (EDA) is a critical step in
the data analysis process. It involves examining and summarizing
the main characteristics of a dataset, often using visual methods.
Here are some key steps and techniques you can use during EDA
Techniques:
Data visualization: Histograms, scatter plots, box plots.
Summary statistics: Mean, median, standard deviation.
Outlier detection: Identifying data points that deviate significantly from the
rest of the dataset
What we can do in EDA
Visualization
Using plots for making inferences in machine learning involves
visualizing data to understand its structure, relationships, and
patterns, which can guide feature selection, model choice, and
evaluation.
Visualization Techniques for
EDA
Histograms
Box plots
Scatter plots
Pair plots
Correlation matrix and heatmaps
Bar plots
Count plots
Violin plots
Histograms
Purpose: Understand the distribution of individual variables.
Inference: Identify skewness, outliers, and the presence of multiple
modes in the data.
Box Plots
Purpose: Summarize the distribution of a dataset.
Inference: Detect outliers, understand the spread and symmetry of the
data.
Scatter Plots
Purpose: Visualize the relationship between two continuous
variables.
Inference: Identify correlations, clusters, and potential outliers
Pair Plots (Scatterplot
Matrix)
Purpose: Visualize pairwise relationships
between multiple variables.
Inference: Detect relationships between pairs
of features, spot trends, clusters, and outliers.
Correlation Matrix and
Heatmaps
Purpose: Show the correlation coefficients between variables.
Inference: Identify highly correlated features that might be
redundant.
Bar Plots
Purpose: Compare categorical data.
Inference: Understand the frequency distribution of categorical
features.
Density plots
Density plots display the probability density function of a continuous
variable. They are useful for visualizing the overall shape of the distribution
and comparing multiple distributions.
Distribution plot
A distribution plot is a visualization that combines aspects of a histogram and a
kernel density plot to show the distribution of a continuous variable.
It is useful for understanding the distribution of data points in a dataset and
identifying patterns such as skewness, kurtosis, and the presence of outliers.
sns.histplot is used instead of sns.distplot.
The parameter kde=True adds the KDE line to the histogram
➢ In recent versions of Seaborn (0.11.0 and later), sns.distplot has been
deprecated.
➢ Instead, you should use sns.histplot or sns.kdeplot for similar functionality.
Here's how to create a similar plot using sns.histplot
Count Plots
Purpose: Show the counts of observations in each categorical bin.
Inference: Detect the distribution of categorical features.
Pie Plot
A pie plot (or pie chart) is a circular statistical graphic that is divided into slices
to illustrate numerical proportions. Each slice represents a category's
proportion to the whole dataset. Pie charts are useful for showing the relative
sizes of parts to a whole, making it easy to compare the parts of a single
categorical variable.
Feature Elimination Method
If number of features are too large to handle than it is
wise approach to remove insignificant features.
There two popular approach.
PCA :Principal Component Analysis
RFE: Recursive Feature Elimination
Recursive Feature
Elimination (RFE)
Purpose:
RFE is a feature selection method that iteratively removes less
important features based on the model's performance, identifying
the most influential features for predicting the target variable.
➢ Mathematical Basis:
1.Feature Importance:
➢ RFE uses an estimator (e.g., Logistic Regression, Random
Forest, etc.) that provides feature importance, such as weights
or coefficients.
Steps:
➢ Train the model on the dataset.
➢ Rank features based on their importance scores.
➢ Remove the least important feature(s).
➢ Repeat until the desired number of features remains.
➢ Advantages:
➢ Identifies the most critical features for the model.
➢ Helps improve model performance by eliminating redundant or
irrelevant features.
➢ Works well with small to medium datasets.
➢ Disadvantages:
➢ Computationally expensive for large datasets.
➢ Performance depends on the chosen estimator.
Code Example:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load dataset
data = load_iris()
X, y = data.data, data.target
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Logistic Regression as estimator
logitR = LogisticRegression()
# Apply RFE to select top 2 features
selector = RFE(estimator=logitR, n_features_to_select=2, step=1)
selector.fit(X_train, y_train)
# Selected features
print("Selected Features:", selector.support_)
Principal Component
Analysis (PCA)
Purpose:
PCA is a dimensionality reduction technique that
transforms the dataset into a lower-dimensional space
while preserving as much variance as possible.
Key Differences