0% found this document useful (0 votes)

46 views111 pages

Data Preprocessing: Essential Steps For Preparing Data Before Modeling

Data preprocessing is a vital step in machine learning that transforms raw data into a clean format to enhance quality, model performance, and interpretability. Key steps include handling duplicates, missing values, encoding categorical variables, managing outliers, and feature scaling. Techniques such as imputation, one-hot encoding, and data balancing methods like SMOTE are essential for preparing data for analysis.

Uploaded by

amruthaangadi0506

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views111 pages

Data Preprocessing: Essential Steps For Preparing Data Before Modeling

Uploaded by

amruthaangadi0506

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data

Preprocessing
Essential Steps for Preparing Data Before Modeling
Introduction

 Data preprocessing is a crucial step in the machine learning and

statistical analysis pipeline.

 It involves transforming raw data into a clean and usable format,

ensuring that the data is consistent, accurate, and relevant for the
analysis.

 Here are the key reasons why data preprocessing is essential:

 Improving Data Quality

 Enhancing Model Performance

 Improving Interpretability

 Ensuring Consistency
Basic Steps in Data Preprocessing

 Step 1 :Import important libraries

 Step 2: Import dataset
 Step 3: Preprocessing:
 Find duplicates
 Missing value treatment
 Encoding
 Handling data types
 Outlier treatment
 Feature scaling
 Data balancing
Import Important Libraries
Purpose of Libraries

 os: Functions to interact with the operating system.

 Example Usage: os.listdir() lists files and directories in the specified

path.

 numpy: Support for arrays, matrices, and mathematical functions.

 pandas: Data manipulation and analysis.

 matplotlib & seaborn: Data visualization.

 warnings: Manage warning messages in code.

 sns.set(): Automatically sets the seaborn plot aesthetics to a default

theme.

 %matplotlib inline: A magic command used in Jupyter notebooks to

display matplotlib plots inline within the notebook.
Importing Data

 Data = pd.read_csv(r"C:\Desktop\DataScience\data.csv“)
 Data = pd.read_csv(r"C://Desktop//DataScience//data.csv“)
 Data.head()
 Data.tail()
Finding and Handling Duplicate
 If there is any kind of repetition of data in dataset than it is
required to remove them for healthy analysis and prediction.
Handling Duplicates
Handling Missing Values

 Identifying missing values using `df.isnull().sum()`.

 In percentage form: df.isnull().sum()/len((df)*100

 Techniques to handle missing values:

 If any variable have missing value > 25% , drop it.

 data = data.drop([‘name of column’], axis = 1)

 - Else Imputing missing values

Imputation Method
 Various imputation Approaches are:

 Simple Statistical Imputation:

 Mean: If no outliers.

 Median:If data have outliers.

 Mode : If variables are categorical type.

 KNN imputation: Replaces missing values based on the mean or median of the

nearest neighbors' values.

Encoding Categorical Variables
 Encoding categorical variables is a critical step in data
preprocessing for machine learning models, as most models
require numerical input.
 There are two approach:
 Label encoding and one-hot encoding.
Label Encoding
 It converts each category in a categorical variable to a unique
integer.
 When to Use: When the categorical variable has an ordinal
relationship (e.g., low, medium, high).
 When there are a limited number of categories.
One-Hot Encoding
 One-hot encoding converts categorical variables into a series of
binary columns, each representing a unique category.
 When to Use
 When the categorical variable is nominal (no intrinsic order).
 When you want to avoid introducing ordinal relationships.
 pd.get_dummies() is a function in the pandas library in Python used
for one-hot encoding categorical variables.

 In one hot encoding usually drop one column.

Handling Outlier

 Identifying outliers:
 Visualization-Based Detection:
 Box plots
 Histograms with normal distribution curve

➢ Statistical Methods:
➢ Z-Score (standard deviation approach)
➢ IQR (Interquartile Range):
➢ Values below Q1−1.5×IQR
➢ Values above Q3+1.5×IQR.
Approaches to Handle Outliers

 Capping method:

 Using IQR or Z score

 Transformation Approach:

 Log Transformation

 Square root Transformation

 Box- Cox Transformation

 Winsorization Method
Finding Outliers Using IQR
method
Handling outlier using capping by
IQR
Distribution and box plot
Handling outlier using capping
by Z score:
 Step1: find mean and standard deviation
 Step 2: Find minimum and maximum based on normal

distribution parameters to identify an outlier

Step 3: Capping
➢ To view the outliers based on mini or max
 To Calculate Z-score
 To view the outliers based on Z-score:
To visualize the outliers by distribution
plot and boxplot:
Log Transformation
 Logarithmic transformation is often used to reduce the effect of
large outliers by compressing the range of data values.
 Formula:
 y=log(x)
 Works best when all values are positive (since log of zero or negative
values is undefined).
 Helps stabilize variance and normalize skewed distributions.
Square Root Transform

 Square root transformation reduces the magnitude of

large values but less aggressively than logarithmic
transformation.

 It works with zero values but not negatives.

 Formula:

 y=sqrt{x}
Box-Cox Transformation
 The Box-Cox transformation is a family of power transformations that
stabilize variance, make the data more normal-like, and reduce the
impact of outliers.

 Unlike logarithmic or square root transformations, the Box-Cox method

includes an adjustable parameter (λ that determines the transformation
applied.
 When to Use
 When data is not normally distributed and transformations like log or
square root are insufficient.
 To stabilize variance across different scales of data.
Winsorization:

 Winsorization replaces extreme values with specified percentiles

to limit the influence of outliers while retaining the dataset's size.

 Steps:

 Define lower and upper limits (e.g., 5th and 95th percentiles).

 Replace values below the lower limit with the 5th percentile and
above the upper limit with the 95th percentile.
Feature Scaling
 Importance of scaling features.
 Methods: Standard Scaler, Min-Max Scaler, Normalizer.
 We do not do feature scaling with dependent variables.
 So Ist separate the data into independent and dependent variable.
Standardization:

 Scaling features to have zero mean and unit variance. This is

particularly important for algorithms that rely on distance measures,
such as SVM and k-NN.
When to Use Standard Scaler
 The Standard Scaler standardizes features by removing the mean and
scaling to unit variance.
 Where "Remove the mean" means that for each feature in your dataset,
you subtract the mean (average) value of that feature from all the
values of that feature.
 This process centers the feature around zero, ensuring that the
transformed feature has a mean of zero.
 This means each feature will have a mean of ‘0’ and a standard
deviation of 1.
 This ensures that each feature contributes equally to the model.
 When your dataset contains features with different units (e.g.,
age in years, income in dollars, and height in centimeters),
StandardScaler helps to bring all features to the same scale.
 You are using algorithms that assume normal distribution of
features.

X = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

X_scaled = [[-1.22474487 -1.22474487 -1.22474487]

[ 0. 0. 0. ]
[ 1.22474487 1.22474487 1.22474487]]
When to Use Standard Scaler:

➢ Algorithms that Assume Normal Distribution:

1. Linear Regression: Assumes that the relationship between the input and output is linear.

2. Logistic Regression: Assumes a linear relationship between the input features and the log-
odds of the target.

3. Linear Discriminant Analysis (LDA): Assumes data is normally distributed within each class.

4. Support Vector Machines (SVM): Assumes features have similar scales for optimal
performance.

5. Principal Component Analysis (PCA): Assumes the data is centered around the origin for
variance maximization.

6. K-Means Clustering: Assumes features are on similar scales for effective distance
calculation.
Normalizer:
 Normalizer is suitable when the goal is to scale individual
samples to have unit norm(length).
 This technique is useful when the direction of the data
points is more important than the magnitude of their
distance from the origin.
 Normalizing the data to unit norm ensures that the focus is on the direction of each
sample.
 Sample data X = [[3, 4], [1, 2], [4, 5]]

 Normalized data: means sum of data point(3,4 = 1).

X = [[0.6 0.8 ] [0.4472136 0.89442719] [0.62469505
0.78086881]]
When to Use Normalizer:
1. Feature Comparison:

➢ Cosine Similarity: When you want to measure the cosine

similarity between samples. By normalizing, you ensure that
the angle between vectors becomes the metric rather than
their magnitude.

➢ Clustering: When using clustering algorithms like K-Means,

normalized data can improve the convergence speed and
cluster quality, especially if the data has different scales.

➢ Nearest Neighbor: When you want to perform k-nearest

neighbors (KNN) classification or regression, normalizing
ensures that all features contribute equally to the distance
calculations.
4 Sparse Data:

➢ When working with sparse data (data with a lot of zeros),

normalizing can make algorithms like Support Vector
Machines (SVM) and Principal Component Analysis (PCA)
perform better by ensuring that features with more non-
zero values don’t dominate.

5 Text Data:

➢ TF-IDF: In Natural Language Processing (NLP), TF-IDF

vectors are often normalized to have unit norm to
account for the difference in document lengths and to
focus on the relative importance of terms.
Concept of fit_transform and transform:
When to Use
 fit_transform: Use this on your training data to compute
the necessary parameters(mean, standard deviation) and
apply the transformation in one step.
 fit the preprocessing transformers on the training data to
learn the necessary parameters.
 transform the training data using the fitted transformers.

 transform: Use this on your test data (or any new data) to
apply the transformation using the parameters computed
from the training data.
 For test data we do not use ‘fit’ and we are using the
same parameters as calculated for training.
 Converting data types using `pd.to_numeric`.
Common Data Types and
Their Handling
 Numerical Data:
 Integers: Whole numbers.
 Floats: Decimal numbers.
 Handling: Ensure numerical columns are in the correct format and handle
missing values appropriately.
Numerical Data:
Categorical Data:

 Nominal: Categories without a specific order (e.g., color: red,

green, blue).
 Ordinal: Categories with a specific order (e.g., rating: low,
medium, high).
 Handling: Encode categorical variables using techniques such as
label encoding or one-hot encoding.
 Label encoding:
 Label encoding is a technique used to convert ordinal type of
categorical data into numerical format.
 Itassigns a unique integer to each category in the categorical
variable.
 This
is particularly useful for machine learning algorithms that require
numeric input.
 dataset[‘Col_Name'] = dataset[' Col_Name '].astype('category')
 dataset[‘Col_Name'] = dataset[‘Col_Name'].cat.codes
 convert the [‘Col_Name'] to categorical type using astype('category').
 then use cat.codes to assign numeric labels to each category in the
[‘Col_Name'] and create a new column with the encoded values.
 One-hot encoding:
 One-hot encoding is a technique used to convert Nominal type
categorical data into a binary format, where each category is
represented as a binary vector.
 It creates new binary columns (also known as dummy variables)
for each category, with a value of 1 indicating the presence of that
category and 0 indicating absence.
 After OHE drop one variable. Here is python code:
 dataset = pd.get_dummies(dataset, columns = [‘Col_Name'])
Where:
“pd.get_dummies()” is a pandas function used for one-hot
encoding categorical variables.
“columns=['Col_Name']” specifies the column(s) in the Data
Frame that you want to encode.
Categorical Data:
Datetime Data:
Convert Data Types:
Example
pd.to_numeric(df['number'], errors='coerce', downcast='integer')

➢ pd.to_numeric: This function is used to convert argument to a numeric

type.
➢ df['number']: This specifies the column number from the DataFrame df
that we want to convert.
➢ errors='coerce': This argument tells the function how to handle errors
during the conversion process. Specifically:'coerce': This means that any
values that cannot be converted to a numeric type will be set to NaN (Not
a Number).
➢ downcast='integer': This argument attempts to downcast the numeric
type to the smallest possible integer subtype, which helps in saving
memory.
➢ For example, if all values can be represented by a smaller integer type
(like int8), it will use that type instead of a larger one (like int64).
np.where(df['num_numerical'].isnull(), df['number'], np.nan)

1. np.where(condition, x, y):
•This function from the NumPy library is used for element-wise selection from two arrays (x and y) based on a
condition. It returns elements chosen from x or y depending on the condition.
•condition: This specifies the condition to be checked.
•x: The values to select where the condition is True.
•y: The values to select where the condition is False.

2. df['num_numerical'].isnull():
•This checks for NaN values in the num_numerical column of the DataFrame df. It returns a boolean Series where
True indicates the presence of a NaN value and False indicates the absence of a NaN value.

3. df['number']:
•This specifies the original number column from the DataFrame df, which contains the original values before any

numeric conversion.

4. np.nan:
•This specifies that NaN should be used where the condition is False.
Data Balancing
 Importance of balanced datasets.
 Data is said to be imbalance if twice of minority class is less
than the majority class.
 To find it, check the count in dependent variables.
 Two popular approach to solve the problem:
 Oversampling
 SMOTE
RandomOverSampler
 Simply duplicates random instances of the minority class to increase its
representation in the dataset.
 This can be effective but may lead to overfitting as the same instances are
repeated multiple times.
SMOTE
(Synthetic minority oversampling technique)

➢ Generates synthetic samples by interpolating between existing

minority class instances.
➢ This technique creates more diverse samples compared to simple
duplication, potentially reducing overfitting.
Cross Tabs for Feature Relationships
Exploratory Data Analysis
(EDA):

 Definition: Exploratory Data Analysis (EDA) is a critical step in

the data analysis process. It involves examining and summarizing
the main characteristics of a dataset, often using visual methods.
 Here are some key steps and techniques you can use during EDA
 Techniques:
 Data visualization: Histograms, scatter plots, box plots.
 Summary statistics: Mean, median, standard deviation.
 Outlier detection: Identifying data points that deviate significantly from the
rest of the dataset
What we can do in EDA
Visualization

 Using plots for making inferences in machine learning involves

visualizing data to understand its structure, relationships, and
patterns, which can guide feature selection, model choice, and
evaluation.
Visualization Techniques for
EDA
 Histograms
 Box plots
 Scatter plots
 Pair plots
 Correlation matrix and heatmaps
 Bar plots
 Count plots
 Violin plots
Histograms
 Purpose: Understand the distribution of individual variables.
 Inference: Identify skewness, outliers, and the presence of multiple
modes in the data.
Box Plots
 Purpose: Summarize the distribution of a dataset.
 Inference: Detect outliers, understand the spread and symmetry of the
data.
Scatter Plots
 Purpose: Visualize the relationship between two continuous
variables.
 Inference: Identify correlations, clusters, and potential outliers
Pair Plots (Scatterplot
Matrix)
 Purpose: Visualize pairwise relationships
between multiple variables.
 Inference: Detect relationships between pairs
of features, spot trends, clusters, and outliers.
Correlation Matrix and
Heatmaps

 Purpose: Show the correlation coefficients between variables.

 Inference: Identify highly correlated features that might be
redundant.
Bar Plots
 Purpose: Compare categorical data.
 Inference: Understand the frequency distribution of categorical
features.
Density plots
 Density plots display the probability density function of a continuous
variable. They are useful for visualizing the overall shape of the distribution
and comparing multiple distributions.
Distribution plot
 A distribution plot is a visualization that combines aspects of a histogram and a
kernel density plot to show the distribution of a continuous variable.
 It is useful for understanding the distribution of data points in a dataset and
identifying patterns such as skewness, kurtosis, and the presence of outliers.
 sns.histplot is used instead of sns.distplot.
 The parameter kde=True adds the KDE line to the histogram
➢ In recent versions of Seaborn (0.11.0 and later), sns.distplot has been
deprecated.
➢ Instead, you should use sns.histplot or sns.kdeplot for similar functionality.
Here's how to create a similar plot using sns.histplot
Count Plots
 Purpose: Show the counts of observations in each categorical bin.
 Inference: Detect the distribution of categorical features.
Pie Plot
 A pie plot (or pie chart) is a circular statistical graphic that is divided into slices
to illustrate numerical proportions. Each slice represents a category's
proportion to the whole dataset. Pie charts are useful for showing the relative
sizes of parts to a whole, making it easy to compare the parts of a single
categorical variable.
Feature Elimination Method

 If number of features are too large to handle than it is

wise approach to remove insignificant features.
 There two popular approach.
 PCA :Principal Component Analysis
 RFE: Recursive Feature Elimination
Recursive Feature
Elimination (RFE)
 Purpose:
 RFE is a feature selection method that iteratively removes less
important features based on the model's performance, identifying
the most influential features for predicting the target variable.

➢ Mathematical Basis:
1.Feature Importance:
➢ RFE uses an estimator (e.g., Logistic Regression, Random
Forest, etc.) that provides feature importance, such as weights
or coefficients.
Steps:
➢ Train the model on the dataset.
➢ Rank features based on their importance scores.
➢ Remove the least important feature(s).
➢ Repeat until the desired number of features remains.

➢ Advantages:
➢ Identifies the most critical features for the model.
➢ Helps improve model performance by eliminating redundant or
irrelevant features.
➢ Works well with small to medium datasets.
➢ Disadvantages:
➢ Computationally expensive for large datasets.
➢ Performance depends on the chosen estimator.
Code Example:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Logistic Regression as estimator

logitR = LogisticRegression()

# Apply RFE to select top 2 features

selector = RFE(estimator=logitR, n_features_to_select=2, step=1)
selector.fit(X_train, y_train)

# Selected features
print("Selected Features:", selector.support_)
Principal Component
Analysis (PCA)
 Purpose:
 PCA is a dimensionality reduction technique that
transforms the dataset into a lower-dimensional space
while preserving as much variance as possible.
Key Differences

Data Preprocessing and Feature Engineering
No ratings yet
Data Preprocessing and Feature Engineering
32 pages
ML - Week 04
No ratings yet
ML - Week 04
33 pages
Data Preprocessing PT 2
No ratings yet
Data Preprocessing PT 2
7 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Week 10
No ratings yet
Week 10
50 pages
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
No ratings yet
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
35 pages
Seven Lab Instruction
No ratings yet
Seven Lab Instruction
38 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
Data Normalization Machine Learning
No ratings yet
Data Normalization Machine Learning
5 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
5 Preprocessing
No ratings yet
5 Preprocessing
44 pages
ML Unit 2
No ratings yet
ML Unit 2
90 pages
Machine Learning Normalization Techniques
No ratings yet
Machine Learning Normalization Techniques
5 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
ML Notes
No ratings yet
ML Notes
44 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
Feature Scaling (Standardization & Normalization)
No ratings yet
Feature Scaling (Standardization & Normalization)
35 pages
Machine Learning Data Preprocessing Guide
No ratings yet
Machine Learning Data Preprocessing Guide
24 pages
3 - AML - Lecture 3 - Feature Engg
No ratings yet
3 - AML - Lecture 3 - Feature Engg
39 pages
ML Lab Exam Document
No ratings yet
ML Lab Exam Document
14 pages
Data Processing
No ratings yet
Data Processing
19 pages
Feature Scaling in Machine Learning
No ratings yet
Feature Scaling in Machine Learning
4 pages
Data Preprocessing: Normalize vs. Standardize
No ratings yet
Data Preprocessing: Normalize vs. Standardize
10 pages
Résumé-Analyse Des Données Resumee Resumee
No ratings yet
Résumé-Analyse Des Données Resumee Resumee
4 pages
Feature Engineering for BE Students
No ratings yet
Feature Engineering for BE Students
91 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
20 pages
PMA Unit-2 PDF
No ratings yet
PMA Unit-2 PDF
19 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
Feature Scaling Techniques: Machine Learning
No ratings yet
Feature Scaling Techniques: Machine Learning
27 pages
Standar Ization
No ratings yet
Standar Ization
7 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
ML Lec 4
No ratings yet
ML Lec 4
9 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Data Preparation
No ratings yet
Data Preparation
11 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Data Preparation.2
No ratings yet
Data Preparation.2
18 pages
Data Preprocessing: Clean, Transform, Integrate
No ratings yet
Data Preprocessing: Clean, Transform, Integrate
6 pages
Data Mining: Preprocessing Techniques
No ratings yet
Data Mining: Preprocessing Techniques
33 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Mini 4
No ratings yet
Mini 4
9 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Standardization Campusx
No ratings yet
Standardization Campusx
4 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Machine Learning Feature Scaling
No ratings yet
Machine Learning Feature Scaling
26 pages
Feature Engineering Basics for ML
No ratings yet
Feature Engineering Basics for ML
33 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
ML Da
No ratings yet
ML Da
55 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Ryan FU G6 Ciii
No ratings yet
Ryan FU G6 Ciii
2 pages
AgainstAllOdds FacultyGuide Set1
No ratings yet
AgainstAllOdds FacultyGuide Set1
122 pages
2008-05 Anchoring To Stone
No ratings yet
2008-05 Anchoring To Stone
38 pages
Subtitle Big Data Coursera 4
No ratings yet
Subtitle Big Data Coursera 4
2 pages
Data Analyst Interview Questions
No ratings yet
Data Analyst Interview Questions
7 pages
Mean, Mode, Median, and Standard Deviation
No ratings yet
Mean, Mode, Median, and Standard Deviation
17 pages
Chapter 7 - Biological Variation
No ratings yet
Chapter 7 - Biological Variation
16 pages
DS Unit 1 Chapter 1
No ratings yet
DS Unit 1 Chapter 1
40 pages
Frank H. Allen J. Chem. Soc Perkin Trans. II 1987
No ratings yet
Frank H. Allen J. Chem. Soc Perkin Trans. II 1987
19 pages
The Good Indicators Guide:: Understanding How To Use and Choose Indicators
No ratings yet
The Good Indicators Guide:: Understanding How To Use and Choose Indicators
40 pages
Capstone Project Final Report - Draft 02
No ratings yet
Capstone Project Final Report - Draft 02
24 pages
Jas Et Al., 2017. Autoreject
No ratings yet
Jas Et Al., 2017. Autoreject
13 pages
Ground-Motion Prediction Equation For The Chilean Subduction Zone
No ratings yet
Ground-Motion Prediction Equation For The Chilean Subduction Zone
11 pages
Laser Toolsetter2
No ratings yet
Laser Toolsetter2
7 pages
Data Visualization for Beginners
No ratings yet
Data Visualization for Beginners
27 pages
Python Sales Data Visualization Guide
No ratings yet
Python Sales Data Visualization Guide
19 pages
DM-Model Question Paper Solutions
No ratings yet
DM-Model Question Paper Solutions
27 pages
Provenance of Troia's Grey Wares
No ratings yet
Provenance of Troia's Grey Wares
20 pages
Sections 2.1 - 2.3: Mind On Statistics
No ratings yet
Sections 2.1 - 2.3: Mind On Statistics
22 pages
2020 Fernando&Wickramasuriya Concept of Threshold in The Estimation of Probable Maximum Precipitati
No ratings yet
2020 Fernando&Wickramasuriya Concept of Threshold in The Estimation of Probable Maximum Precipitati
8 pages
Carel Digital Technology
No ratings yet
Carel Digital Technology
28 pages
The Power of Outliers
No ratings yet
The Power of Outliers
9 pages
MATH 1280-01 Assignment Unit 4
No ratings yet
MATH 1280-01 Assignment Unit 4
6 pages
ELISA Logit Regression Program Manual
No ratings yet
ELISA Logit Regression Program Manual
20 pages
Manual Técnico QRev
No ratings yet
Manual Técnico QRev
122 pages
9 Steps for Effective Data Preparation
No ratings yet
9 Steps for Effective Data Preparation
14 pages
Data Visualization Exercises in Tableau
No ratings yet
Data Visualization Exercises in Tableau
5 pages
Sampling Errors and Control of Assay Datax
No ratings yet
Sampling Errors and Control of Assay Datax
34 pages
Module 4 Analysis (3 Lessons) Fundamentals of Data Analytics
No ratings yet
Module 4 Analysis (3 Lessons) Fundamentals of Data Analytics
17 pages
A Protocol For Data Exploration To Avoid Common Statistical Problems
No ratings yet
A Protocol For Data Exploration To Avoid Common Statistical Problems
12 pages

Data Preprocessing: Essential Steps For Preparing Data Before Modeling

Uploaded by

Data Preprocessing: Essential Steps For Preparing Data Before Modeling

Uploaded by

Data

 Data preprocessing is a crucial step in the machine learning and

 It involves transforming raw data into a clean and usable format,

 Here are the key reasons why data preprocessing is essential:

 Improving Data Quality

 Enhancing Model Performance

 Step 1 :Import important libraries

 os: Functions to interact with the operating system.

 Example Usage: os.listdir() lists files and directories in the specified

 numpy: Support for arrays, matrices, and mathematical functions.

 pandas: Data manipulation and analysis.

 matplotlib & seaborn: Data visualization.

 warnings: Manage warning messages in code.

 sns.set(): Automatically sets the seaborn plot aesthetics to a default

 %matplotlib inline: A magic command used in Jupyter notebooks to

 Identifying missing values using `df.isnull().sum()`.

 In percentage form: df.isnull().sum()/len((df)*100

 Techniques to handle missing values:

 If any variable have missing value > 25% , drop it.

 data = data.drop([‘name of column’], axis = 1)

 - Else Imputing missing values

 Simple Statistical Imputation:

 Median:If data have outliers.

 Mode : If variables are categorical type.

nearest neighbors' values.

 In one hot encoding usually drop one column.

 Using IQR or Z score

 Square root Transformation

 Box- Cox Transformation

distribution parameters to identify an outlier

 Square root transformation reduces the magnitude of

 It works with zero values but not negatives.

 Unlike logarithmic or square root transformations, the Box-Cox method

 Winsorization replaces extreme values with specified percentiles

 Scaling features to have zero mean and unit variance. This is

X_scaled = [[-1.22474487 -1.22474487 -1.22474487]

➢ Algorithms that Assume Normal Distribution:

 Normalized data: means sum of data point(3,4 = 1).

➢ Cosine Similarity: When you want to measure the cosine

➢ Clustering: When using clustering algorithms like K-Means,

➢ Nearest Neighbor: When you want to perform k-nearest

➢ When working with sparse data (data with a lot of zeros),

➢ TF-IDF: In Natural Language Processing (NLP), TF-IDF

 Nominal: Categories without a specific order (e.g., color: red,

➢ pd.to_numeric: This function is used to convert argument to a numeric

➢ Generates synthetic samples by interpolating between existing

 Definition: Exploratory Data Analysis (EDA) is a critical step in

 Using plots for making inferences in machine learning involves

 Purpose: Show the correlation coefficients between variables.

 If number of features are too large to handle than it is

# Logistic Regression as estimator

# Apply RFE to select top 2 features

You might also like