Ch03 DS-Unit-2 ABM Final
Ch03 DS-Unit-2 ABM Final
Preprocessing
Data Proprocesing
a. This labor-intensive phase covers all
aspects of preparing the final data set,
which shall be used for subsequent
phases, from the initial, raw, dirty data.
b. Select the cases and variables you
want to analyse, and that are
appropriate for your analysis.
c. Perform transformations on certain
variables, if needed.
d. Clean the raw data so that it is ready
for the modeling tools.
Data Preprocessing 2
Data Preprocessing
Data preparation
Evaluate the quality of the data,
Clean the raw data,
Deal with missing data, and
Perform transformations on certain variables
Data Understanding
Exploratory Data Analysis (EDA)
Exploratory data analysis (EDA) is used by data scientists to analyze
and investigate data sets and summarize their main characteristics,
often employing data visualization methods. It helps determine how
best to manipulate data sources to get the answers you need, making
it easier for data scientists to discover patterns, spot anomalies, test a
hypothesis, or check assumption
Data Preprocessing 3
Importance of data preprocessing
•Preprocessing data is an important step for data analysis. The
following are some benefits of preprocessing data:
•It improves accuracy and reliability. Preprocessing data removes
missing or inconsistent data values resulting from human or
computer error, which can improve the accuracy and quality of a
dataset, making it more reliable.
•It makes data consistent. When collecting data, it's possible to have
data duplicates, and discarding them during preprocessing can
ensure the data values for analysis are consistent, which helps
produce accurate results.
•It increases the data's algorithm readability. Preprocessing
enhances the data's quality and makes it easier for machine learning
algorithms to read, use, and interpret it.
Data Preprocessing 4
Data Cleaning
What is Data Cleaning in Data Science?
Data Preprocessing 5
Data Cleaning
• Real-world data is noisy and contains a lot of errors. They are
not in their best format.
• So, it becomes important that these data points need to be
fixed.
• It is estimated that data scientists spend between 80 to 90
percent of their time in data cleaning.
• Your workflow should start with data cleaning. You may likely
duplicate or incorrectly classify data while working with large
datasets and merging several data sources.
• Your algorithms and results will lose their accuracy if you have
wrong or incomplete data.
• For example: consider data where we have the gender column. If the
data is being filled manually, then there is a chance that the data
column can contain records of ‘male’ ‘female’, ‘M’, ‘F’, ‘Male’, ‘Female’,
‘MALE’, ‘FEMALE’, etc. In such cases, while we perform analysis on
the columns, all these values will be considered distinct. But in reality,
‘Male’, ‘M’, ‘male’, and ‘MALE’ refer to the same information. The data
cleaning step will identify such incorrect formats and fix them.
Data Preprocessing 6
Data Cleaning
Data Cleaning Process
Data Preprocessing 7
Step 1: Remove Duplicates
•When you are working with large datasets, working across
multiple data sources, or have not implemented any quality
checks before adding an entry, your data will likely show
duplicated values.
•These duplicated values add redundancy to your data and can
make your calculations go wrong. Duplicate serial numbers of
products in a dataset will give you a higher count of products than
the actual numbers.
•Duplicate email IDs or mobile numbers might cause your
communication to look more like spam. We take care of these
duplicate records by keeping just one occurrence of any unique
observation in our data.
Data Preprocessing 8
Step 2: Remove Irrelevant Data
•Consider you are analyzing the after-sales service of a product. You get
data that contains various fields like service request date, unique service
request number, product serial number, product type, product purchase
date, etc.
•While these fields seem to be relevant, the data may also contain other
fields like attended by (name of the person who initiated the service
request), location of the service center, customer contact details, etc.,
which might not serve our purpose if we were to analyze the expected
period for a product to undergo servicing. In such cases, we remove
those fields irrelevant to our scope of work. This is the column-level
check we perform initially.
Data Preprocessing 9
Step 3: Standardize capitalization
You must ensure that the text in your data is consistent. If your
capitalization is inconsistent, it could result in the creation of many false
categories.
•For example: having column name as “Total_Sales” and “total_sales” is
different (most programming languages are case-sensitive).
•To avoid confusion and maintain uniformity among the column names,
we should follow a standardized way of providing the column names. The
most preferred code case is the snake case or cobra case.
•Cobra case is a writing style in which the first letter of each word is
written in uppercase, and each space is substituted by the underscore (_)
character. While, in the snake case, the first letter of each word is written
in lowercase and each space is substituted by the underscore.
Data Preprocessing 10
Step 5: Handling Outliers
An outlier is a data point in statistics that dramatically deviates from
other observations. An outlier may reflect measurement variability, or it
may point to an experimental error; the latter is occasionally removed
from the data set.
For example: let us consider pizza prices in a region. The pizza prizes
vary between INR 100 to INR 7500 in the region after surveying around
500 restaurants. But after analysis, we found that there is just one record
in the dataset with the pizza price as INR 7500, while the rest of the other
pizza prices are between INR 100 to INR 1500. Therefore, the observation
with pizza price as INR 7500 is an outlier since it significantly deviates
from the population. These outliers are usually identified using a box plot
or scatter plot. These outliers result in skewed data. There are models
which assume the data to follow a normal distribution, and the outliers
can affect the model performance if the data is skewed thus, these
outliers must be handled before the data is fed for model training. There
are two common ways to deal with these outliers.
•Remove the observations that consist of outlier values.
•Apply transformations like a log, square root, box-cox, etc., to make the
data values follow the normal or near-normal distribution.
Data Preprocessing 11
Step 6: Fix errors
Errors in your data can lead you to miss out on the key findings. This
needs to be avoided by fixing the errors that your data might have.
Systems that manually input data without any provision for data
checks are almost always going to contain errors. To fix them, we
need to first get the data understanding. Consider the following
example cases.
•Removing the country code from the mobile field so that all the
values are exactly 10 digits.
•Remove any unit mentioned in columns like weight, height, etc. to
make it a numeric field.
•Identifying any incorrect data format like email address and then
either fixing it or removing it.
•Making some validation checks like customer purchase date should
be greater than the manufacturing date, the total amount should be
equal to the sum of the other amounts, any punctuation or special
characters found in a field that does not allow it, etc.
Data Preprocessing 12
Step 7: Language Translation
Datasets for machine translation are frequently combined from several
sources, which can result in linguistic discrepancies. Software used to
evaluate data typically uses monolingual Natural Language Processing
(NLP) models, which are unable to process more than one language.
Therefore, you must translate everything into a single language. There are
a few language translational AI models that we can use for the task.
Data Preprocessing 13
Example:-
Data Preprocessing 14
Benefits of Data Cleaning in Data Science
Your analysis will be reliable and free of bias if you have a clean and correct
data collection. We have looked at eight steps for data cleansing in data
science. Let us discuss some of the benefits of cleaning data science.\
Data Preprocessing 15
DATA CLEANING
Data Preprocessing 16
What is a Missing Value?
What is a Missing Value?
Missing data is defined as the values or data that is not stored (or not present) for
some variable/s in the given dataset.
Below is a sample of the missing data from the Titanic dataset. You can see the
columns ‘Age’ and ‘Cabin’ have some missing values.
Data Preprocessing 17
How is Missing Value Represented In The Dataset?
In the dataset, blank shows the missing values.
In Pandas, usually, missing values are represented by NaN
It stands for Not a Number.
Data Preprocessing 18
Why Is Data Missing From The Dataset
There can be multiple reasons why certain values are missing from the data.
Reasons for the missing data from the dataset affect the approach of handling
missing data. So it’s necessary to understand why the data could be missing.
Some of the reasons are listed below:
Past data might get corrupted due to improper maintenance.
Observations are not recorded for certain fields due to some reasons. There might be
a failure in recording the values due to human error.
The user has not provided the values intentionally.
Data Preprocessing 19
Figure Out How To Handle The Missing Data
Analyze each column with missing values carefully to understand the
reasons behind the missing values as it is crucial to find out the strategy
for handling the missing values.
There are 2 primary ways of handling missing values:
1.Deleting the Missing values
2.Imputing the Missing Values
Data Preprocessing 20
Deleting the entire column
If a certain column has many missing values then you can choose to drop
the entire column.
Data Preprocessing 21
Replacing With Mode
Mode is the most frequently occurring value. It is used in the case of
categorical features.
You can use the ‘fillna’ method for imputing the categorical columns
‘Gender’, ‘Married’, and ‘Self_Employed’.
Data Preprocessing 22
HANDLING MISSING DATA
Missing data is a problem that continues to
plague data analysis methods.
Even as our analysis methods gain
sophistication, but still encounter missing
values in fields
Data Preprocessing 25
000
Delete Rows with Missing Values:
Missing values can be handled by deleting the rows or columns having
null values. If columns have more than half of the rows as null then the
entire column can be dropped. The rows which are having one or more
columns values as null can also be dropped.
Pros:
•A model trained with the removal of all missing values creates a robust
model.
Cons:
•Loss of a lot of information.
•Works poorly if the percentage of missing values is excessive in
comparison to the complete dataset.
Data Preprocessing 26
Impute missing values with Mean/Median:
Columns in the dataset which are having numeric continuous values can
be replaced with the mean, median, or mode of remaining values in the
column. This method can prevent the loss of data compared to the earlier
method. Replacing the above two approximations (mean, median) is a
statistical approach to handle the missing values.
Data Preprocessing 27
Pros:
•Prevent data loss which results in deletion of rows or columns
•Works well with a small dataset and is easy to implement.
Cons:
•Works only with numerical continuous variables.
•Can cause data leakage
•Do not factor the covariance between features.
Data Preprocessing 28
Imputation method for categorical columns:
When missing values is from categorical columns (string or numerical) then
the missing values can be replaced with the most frequent category. If the
number of missing values is very large then it can be replaced with a new
category.
Data Preprocessing 29
Pros:
•Prevent data loss which results in deletion of rows or columns
•Works well with a small dataset and is easy to implement.
•Negates the loss of data by adding a unique category
Cons:
•Works only with categorical variables.
•Addition of new features to the model while encoding, which may result
in poor performance
Data Preprocessing 30
Other Imputation Methods:
Depending on the nature of the data or data type, some other imputation
methods may be more appropriate to impute missing values.
For example, for the data variable having longitudinal behavior, it might
make sense to use the last valid observation to fill the missing value. This
is known as the Last observation carried forward (LOCF) method.
data["Age"] = data["Age"].fillna(method='ffill')
Data Preprocessing 31
Using Algorithms that support missing values:
All the machine learning algorithms don’t support missing values but some
ML algorithms are robust to missing values in the dataset.
The k-NN algorithm can ignore a column from a distance measure when a
value is missing.
Naive Bayes can also support missing values when making a prediction.
These algorithms can be used when the dataset contains null or missing
values.
Another algorithm that can be used here is Random Forest that works well on non-
linear and categorical data.
Pros:
No need to handle missing values in each column as ML algorithms will handle them
efficiently.
Cons:
No implementation of these ML algorithms in the scikit-learn library.
Data Preprocessing 32
Prediction of missing values:
In the earlier methods to handle missing values, we do not use the
correlation advantage of the variable containing the missing value and
other variables. Using the other features which don’t have nulls can be
used to predict missing values.
The regression or classification model can be used for the prediction of
missing values depending on the nature (categorical or continuous) of
the feature having missing value.
Data Preprocessing 33
Missing values categories
Data Preprocessing 34
Missing Completely At Random (MCAR)
•In MCAR, the probability of data being missing is the same for all the
observations.
•In this case, there is no relationship between the missing data and any
other values observed or unobserved (the data which is not recorded)
within the given dataset.
•That is, missing values are completely independent of other data. There is
no pattern.
•In the case of MCAR data, the value could be missing due to human error,
some system/equipment failure, loss of sample, or some unsatisfactory
technicalities while recording the values.
•For Example, suppose in a library there are some overdue books. Some
values of overdue books in the computer system are missing.
•The reason might be a human error, like the librarian forgetting to type in
the values. So, the missing values of overdue books are not related to any
other variable/data in the system.
Data Preprocessing 35
Missing At Random (MAR)
•MAR data means that the reason for missing values can be explained by
variables on which you have complete information, as there is some
relationship between the missing data and other values/data.
•In this case, the data is not missing for all the observations. It is missing
only within sub-samples of the data, and there is some pattern in the
missing values.
•For example, if you check the survey data, you may find that all the
people have answered their ‘Gender,’ but ‘Age’ values
are mostly missing for people who have answered their ‘Gender’ as
‘female.’ (The reason being most of the females don’t want to reveal their
age.)
Data Preprocessing 36
Missing Not At Random (MNAR)
•Missing values depend on the unobserved data. If there is some
structure/pattern in missing data and other observed data can not
explain it, then it is considered to be Missing Not At Random (MNAR).
•If the missing data does not fall under the MCAR or MAR, it can be
categorized as MNAR. It can happen due to the reluctance of people to
provide the required information. A specific group of respondents may not
answer some questions in a survey.
•For example, suppose the name and the number of overdue books are
asked in the poll for a library. So most of the people having no overdue
books are likely to answer the poll. People having more overdue books are
less likely to answer the poll. So, in this case, the missing value of the
number of overdue books depends on the people who have more books
overdue.
Data Preprocessing 37
HANDLING MISSING DATA Cont…
Result of replacing the missing values with the
constant 0 for the numerical variable cubic inches
and the label missing for the categorical variable
brand.
Data Preprocessing 38
HANDLING MISSING DATA Cont…
Missing values may be replaced with the respective
field means and modes
Data Preprocessing 42
HANDLING MISSING DATA (more methods) Cont…
Xgboost learns for missing values
Once a tree structure has been trained it isn’t too hard to also consider
the presence of missing values in the test set: it’s enough to attach a default
direction to each decision node.
Optimum default direction is determined and the missing values will go in
that direction
Data Preprocessing 43
Handling Missing Data
Data Preprocessing 44
IDENTIFYING MISCLASSIFICATIONS
Data Preprocessing 46
GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS
cars data set histogram
Data Preprocessing 47
GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS
Sometimes two-dimensional scatter plots can 5
Data Preprocessing 49
Qualitative = Quality
Qualitative data are
•measures of 'types' and may be represented by a name, symbol, or a
number code.
•Qualitative data are data about categorical variables (e.g. what type).
Data collected about a numeric variable will always be quantitative and
data collected about a categorical variable will always be qualitative.
Therefore, you can identify the type of data, prior to collection, based on
whether the variable is numeric or categorical.
Data Preprocessing 50
MEASURES OF CENTER AND SPREAD
A measure of central tendency (also referred to as measures of centre or
central location) is a summary measure that attempts to describe a whole
set of data with a single value that represents the middle or centre of its
distribution.
The mode has an advantage over the median and the mean as it can be found for
both numerical and categorical (non-numerical) data.
Data Preprocessing 51
MEASURES OF CENTER AND SPREAD
The are some limitations to using the mode. In some distributions, the
mode may not reflect the centre of the distribution very well. When the
distribution of retirement age is ordered from lowest to highest value, it
is easy to see that the centre of the distribution is 57 years, but the mode
is lower, at 54 years.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Median
The median is the middle value in distribution when the values are
arranged in ascending or descending order.
The median divides the distribution in half. In a distribution with an odd
number of observations, the median value is the middle value.
Looking at the retirement age distribution (which has 11 observations),
the median is the middle value, which is 57 years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the median
value is the mean of the two middle values. In the following distribution,
the two middle values are 56 and 57, therefore the median equals 56.5
years:
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Data Preprocessing 52
MEASURES OF CENTER AND SPREAD
Mean
The mean is the sum of the value of each observation in a dataset divided
by the number of observations. This is also known as the arithmetic
average.
Looking at the retirement age distribution again:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The mean is calculated by adding together all the values
(54+54+54+55+56+57+57+58+58+60+60 = 623) and dividing by the
number of observations (11) which equals 56.6 years.
Data Preprocessing 53
MEASURES OF CENTER AND SPREAD
Definition
The spread of data (also called dispersion or variability) refers to how
much the data values differ from each other or from the central value
(mean, median, or mode). It helps understand the distribution and
consistency of data points.
Data Preprocessing 54
• Need for Measuring Spread
• Understanding Variability – Helps determine how much the data deviates
from the average.
• Comparing Datasets – Two datasets with the same mean can have different
spreads, impacting decision-making.
• Detecting Outliers – Large spread values can indicate the presence of
extreme values.
• Choosing the Right Model – Some machine learning models work better
with data having low variability.
• Measures of Spread
1. Range – Difference between the highest and lowest values.
2. Interquartile Range (IQR) – Range of the middle 50% of the data.
3. Variance – The average squared deviation from the mean.
4. Standard Deviation – The square root of variance, indicating average deviation
from the mean.
Data Preprocessing 55
Example
Scenario: Exam Scores
Suppose two classes have students scoring as follows:
•Class A: [45, 50, 55, 60, 65]
•Class B: [30, 40, 55, 70, 80]
Calculating Spread:
•Standard Deviation:
• Class A has a smaller standard deviation → Scores are close to the mean.
• Class B has a higher standard deviation → Scores are more spread out.
Interpretation:
•Class A's students have scores closer to each other (low spread).
•Class B's students have scores that vary significantly (high spread).
•A teacher might conclude that Class A is more consistent in performance, while
Class B has more diverse performance levels.
Data Preprocessing 56
Low variability in data means that the values in the dataset are closely clustered
around the central value (mean or median), indicating minimal spread or dispersion.
In other words, the data points do not fluctuate much and are more consistent.
Example:
•Low Variability Dataset: [48, 49, 50, 51, 52]
• Mean = 50
• Standard Deviation ≈ 1.58 (small value)
•High Variability Dataset: [20, 30, 50, 70, 80]
• Mean = 50
• Standard Deviation ≈ 25.50 (larger spread)
Data Preprocessing 57
MEASURES OF CENTER AND SPREAD
What each measure of spread can tell us
Range
The range is the difference between the smallest value and the largest
value in a dataset.
Dataset A
4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 8
The range is 4, the difference between the highest value (8) and the lowest value
(4)
Data Preprocessing 58
Standard Deviation and
Variance
• The standard deviation is the average amount of variability in your
dataset. It tells you, on average, how far each value lies from the mean.
• A high standard deviation means that values are generally far from
the mean, while a low standard deviation indicates that values are
clustered close to the mean.
• Standard deviation is the square root of the variance. It provides a more
interpretable measure of spread because it is in the same unit as the
original data.
Data Preprocessing 59
Example: Comparing different standard deviations You collect data on job
satisfaction ratings from three groups of employees using simple random
sampling.
The mean (M) ratings are the same for each group – it’s the value on the x-
axis when the curve is at its peak. However, their standard deviations (SD)
differ from each other.
The standard deviation reflects the dispersion of the distribution. The curve
with the lowest standard deviation has a high peak and a small spread,
while the curve with the highest standard deviation is more flat and
widespread.
Data Preprocessing 60
Example: Standard deviation in a normal distribution
You administer a memory recall test to a group of students. The data follows a
normal distribution with a mean score of 50 and a standard deviation of 10.
Following the empirical rule:
Data Preprocessing 61
Data Preprocessing 62
Data Preprocessing 63
Variance
Variance is defined as, “The measure of how far the set of data is
dispersed from their mean value”. Variance is represented with the
symbol σ2. In other words, we can also say that the variance is the
average of the squared difference from the mean.
Properties of Variance
Various properties of the Variance of the group of data are,
•As each term in the variance formula is firstly squared and then their
mean is found, it is always a non-negative value, i.e. mean can be either
positive or can be zero but it can never be negative.
•Variance is always measured in squared units. For example, if we have to
find the variance of the height of the student in a class, and if the height
of the student is given in cm then the variance is calculated in cm 2.
Data Preprocessing 64
Difference between Standard Deviation and variance
• Standard deviation measures how far apart numbers are in a data set. Variance, on the
other hand, gives an actual value to how much the numbers in a data set vary from the
mean.
• Standard deviation is the square root of the variance and is expressed in the same units
as the data set. Variance can be expressed in squared units or as a percentage (especially
in the context of finance).
• Standard deviation can be greater than the variance since the square root of a decimal
is larger (and not smaller) than the original number when the variance is less than one
(1.0 or 100%).
• The standard deviation is smaller than the variance when the variance is more than one
(e.g. 1.2 or 120%).
Data Preprocessing 65
Data Preprocessing 66
Data Preprocessing 67
Data Preprocessing 68
example
Example: For the following data, determine the mean, variance, and
standard deviation:
Data Preprocessing 69
Data Preprocessing 70
DATA TRANSFORMATION
Raw data is difficult to trace or understand. That's why it needs to be
preprocessed before retrieving any information from it.
Data transformation is a technique used to convert the raw data into a
suitable format that efficiently eases data mining and retrieves strategic
information.
Data transformation includes data cleaning techniques and a data
reduction technique to convert the data into the appropriate form.
Data transformation is an essential data preprocessing technique that must
be performed on the data before data mining to provide patterns that are
easier to understand.
Data transformation changes the format, structure, or values of the data
and converts them into clean, usable data. Data may be transformed at
two stages of the data pipeline for data analytics projects.
Data integration, migration, data warehousing, data wrangling may all involve
data transformation. Data transformation increases the efficiency of business and
analytic processes, and it enables businesses to make better data-driven decisions.
Data Preprocessing 71
Data Transformation Techniques
There are several data transformation techniques that can help
structure and clean up the data before analysis or storage in a data
warehouse.
Data Preprocessing 72
1. Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset
using some algorithms. It allows for highlighting important features
present in the dataset. It helps in predicting the patterns. When collecting
data, it can be manipulated to eliminate or reduce any variance or any
other noise form.
2. Attribute Construction
Attribute construction, also known as feature construction, is the process of creating
new attributes (features) from existing data to improve the performance of machine
learning models. This helps in better data representation, leading to improved
accuracy and efficiency.
3. Data Aggregation
Data collection or aggregation is the method of storing and presenting data in a
summary format. The data may be obtained from multiple data sources to integrate
these data sources into a data analysis description
Data Preprocessing 73
4. Data Discretization
Data discretization is the process of converting continuous numerical
data into discrete categories or bins. This transformation helps simplify
complex datasets and makes them more interpretable for certain
algorithms, such as decision trees and Naïve Bayes classifiers.
Data Preprocessing 74
5. Data Normalization
data normalization is the process of scaling the data to a much smaller
range, without losing information to help minimize or exclude duplicated
data and improve algorithm efficiency and data extraction performance.
There are three methods to normalize an attribute:
Data Preprocessing 75
Decimal Scaling Method For
Normalization
Decimal Scaling is a normalization technique where data values are divided
by a power of 10 such that the largest absolute value in the dataset is
between -1 and 1.
Data Preprocessing 76
Min-Max Normalization
What is Data Normalization?
Data normalization is a technique in data transformation where numerical
values in a dataset are scaled to a common range without distorting their
relationships.
This process ensures that different features contribute equally to machine
learning models, improving model accuracy and efficiency.
Normalization is crucial in data preprocessing before applying machine
learning algorithms, especially when working with data that has different
units or scales.
Data Preprocessing 77
Data Preprocessing 78
MIN–MAX NORMALIZATION
Data Preprocessing 79
Data Preprocessing 80
MIN–MAX NORMALIZATION
Let X refer to our original field value, and X∗ refer to the
normalized field value.
Min–max normalization works by seeing how much greater
the field value is than the minimum value min(X), and scaling
this difference by the range. That is,
Data Preprocessing 82
Z-SCORE STANDARDIZATION
• Z-score standardization (also called Standardization or Normalization
using Z-score) is a method that transforms data so that it has a mean
(μ) of 0 and a standard deviation (σ) of 1.This transformation helps in
making the dataset scale-independent, making it useful for algorithms
that assume normally distributed data, such as Linear Regression,
Logistic Regression, PCA, and Clustering algorithms.
Data Preprocessing 83
Data Preprocessing 84
Standardized Data: [-1.41, -0.71, 0.0, 0.71, 1.41]
Now, the data has a mean of 0 and a standard deviation of 1.
Data Preprocessing 85
Data Preprocessing 86
Z-SCORE STANDARDIZATION
Z-score standardization, which is very widespread in the
world of statistical analysis
Works by taking the difference between the field value and the field mean value,
and scaling this difference by the SD
Data Preprocessing 87
Z-Score Normalization
Z-score normalization refers to the process of normalizing every value in a
dataset such that the mean of all of the values is 0 and the standard deviation is 1.
We use the following formula to perform a z-score normalization on every value in a
dataset:
New value = (x – μ) / σ
where:
•x: Original value
•μ: Mean of data
•σ: Standard deviation of data
Data Preprocessing 88
Using a calculator, we can find that the mean of the
dataset is 21.2 and the standard deviation is 29.8.
To perform a z-score normalization on the first value in
the dataset, we can use the following formula:
•New value = (x – μ) / σ
•New value = (3 – 21.2) / 29.8
•New value = -0.61
Data Preprocessing 89
DECIMAL SCALING
This method normalizes the value of attribute A by moving the
decimal point in the value. This movement of a decimal point
depends on the maximum absolute value of A. The formula for
the decimal scaling is given below:.
For example, the observed values for attribute A range from -986 to 917, and the
maximum absolute value for attribute A is 986. Here, to normalize each value of
attribute A using decimal scaling, we have to divide each value of attribute A by
1000, i.e., j=3.
So, the value -986 would be normalized to -0.986, and 917 would be normalized to
0.917.
Data Preprocessing 90
DECIMAL SCALING
Decimal scaling ensures that every normalized
value lies between −1 and 1.
Data Preprocessing 91
NORMALIZATION TECHNIQUES COMMENTS
What are the best normalization methods
(Z-Score, Min-Max, etc.)?
Z-score
preserve range (maximum and minimum)
If you data follow a Gaussian distribution, they are converted into a
N(0,1) distribution and probabilities calculation will be easier.
More Researcher Comments
Depend on the data to be normalized. Normally Z-score is very
common for data normalization.
Min-Max and Z-Score not suitable for sparse
It depends on the aims of the study and the nature of the data.
Data Preprocessing 92
NORMAL DISTRIBUTION OF DATA
A normal distribution has a probability distribution that is centered around the
mean. This means that the distribution has more data around the mean. The data
distribution decreases as you move away from the center. The resulting curve is
symmetrical about the mean and forms a bell-shaped distribution.
Most of the data scientists claim they are getting more accurate results when they
transform the independent variables too. It means skew correction for the
independent variables. Lower the skewness better the result
Data Preprocessing 93
Types Of Transformations For Better Normal Distribution
1. Log Transformation :
Numerical variables may have high skewed and non-normal distribution
(Gaussian Distribution) caused by outliers, highly exponential
distributions, etc. Therefore we go for data transformation.
In Log transformation each variable of x will be replaced by log(x) with
base 10, base 2, or natural log.
import numpy as np
log_target = np.log1p(df["Target"])
Data Preprocessing 94
Data Preprocessing 95
2. Square-Root Transformation :
This transformation will give a moderate effect on distribution. The main advantage
of square root transformation is, it can be applied to zero values.
Here the x will replace by the square root(x). It is weaker than the Log
Transformation.
sqrt_target = df["Target"]**(1/2)
Data Preprocessing 96
Data Preprocessing 97
Data Preprocessing 98
TRANSFORMATIONS TO ACHIEVE NORMALITY
Variables must be normally distributed for some data mining algorithms
The normal distribution is a continuous probability distribution commonly known as
the bell curve, which is symmetric.
The normal distribution is a continuous probability distribution commonly known as
the bell curve, which is symmetric.
A distribution, or data set, is symmetric if it looks the same to the left and
right of the center point.
It is centered at mean 𝜇 and has its spread determined by SD 𝜎 (sigma)
Data Preprocessing 99
TRANSFORMATIONS TO ACHIEVE NORMALITY
Common misconception - Z-score standardization applied to them follow the
standard normal Z distribution
Z-Standardized Data is
still right-skewed, not
Original Data normally distributed
Data Preprocessing 10
TRANSFORMATIONS TO ACHIEVE NORMALITY
Statistic to measure the skewness of a distribution
Data Preprocessing 10
TRANSFORMATIONS TO ACHIEVE NORMALITY
Data Preprocessing 10
TRANSFORMATIONS TO ACHIEVE NORMALITY
To make data “more normally
distributed,” make it symmetric -
eliminate skewness.
Common transformations are
o the natural log transformation
o square root transformation, and
o Inverse square root
transformation
Data Preprocessing 10
Why is Normality so important?
Linear Discriminant Analysis (LDA), Linear Regression, and many other
parametric machine learning models assume that data is normally
distributed. If this assumption is not met, the model will not provide
accurate predictions.
Data Preprocessing 10
What is skewness?
Skewness is the degree of asymmetry of a distribution. A distribution is
symmetric if it is centred around its mean and the left and right sides are
mirror images of each other. A distribution is skewed if it is not
symmetric.
Suppose car prices may range from 100 to 10,00,000 with the average
being 5,00,000.
If the distribution’s peak is on the left side, our data is positively skewed
and the majority of the cars are being sold for less than the average
price.
If the distribution’s peak is on the right side, our data is negatively
skewed and the majority of the cars are being sold for more than the
average price.
Data Preprocessing 10
TRANSFORMATIONS TO ACHIEVE NORMALITY
Application of the square root transformation
Data Preprocessing 10
TRANSFORMATIONS TO ACHIEVE NORMALITY
Data Preprocessing 10
TRANSFORMATIONS TO ACHIEVE NORMALITY
Data Preprocessing 11
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
The Z-score method states that a data value is an outlier if it
has a Z-score that is either less than −3 or greater than 3.
Data Preprocessing 11
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
Mean and SD, part of the formula for the Z-score
standardization, are sensitive to the presence of outliers
Values of mean and SD will both be unduly affected by
the presence or absence of this new data value.
Not appropriate to use measures that are themselves
sensitive to their presence.
Data analysts have developed more robust statistical
methods for outlier detection, which are less sensitive to
the presence of the outliers.
One elementary robust method is to use the Interquartile
Range (IQR)
Data Preprocessing 11
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
Using the interquartile range
The interquartile range (IQR) tells you the range of the middle half of
your dataset. You can use the IQR to create “fences” around your data
and then define outliers as any values that fall outside those fences.
Data Preprocessing
11
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
This method is helpful if you have a few values on the
extreme ends of your dataset, but you aren’t sure
whether any of them might count as outliers.
22 24 25 28 29 31 35 37 41 53 64
Step 2: Identify the median, the first quartile (Q1), and the third
quartile (Q3)
The median is the value exactly in the middle of your dataset when all
values are ordered from low to high.
Since you have 11 values, the median is the 6th value. The median value
is 22
31. 24 25 28 29 31 35 37 41 53 64
Data Preprocessing 11
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
Next, we’ll use the exclusive method for identifying Q1 and Q3. This
means we remove the median from our calculations.
The Q1 is the value in the middle of the first half of your dataset,
excluding the median. The first quartile value is 25.
22 24 25 28 29
Data Preprocessing 11
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
Formula
Calculation
IQR = Q3 – Q1 Q1 = 26
Q3 = 41
IQR = 41 – 26
= 15
Formula Calculation
Upper fence = Q3 + (1.5 * IQR) Upper fence = 41 + (1.5 * 15)
= 41 + 22.5
= 63.5
Data Preprocessing 11
Step 5: Calculate your lower fence
The lower fence is the boundary around the first quartile. Any values less
than the lower fence are outliers.
Formula Calculation
22 24 25 28 29 31 35 37 41 53 64
Data Preprocessing 11
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
Quartiles of a data set divide data set four
parts
Each containing 25% of the data:
first quartile (Q1) is the 25th percentile.
second quartile (Q2) is 50th percentile,
(median).
The third quartile (Q3) is the 75th
percentile.
IQR is calculated as IQR=Q3−Q1
Robust measure of outlier detection - A data
value is an outlier if
Data Preprocessing
a. it is located 1.5(IQR) or more below Q1,11
Data Preprocessing 12
Data Preprocessing 12
FLAG VARIABLES
Some analytical methods, such as regression, require predictors to be
numeric
Need to recode the categorical variable into one or more flag variables
A flag variable (or dummy variable, or indicator variable) is a
categorical variable taking only two values, 0 and 1.
For e.g., the categorical predictor Gender, taking values for female and
male could be recoded into the flag variable gender_flag as follows:
If sex = female = then sex_flag = 0; if sex = male then sex_flag = 1.
When a categorical predictor takes k ≥ 3 possible values, then define
k−1 dummy variables and use the unassigned category as the
reference category
For example, region has k=4 possible categories, {north, east, south,
west}, then the analyst could define the following k−1=3 flag variables
north_flag: If region = north then north_flag = 1; otherwise
north_flag = 0.
east_flag: If region = east then east_flag = 1; otherwise east_flag =
0
south_flag: If region = south then south_flag = 1; otherwise
south_flag = 0.
Data Preprocessing 12
TRANSFORMING CATEGORICAL VARIABLES INTO NUMERICAL VARIABLES
Data Preprocessing 12
TRANSFORMING CATEGORICAL VARIABLES INTO NUMERICAL VARIABLES
Data Preprocessing 12
One-Hot Encoding
Data Preprocessing 12
Ordinal Encoding
Data Preprocessing 12
Mean Encoding
Data Preprocessing 13
Dummy Encoding
Data Preprocessing 13
Effect Encoding
Data Preprocessing 13
BINNING NUMERICAL VARIABLES
What is binning numerical value?
Numerical Binning is a way to group a number of more or less continuous values into
a smaller number of “bins”. Creating ranges or bins will help to understand the
numerical data better. For example, if you have age data on a group of people, you
might want to arrange their ages into a smaller number of age intervals.
In other words, binning will take a column with continuous numbers and place the
numbers in “bins” based on ranges that we determine. This will give us a new
categorical variable feature.
Data Preprocessing 13
Advantages of binning:-
•Improves the accuracy of predictive models by reducing noise or non-linearity in the
dataset.
•Helps identify outliers and invalid and missing values of numerical variables.
Types of Binning
Equal Width (or distance) Binning
This algorithm divides the continuous variable into several categories having bins or
ranges of the same width. Let x be the number of categories and max and min be the
maximum and minimum values in the concerned column.
Then width(w) will be:-
Data Preprocessing 13
Data Preprocessing 13
ADDING AN INDEX FIELD
Recommended that the data analyst
create an index field,
tracks the sort order of the records
Data mining data gets partitioned at
least once (and sometimes several
times).
It is helpful to have an index field so
that the original sort order may be
recreated.
For example, using IBM/SPSS Modeler,
you can use the @Index function in the
Derive node to create an index field. 136
Data Preprocessing
REMOVING VARIABLES THAT ARE NOT USEFUL
Data analyst may remove variables that will
not help analysis
Such variables include
unary variables and
variables that are very nearly unary.
Unary variables take on only a single value
sample of students at an all-girls private
school would gender as female.
Sometimes a variable can be very nearly
unary
For example, suppose that 99.95% of the
players in a field hockey league are female,
with the remaining 0.05% male.
Data Preprocessing 13
REMOVAL OF DUPLICATE RECORDS
Records may have been inadvertently copied,
thus creating duplicate records.
Duplicate records lead to an overweighting of
the data values
Only one set of them should be retained
For example, if the ID field is duplicated, then
definitely remove the duplicate records.
Data analyst should apply common sense
Suppose a data set contains three nominal
fields, and each field takes only three values,
then 3 × 3 × 3 = 27 possible different records
If there are more than 27 records, at least one of
them has to be a duplicate
Data Preprocessing 13
REMOVAL OF DUPLICATE RECORDS
Removing duplicate records is not
particularly difficult.
Most statistical packages and database
systems have built-in commands that
group records together.
In fact, in the database language SQL,
this command is called Group By.
Data Preprocessing 13
Attribute selection
• Attribute selection is defined as “the process of finding a best subset of features,
from the original set of features in a given data set.
• Attribute subset Selection is a technique which is used for data reduction in data
mining process.
• The data set may have a large number of attributes. But some of those attributes
can be irrelevant or redundant.
• The goal of attribute subset selection is to find a minimum set of attributes
• Dropping of irrelevant attributes does not much affect the utility of data and the
cost of data analysis could be reduced
Data Preprocessing 14
• Procedure is repeated again and again until all the attribute in data set has p-value
less than or equal to the significance level.
• This gives the reduced data set having no irrelevant attributes.
Data Preprocessing 14
3. Combination of Forward Selection and Backward Elimination:
•The stepwise forward selection and backward elimination are combined so as to
select the relevant attributes most efficiently.
•This is the most common technique which is generally used for attribute selection.
Data Preprocessing 14
Thank You !!!