0% found this document useful (0 votes)
40 views143 pages

Ch03 DS-Unit-2 ABM Final

Data preprocessing is a crucial step in preparing raw data for analysis, involving cleaning, transforming, and evaluating data quality. It includes steps like removing duplicates, handling missing values, and standardizing formats to improve data accuracy and algorithm readability. Effective data cleaning enhances the reliability of analysis and prevents biases in machine learning models.

Uploaded by

Yogesh Kamble
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views143 pages

Ch03 DS-Unit-2 ABM Final

Data preprocessing is a crucial step in preparing raw data for analysis, involving cleaning, transforming, and evaluating data quality. It includes steps like removing duplicates, handling missing values, and standardizing formats to improve data accuracy and algorithm readability. Effective data cleaning enhances the reliability of analysis and prevents biases in machine learning models.

Uploaded by

Yogesh Kamble
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 143

Data

Preprocessing
Data Proprocesing
a. This labor-intensive phase covers all
aspects of preparing the final data set,
which shall be used for subsequent
phases, from the initial, raw, dirty data.
b. Select the cases and variables you
want to analyse, and that are
appropriate for your analysis.
c. Perform transformations on certain
variables, if needed.
d. Clean the raw data so that it is ready
for the modeling tools.
Data Preprocessing 2
Data Preprocessing
Data preparation
 Evaluate the quality of the data,
 Clean the raw data,
 Deal with missing data, and
 Perform transformations on certain variables
Data Understanding
 Exploratory Data Analysis (EDA)
Exploratory data analysis (EDA) is used by data scientists to analyze
and investigate data sets and summarize their main characteristics,
often employing data visualization methods. It helps determine how
best to manipulate data sources to get the answers you need, making
it easier for data scientists to discover patterns, spot anomalies, test a
hypothesis, or check assumption
Data Preprocessing 3
Importance of data preprocessing
•Preprocessing data is an important step for data analysis. The
following are some benefits of preprocessing data:
•It improves accuracy and reliability. Preprocessing data removes
missing or inconsistent data values resulting from human or
computer error, which can improve the accuracy and quality of a
dataset, making it more reliable.
•It makes data consistent. When collecting data, it's possible to have
data duplicates, and discarding them during preprocessing can
ensure the data values for analysis are consistent, which helps
produce accurate results.
•It increases the data's algorithm readability. Preprocessing
enhances the data's quality and makes it easier for machine learning
algorithms to read, use, and interpret it.

Data Preprocessing 4
Data Cleaning
What is Data Cleaning in Data Science?

Data cleaning is the process of identifying and fixing incorrect


data. It can be in incorrect format, duplicates, corrupt, inaccurate,
incomplete, or irrelevant. Various fixes can be made to the data
values representing incorrectness in the data.
data cleaning process, as mentioned below.
Removing duplicates
Remove irrelevant data
Standardize capitalization
Convert data type
Handling outliers
Fix errors
Language Translation
Handle missing values

Data Preprocessing 5
Data Cleaning
• Real-world data is noisy and contains a lot of errors. They are
not in their best format.
• So, it becomes important that these data points need to be
fixed.
• It is estimated that data scientists spend between 80 to 90
percent of their time in data cleaning.
• Your workflow should start with data cleaning. You may likely
duplicate or incorrectly classify data while working with large
datasets and merging several data sources.
• Your algorithms and results will lose their accuracy if you have
wrong or incomplete data.

• For example: consider data where we have the gender column. If the
data is being filled manually, then there is a chance that the data
column can contain records of ‘male’ ‘female’, ‘M’, ‘F’, ‘Male’, ‘Female’,
‘MALE’, ‘FEMALE’, etc. In such cases, while we perform analysis on
the columns, all these values will be considered distinct. But in reality,
‘Male’, ‘M’, ‘male’, and ‘MALE’ refer to the same information. The data
cleaning step will identify such incorrect formats and fix them.
Data Preprocessing 6
Data Cleaning
Data Cleaning Process

Data Preprocessing 7
Step 1: Remove Duplicates
•When you are working with large datasets, working across
multiple data sources, or have not implemented any quality
checks before adding an entry, your data will likely show
duplicated values.
•These duplicated values add redundancy to your data and can
make your calculations go wrong. Duplicate serial numbers of
products in a dataset will give you a higher count of products than
the actual numbers.
•Duplicate email IDs or mobile numbers might cause your
communication to look more like spam. We take care of these
duplicate records by keeping just one occurrence of any unique
observation in our data.

Data Preprocessing 8
Step 2: Remove Irrelevant Data
•Consider you are analyzing the after-sales service of a product. You get
data that contains various fields like service request date, unique service
request number, product serial number, product type, product purchase
date, etc.
•While these fields seem to be relevant, the data may also contain other
fields like attended by (name of the person who initiated the service
request), location of the service center, customer contact details, etc.,
which might not serve our purpose if we were to analyze the expected
period for a product to undergo servicing. In such cases, we remove
those fields irrelevant to our scope of work. This is the column-level
check we perform initially.

Data Preprocessing 9
Step 3: Standardize capitalization
You must ensure that the text in your data is consistent. If your
capitalization is inconsistent, it could result in the creation of many false
categories.
•For example: having column name as “Total_Sales” and “total_sales” is
different (most programming languages are case-sensitive).
•To avoid confusion and maintain uniformity among the column names,
we should follow a standardized way of providing the column names. The
most preferred code case is the snake case or cobra case.
•Cobra case is a writing style in which the first letter of each word is
written in uppercase, and each space is substituted by the underscore (_)
character. While, in the snake case, the first letter of each word is written
in lowercase and each space is substituted by the underscore.

Data Preprocessing 10
Step 5: Handling Outliers
An outlier is a data point in statistics that dramatically deviates from
other observations. An outlier may reflect measurement variability, or it
may point to an experimental error; the latter is occasionally removed
from the data set.

For example: let us consider pizza prices in a region. The pizza prizes
vary between INR 100 to INR 7500 in the region after surveying around
500 restaurants. But after analysis, we found that there is just one record
in the dataset with the pizza price as INR 7500, while the rest of the other
pizza prices are between INR 100 to INR 1500. Therefore, the observation
with pizza price as INR 7500 is an outlier since it significantly deviates
from the population. These outliers are usually identified using a box plot
or scatter plot. These outliers result in skewed data. There are models
which assume the data to follow a normal distribution, and the outliers
can affect the model performance if the data is skewed thus, these
outliers must be handled before the data is fed for model training. There
are two common ways to deal with these outliers.
•Remove the observations that consist of outlier values.
•Apply transformations like a log, square root, box-cox, etc., to make the
data values follow the normal or near-normal distribution.
Data Preprocessing 11
Step 6: Fix errors
Errors in your data can lead you to miss out on the key findings. This
needs to be avoided by fixing the errors that your data might have.
Systems that manually input data without any provision for data
checks are almost always going to contain errors. To fix them, we
need to first get the data understanding. Consider the following
example cases.
•Removing the country code from the mobile field so that all the
values are exactly 10 digits.
•Remove any unit mentioned in columns like weight, height, etc. to
make it a numeric field.
•Identifying any incorrect data format like email address and then
either fixing it or removing it.
•Making some validation checks like customer purchase date should
be greater than the manufacturing date, the total amount should be
equal to the sum of the other amounts, any punctuation or special
characters found in a field that does not allow it, etc.

Data Preprocessing 12
Step 7: Language Translation
Datasets for machine translation are frequently combined from several
sources, which can result in linguistic discrepancies. Software used to
evaluate data typically uses monolingual Natural Language Processing
(NLP) models, which are unable to process more than one language.
Therefore, you must translate everything into a single language. There are
a few language translational AI models that we can use for the task.

Step 8: Handle missing values

During cleaning and munging in data science, handling missing values


is one of the most common tasks. The real-life data might contain missing
values which need a fix before the data can be used for analysis. We can
handle missing values by:
•Either removing the records that have missing values or
•Filling the missing values using some statistical technique or by gathering
data understanding.

Data Preprocessing 13
Example:-

Consider a dataset where we have information about the laborers working


on a construction site. If the gender column in this dataset has around 30
percent missing values. We cannot drop 30 percent of data observations but
on further digging, we found that among the rest 70 percent of observations
and 90 percent of records are male. Therefore, we can choose to fill these
missing values as the male gender. By doing this, we have made an
assumption, but it can be a safe assumption because the laborers working
on the construction site are male dominant and even the data suggests the
same.

Data Preprocessing 14
Benefits of Data Cleaning in Data Science

Your analysis will be reliable and free of bias if you have a clean and correct
data collection. We have looked at eight steps for data cleansing in data
science. Let us discuss some of the benefits of cleaning data science.\

•Avoiding mistakes: Your analysis results will be accurate and consistent


if data cleansing techniques are effective.

•Improving productivity: Maintaining data quality and enabling more


precise analytics that support the overall decision-making process are made
possible by cleaning the data.

•Avoiding unnecessary costs and errors: Correcting faulty or mistaken


data in the future is made easier by keeping track of errors and improving
reporting to determine where errors originate.
•Staying organized
•Improved mapping

Data Preprocessing 15
DATA CLEANING

Data Preprocessing 16
What is a Missing Value?
What is a Missing Value?
Missing data is defined as the values or data that is not stored (or not present) for
some variable/s in the given dataset.
Below is a sample of the missing data from the Titanic dataset. You can see the
columns ‘Age’ and ‘Cabin’ have some missing values.

Data Preprocessing 17
How is Missing Value Represented In The Dataset?
In the dataset, blank shows the missing values.
In Pandas, usually, missing values are represented by NaN
It stands for Not a Number.

Data Preprocessing 18
Why Is Data Missing From The Dataset
There can be multiple reasons why certain values are missing from the data.
Reasons for the missing data from the dataset affect the approach of handling
missing data. So it’s necessary to understand why the data could be missing.
Some of the reasons are listed below:
Past data might get corrupted due to improper maintenance.
Observations are not recorded for certain fields due to some reasons. There might be
a failure in recording the values due to human error.
The user has not provided the values intentionally.

Why Do We Need To Care About Handling Missing Value?


It is important to handle the missing values appropriately.
•Many machine learning algorithms fail if the dataset contains missing
values. However, algorithms like K-nearest and Naive Bayes support data
with missing values.
•You may end up building a biased machine learning model which will
lead to incorrect results if the missing values are not handled properly.
•Missing data can lead to a lack of precision in the statistical analysis

Data Preprocessing 19
Figure Out How To Handle The Missing Data
Analyze each column with missing values carefully to understand the
reasons behind the missing values as it is crucial to find out the strategy
for handling the missing values.
There are 2 primary ways of handling missing values:
1.Deleting the Missing values
2.Imputing the Missing Values

Deleting the Missing value


Generally, this approach is not recommended. It is one of the quick and dirty
techniques one can use to deal with missing values.

There are 2 ways one can delete the missing values:


Deleting the entire row
If a row has many missing values then you can choose to drop the entire row.
If every row has some (column) value missing then you might end up deleting the
whole data.

Data Preprocessing 20
Deleting the entire column
If a certain column has many missing values then you can choose to drop
the entire column.

Imputing the Missing Value


There are different ways of replacing the missing values. You can use the python
libraries Pandas and Sci-kit learn as follows:

Replacing With Arbitrary Value


If you can make an educated guess about the missing value then you can replace it
with some arbitrary value using the following code.
Ex. In the following code, we are replacing the missing values of the ‘Dependents’
column with ‘0’.

Replacing With Mean

This is the most common method of imputing missing values of numeric


columns. If there are outliers then the mean will not be appropriate. In
such cases, outliers need to be treated first

Data Preprocessing 21
Replacing With Mode
Mode is the most frequently occurring value. It is used in the case of
categorical features.
You can use the ‘fillna’ method for imputing the categorical columns
‘Gender’, ‘Married’, and ‘Self_Employed’.

Replacing With Median


Median is the middlemost value. It’s better to use the median value for imputation in
the case of outliers.

Replacing with previous value – Forward fill


In some cases, imputing the values with the previous value instead of mean, mode
or median is more appropriate. This is called forward fill. It is mostly used in time
series data.

Replacing with next value – Backward fill


In backward fill, the missing value is imputed using the next value.

Data Preprocessing 22
HANDLING MISSING DATA
 Missing data is a problem that continues to
plague data analysis methods.
 Even as our analysis methods gain
sophistication, but still encounter missing
values in fields

Common method of “handling” missing values is simply to


omit the records from analysis. May be dangerous … lead to
a biased subset of the data.
Data Preprocessing 23
HANDLING MISSING DATA Cont…
Common criteria for choosing
replacement values for missing
data:
1.Replace the missing value with some
constant, specified by the analyst.
2.Replace the missing value with the field
mean (for numeric variables) or the mode (for
categorical variables).
3.Replace the missing values with a value
generated at random from the observed
distribution of the variable.
4.Replace the missing values with imputed
values
Data based on the other characteristics of 24
Preprocessing
Ways to Handle Missing Values
The real-world data often has a lot of missing values. The cause of
missing values can be data corruption or failure to record data. The
handling of missing data is very important during the preprocessing of
the dataset as many machine learning algorithms do not support missing
values.

Deleting Rows with missing values


Impute missing values for continuous variable
Impute missing values for categorical variable
Other Imputation Methods
Using Algorithms that support missing values
Prediction of missing values
Imputation using Deep Learning Library — Datawig

Data Preprocessing 25
000
Delete Rows with Missing Values:
Missing values can be handled by deleting the rows or columns having
null values. If columns have more than half of the rows as null then the
entire column can be dropped. The rows which are having one or more
columns values as null can also be dropped.

Pros:
•A model trained with the removal of all missing values creates a robust
model.
Cons:
•Loss of a lot of information.
•Works poorly if the percentage of missing values is excessive in
comparison to the complete dataset.
Data Preprocessing 26
Impute missing values with Mean/Median:
Columns in the dataset which are having numeric continuous values can
be replaced with the mean, median, or mode of remaining values in the
column. This method can prevent the loss of data compared to the earlier
method. Replacing the above two approximations (mean, median) is a
statistical approach to handle the missing values.

Data Preprocessing 27
Pros:
•Prevent data loss which results in deletion of rows or columns
•Works well with a small dataset and is easy to implement.
Cons:
•Works only with numerical continuous variables.
•Can cause data leakage
•Do not factor the covariance between features.

Data Preprocessing 28
Imputation method for categorical columns:
When missing values is from categorical columns (string or numerical) then
the missing values can be replaced with the most frequent category. If the
number of missing values is very large then it can be replaced with a new
category.

Data Preprocessing 29
Pros:
•Prevent data loss which results in deletion of rows or columns
•Works well with a small dataset and is easy to implement.
•Negates the loss of data by adding a unique category
Cons:
•Works only with categorical variables.
•Addition of new features to the model while encoding, which may result
in poor performance

Data Preprocessing 30
Other Imputation Methods:
Depending on the nature of the data or data type, some other imputation
methods may be more appropriate to impute missing values.
For example, for the data variable having longitudinal behavior, it might
make sense to use the last valid observation to fill the missing value. This
is known as the Last observation carried forward (LOCF) method.

data["Age"] = data["Age"].fillna(method='ffill')

Data Preprocessing 31
Using Algorithms that support missing values:
All the machine learning algorithms don’t support missing values but some
ML algorithms are robust to missing values in the dataset.
The k-NN algorithm can ignore a column from a distance measure when a
value is missing.
Naive Bayes can also support missing values when making a prediction.
These algorithms can be used when the dataset contains null or missing
values.
Another algorithm that can be used here is Random Forest that works well on non-
linear and categorical data.

Pros:
No need to handle missing values in each column as ML algorithms will handle them
efficiently.
Cons:
No implementation of these ML algorithms in the scikit-learn library.

Data Preprocessing 32
Prediction of missing values:
In the earlier methods to handle missing values, we do not use the
correlation advantage of the variable containing the missing value and
other variables. Using the other features which don’t have nulls can be
used to predict missing values.
The regression or classification model can be used for the prediction of
missing values depending on the nature (categorical or continuous) of
the feature having missing value.

Data Preprocessing 33
Missing values categories

Data Preprocessing 34
Missing Completely At Random (MCAR)
•In MCAR, the probability of data being missing is the same for all the
observations.
•In this case, there is no relationship between the missing data and any
other values observed or unobserved (the data which is not recorded)
within the given dataset.
•That is, missing values are completely independent of other data. There is
no pattern.
•In the case of MCAR data, the value could be missing due to human error,
some system/equipment failure, loss of sample, or some unsatisfactory
technicalities while recording the values.
•For Example, suppose in a library there are some overdue books. Some
values of overdue books in the computer system are missing.
•The reason might be a human error, like the librarian forgetting to type in
the values. So, the missing values of overdue books are not related to any
other variable/data in the system.

Data Preprocessing 35
Missing At Random (MAR)
•MAR data means that the reason for missing values can be explained by
variables on which you have complete information, as there is some
relationship between the missing data and other values/data.
•In this case, the data is not missing for all the observations. It is missing
only within sub-samples of the data, and there is some pattern in the
missing values.
•For example, if you check the survey data, you may find that all the
people have answered their ‘Gender,’ but ‘Age’ values
are mostly missing for people who have answered their ‘Gender’ as
‘female.’ (The reason being most of the females don’t want to reveal their
age.)

Data Preprocessing 36
Missing Not At Random (MNAR)
•Missing values depend on the unobserved data. If there is some
structure/pattern in missing data and other observed data can not
explain it, then it is considered to be Missing Not At Random (MNAR).
•If the missing data does not fall under the MCAR or MAR, it can be
categorized as MNAR. It can happen due to the reluctance of people to
provide the required information. A specific group of respondents may not
answer some questions in a survey.
•For example, suppose the name and the number of overdue books are
asked in the poll for a library. So most of the people having no overdue
books are likely to answer the poll. People having more overdue books are
less likely to answer the poll. So, in this case, the missing value of the
number of overdue books depends on the people who have more books
overdue.

Data Preprocessing 37
HANDLING MISSING DATA Cont…
 Result of replacing the missing values with the
constant 0 for the numerical variable cubic inches
and the label missing for the categorical variable
brand.

Data Preprocessing 38
HANDLING MISSING DATA Cont…
 Missing values may be replaced with the respective
field means and modes

 May not always be the best choice


 Observed that mean is greater than the 81st
percentile
 Measures of spread will be artificially reduced
 replacing missing values is a gamble
Data Preprocessing 39
HANDLING MISSING DATA Cont…
 Missing values replaced with values
generated at random from the observed
distribution of the variable

 benefit is measures of center and spread should


remain closer to the original
 no guarantee that the resulting records would make
sense
Data Preprocessing 40
HANDLING MISSING DATA Cont…
ata imputation methods
 In data imputation, we ask “What would be the
most likely value for this missing value, given all
the other attributes for a particular record?”
 For instance, an American car with 300 cubic
inches and 150 horsepower would probably be
expected to have more cylinders than a
Japanese car with 100 cubic inches and 90
horsepower.
 This is called imputation of missing data.
 Tools needed, such as multiple regression or
classification and regression trees.
Data Preprocessing 41
HANDLING MISSING DATA (more methods) Cont…
KNN (K Nearest Neighbors)
Machine Learning algorithms can be used to handle missing data/for data
imputation
ML techniques like KNN, XGBoost and Random Forest can be used
k neighbors are chosen based on some distance measure and their
average is used as an imputation estimate.
Method requires
 Selection of the number of nearest neighbors, and
 Distance metric
KNN can predict both discrete attributes (the most frequent value among
the k nearest neighbors) and continuous attributes
Distance metric varies according to type of data:
1. Continuous Data: The commonly used distance metrics for
continuous data are Euclidean, Manhattan and Cosine
2. Categorical Data: Hamming distance is generally used in this case.

Data Preprocessing 42
HANDLING MISSING DATA (more methods) Cont…
Xgboost learns for missing values
Once a tree structure has been trained it isn’t too hard to also consider
the presence of missing values in the test set: it’s enough to attach a default
direction to each decision node.
Optimum default direction is determined and the missing values will go in
that direction

Data Preprocessing 43
Handling Missing Data

Data Preprocessing 44
IDENTIFYING MISCLASSIFICATIONS

 Frequency distribution shows five classes


 Two of the classes, USA and France, have a count of only one
automobile each.
 Two of the records have been inconsistently classified with
respect to the origin of manufacture
 To maintain consistency, the record with origin USA should have
been labelled US, and France should have been labelled Europe.
Data Preprocessing 45
GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS
Outliers are extreme values that go against the trend of the
remaining data.
Wikipedia Definition -In statistics, an outlier is an observation
point that is distant from other observations.
In data mining, outlier detection is the identification of rare items,
events or observations which raise suspicions by differing
significantly from the majority of the data.
 Identifying outliers is important because they may represent errors
in data entry.
 Statistical methods are sensitive to the presence of outliers, and
may deliver unreliable results
 Method for identifying outliers for numeric variables is to examine
a histogram of the variable.

Data Preprocessing 46
GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS
cars data set histogram

Data Preprocessing 47
GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS
Sometimes two-dimensional scatter plots can 5

help to reveal outliers in more than one variable

 Record may be outlier in a particular dimension


but not in another
Data Preprocessing 48
What is Data?
Data are measurements or observations that are collected as a source of
information. There are a variety of different types of data, and different
ways to represent data.
The number of people in Australia, the countries where people were born,
number of calls received by the emergency services each day, the value
of sales of a particular product, or the number of times Australia has won
a cricket match, are all examples of data.

Quantitative and qualitative data


Quantitative = Quantity
Quantitative data are
measures of values or counts and are expressed as numbers.
data about numeric variables (e.g. how many, how much or how often).

Data Preprocessing 49
Qualitative = Quality
Qualitative data are
•measures of 'types' and may be represented by a name, symbol, or a
number code.
•Qualitative data are data about categorical variables (e.g. what type).
Data collected about a numeric variable will always be quantitative and
data collected about a categorical variable will always be qualitative.
Therefore, you can identify the type of data, prior to collection, based on
whether the variable is numeric or categorical.

Data Preprocessing 50
MEASURES OF CENTER AND SPREAD
A measure of central tendency (also referred to as measures of centre or
central location) is a summary measure that attempts to describe a whole
set of data with a single value that represents the middle or centre of its
distribution.

There are three main measures of central tendency:


mode
median
mean
Each of these measures describes a different indication of the typical or central value
in the distribution.
Mode
The mode is the most commonly occurring value in a distribution.
Consider this dataset showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

The mode has an advantage over the median and the mean as it can be found for
both numerical and categorical (non-numerical) data.

Data Preprocessing 51
MEASURES OF CENTER AND SPREAD
The are some limitations to using the mode. In some distributions, the
mode may not reflect the centre of the distribution very well. When the
distribution of retirement age is ordered from lowest to highest value, it
is easy to see that the centre of the distribution is 57 years, but the mode
is lower, at 54 years.

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Median
The median is the middle value in distribution when the values are
arranged in ascending or descending order.
The median divides the distribution in half. In a distribution with an odd
number of observations, the median value is the middle value.
Looking at the retirement age distribution (which has 11 observations),
the median is the middle value, which is 57 years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the median
value is the mean of the two middle values. In the following distribution,
the two middle values are 56 and 57, therefore the median equals 56.5
years:
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Data Preprocessing 52
MEASURES OF CENTER AND SPREAD
Mean
The mean is the sum of the value of each observation in a dataset divided
by the number of observations. This is also known as the arithmetic
average.
Looking at the retirement age distribution again:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The mean is calculated by adding together all the values
(54+54+54+55+56+57+57+58+58+60+60 = 623) and dividing by the
number of observations (11) which equals 56.6 years.

Data Preprocessing 53
MEASURES OF CENTER AND SPREAD
Definition
The spread of data (also called dispersion or variability) refers to how
much the data values differ from each other or from the central value
(mean, median, or mode). It helps understand the distribution and
consistency of data points.

When spread can be measured


The spread of the values can be measured for quantitative data, as the variables
are numeric and can be arranged into a logical order with a low end value and a
high end value

Reasons to measure spread


Summarizing the dataset can help us understand the data, especially when the
dataset is large. Measures of spread summarizes the data in a way that shows how
scattered the values are and how much they differ from the mean value.

Data Preprocessing 54
• Need for Measuring Spread
• Understanding Variability – Helps determine how much the data deviates
from the average.
• Comparing Datasets – Two datasets with the same mean can have different
spreads, impacting decision-making.
• Detecting Outliers – Large spread values can indicate the presence of
extreme values.
• Choosing the Right Model – Some machine learning models work better
with data having low variability.

• Measures of Spread
1. Range – Difference between the highest and lowest values.
2. Interquartile Range (IQR) – Range of the middle 50% of the data.
3. Variance – The average squared deviation from the mean.
4. Standard Deviation – The square root of variance, indicating average deviation
from the mean.

Data Preprocessing 55
Example
Scenario: Exam Scores
Suppose two classes have students scoring as follows:
•Class A: [45, 50, 55, 60, 65]
•Class B: [30, 40, 55, 70, 80]
Calculating Spread:

•Mean for both classes = 55


•Range:
• Class A: 65−45=2065 - 45 = 2065−45=20
• Class B: 80−30=5080 - 30 = 5080−30=50 → More spread out

•Standard Deviation:
• Class A has a smaller standard deviation → Scores are close to the mean.
• Class B has a higher standard deviation → Scores are more spread out.
Interpretation:
•Class A's students have scores closer to each other (low spread).
•Class B's students have scores that vary significantly (high spread).
•A teacher might conclude that Class A is more consistent in performance, while
Class B has more diverse performance levels.
Data Preprocessing 56
Low variability in data means that the values in the dataset are closely clustered
around the central value (mean or median), indicating minimal spread or dispersion.
In other words, the data points do not fluctuate much and are more consistent.

Characteristics of Low Variability Data:


1.Small Range – The difference between the maximum and minimum values is
small.
2.Low Standard Deviation – The data points are close to the mean.
3.Low Variance – The squared differences from the mean are small.
4.Narrow Interquartile Range (IQR) – The middle 50% of data is closely packed.

Example:
•Low Variability Dataset: [48, 49, 50, 51, 52]
• Mean = 50
• Standard Deviation ≈ 1.58 (small value)
•High Variability Dataset: [20, 30, 50, 70, 80]
• Mean = 50
• Standard Deviation ≈ 25.50 (larger spread)

Data Preprocessing 57
MEASURES OF CENTER AND SPREAD
What each measure of spread can tell us

Range
The range is the difference between the smallest value and the largest
value in a dataset.

Calculating the range

Dataset A
4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 8
The range is 4, the difference between the highest value (8) and the lowest value
(4)

Data Preprocessing 58
Standard Deviation and
Variance
• The standard deviation is the average amount of variability in your
dataset. It tells you, on average, how far each value lies from the mean.
• A high standard deviation means that values are generally far from
the mean, while a low standard deviation indicates that values are
clustered close to the mean.
• Standard deviation is the square root of the variance. It provides a more
interpretable measure of spread because it is in the same unit as the
original data.

What does standard deviation tell you?


• Standard deviation is a useful measure of spread for normal
distributions.
• In normal distributions, data is symmetrically distributed with no skew.
Most values cluster around a central region, with values tapering off as
they go further away from the center.
• The standard deviation tells you how spread out from the center of the
distribution your data is on average.

Data Preprocessing 59
Example: Comparing different standard deviations You collect data on job
satisfaction ratings from three groups of employees using simple random
sampling.

The mean (M) ratings are the same for each group – it’s the value on the x-
axis when the curve is at its peak. However, their standard deviations (SD)
differ from each other.
The standard deviation reflects the dispersion of the distribution. The curve
with the lowest standard deviation has a high peak and a small spread,
while the curve with the highest standard deviation is more flat and
widespread.

Data Preprocessing 60
Example: Standard deviation in a normal distribution
You administer a memory recall test to a group of students. The data follows a
normal distribution with a mean score of 50 and a standard deviation of 10.
Following the empirical rule:

Around 68% of scores are between 40 and 60.


Around 95% of scores are between 30 and 70.
Around 99.7% of scores are between 20 and 80

Data Preprocessing 61
Data Preprocessing 62
Data Preprocessing 63
Variance
Variance is defined as, “The measure of how far the set of data is
dispersed from their mean value”. Variance is represented with the
symbol σ2. In other words, we can also say that the variance is the
average of the squared difference from the mean.

Properties of Variance
Various properties of the Variance of the group of data are,

•As each term in the variance formula is firstly squared and then their
mean is found, it is always a non-negative value, i.e. mean can be either
positive or can be zero but it can never be negative.
•Variance is always measured in squared units. For example, if we have to
find the variance of the height of the student in a class, and if the height
of the student is given in cm then the variance is calculated in cm 2.

Data Preprocessing 64
Difference between Standard Deviation and variance

• Standard deviation measures how far apart numbers are in a data set. Variance, on the
other hand, gives an actual value to how much the numbers in a data set vary from the
mean.
• Standard deviation is the square root of the variance and is expressed in the same units
as the data set. Variance can be expressed in squared units or as a percentage (especially
in the context of finance).
• Standard deviation can be greater than the variance since the square root of a decimal
is larger (and not smaller) than the original number when the variance is less than one
(1.0 or 100%).
• The standard deviation is smaller than the variance when the variance is more than one
(e.g. 1.2 or 120%).

Data Preprocessing 65
Data Preprocessing 66
Data Preprocessing 67
Data Preprocessing 68
example
Example: For the following data, determine the mean, variance, and
standard deviation:

Data Preprocessing 69
Data Preprocessing 70
DATA TRANSFORMATION
Raw data is difficult to trace or understand. That's why it needs to be
preprocessed before retrieving any information from it.
Data transformation is a technique used to convert the raw data into a
suitable format that efficiently eases data mining and retrieves strategic
information.
Data transformation includes data cleaning techniques and a data
reduction technique to convert the data into the appropriate form.
Data transformation is an essential data preprocessing technique that must
be performed on the data before data mining to provide patterns that are
easier to understand.
Data transformation changes the format, structure, or values of the data
and converts them into clean, usable data. Data may be transformed at
two stages of the data pipeline for data analytics projects.

Data integration, migration, data warehousing, data wrangling may all involve
data transformation. Data transformation increases the efficiency of business and
analytic processes, and it enables businesses to make better data-driven decisions.

Data Preprocessing 71
Data Transformation Techniques
There are several data transformation techniques that can help
structure and clean up the data before analysis or storage in a data
warehouse.

Data Preprocessing 72
1. Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset
using some algorithms. It allows for highlighting important features
present in the dataset. It helps in predicting the patterns. When collecting
data, it can be manipulated to eliminate or reduce any variance or any
other noise form.

2. Attribute Construction
Attribute construction, also known as feature construction, is the process of creating
new attributes (features) from existing data to improve the performance of machine
learning models. This helps in better data representation, leading to improved
accuracy and efficiency.

3. Data Aggregation
Data collection or aggregation is the method of storing and presenting data in a
summary format. The data may be obtained from multiple data sources to integrate
these data sources into a data analysis description

Data Preprocessing 73
4. Data Discretization
Data discretization is the process of converting continuous numerical
data into discrete categories or bins. This transformation helps simplify
complex datasets and makes them more interpretable for certain
algorithms, such as decision trees and Naïve Bayes classifiers.

Why is Data Discretization Needed?


Improves Interpretability – Easier for humans to understand and
analyze data.
Reduces Data Complexity – Converts large ranges of values into fewer
groups, making computations faster.
Enables Certain Algorithms – Some machine learning models work
better with categorical rather than continuous data.
Handles Noise and Outliers – Grouping similar values can reduce the
effect of minor fluctuations.

Data Preprocessing 74
5. Data Normalization
data normalization is the process of scaling the data to a much smaller
range, without losing information to help minimize or exclude duplicated
data and improve algorithm efficiency and data extraction performance.
There are three methods to normalize an attribute:

•Min-max normalization: Where you perform a linear transformation on


the original data.

•Z-score normalization: In z-score normalization (or zero-mean


normalization) you are normalizing the value for attribute A using the mean
and standard deviation.

•Decimal scaling: Where you can normalize the value of attribute A by


moving the decimal point in the value.
Normalization methods are frequently used when you have values that
skew your dataset and you find it hard to extract valuable insights.

Data Preprocessing 75
Decimal Scaling Method For
Normalization
Decimal Scaling is a normalization technique where data values are divided
by a power of 10 such that the largest absolute value in the dataset is
between -1 and 1.

Data Preprocessing 76
Min-Max Normalization
What is Data Normalization?
Data normalization is a technique in data transformation where numerical
values in a dataset are scaled to a common range without distorting their
relationships.
This process ensures that different features contribute equally to machine
learning models, improving model accuracy and efficiency.
Normalization is crucial in data preprocessing before applying machine
learning algorithms, especially when working with data that has different
units or scales.

Data Preprocessing 77
Data Preprocessing 78
MIN–MAX NORMALIZATION

Data Preprocessing 79
Data Preprocessing 80
MIN–MAX NORMALIZATION
 Let X refer to our original field value, and X∗ refer to the
normalized field value.
 Min–max normalization works by seeing how much greater
the field value is than the minimum value min(X), and scaling
this difference by the range. That is,

 Theminimum weight is 1613 pounds, and the range =


max(X) − min(X) = 4997 − 1613 = 3384 pounds.
Data Preprocessing 81
MIN–MAX NORMALIZATION
 Let us find the min–max normalization for three automobiles weighing 1613,
3384, and 4997 pounds, respectively.

Data Preprocessing 82
Z-SCORE STANDARDIZATION
• Z-score standardization (also called Standardization or Normalization
using Z-score) is a method that transforms data so that it has a mean
(μ) of 0 and a standard deviation (σ) of 1.This transformation helps in
making the dataset scale-independent, making it useful for algorithms
that assume normally distributed data, such as Linear Regression,
Logistic Regression, PCA, and Clustering algorithms.

Data Preprocessing 83
Data Preprocessing 84
Standardized Data: [-1.41, -0.71, 0.0, 0.71, 1.41]
Now, the data has a mean of 0 and a standard deviation of 1.

Data Preprocessing 85
Data Preprocessing 86
Z-SCORE STANDARDIZATION
 Z-score standardization, which is very widespread in the
world of statistical analysis
 Works by taking the difference between the field value and the field mean value,
and scaling this difference by the SD

Data Preprocessing 87
Z-Score Normalization
Z-score normalization refers to the process of normalizing every value in a
dataset such that the mean of all of the values is 0 and the standard deviation is 1.
We use the following formula to perform a z-score normalization on every value in a
dataset:

New value = (x – μ) / σ

where:
•x: Original value
•μ: Mean of data
•σ: Standard deviation of data

Example: Performing Z-Score Normalization


Suppose we have the dataset:

Data Preprocessing 88
Using a calculator, we can find that the mean of the
dataset is 21.2 and the standard deviation is 29.8.
To perform a z-score normalization on the first value in
the dataset, we can use the following formula:
•New value = (x – μ) / σ
•New value = (3 – 21.2) / 29.8
•New value = -0.61

We can use this formula to perform a z-score


normalization on every value in the dataset:

Data Preprocessing 89
DECIMAL SCALING
 This method normalizes the value of attribute A by moving the
decimal point in the value. This movement of a decimal point
depends on the maximum absolute value of A. The formula for
the decimal scaling is given below:.

 Here j is the smallest integer such that max(|v'i|)<1

 For example, the observed values for attribute A range from -986 to 917, and the
maximum absolute value for attribute A is 986. Here, to normalize each value of
attribute A using decimal scaling, we have to divide each value of attribute A by
1000, i.e., j=3.
So, the value -986 would be normalized to -0.986, and 917 would be normalized to
0.917.

Data Preprocessing 90
DECIMAL SCALING
 Decimal scaling ensures that every normalized
value lies between −1 and 1.

where d represents the number of digits in the data value with


the largest absolute value
 For the weight data, the largest absolute value is |4997| =
4997, which has d=4 digits.
 The decimal scaling for the minimum and maximum weight are

Data Preprocessing 91
NORMALIZATION TECHNIQUES COMMENTS
What are the best normalization methods
(Z-Score, Min-Max, etc.)?
Z-score
 preserve range (maximum and minimum)
 If you data follow a Gaussian distribution, they are converted into a
N(0,1) distribution and probabilities calculation will be easier.
More Researcher Comments
 Depend on the data to be normalized. Normally Z-score is very
common for data normalization.
 Min-Max and Z-Score not suitable for sparse
 It depends on the aims of the study and the nature of the data.

Data Preprocessing 92
NORMAL DISTRIBUTION OF DATA
A normal distribution has a probability distribution that is centered around the
mean. This means that the distribution has more data around the mean. The data
distribution decreases as you move away from the center. The resulting curve is
symmetrical about the mean and forms a bell-shaped distribution.

Normally distributed data is crucial for the application of large-scale statistical


analysis.

To statisticians, the most important assumptions of statistical users are the


adequacy of the data and the normal distribution of the data. However, users are
constantly forced to deal with unusual data. This includes changing the method
used to be less sensitive to non-normal data or transforming that data to normal
data

in the regression analysis the response variable should be normally distributed to


get better prediction results.

Most of the data scientists claim they are getting more accurate results when they
transform the independent variables too. It means skew correction for the
independent variables. Lower the skewness better the result

Data Preprocessing 93
Types Of Transformations For Better Normal Distribution
1. Log Transformation :
Numerical variables may have high skewed and non-normal distribution
(Gaussian Distribution) caused by outliers, highly exponential
distributions, etc. Therefore we go for data transformation.
In Log transformation each variable of x will be replaced by log(x) with
base 10, base 2, or natural log.

import numpy as np
log_target = np.log1p(df["Target"])

Data Preprocessing 94
Data Preprocessing 95
2. Square-Root Transformation :
This transformation will give a moderate effect on distribution. The main advantage
of square root transformation is, it can be applied to zero values.

Here the x will replace by the square root(x). It is weaker than the Log
Transformation.

sqrt_target = df["Target"]**(1/2)

Data Preprocessing 96
Data Preprocessing 97
Data Preprocessing 98
TRANSFORMATIONS TO ACHIEVE NORMALITY
 Variables must be normally distributed for some data mining algorithms
 The normal distribution is a continuous probability distribution commonly known as
the bell curve, which is symmetric.
 The normal distribution is a continuous probability distribution commonly known as
the bell curve, which is symmetric.
A distribution, or data set, is symmetric if it looks the same to the left and
right of the center point.
 It is centered at mean 𝜇 and has its spread determined by SD 𝜎 (sigma)

Data Preprocessing 99
TRANSFORMATIONS TO ACHIEVE NORMALITY
Common misconception - Z-score standardization applied to them follow the
standard normal Z distribution

Z-Standardized Data is
still right-skewed, not
Original Data normally distributed
Data Preprocessing 10
TRANSFORMATIONS TO ACHIEVE NORMALITY
 Statistic to measure the skewness of a distribution

 Right-skewed data has positive skewness


 Mean is greater than the median

Data Preprocessing 10
TRANSFORMATIONS TO ACHIEVE NORMALITY

Left-skewed data, the mean is smaller than


the median,
Negative values for skewness
For perfectly symmetric (and unimodal)
data, the mean, median, and mode are all
equal, and so the skewness equals zero.
Data Preprocessing 10
TRANSFORMATIONS TO ACHIEVE NORMALITY
core standardization has no effect on skewnes

Data Preprocessing 10
TRANSFORMATIONS TO ACHIEVE NORMALITY
To make data “more normally
distributed,” make it symmetric -
eliminate skewness.
Common transformations are
o the natural log transformation
o square root transformation, and
o Inverse square root
transformation

Data Preprocessing 10
Why is Normality so important?
Linear Discriminant Analysis (LDA), Linear Regression, and many other
parametric machine learning models assume that data is normally
distributed. If this assumption is not met, the model will not provide
accurate predictions.

What is normal distribution?


Normal distribution is a type of probability distribution that is defined by
a symmetric bell-shaped curve. The curve is defined by its centre (mean),
spread (standard deviation), and skewness.

Data Preprocessing 10
What is skewness?
Skewness is the degree of asymmetry of a distribution. A distribution is
symmetric if it is centred around its mean and the left and right sides are
mirror images of each other. A distribution is skewed if it is not
symmetric.

There are two types of skewness:


Positive Skewness: If the bulk of the values fall on the right side of the
curve and the tail extends towards the right, it is known as positive
skewness.
Negative Skewness: If the bulk of the values fall on the left side of the
curve and the tail extends towards the left, it is known as negative
skewness.
Data Preprocessing 10
What does skewness tell us?
To understand this better consider a example:

Suppose car prices may range from 100 to 10,00,000 with the average
being 5,00,000.
If the distribution’s peak is on the left side, our data is positively skewed
and the majority of the cars are being sold for less than the average
price.
If the distribution’s peak is on the right side, our data is negatively
skewed and the majority of the cars are being sold for more than the
average price.

Data Preprocessing 10
TRANSFORMATIONS TO ACHIEVE NORMALITY
Application of the square root transformation

Data Preprocessing 10
TRANSFORMATIONS TO ACHIEVE NORMALITY

Data Preprocessing 10
TRANSFORMATIONS TO ACHIEVE NORMALITY

Data Preprocessing 11
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
The Z-score method states that a data value is an outlier if it
has a Z-score that is either less than −3 or greater than 3.

 Variable much beyond this range may bear further


investigation (data entry errors or other issues)
 Should not automatically omit outliers from analysis.
 No outliers among the vehicle weights - Z-scores - for 1613
pounds - 1.63, 4997-pound vehicle-Z-score of 2.34.
As neither Z-scores are either less than −3 or greater than 3, we
conclude that there are no outliers among the vehicle weights.

Data Preprocessing 11
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
 Mean and SD, part of the formula for the Z-score
standardization, are sensitive to the presence of outliers
 Values of mean and SD will both be unduly affected by
the presence or absence of this new data value.
 Not appropriate to use measures that are themselves
sensitive to their presence.
 Data analysts have developed more robust statistical
methods for outlier detection, which are less sensitive to
the presence of the outliers.
 One elementary robust method is to use the Interquartile
Range (IQR)

Data Preprocessing 11
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
Using the interquartile range
The interquartile range (IQR) tells you the range of the middle half of
your dataset. You can use the IQR to create “fences” around your data
and then define outliers as any values that fall outside those fences.

Data Preprocessing
11
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
This method is helpful if you have a few values on the
extreme ends of your dataset, but you aren’t sure
whether any of them might count as outliers.

Interquartile range method


1.Sort your data from low to high
2.Identify the first quartile (Q1), the median, and the
third quartile (Q3).
3.Calculate your IQR = Q3 – Q1
4.Calculate your upper fence = Q3 + (1.5 * IQR)
5.Calculate your lower fence = Q1 – (1.5 * IQR)
6.Use your fences to highlight any outliers, all values that
fall outside your fences.
Your outliers are any values greater than your upper
fence or less than your lower fence.
Data Preprocessing 11
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
Example: Using the interquartile range to find outliers
We’ll walk you through the popular IQR method for identifying outliers
using a step-by-step example.
Your dataset has 11 values. You have a couple of extreme values in
your dataset, so you’ll use the IQR method to check whether they are
outliers.
25 37 24 28 35 22 31 53 41 64 29

Step 1: Sort your data from low to high


First, you’ll simply sort your data in ascending order.

22 24 25 28 29 31 35 37 41 53 64

Step 2: Identify the median, the first quartile (Q1), and the third
quartile (Q3)
The median is the value exactly in the middle of your dataset when all
values are ordered from low to high.
Since you have 11 values, the median is the 6th value. The median value
is 22
31. 24 25 28 29 31 35 37 41 53 64

Data Preprocessing 11
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
Next, we’ll use the exclusive method for identifying Q1 and Q3. This
means we remove the median from our calculations.
The Q1 is the value in the middle of the first half of your dataset,
excluding the median. The first quartile value is 25.
22 24 25 28 29

Your Q3 value is in the middle of the second half of your dataset,


excluding the median. The third quartile value is 41.
35 37 41 53 64

Step 3: Calculate your IQR


The IQR is the range of the middle half of your dataset. Subtract Q1
from Q3 to calculate the IQR.

Data Preprocessing 11
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
Formula
Calculation
IQR = Q3 – Q1 Q1 = 26
Q3 = 41
IQR = 41 – 26
= 15

Step 4: Calculate your upper fence


The upper fence is the boundary around the third quartile. It tells you that
any values exceeding the upper fence are outliers.

Formula Calculation
Upper fence = Q3 + (1.5 * IQR) Upper fence = 41 + (1.5 * 15)
= 41 + 22.5
= 63.5

Data Preprocessing 11
Step 5: Calculate your lower fence
The lower fence is the boundary around the first quartile. Any values less
than the lower fence are outliers.
Formula Calculation

Lower fence = Q1 – (1.5 * IQR)


Lower fence = 26 – (1.5 * IQR)
= 26 – 22.5
= 3.5
Step 6: Use your fences to highlight any outliers
Go back to your sorted dataset from Step 1 and highlight any values
that are greater than the upper fence or less than your lower fence.
These are your outliers.
•Upper fence = 63.5
•Lower fence = 3.5

22 24 25 28 29 31 35 37 41 53 64

Data Preprocessing 11
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
 Quartiles of a data set divide data set four
parts
 Each containing 25% of the data:
 first quartile (Q1) is the 25th percentile.
 second quartile (Q2) is 50th percentile,
(median).
 The third quartile (Q3) is the 75th
percentile.
 IQR is calculated as IQR=Q3−Q1
 Robust measure of outlier detection - A data
value is an outlier if
Data Preprocessing
a. it is located 1.5(IQR) or more below Q1,11
Data Preprocessing 12
Data Preprocessing 12
FLAG VARIABLES
 Some analytical methods, such as regression, require predictors to be
numeric
 Need to recode the categorical variable into one or more flag variables
 A flag variable (or dummy variable, or indicator variable) is a
categorical variable taking only two values, 0 and 1.
 For e.g., the categorical predictor Gender, taking values for female and
male could be recoded into the flag variable gender_flag as follows:
If sex = female = then sex_flag = 0; if sex = male then sex_flag = 1.
 When a categorical predictor takes k ≥ 3 possible values, then define
k−1 dummy variables and use the unassigned category as the
reference category
 For example, region has k=4 possible categories, {north, east, south,
west}, then the analyst could define the following k−1=3 flag variables
 north_flag: If region = north then north_flag = 1; otherwise
north_flag = 0.
 east_flag: If region = east then east_flag = 1; otherwise east_flag =
0
 south_flag: If region = south then south_flag = 1; otherwise
south_flag = 0.
Data Preprocessing 12
TRANSFORMING CATEGORICAL VARIABLES INTO NUMERICAL VARIABLES

Would it not be easier to simply transform the categorical


variable region into a single numerical variable rather than
using several different flag variables?

Unfortunately, this is a common and hazardous error. The


algorithm now erroneously thinks the following:
 The four regions are ordered.
 West>South>East>North.
 West is three times closer to South compared to North, and
so on.
Data Preprocessing 12
TRANSFORMING CATEGORICAL VARIABLES INTO NUMERICAL VARIABLES

 In most instances, the data analyst


should avoid transforming categorical
variables to numerical variables
 Exception - for categorical variables that areclearly
ordered

Data Preprocessing 12
TRANSFORMING CATEGORICAL VARIABLES INTO NUMERICAL VARIABLES

The most popular techniques are


 Label Encoding or Ordinal Encoding
 One hot Encoding
 Dummy Encoding
 Effect Encoding
 Binary Encoding
 BaseN Encoding
 Hash Encoding
 Target Encoding

Feature Construction and Feature Selection 12


Label Encoding

Label Encoding is a technique of transforming categorical data into a


format that can be provided to machine learning algorithms to improve
their performance. While the idea is simple — replace the categories of a
categorical variable with numerical labels — the implications and
subtleties of this method are worth understanding in detail.

How Label Encoding Works


Label Encoding begins by identifying all the unique categories within a
categorical variable. Then, each category is assigned a unique integer. For
instance, if we have a ‘Color’ variable with the categories ‘Red,’ ‘Blue,’
and ‘Green,’ we might assign ‘Red’ as 1, ‘Blue’ as 2, and ‘Green’ as 3.
There’s no strict rule on how these numerical labels are assigned. One
common method is to assign labels based on the alphabetical order of
categories, though the labels could also be assigned randomly or based
on the order of appearance in the data.

Data Preprocessing 12
One-Hot Encoding

One-Hot Encoding is another popular technique for converting categorical


variables into a form that can be provided to machine learning algorithms.
It creates binary (0 or 1) features for each category in the original
variable, effectively mapping each category to a vector in a high-
dimensional binary space.
How One-Hot Encoding Works
Let’s say we have a ‘Color’ variable with three categories: ‘Red,’ ‘Blue,’
and ‘Green.’ With One-Hot Encoding, we would create three new variables
(or ‘features’), one for each category: ‘Is_Red,’ ‘Is_Blue,’ and ‘Is_Green.’
Each of these new features is binary, meaning it takes the value 1 if the
original feature was that color and 0 if it was not.

So if we had five data points: They would be transformed into:


Red
Blue
Green
Blue
Red
Data Preprocessing 12
Binary Encoding

Binary Encoding is another technique for converting categorical variables


into numerical form. This method is particularly useful when dealing with
high cardinality categorical variables, where variables have many unique
categories.
How Binary Encoding Works
Binary Encoding is a combination of Hashing and Binary. First, the
categories of a variable are encoded as ordinal, meaning integers are
assigned to categories just like in integer encoding. Then, those integers
are converted into binary code, resulting in binary digits or bits.
For example, let’s say we have a ‘Color’ variable with four categories:
‘Red,’ ‘Blue,’ ‘Green,’ and ‘Yellow.’ These categories would first be
assigned integer values. Let’s say ‘Red’ is 1, ‘Blue’ is 2, ‘Green’ is 3, and
‘Yellow’ is 4. Then, these integers are converted into binary format:

Data Preprocessing 12
Ordinal Encoding

Ordinal Encoding is a technique used to convert the categorical data into a


numerical format. As the name suggests, it’s particularly suited to ordinal
categorical variables, where the categories have an inherent order or
hierarchy.
How Ordinal Encoding Works
In Ordinal Encoding, each unique category value is assigned an integer
value. For example, for an ‘Education Level’ variable with categories ‘No
degree’, ‘High School’, ‘Bachelor’s’, ‘Master’s’, and ‘PhD’, we could assign
‘No degree’ to 0, ‘High School’ to 1, ‘Bachelor’s’ to 2, ‘Master’s’ to 3, and
‘PhD’ to 4.
The critical aspect of Ordinal Encoding is to respect the inherent ordering
of the categories. The integers should be assigned in such a way that the
order of the categories is preserved.

Data Preprocessing 12
Mean Encoding

Mean Encoding is a method used to encode categorical variables based on


the mean value of the target variable. It’s a way to capture information
within the label, therefore is mainly used for supervised learning tasks.
How Mean Encoding Works
In Mean Encoding, each category in the feature variable is replaced with
the mean value of the target variable for that category. For example,
suppose we’re predicting the price of a car (target variable), and we have
a categorical variable ‘Color’. If the average price of red cars is $20,000,
then ‘Red’ would be replaced by ‘20000’ in the encoded feature.

Data Preprocessing 13
Dummy Encoding

Dummy coding scheme is similar to one-hot encoding. This categorical data


encoding method transforms the categorical variable into a set of binary
variables (also known as dummy variables). In the case of one-hot
encoding, for N categories in a variable, it uses N binary variables. The
dummy encoding is a small improvement over one-hot-encoding. Dummy
encoding uses N-1 features to represent N labels/categories.

Data Preprocessing 13
Effect Encoding

This encoding technique is also known as Deviation Encoding or Sum


Encoding. Effect encoding is almost similar to dummy encoding, with a
little difference. In dummy coding, we use 0 and 1 to represent the data
but in effect encoding, we use three values i.e. 1,0, and -1.
The row containing only 0s in dummy encoding is encoded as -1 in effect
encoding. In the dummy encoding example, the city Bangalore at index
4 was encoded as 0000. Whereas in effect encoding it is represented by -
1-1-1-1.

Data Preprocessing 13
BINNING NUMERICAL VARIABLES
What is binning numerical value?
Numerical Binning is a way to group a number of more or less continuous values into
a smaller number of “bins”. Creating ranges or bins will help to understand the
numerical data better. For example, if you have age data on a group of people, you
might want to arrange their ages into a smaller number of age intervals.

Data binning, or bucketing, is a process used to minimize the effects of observation


errors. It is the process of transforming numerical variables into their categorical
counterparts.

In other words, binning will take a column with continuous numbers and place the
numbers in “bins” based on ranges that we determine. This will give us a new
categorical variable feature.

Data Preprocessing 13
Advantages of binning:-
•Improves the accuracy of predictive models by reducing noise or non-linearity in the
dataset.
•Helps identify outliers and invalid and missing values of numerical variables.

Types of Binning
Equal Width (or distance) Binning
This algorithm divides the continuous variable into several categories having bins or
ranges of the same width. Let x be the number of categories and max and min be the
maximum and minimum values in the concerned column.
Then width(w) will be:-

Data Preprocessing 13
Data Preprocessing 13
ADDING AN INDEX FIELD
Recommended that the data analyst
create an index field,
tracks the sort order of the records
Data mining data gets partitioned at
least once (and sometimes several
times).
It is helpful to have an index field so
that the original sort order may be
recreated.
For example, using IBM/SPSS Modeler,
you can use the @Index function in the
Derive node to create an index field. 136
Data Preprocessing
REMOVING VARIABLES THAT ARE NOT USEFUL
 Data analyst may remove variables that will
not help analysis
 Such variables include
 unary variables and
 variables that are very nearly unary.
Unary variables take on only a single value
 sample of students at an all-girls private
school would gender as female.
Sometimes a variable can be very nearly
unary
 For example, suppose that 99.95% of the
players in a field hockey league are female,
with the remaining 0.05% male.
Data Preprocessing 13
REMOVAL OF DUPLICATE RECORDS
 Records may have been inadvertently copied,
thus creating duplicate records.
 Duplicate records lead to an overweighting of
the data values
 Only one set of them should be retained
 For example, if the ID field is duplicated, then
definitely remove the duplicate records.
 Data analyst should apply common sense
 Suppose a data set contains three nominal
fields, and each field takes only three values,
then 3 × 3 × 3 = 27 possible different records
 If there are more than 27 records, at least one of
them has to be a duplicate

Data Preprocessing 13
REMOVAL OF DUPLICATE RECORDS
Removing duplicate records is not
particularly difficult.
Most statistical packages and database
systems have built-in commands that
group records together.
In fact, in the database language SQL,
this command is called Group By.

Data Preprocessing 13
Attribute selection
• Attribute selection is defined as “the process of finding a best subset of features,
from the original set of features in a given data set.
• Attribute subset Selection is a technique which is used for data reduction in data
mining process.
• The data set may have a large number of attributes. But some of those attributes
can be irrelevant or redundant.
• The goal of attribute subset selection is to find a minimum set of attributes
• Dropping of irrelevant attributes does not much affect the utility of data and the
cost of data analysis could be reduced

Process of Attribute Subset Selection


• Use the statistical significance tests such that best (or worst) attributes can be
recognized
• This is a kind of greedy approach in which a significance level is decided
(statistically ideal value of significance level is 5%) and the models are tested
again and again until p-value (probability value) of all attributes is less than or
equal to the selected significance level.
• The attributes having p-value higher than significance level are discarded

Data Preprocessing 14
• Procedure is repeated again and again until all the attribute in data set has p-value
less than or equal to the significance level.
• This gives the reduced data set having no irrelevant attributes.

Methods of Attribute Subset Selection-


1. Stepwise Forward Selection.
2. Stepwise Backward Elimination.
3. Combination of Forward Selection and Backward Elimination.
4. Decision Tree Induction.

1. Stepwise Forward Selection:


• This procedure start with an empty set of attributes as the minimal set.
• The most relevant attributes are chosen(having minimum p-value) and are added
to the minimal set.
• In each iteration, one attribute is added to a reduced set.
2. Stepwise Backward Elimination:
• Here all the attributes are considered in the initial set of attributes.
• In each iteration, one attribute is eliminated from the set of attributes whose p-
value is higher than significance level.

Data Preprocessing 14
3. Combination of Forward Selection and Backward Elimination:
•The stepwise forward selection and backward elimination are combined so as to
select the relevant attributes most efficiently.
•This is the most common technique which is generally used for attribute selection.

4. Decision Tree Induction:


• This approach uses decision tree for attribute selection.
• It constructs a flow chart like structure having nodes denoting a test on an attribute.
• Each branch corresponds to the outcome of test and leaf nodes is a class
prediction.
• The attribute that is not the part of tree is considered irrelevant and hence
discarded.

Data Preprocessing 14
Thank You !!!

You might also like