0% found this document useful (0 votes)
5 views

Data Preparation.2

This document discusses techniques for preparing data for machine learning models, including feature scaling of numerical data through normalization and standardization, and encoding of categorical data through label encoding and one-hot encoding.

Uploaded by

yasmine hussein
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Preparation.2

This document discusses techniques for preparing data for machine learning models, including feature scaling of numerical data through normalization and standardization, and encoding of categorical data through label encoding and one-hot encoding.

Uploaded by

yasmine hussein
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Principles and Practices of

Data Science
1.Numerical data
● One of the most important steps in Data Preparation stage when we have
Numerical Data is Feature Scaling.

1. Feature Scaling is a method used to transform the numerical data into a


standard range to improve the performance of the machine learning model.

It can be achieved by two methods

1. Normalizing.
2. Standardizing.
Importance of Feature-Scaling
● By considering this car dataset,Here age of car is ranging from 5 years to
20 years, whereas Distance Travelled is from 10000 km to 50000 km.
When we compare both the ranges, they are at very long distance from
each other.
● The machine learning algorithm thinks that the feature with higher range
values is more important while predicting the output and tends to ignore the
feature with smaller range values. This approach results couldd be a biased
or wrongly predicted.
1.Normalization
● Normalization is the process of transforming the data by scaling individual samples to have unit

norm.

● It is called Min-Max scaling which makes the the range of data values shrinks between 0 and

1.For Example, if we have length of a person in a dataset with values in the range 40cm (baby) to

200cm (udlt), then feature scaling transforms all the values to the range 0 to 1.where 0 represents

lowest weight and 1 represents highest weight instead of representing the length in Centimeters.
● Normalization (Min-Max Scaling)formula:
● Where x is the current value to be scaled, min(X) is the minimum value in
the list of values and max(X) is the maximum value in the list of values

● Example:
if X= [1,3,5,7,9] then min(X) = 1 and max(X) = 9 then scaled values would be:
ere we can observe that the min(X) 1 is represented as 0 and max(X) 9 is represented as 1.
2.Standardization

● It represents the values in standard deviations from the mean

● Where x is the current value to be scaled, µ is the mean of the list of values and σ is the standard
deviation of the list of values.The scaled values are distributed such that the mean of the values is 0
and the standard deviation is 1.

● Example if X= [1,3,5,7,9] then


Standardization
● The scaled values results will be as following:

● Here the values are ranging from -1.41 to 1.41. This range changes
depending on the values of X.
Data distribution before and after feature scaling:
Step 6: Outliers removal
● Outlier is one of the most popular issues that could easily give a misleading
results.
● Outlier is an observation that lies outside the overall pattern of a distribution.It
lies in an abnormal distance from other values in a random sample from a
population.
2.Categorical Data
● It is very common to see categorical data which can be of type nominal or
ordinal in your data set. Since Machine algorithm works on number only
hence,So we need to transform the Categorical Data into Numeric Data,this
technique is called Categorical Data Encoding.

● There are multiple types of categorical data encoding,we will discuss these
two types :
1. Label Encoding.
2. One-hot Encoding
1.Label Encoding

● It converts each value in a column/feature of categorical type to number.


● Each category is assigned a unique label starting from 0 and going on till N
categories 1 per feature.
1.Label Encoding
It could be applied into nominal data ,for
example :
BRIDGE-TYPE

Arch

Beam

Truss

Cantilever

Tied Arch

Suspension

Cable
1.Label Encoding
Also Label Encoding Could be applied into ordinal data,for example :
2.One Hot Encoding

● One hot encoding is one of the most widely used encoding approach
● It works by creating a new column for each category present in the feature
and assigning a 1 or 0 to indicate the presence of a category in the
data,where a value of 1 in a column represents the presence of that level in
the original data and vice versa .
2.One Hot Encoding
Example:(Label Encoding vs.One Hot Encoding)
In this example and by
applying Label Encoded
approach , we can see:

● Each “A” value has


been converted to 0.
● Each “B” value has
been converted to 1.
● Each “C” value has
been converted to 2.
Example:(Label Encoding vs.One Hot Encoding)
Using one hot encoding, we would convert the Team column into new variables that
contain only 0 and 1 values:
Example:(Label Encoding vs.One Hot Encoding)
● For example, the categorical variable Team had three unique values so we created three new
columns in the dataset that all contain 0 or 1 values.

Here’s how to interpret the values in the new columns:

● The value in the new Team_A column is 1 if the original value in the Team column was A.
Otherwise, the value is 0.
● The value in the new Team_B column is 1 if the original value in the Team column was B.
Otherwise, the value is 0.
● The value in the new Team_C column is 1 if the original value in the Team column was C.
Otherwise, the value is 0.

You might also like