Data Preparation.2
Data Preparation.2
Data Science
1.Numerical data
● One of the most important steps in Data Preparation stage when we have
Numerical Data is Feature Scaling.
1. Normalizing.
2. Standardizing.
Importance of Feature-Scaling
● By considering this car dataset,Here age of car is ranging from 5 years to
20 years, whereas Distance Travelled is from 10000 km to 50000 km.
When we compare both the ranges, they are at very long distance from
each other.
● The machine learning algorithm thinks that the feature with higher range
values is more important while predicting the output and tends to ignore the
feature with smaller range values. This approach results couldd be a biased
or wrongly predicted.
1.Normalization
● Normalization is the process of transforming the data by scaling individual samples to have unit
norm.
● It is called Min-Max scaling which makes the the range of data values shrinks between 0 and
1.For Example, if we have length of a person in a dataset with values in the range 40cm (baby) to
200cm (udlt), then feature scaling transforms all the values to the range 0 to 1.where 0 represents
lowest weight and 1 represents highest weight instead of representing the length in Centimeters.
● Normalization (Min-Max Scaling)formula:
● Where x is the current value to be scaled, min(X) is the minimum value in
the list of values and max(X) is the maximum value in the list of values
● Example:
if X= [1,3,5,7,9] then min(X) = 1 and max(X) = 9 then scaled values would be:
ere we can observe that the min(X) 1 is represented as 0 and max(X) 9 is represented as 1.
2.Standardization
● Where x is the current value to be scaled, µ is the mean of the list of values and σ is the standard
deviation of the list of values.The scaled values are distributed such that the mean of the values is 0
and the standard deviation is 1.
● Here the values are ranging from -1.41 to 1.41. This range changes
depending on the values of X.
Data distribution before and after feature scaling:
Step 6: Outliers removal
● Outlier is one of the most popular issues that could easily give a misleading
results.
● Outlier is an observation that lies outside the overall pattern of a distribution.It
lies in an abnormal distance from other values in a random sample from a
population.
2.Categorical Data
● It is very common to see categorical data which can be of type nominal or
ordinal in your data set. Since Machine algorithm works on number only
hence,So we need to transform the Categorical Data into Numeric Data,this
technique is called Categorical Data Encoding.
● There are multiple types of categorical data encoding,we will discuss these
two types :
1. Label Encoding.
2. One-hot Encoding
1.Label Encoding
Arch
Beam
Truss
Cantilever
Tied Arch
Suspension
Cable
1.Label Encoding
Also Label Encoding Could be applied into ordinal data,for example :
2.One Hot Encoding
● One hot encoding is one of the most widely used encoding approach
● It works by creating a new column for each category present in the feature
and assigning a 1 or 0 to indicate the presence of a category in the
data,where a value of 1 in a column represents the presence of that level in
the original data and vice versa .
2.One Hot Encoding
Example:(Label Encoding vs.One Hot Encoding)
In this example and by
applying Label Encoded
approach , we can see:
● The value in the new Team_A column is 1 if the original value in the Team column was A.
Otherwise, the value is 0.
● The value in the new Team_B column is 1 if the original value in the Team column was B.
Otherwise, the value is 0.
● The value in the new Team_C column is 1 if the original value in the Team column was C.
Otherwise, the value is 0.