0% found this document useful (0 votes)

5 views

Data Preparation.2

This document discusses techniques for preparing data for machine learning models, including feature scaling of numerical data through normalization and standardization, and encoding of categorical data through label encoding and one-hot encoding.

Uploaded by

yasmine hussein

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Data Preparation.2

Uploaded by

yasmine hussein

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Principles and Practices of

Data Science
1.Numerical data
● One of the most important steps in Data Preparation stage when we have
Numerical Data is Feature Scaling.

1. Feature Scaling is a method used to transform the numerical data into a

standard range to improve the performance of the machine learning model.

It can be achieved by two methods

1. Normalizing.
2. Standardizing.
Importance of Feature-Scaling
● By considering this car dataset,Here age of car is ranging from 5 years to
20 years, whereas Distance Travelled is from 10000 km to 50000 km.
When we compare both the ranges, they are at very long distance from
each other.
● The machine learning algorithm thinks that the feature with higher range
values is more important while predicting the output and tends to ignore the
feature with smaller range values. This approach results couldd be a biased
or wrongly predicted.
1.Normalization
● Normalization is the process of transforming the data by scaling individual samples to have unit

norm.

● It is called Min-Max scaling which makes the the range of data values shrinks between 0 and

1.For Example, if we have length of a person in a dataset with values in the range 40cm (baby) to

200cm (udlt), then feature scaling transforms all the values to the range 0 to 1.where 0 represents

lowest weight and 1 represents highest weight instead of representing the length in Centimeters.
● Normalization (Min-Max Scaling)formula:
● Where x is the current value to be scaled, min(X) is the minimum value in
the list of values and max(X) is the maximum value in the list of values

● Example:
if X= [1,3,5,7,9] then min(X) = 1 and max(X) = 9 then scaled values would be:
ere we can observe that the min(X) 1 is represented as 0 and max(X) 9 is represented as 1.
2.Standardization

● It represents the values in standard deviations from the mean

● Where x is the current value to be scaled, µ is the mean of the list of values and σ is the standard
deviation of the list of values.The scaled values are distributed such that the mean of the values is 0
and the standard deviation is 1.

● Example if X= [1,3,5,7,9] then

Standardization
● The scaled values results will be as following:

● Here the values are ranging from -1.41 to 1.41. This range changes
depending on the values of X.
Data distribution before and after feature scaling:
Step 6: Outliers removal
● Outlier is one of the most popular issues that could easily give a misleading
results.
● Outlier is an observation that lies outside the overall pattern of a distribution.It
lies in an abnormal distance from other values in a random sample from a
population.
2.Categorical Data
● It is very common to see categorical data which can be of type nominal or
ordinal in your data set. Since Machine algorithm works on number only
hence,So we need to transform the Categorical Data into Numeric Data,this
technique is called Categorical Data Encoding.

● There are multiple types of categorical data encoding,we will discuss these
two types :
1. Label Encoding.
2. One-hot Encoding
1.Label Encoding

● It converts each value in a column/feature of categorical type to number.

● Each category is assigned a unique label starting from 0 and going on till N
categories 1 per feature.
1.Label Encoding
It could be applied into nominal data ,for
example :
BRIDGE-TYPE

Arch

Beam

Truss

Cantilever

Tied Arch

Suspension

Cable
1.Label Encoding
Also Label Encoding Could be applied into ordinal data,for example :
2.One Hot Encoding

● One hot encoding is one of the most widely used encoding approach
● It works by creating a new column for each category present in the feature
and assigning a 1 or 0 to indicate the presence of a category in the
data,where a value of 1 in a column represents the presence of that level in
the original data and vice versa .
2.One Hot Encoding
Example:(Label Encoding vs.One Hot Encoding)
In this example and by
applying Label Encoded
approach , we can see:

● Each “A” value has

been converted to 0.
● Each “B” value has
been converted to 1.
● Each “C” value has
been converted to 2.
Example:(Label Encoding vs.One Hot Encoding)
Using one hot encoding, we would convert the Team column into new variables that
contain only 0 and 1 values:
Example:(Label Encoding vs.One Hot Encoding)
● For example, the categorical variable Team had three unique values so we created three new
columns in the dataset that all contain 0 or 1 values.

Here’s how to interpret the values in the new columns:

● The value in the new Team_A column is 1 if the original value in the Team column was A.
Otherwise, the value is 0.
● The value in the new Team_B column is 1 if the original value in the Team column was B.
Otherwise, the value is 0.
● The value in the new Team_C column is 1 if the original value in the Team column was C.
Otherwise, the value is 0.

Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Week 10
No ratings yet
Week 10
50 pages
FeatureEngineering (1)
No ratings yet
FeatureEngineering (1)
50 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
3_AML _Lecture 3_Feature Engg
No ratings yet
3_AML _Lecture 3_Feature Engg
39 pages
ML - WEEK 04
No ratings yet
ML - WEEK 04
33 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
1737527078055
No ratings yet
1737527078055
111 pages
4 Data Pre Processing II
No ratings yet
4 Data Pre Processing II
26 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
data processing
No ratings yet
data processing
19 pages
L1_Data Pre-processing & Steps of Building a Model (1)
No ratings yet
L1_Data Pre-processing & Steps of Building a Model (1)
30 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Lecture # 13 Data_Transformation_Techniques
No ratings yet
Lecture # 13 Data_Transformation_Techniques
36 pages
Feature Engineering
No ratings yet
Feature Engineering
43 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
PMA Unit-2 pdf
No ratings yet
PMA Unit-2 pdf
19 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
CH1
No ratings yet
CH1
64 pages
Data Transformation
No ratings yet
Data Transformation
5 pages
5 Data Pre Processing II
No ratings yet
5 Data Pre Processing II
26 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
DS Day 5
No ratings yet
DS Day 5
11 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
1.3.2. Feature Engineering and Variable - Transformation
No ratings yet
1.3.2. Feature Engineering and Variable - Transformation
29 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Unit-II
No ratings yet
Unit-II
119 pages
Practical 1 52
No ratings yet
Practical 1 52
4 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
DMDW 03
No ratings yet
DMDW 03
25 pages
Eda
No ratings yet
Eda
48 pages
Featureengineering 171206213206
No ratings yet
Featureengineering 171206213206
45 pages
5.Feauture Engineering
No ratings yet
5.Feauture Engineering
34 pages
Ds 5
No ratings yet
Ds 5
9 pages
PPA Data Preparation
No ratings yet
PPA Data Preparation
31 pages
Data Mining
No ratings yet
Data Mining
33 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Data Treatment
No ratings yet
Data Treatment
6 pages
23.-Scaling-Techniques
No ratings yet
23.-Scaling-Techniques
30 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
Data Transformation (1)
No ratings yet
Data Transformation (1)
16 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
Machine Learning: by Team 2
No ratings yet
Machine Learning: by Team 2
41 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Develop A Program To Implement Data Preprocessing Using
No ratings yet
Develop A Program To Implement Data Preprocessing Using
19 pages
LAB MANUAL 5 SOLVED 40 (1)
No ratings yet
LAB MANUAL 5 SOLVED 40 (1)
13 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
Exp2 - Data Visualization and Cleaning and Feature Selection
No ratings yet
Exp2 - Data Visualization and Cleaning and Feature Selection
13 pages
ML unit 3
No ratings yet
ML unit 3
17 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
Intro To Statistics
No ratings yet
Intro To Statistics
37 pages
Algorithms and Flowcharts
No ratings yet
Algorithms and Flowcharts
37 pages
Computer Software
No ratings yet
Computer Software
23 pages
Data Representation
No ratings yet
Data Representation
31 pages
CS361 Artificial Intelligence (SEP) Lecture 1 (An Introduction To Artificial Intelligence) Fall 2020
No ratings yet
CS361 Artificial Intelligence (SEP) Lecture 1 (An Introduction To Artificial Intelligence) Fall 2020
44 pages
L16 Qcar
No ratings yet
L16 Qcar
9 pages
Box Plots Questions MME
No ratings yet
Box Plots Questions MME
9 pages
Week 8 - Results Section
No ratings yet
Week 8 - Results Section
21 pages
PS - Gtu Paper
No ratings yet
PS - Gtu Paper
3 pages
Reviewer in Statistics and Probability
No ratings yet
Reviewer in Statistics and Probability
1 page
Uji Validitas Dan Reabilitas: Case Processing Summary
No ratings yet
Uji Validitas Dan Reabilitas: Case Processing Summary
2 pages
Chapter 04 Notes
No ratings yet
Chapter 04 Notes
4 pages
A Level h1 Maths Solution 2009
No ratings yet
A Level h1 Maths Solution 2009
8 pages
Brand Loyalty Data Logistic Regression
No ratings yet
Brand Loyalty Data Logistic Regression
4 pages
Descriptive-Analytics
No ratings yet
Descriptive-Analytics
6 pages
Impulse Response VAR
No ratings yet
Impulse Response VAR
2 pages
Testul Chi Patrat
No ratings yet
Testul Chi Patrat
9 pages
M1 Mock 0 (3000)
No ratings yet
M1 Mock 0 (3000)
18 pages
Mocktest - Midterm 1 - Solution
No ratings yet
Mocktest - Midterm 1 - Solution
8 pages
Coursera Data Analysis With R
No ratings yet
Coursera Data Analysis With R
1 page
Team 5
No ratings yet
Team 5
12 pages
TYBBI SEM 5 Research Methodology
No ratings yet
TYBBI SEM 5 Research Methodology
18 pages
RI H2 Maths 2013 Prelim P2 Solutions
No ratings yet
RI H2 Maths 2013 Prelim P2 Solutions
10 pages
Anova 1 Running Head: ANOVA WITH SPSS
No ratings yet
Anova 1 Running Head: ANOVA WITH SPSS
6 pages
Lab Manual 2016-Ps
No ratings yet
Lab Manual 2016-Ps
29 pages
FORECAST in Excel - Step by Step Tutorial
No ratings yet
FORECAST in Excel - Step by Step Tutorial
8 pages
(eBook PDF) Business Statistics, 3rd Canadian Edition instant download
100% (5)
(eBook PDF) Business Statistics, 3rd Canadian Edition instant download
53 pages
CEA_ECE069_SAS-17-1
No ratings yet
CEA_ECE069_SAS-17-1
9 pages
1 PB
No ratings yet
1 PB
8 pages
IT445 Project
No ratings yet
IT445 Project
10 pages
Where can buy (Ebook) Applied Spatial Statistics and Econometrics by Katarzyna Kopczewska ISBN 9780367470760, 9780367470777, 9781003033219, 0367470764, 0367470772, 1003033210 ebook with cheap price
100% (4)
Where can buy (Ebook) Applied Spatial Statistics and Econometrics by Katarzyna Kopczewska ISBN 9780367470760, 9780367470777, 9781003033219, 0367470764, 0367470772, 1003033210 ebook with cheap price
77 pages
Research Methods for Public Administrators, 7th Edition Complete eBook Edition
100% (7)
Research Methods for Public Administrators, 7th Edition Complete eBook Edition
15 pages
ANOVA - Analysis of Variance (Slides)
No ratings yet
ANOVA - Analysis of Variance (Slides)
41 pages
Process Capability Analysis and Process Analytical Technology
No ratings yet
Process Capability Analysis and Process Analytical Technology
43 pages
Ch8 Index Model
No ratings yet
Ch8 Index Model
13 pages

Data Preparation.2

Uploaded by

Data Preparation.2

Uploaded by

Principles and Practices of

1. Feature Scaling is a method used to transform the numerical data into a

It can be achieved by two methods

● It represents the values in standard deviations from the mean

● Example if X= [1,3,5,7,9] then

● It converts each value in a column/feature of categorical type to number.

● Each “A” value has

Here’s how to interpret the values in the new columns:

You might also like