Data Science
Lecture no: 02
Data Preparation
Instructor: Dr. Khalid Mahboob
Email-id: [Link]@[Link]
Outline
Data Preparation
Importance of Data Preparation
Good Quality
Data Preparation and Model training Steps
Benefits of Data Preparation
Data Preparation
The process of cleaning data by reformatting, correcting errors, and combining data sets.
In other words: Data preparation is the cleaning and transforming of raw data before processing and
analysis.
Data preparation, also sometimes called “pre-processing,”.
Importance of Data Preparation
Data should be formatted according to the required software tool.
Machine Learning algorithm follows the rules (learn like a child)
Data in the real world is dirty.
Incomplete data: Some data lack attribute values, lack certain attributes of interest, or containing only
aggregate data.
For example, First name = “” or Last name = “”
Noisy: The presence of noisy data may be due to human error during data entry.
For example, Age = -10
Good Quality
?
GOOD QUALITY
MERCEDES-BENZ OR SUZUKI MEHRAN
Good Quality
Two Views of Quality Definition
Popular View: Quality is directly related to CLASS.
Technical View: To meets Customer Level of Satisfaction within Time Budget and Scope.
Data Preparation and Model training Steps
1. Problem Definition
2. Data Collection
3. Data Preparation
4. Data Exploration
5. Data Modeling
6. Evaluation
7. Deployment
8. Monitoring and Maintenance
Data Preparation and Model training Steps
1. Problem Definition: The first step is to clearly define the problem you are going to solve.
2. For example A DARAZ company is losing customers
Understand the root causes – the problem behind the problem
Identify the Stakeholders and the Users: Understanding the market and customers’ needs/wants
Explain the constraints imposed on the solution.
Data Preparation and Model training Steps
2. Data Collection: The next step is to collect the right set of data from various sources.
Data Preparation and Model training Steps
3. Data Preparation: Data preparation is the process of collecting, integrating, structuring, and organizing data so that it
can be used in business.
Data preparation involves many activities that can be performed in different ways.
• Data cleaning: fixing incomplete or erroneous data
• Data integration: unifying data from different sources
• Data transformation: formatting the data
• Data reduction: reducing data to its simplest form
• Data discretization: reducing the number of values to make data management easier
• Feature engineering: selecting and transforming variables to work better with machine learning
Data Preparation and Model training Steps
3. Data Preparation:
Cleaning Data: Most of the data you collect during the collection phase will be unstructured, irrelevant, and unfiltered.
Cleaning data eliminates duplicate, fills or delete missing values,, reduces outliers.
Data Preparation and Model training Steps
3. Data Preparation:
Data integration: Integrate (CSV, TSV, text, etc.) data into a coherent dataset that has been collected from different sources
Data Preparation and Model training Steps
3. Data Preparation:
Data Transformation is the method of changing data from one order or structure into another order or arrangement
Data Preparation and Model training Steps
3. Data Preparation:
Data Reduction is a dimension reduction technique that ensures the integrity of data while reducing the data.
Data reduction is a process that reduces the volume of original data and represents it in a much smaller volume.
For Example: 2D to 1D or Removing duplicate data
Data Preparation and Model training Steps
3. Data Preparation:
Data discretization is a process of converting large data attribute values into a finite set of intervals with minimal
loss of information and associating with each interval some specific data value or conceptual labels.
For example Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77
Attribute Age Age Age Age
1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78
After Discretization Child Young Mature Old
Data Preparation and Model training Steps
3. Data Preparation:
Feature Engineering is the process of creating/extracting features from data using domain knowledge of
data.
Features can be created from raw data or existing data.
Data Preparation and Model training Steps
4. Data Exploration: Data exploration is the first step in data analysis involving the use of data visualization
tools and statistical techniques to uncover data set characteristics and initial patterns.
such as size, dimensions, and accuracy, in order to better understand the nature of the data.
Data Preparation and Model training Steps
5. Data Modeling: Choose the model according to the data as there are different types of models in
machine learning such as supervised, unsupervised, or reinforcement learning models.
Data Preparation and Model training Steps
6. Evaluation: Model evaluation is the process of using different evaluation metrics to understand a
machine learning model's performance, as well as its strengths and weaknesses.
Data Preparation and Model training Steps
7. Model Deployment: Machine learning model deployment is the process of implementing a fully
trained machine learning model into a live environment where it can be used for its intended purpose
Data Preparation and Model training Steps
8. Monitoring and Maintenance: Maintenance and monitoring are actions intended to ensure that the
objectives of the stream restoration project are met over time.
Benefits of Data Preparation
Produce top-quality data — Cleaning and reformatting datasets ensures that all data used in the analysis will
be high quality.
Make better business decisions — higher quality data that can be processed and analyzed more quickly and
efficiently leads to more timely, efficient, and high-quality business decisions.
Reduce Loss of Money: Data preparation helps catch errors before processing.
Save the Time
Increase the Business Reputation