0% found this document useful (0 votes)
47 views23 pages

Data Preparation in Data Science

The document outlines the importance and steps of data preparation in data science, emphasizing the need for cleaning, integrating, and transforming raw data before analysis. It details various activities involved in data preparation, such as data cleaning, integration, transformation, reduction, discretization, and feature engineering. The benefits of effective data preparation include improved data quality, better business decisions, cost savings, and enhanced business reputation.

Uploaded by

Marium Zehra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views23 pages

Data Preparation in Data Science

The document outlines the importance and steps of data preparation in data science, emphasizing the need for cleaning, integrating, and transforming raw data before analysis. It details various activities involved in data preparation, such as data cleaning, integration, transformation, reduction, discretization, and feature engineering. The benefits of effective data preparation include improved data quality, better business decisions, cost savings, and enhanced business reputation.

Uploaded by

Marium Zehra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Data Science

Lecture no: 02
Data Preparation
Instructor: Dr. Khalid Mahboob
Email-id: [Link]@[Link]
Outline

 Data Preparation

 Importance of Data Preparation

 Good Quality

 Data Preparation and Model training Steps

 Benefits of Data Preparation


Data Preparation

 The process of cleaning data by reformatting, correcting errors, and combining data sets.

 In other words: Data preparation is the cleaning and transforming of raw data before processing and

analysis.

 Data preparation, also sometimes called “pre-processing,”.


Importance of Data Preparation

 Data should be formatted according to the required software tool.

 Machine Learning algorithm follows the rules (learn like a child)

 Data in the real world is dirty.

 Incomplete data: Some data lack attribute values, lack certain attributes of interest, or containing only
aggregate data.

 For example, First name = “” or Last name = “”

 Noisy: The presence of noisy data may be due to human error during data entry.

 For example, Age = -10


Good Quality

?
GOOD QUALITY
 MERCEDES-BENZ OR SUZUKI MEHRAN
Good Quality

 Two Views of Quality Definition

 Popular View: Quality is directly related to CLASS.

 Technical View: To meets Customer Level of Satisfaction within Time Budget and Scope.
Data Preparation and Model training Steps

1. Problem Definition

2. Data Collection

3. Data Preparation

4. Data Exploration

5. Data Modeling

6. Evaluation

7. Deployment

8. Monitoring and Maintenance


Data Preparation and Model training Steps

1. Problem Definition: The first step is to clearly define the problem you are going to solve.

2. For example A DARAZ company is losing customers

 Understand the root causes – the problem behind the problem

 Identify the Stakeholders and the Users: Understanding the market and customers’ needs/wants

 Explain the constraints imposed on the solution.


Data Preparation and Model training Steps

2. Data Collection: The next step is to collect the right set of data from various sources.
Data Preparation and Model training Steps

3. Data Preparation: Data preparation is the process of collecting, integrating, structuring, and organizing data so that it

can be used in business.

 Data preparation involves many activities that can be performed in different ways.
• Data cleaning: fixing incomplete or erroneous data
• Data integration: unifying data from different sources
• Data transformation: formatting the data
• Data reduction: reducing data to its simplest form
• Data discretization: reducing the number of values to make data management easier
• Feature engineering: selecting and transforming variables to work better with machine learning
Data Preparation and Model training Steps

3. Data Preparation:

 Cleaning Data: Most of the data you collect during the collection phase will be unstructured, irrelevant, and unfiltered.

 Cleaning data eliminates duplicate, fills or delete missing values,, reduces outliers.
Data Preparation and Model training Steps

3. Data Preparation:
 Data integration: Integrate (CSV, TSV, text, etc.) data into a coherent dataset that has been collected from different sources
Data Preparation and Model training Steps

3. Data Preparation:

 Data Transformation is the method of changing data from one order or structure into another order or arrangement
Data Preparation and Model training Steps

3. Data Preparation:

 Data Reduction is a dimension reduction technique that ensures the integrity of data while reducing the data.

 Data reduction is a process that reduces the volume of original data and represents it in a much smaller volume.

 For Example: 2D to 1D or Removing duplicate data


Data Preparation and Model training Steps

3. Data Preparation:

 Data discretization is a process of converting large data attribute values into a finite set of intervals with minimal

loss of information and associating with each interval some specific data value or conceptual labels.

 For example Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Attribute Age Age Age Age


1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old


Data Preparation and Model training Steps

3. Data Preparation:

 Feature Engineering is the process of creating/extracting features from data using domain knowledge of

data.

 Features can be created from raw data or existing data.


Data Preparation and Model training Steps

4. Data Exploration: Data exploration is the first step in data analysis involving the use of data visualization
tools and statistical techniques to uncover data set characteristics and initial patterns.

 such as size, dimensions, and accuracy, in order to better understand the nature of the data.
Data Preparation and Model training Steps

5. Data Modeling: Choose the model according to the data as there are different types of models in

machine learning such as supervised, unsupervised, or reinforcement learning models.


Data Preparation and Model training Steps

6. Evaluation: Model evaluation is the process of using different evaluation metrics to understand a

machine learning model's performance, as well as its strengths and weaknesses.


Data Preparation and Model training Steps

7. Model Deployment: Machine learning model deployment is the process of implementing a fully

trained machine learning model into a live environment where it can be used for its intended purpose
Data Preparation and Model training Steps

8. Monitoring and Maintenance: Maintenance and monitoring are actions intended to ensure that the

objectives of the stream restoration project are met over time.


Benefits of Data Preparation

 Produce top-quality data — Cleaning and reformatting datasets ensures that all data used in the analysis will

be high quality.

 Make better business decisions — higher quality data that can be processed and analyzed more quickly and

efficiently leads to more timely, efficient, and high-quality business decisions.

 Reduce Loss of Money: Data preparation helps catch errors before processing.

 Save the Time

 Increase the Business Reputation

You might also like