0% found this document useful (0 votes)

45 views20 pages

Cleaning Techniques (Slides)

The document provides an overview of data cleaning, detailing its importance, common techniques, and challenges such as handling missing data, duplicates, outliers, and structural issues. It emphasizes that data cleaning is essential for ensuring accurate analysis and decision-making, as erroneous data can lead to biased results. Various strategies for addressing these issues are discussed, including imputation, deduplication, and standardization of data formats.

Uploaded by

Wakanda Citizen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views20 pages

Cleaning Techniques (Slides)

Uploaded by

Wakanda Citizen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Data cleaning

Cleaning techniques
Please do not copy without permission. © ALX 2024.
Data cleaning

Data overview

| We will use a dataset containing the prices of maize in three major cities in Kenya from
January to March 2022.

01. The last column in the dataset, namely Dataset

exchrate, contains values calculated by
dividing the price column values by the 01.
usdprice column values.

02. At the bottom of the dataset, we calculated

the sum and count of the price, usdprice, and
exchrate columns, and the results were used
to calculate the average of each, respectively.

02.

2
Data cleaning

What is data cleaning?

| Data cleaning is the process of identifying and correcting erroneous data within a
dataset. This improves the quality of our data in preparation for analysis.

What qualifies as erroneous data? Common data cleaning tasks

Data collected from different sources and in different Data cleaning is not just about deleting bad data
ways are bound to have mistakes, such as being : but trying to achieve optimal accuracy by
correcting what we can.
● Incorrect
● Incomplete ● Removing duplicate observations
● Duplicated ● Removing irrelevant observations
● Irrelevant ● Handling missing data
● Wrongly formatted ● Filtering unwanted outliers
● Fixing structural issues

Data practitioners spend 60% to 80% of their time getting their data ready through cleaning.
3
Data cleaning

Why data cleaning is important

| Data-driven decisions have become increasingly popular within organizations. As a result,

there is a great need for clean and accurate data that can provide reliable insights.

Benefits of data cleaning

Our analysis results are directly affected by
the quality of our data, regardless of whether ● Prevents time-consuming and costly fixes due
we follow the conventional analysis to decisions brought about by bad data.
processes. ● Makes the analysis more efficient since the data
is well organized.
● Tidy data enable more organized and effective
An analysis conducted with erroneous data storage.
may lead to wrong results. Basing business ● Helps to avoid mistakes during daily business
decisions and strategies on these results operations involving the data.
may do more harm than good. ● Reduces clutter by getting rid of unwanted
data.

4
Data cleaning

Missing data

| Missing data refers to data points that are incomplete or do not contain any value.

Why it’s bad for our data Possible causes

● Data entry errors
● Can affect how the data will be interpreted ● Incomplete survey responses
and lead to biased results as they are not ● Error during data collection
based on all values.

● Can lead to biased results and reduce the

Missing data can be represented in different ways,
statistical power of analysis.
depending on how the data were collected and the
software being used. It is often represented by a
blank, null, or NaN.

5
Data cleaning

NaNs vs nulls

|
Both NaNs and nulls are used to represent
missing data, but they represent slightly
different values.
01.

.Nulls:.

Nulls represent the absence of any value

completely.

In Google Sheets, null values are indicated by 02.

empty or blank cells.

01. Null values are present in the dataset.

Our calculations below are therefore not inclusive of all the observations in our dataset
02.
since the functions used are built to ignore blank cells.

6
Data cleaning

NaNs vs nulls
.NaNs:.

NaNs (Not a Number) are used to represent

unrepresentable calculation results. 01.

In Google Sheets, a NaN value appears as an error

returned by a formula when the result is not a valid
numerical value.

In a cell, this can be displayed as “#VALUE!” in the

case where a formula attempts to perform a
mathematical operation on non-numeric values or 02.
“#DIV/0!” where it tries to divide a zero by a zero.

01. NaN values brought about by dividing zero by zero.

The presence of these NaNs in the column has affected some of our calculations that are
02.
also returning NaN values.

7
Data cleaning

Handling missing data

| During data cleaning, we search for and deal with missing values using various strategies.
The appropriate strategy depends on the type and amount of missing data.

01. Imputation 02. Flag as missing

The process of filling in the blanks using an Missing data may be informative in itself. We can
estimated value. This can be achieved in several inform our analysis of missing values by filling them
ways: in with a uniform value such as ‘null’.

● Using the median or mean. 03. Drop observations with missing values
● Copying data from a similar dataset.
● Using domain knowledge. We can also choose to delete observations with
● Using linear regression. missing values, especially if they contain too many
blanks for it to be useful.
Limitation: Their accuracy can’t be guaranteed,
and they could further confuse the results by Limitation: We may end up losing valuable
reinforcing false patterns within the data. information, especially if we delete too much data.
8
Data cleaning

Duplicate observations

| Duplicate observations are entries that have been repeated in the dataset.

Why it’s bad for our data Possible causes

● Can lead to biased results. ● Data combined from multiple sources

● Error during data entry
● Can lead to an excessively large dataset,
● System glitches
which is difficult to deal with and wastes time
and storage space.

● Can skew the data, causing inaccurate and

confused results.

● Can make visualization difficult to read.

9
Data cleaning

Duplicate observations
Duplicate observations often occur when two or
more observations have identical values for all or
most of the variables in the dataset.
01.

In this example, we see that the identified duplicate

observations are exact duplicates, and we can
simply remove the second occurrence to clean the
dataset.
02.

01. These particular observations have been duplicated.

Therefore, our calculations below are not accurate, as they include the duplicates. For
02.
instance, our ‘Count’ values are in excess by 2 because of counting the 2 duplicate entries.

10
Data cleaning

Removing duplicate observations

| Deduplication is the process of identifying and getting rid of the duplicate records using
various strategies, including deleting and merging.

Steps to deduplication 02. Handle the duplicate observations by:

a. Removing the duplicate, retaining only the
01. Search for the duplicate values in Google first or last occurrence.
Sheets using: b. Merging the duplicate observations.
a. The filter option in the toolbar. 03. Both steps 01. and 02. can be done manually,
b. Data > Column stats in the toolbar. where we inspect duplicates one by one, or
c. A conditional statement in a formula. automatically, using built-in data cleaning
d. Mixed references and comparisons in a functions in software.
formula.
Be careful when removing seemingly duplicate
observations. Ensure that they do not represent
distinct cases.
11
Data cleaning

Unwanted outliers

| Outliers are unusual data points that differ significantly from the rest of the values in the
dataset.

Why it’s bad for our data Possible causes

● Sampling errors
● Can skew the analysis results (towards ● Measurement errors
outliers). ● Data entry errors
● Can disturb distribution of data. ● Natural variation in the data

● Can affect the readability of results.

Although outliers may provide valuable insights into data, unwanted outliers can
significantly impact the accuracy and validity of our analysis results.

12
Data cleaning

Unwanted outliers

01. Can you identify a probable cause for the outlier

in the example? Tip: Consider the other features
02. in the dataset.

01. This value seems to be significantly higher than the rest, which means it is an outlier.

This means that our sum and average results have been skewed towards this outlier. We
02.
need to investigate the reason for this and find an appropriate way to handle the outlier.

13
Data cleaning

Filtering unwanted outliers

| During data cleaning, we remove unwanted outliers that may affect the performance of
the data.

Not all outliers are errors. Outliers can provide valuable insights into the data. It’s important to first
determine the validity of removing an outlier.

01. Identifying the outliers 02. Removing the outliers

● Plot the data for visual inspection, using a ● Delete the entire observation containing the
scatter plot, box plot, or histogram. outlier. This is if the observation is clearly an
error.
● Use of statistical measures that are based on
the distribution of data, such as the interquartile ● Replace the outlier value with a more accurate
range. or representative value, such as the mean.

● Use domain knowledge to determine the

validity.
14
Data cleaning

Irrelevant observations

| Irrelevant observations are observations that are not useful to the context or problem to be
solved and therefore do not contribute to our analysis.

Why it’s bad for our data Possible causes

● Can make our dataset large which makes it less ● Unnecessary data picked during data
manageable and efficient. collection/web scraping
● Data entry errors
● Can reduce the accuracy and effectiveness of
● Merging of multiple datasets
our analysis.

● Can introduce bias. Whether an observation or feature is irrelevant and

whether or not it should be removed depends on
the size and complexity of the dataset and the
problem to be solved.
15
Data cleaning

Irrelevant observations
In this example, the irrelevant observations and
01.
features are easy to identify and simple to remove
because no other features are dependent on them.

However, in many cases it will require greater

insight to identify these irrelevant observations
and to determine possible interdependency.
02.

The second row in our data seems to contain some additional metadata that does not add any
01.
value to our data.

The currency column is also unnecessary. Since our dataset contains maize prices in Kenya, it is
02.
unlikely that we will have differing currencies.

16
Data cleaning

Removing irrelevant observations

| We remove irrelevant observations such that we are only left with data that are necessary
to our analysis.

The appropriate strategy for handling irrelevant observations and features depends on the goals of our
analysis and the structure of the data.

In some cases, removing irrelevant observations We need to be careful of the criteria used to
may be appropriate, while in other cases, they may identify irrelevant observations such that there are
be retained but not used in the analysis. no biases introduced.

It is important to note that additional observations

and features often contribute to the story we are
trying to tell with the data, even if they do not
influence the analysis.

17
Data cleaning

Structural issues

| Structural errors are issues within the data such as inconsistent naming conventions,
inconsistent formatting, typos, incorrect capitalization, and inconsistent data types.

Why it’s bad for our data Possible causes

● Can cause the data to be processed incorrectly ● Data entry errors

and thus give incorrect or biased results. ● Incomplete survey responses
● Error during data collection
● Can lead to mislabeled categories, which may
● Incorrect importing of data
cause us to miss out on or misinterpret key
● Formula errors
findings.
The severity of structural errors depends on the
data structure, how the data are being used in
analysis, and the specific type of structural error.

18
Data cleaning

Structural issues
The structural issues in this example did not
influence our sum, count, and average analysis.
However, this is not always the case.
01.
Incorrect file imports into Google Sheets often
result in structural issues due to varying data
02.
types, inconsistent delimiters, etc.

In the date column, we have different date formats. This is not only untidy but may also cause
01.
confusion in our analysis.

For the market column, inconsistent naming conventions have been used. We have some
02. rows where the full city name has been used, while others have the abbreviated form of the
name. These will be treated as different categories even though they represent the same thing.

19
Data cleaning

Fixing structural issues

|
Depending on the type of structural issue, we can apply various strategies to correct it,
including correcting data entries and imports, converting to consistent formats, and
imputation.

01. Standardise the data and structure 03. Spell check to correct typos

This involves keeping the data consistent We can do this by either inspecting and then
throughout the dataset, for example, lowercase, correcting them manually or automatically using
uppercase, naming conventions, measurements, programming or spelling and grammar tools.
date formats, padding for string size, etc.
These are only some of the ways to fix structural
02. Convert to the correct data type issues. Often, some of our other data cleaning
strategies will also help with fixing structural issues.
This ensures that each column has the appropriate
data type and every value will be processed
appropriately.

Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Data Cleaning: A Brief Guide To
100% (2)
Data Cleaning: A Brief Guide To
15 pages
Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data Mining Group Assignment4
No ratings yet
Data Mining Group Assignment4
10 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
PHD Seminar
No ratings yet
PHD Seminar
38 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Integrating Data From Different Sources
No ratings yet
Integrating Data From Different Sources
11 pages
Data Clean R
100% (1)
Data Clean R
11 pages
Data Cleaning (Examples)
No ratings yet
Data Cleaning (Examples)
9 pages
Unit 2 Preprocessing in Data Analytics
No ratings yet
Unit 2 Preprocessing in Data Analytics
36 pages
Data Cleaning for Analysts
No ratings yet
Data Cleaning for Analysts
1 page
Data Quality
No ratings yet
Data Quality
14 pages
Foundation of DS
No ratings yet
Foundation of DS
21 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
Data Cleaning Essentials
No ratings yet
Data Cleaning Essentials
42 pages
Process Data From Dirty To Clean
No ratings yet
Process Data From Dirty To Clean
30 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Session 7 - Data Preprocessing and Transformation - 2025
No ratings yet
Session 7 - Data Preprocessing and Transformation - 2025
20 pages
Best Practices For Data Cleaning - EN - 1802
No ratings yet
Best Practices For Data Cleaning - EN - 1802
13 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Data Cleaning R
No ratings yet
Data Cleaning R
2 pages
Data Cleaning Mistakes to Avoid
No ratings yet
Data Cleaning Mistakes to Avoid
3 pages
Data Cleansing Using R
0% (1)
Data Cleansing Using R
10 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
5.fixing Invalid Values & Filter Data
No ratings yet
5.fixing Invalid Values & Filter Data
2 pages
FBA Module 3
No ratings yet
FBA Module 3
41 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
M-II FDS U-II Questions
No ratings yet
M-II FDS U-II Questions
43 pages
Data Collection Cleaning Preprocessing Presentation
No ratings yet
Data Collection Cleaning Preprocessing Presentation
13 pages
Lec 3 Data Preprocessing and Transformation
No ratings yet
Lec 3 Data Preprocessing and Transformation
73 pages
The Ultimate Guide To Data Cleaning
No ratings yet
The Ultimate Guide To Data Cleaning
18 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Process-Phase (Data Cleaning Features and Techniques (Lab-Topics)
No ratings yet
Process-Phase (Data Cleaning Features and Techniques (Lab-Topics)
6 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Unit 2
No ratings yet
Unit 2
16 pages
Introduction To Data Cleaning
No ratings yet
Introduction To Data Cleaning
2 pages
Data Cleaningin ML
No ratings yet
Data Cleaningin ML
15 pages
Data Cleaning Using Pandas
No ratings yet
Data Cleaning Using Pandas
9 pages
Data Cleansing in Data Science
No ratings yet
Data Cleansing in Data Science
5 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Data Management Quiz
No ratings yet
Data Management Quiz
4 pages
Importance of Data Cleaning 1
No ratings yet
Importance of Data Cleaning 1
47 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Ids Unit 2
No ratings yet
Ids Unit 2
26 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Data Cleaning R
No ratings yet
Data Cleaning R
16 pages
m4t5 - PDF - Eng Data Cleaning & Etl
No ratings yet
m4t5 - PDF - Eng Data Cleaning & Etl
6 pages
Using Excel To Clean and Prepare Data For Analysis
No ratings yet
Using Excel To Clean and Prepare Data For Analysis
9 pages
Integrated Project - Access To Drinking Water (Transforming The Data) (Re-Brand)
100% (1)
Integrated Project - Access To Drinking Water (Transforming The Data) (Re-Brand)
35 pages
Getting Set Up For Preparing Data (Slides)
No ratings yet
Getting Set Up For Preparing Data (Slides)
10 pages
Code Challenge Integrated Project P1 Student Version
No ratings yet
Code Challenge Integrated Project P1 Student Version
9 pages
Pix Fruits
No ratings yet
Pix Fruits
6 pages
Useful Formulas and Where To Use Them (Slides)
No ratings yet
Useful Formulas and Where To Use Them (Slides)
25 pages
ALX - An Introduction To Data Visualisations - Lesson Overview 3707
No ratings yet
ALX - An Introduction To Data Visualisations - Lesson Overview 3707
1 page
ALX - Descriptive Statistics - Reference Card 3585
No ratings yet
ALX - Descriptive Statistics - Reference Card 3585
1 page
Summarizing Data (Slides)
No ratings yet
Summarizing Data (Slides)
13 pages
Assessment and Project Plan - Preparing Data 3809
No ratings yet
Assessment and Project Plan - Preparing Data 3809
2 pages
Generic Structure of The Final Report and Table of Contents For Final Project Report
No ratings yet
Generic Structure of The Final Report and Table of Contents For Final Project Report
8 pages
Week 9 Legal Social Ethical Issues
No ratings yet
Week 9 Legal Social Ethical Issues
45 pages
Pandas Basics For Data Science
No ratings yet
Pandas Basics For Data Science
2 pages
How To Grow A YouTube Channel From 0 Subs in 2024 (Full Course)
No ratings yet
How To Grow A YouTube Channel From 0 Subs in 2024 (Full Course)
6 pages
Concept - Creating Your Personal Mission Statement 2 - Accra Intranet
No ratings yet
Concept - Creating Your Personal Mission Statement 2 - Accra Intranet
2 pages
Concept - More Suggestions On How To Deal With Imposter Syndrome - Accra Intranet
No ratings yet
Concept - More Suggestions On How To Deal With Imposter Syndrome - Accra Intranet
3 pages
Requirementsdevelopment
No ratings yet
Requirementsdevelopment
7 pages
Threat Landscape Report 2h 2023
No ratings yet
Threat Landscape Report 2h 2023
43 pages
Impactof Data Securityand Privacyin Cloud Computing
No ratings yet
Impactof Data Securityand Privacyin Cloud Computing
16 pages
Developing Web Content Management Systems - From T
No ratings yet
Developing Web Content Management Systems - From T
8 pages
Exercise - Advanced If Statements (Slides) (Re-Brand)
No ratings yet
Exercise - Advanced If Statements (Slides) (Re-Brand)
25 pages
Lab1 TestingVM
No ratings yet
Lab1 TestingVM
3 pages
TSS HD Suspension
No ratings yet
TSS HD Suspension
2 pages
Intelilite 4 Mrs 16 Datasheet - 2
100% (1)
Intelilite 4 Mrs 16 Datasheet - 2
4 pages
Learning Plan in EPP 6
No ratings yet
Learning Plan in EPP 6
4 pages
Standard of Competence
No ratings yet
Standard of Competence
11 pages
A Film by Cristian Mungiu: A Mobra Films Production
No ratings yet
A Film by Cristian Mungiu: A Mobra Films Production
9 pages
Excerpt
No ratings yet
Excerpt
10 pages
IsoFlap Product Information Sheet en
No ratings yet
IsoFlap Product Information Sheet en
2 pages
اسئلة التنافسي قسم الاجهزة الطبية 2018 2019
No ratings yet
اسئلة التنافسي قسم الاجهزة الطبية 2018 2019
5 pages
Surge Protection Standards Guide
No ratings yet
Surge Protection Standards Guide
1 page
Presented By-Khyati, Chareeta, Hitesh
No ratings yet
Presented By-Khyati, Chareeta, Hitesh
6 pages
PROFIBUS DP AC 800M 6.0 Installation
No ratings yet
PROFIBUS DP AC 800M 6.0 Installation
114 pages
Presentation of ENISA Study - Recommendations - Christina Skouloudi
No ratings yet
Presentation of ENISA Study - Recommendations - Christina Skouloudi
31 pages
Software Testing Life Cycle
100% (4)
Software Testing Life Cycle
3 pages
Mfe - Module 1
No ratings yet
Mfe - Module 1
48 pages
Class XI Admission Fees 2022-23
No ratings yet
Class XI Admission Fees 2022-23
1 page
Straight-Line Motion Crossword Puzzle
No ratings yet
Straight-Line Motion Crossword Puzzle
2 pages
Sample Rubrics: Graphing Rubric 1
100% (1)
Sample Rubrics: Graphing Rubric 1
2 pages
5.1 Chemical Formulae, Equations, Calculations (1C) QP Part 2
No ratings yet
5.1 Chemical Formulae, Equations, Calculations (1C) QP Part 2
12 pages
Ethical Dilemmas in Movies
No ratings yet
Ethical Dilemmas in Movies
13 pages
VLSI Design MCQs & Answers
0% (1)
VLSI Design MCQs & Answers
20 pages
Marketing Implementation & Control Guide
No ratings yet
Marketing Implementation & Control Guide
19 pages
Actividad 6 Reading Comprehension: Deisy Johanna Guayacán Vanegas
No ratings yet
Actividad 6 Reading Comprehension: Deisy Johanna Guayacán Vanegas
4 pages
UEME3112 Fluid Mechanics II May 2019 CFD Assignment: Laminar Pipe Flow
No ratings yet
UEME3112 Fluid Mechanics II May 2019 CFD Assignment: Laminar Pipe Flow
18 pages
Fitness For Service Assessments BAOT144 - S
No ratings yet
Fitness For Service Assessments BAOT144 - S
10 pages
Photogrammetry Manual PDF
No ratings yet
Photogrammetry Manual PDF
103 pages
203 Boiling-Out Procedure With Cetamine
No ratings yet
203 Boiling-Out Procedure With Cetamine
4 pages
AWW Dust Collector Article Jan 2006
No ratings yet
AWW Dust Collector Article Jan 2006
7 pages
BW PCA ConfigurationGuide
100% (1)
BW PCA ConfigurationGuide
29 pages
DAA Experiment - 3
No ratings yet
DAA Experiment - 3
40 pages
Cos 202
No ratings yet
Cos 202
28 pages

Cleaning Techniques (Slides)

Uploaded by

Cleaning Techniques (Slides)

Uploaded by

Data cleaning

01. The last column in the dataset, namely Dataset

02. At the bottom of the dataset, we calculated

What is data cleaning?

What qualifies as erroneous data? Common data cleaning tasks

Why data cleaning is important

| Data-driven decisions have become increasingly popular within organizations. As a result,

Benefits of data cleaning

Why it’s bad for our data Possible causes

● Can lead to biased results and reduce the

Nulls represent the absence of any value

In Google Sheets, null values are indicated by 02.

empty or blank cells.

01. Null values are present in the dataset.

NaNs (Not a Number) are used to represent

In Google Sheets, a NaN value appears as an error

In a cell, this can be displayed as “#VALUE!” in the

01. NaN values brought about by dividing zero by zero.

Handling missing data

01. Imputation 02. Flag as missing

Why it’s bad for our data Possible causes

● Can lead to biased results. ● Data combined from multiple sources

● Can skew the data, causing inaccurate and

● Can make visualization difficult to read.

In this example, we see that the identified duplicate

01. These particular observations have been duplicated.

Removing duplicate observations

Steps to deduplication 02. Handle the duplicate observations by:

Why it’s bad for our data Possible causes

● Can affect the readability of results.

01. Can you identify a probable cause for the outlier

Filtering unwanted outliers

01. Identifying the outliers 02. Removing the outliers

● Use domain knowledge to determine the

Why it’s bad for our data Possible causes

● Can introduce bias. Whether an observation or feature is irrelevant and

However, in many cases it will require greater

Removing irrelevant observations

It is important to note that additional observations

Why it’s bad for our data Possible causes

● Can cause the data to be processed incorrectly ● Data entry errors

Fixing structural issues

You might also like