0% found this document useful (0 votes)
45 views20 pages

Cleaning Techniques (Slides)

The document provides an overview of data cleaning, detailing its importance, common techniques, and challenges such as handling missing data, duplicates, outliers, and structural issues. It emphasizes that data cleaning is essential for ensuring accurate analysis and decision-making, as erroneous data can lead to biased results. Various strategies for addressing these issues are discussed, including imputation, deduplication, and standardization of data formats.

Uploaded by

Wakanda Citizen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views20 pages

Cleaning Techniques (Slides)

The document provides an overview of data cleaning, detailing its importance, common techniques, and challenges such as handling missing data, duplicates, outliers, and structural issues. It emphasizes that data cleaning is essential for ensuring accurate analysis and decision-making, as erroneous data can lead to biased results. Various strategies for addressing these issues are discussed, including imputation, deduplication, and standardization of data formats.

Uploaded by

Wakanda Citizen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data cleaning

Cleaning techniques
Please do not copy without permission. © ALX 2024.
Data cleaning

Data overview

| We will use a dataset containing the prices of maize in three major cities in Kenya from
January to March 2022.

01. The last column in the dataset, namely Dataset


exchrate, contains values calculated by
dividing the price column values by the 01.
usdprice column values.

02. At the bottom of the dataset, we calculated


the sum and count of the price, usdprice, and
exchrate columns, and the results were used
to calculate the average of each, respectively.

02.

2
Data cleaning

What is data cleaning?

| Data cleaning is the process of identifying and correcting erroneous data within a
dataset. This improves the quality of our data in preparation for analysis.

What qualifies as erroneous data? Common data cleaning tasks

Data collected from different sources and in different Data cleaning is not just about deleting bad data
ways are bound to have mistakes, such as being : but trying to achieve optimal accuracy by
correcting what we can.
● Incorrect
● Incomplete ● Removing duplicate observations
● Duplicated ● Removing irrelevant observations
● Irrelevant ● Handling missing data
● Wrongly formatted ● Filtering unwanted outliers
● Fixing structural issues

Data practitioners spend 60% to 80% of their time getting their data ready through cleaning.
3
Data cleaning

Why data cleaning is important

| Data-driven decisions have become increasingly popular within organizations. As a result,


there is a great need for clean and accurate data that can provide reliable insights.

Benefits of data cleaning


Our analysis results are directly affected by
the quality of our data, regardless of whether ● Prevents time-consuming and costly fixes due
we follow the conventional analysis to decisions brought about by bad data.
processes. ● Makes the analysis more efficient since the data
is well organized.
● Tidy data enable more organized and effective
An analysis conducted with erroneous data storage.
may lead to wrong results. Basing business ● Helps to avoid mistakes during daily business
decisions and strategies on these results operations involving the data.
may do more harm than good. ● Reduces clutter by getting rid of unwanted
data.

4
Data cleaning

Missing data

| Missing data refers to data points that are incomplete or do not contain any value.

Why it’s bad for our data Possible causes


● Data entry errors
● Can affect how the data will be interpreted ● Incomplete survey responses
and lead to biased results as they are not ● Error during data collection
based on all values.

● Can lead to biased results and reduce the


Missing data can be represented in different ways,
statistical power of analysis.
depending on how the data were collected and the
software being used. It is often represented by a
blank, null, or NaN.

5
Data cleaning

NaNs vs nulls

|
Both NaNs and nulls are used to represent
missing data, but they represent slightly
different values.
01.

.Nulls:.

Nulls represent the absence of any value


completely.

In Google Sheets, null values are indicated by 02.

empty or blank cells.

01. Null values are present in the dataset.

Our calculations below are therefore not inclusive of all the observations in our dataset
02.
since the functions used are built to ignore blank cells.

6
Data cleaning

NaNs vs nulls
.NaNs:.

NaNs (Not a Number) are used to represent


unrepresentable calculation results. 01.

In Google Sheets, a NaN value appears as an error


returned by a formula when the result is not a valid
numerical value.

In a cell, this can be displayed as “#VALUE!” in the


case where a formula attempts to perform a
mathematical operation on non-numeric values or 02.
“#DIV/0!” where it tries to divide a zero by a zero.

01. NaN values brought about by dividing zero by zero.

The presence of these NaNs in the column has affected some of our calculations that are
02.
also returning NaN values.

7
Data cleaning

Handling missing data

| During data cleaning, we search for and deal with missing values using various strategies.
The appropriate strategy depends on the type and amount of missing data.

01. Imputation 02. Flag as missing

The process of filling in the blanks using an Missing data may be informative in itself. We can
estimated value. This can be achieved in several inform our analysis of missing values by filling them
ways: in with a uniform value such as ‘null’.

● Using the median or mean. 03. Drop observations with missing values
● Copying data from a similar dataset.
● Using domain knowledge. We can also choose to delete observations with
● Using linear regression. missing values, especially if they contain too many
blanks for it to be useful.
Limitation: Their accuracy can’t be guaranteed,
and they could further confuse the results by Limitation: We may end up losing valuable
reinforcing false patterns within the data. information, especially if we delete too much data.
8
Data cleaning

Duplicate observations

| Duplicate observations are entries that have been repeated in the dataset.

Why it’s bad for our data Possible causes

● Can lead to biased results. ● Data combined from multiple sources


● Error during data entry
● Can lead to an excessively large dataset,
● System glitches
which is difficult to deal with and wastes time
and storage space.

● Can skew the data, causing inaccurate and


confused results.

● Can make visualization difficult to read.

9
Data cleaning

Duplicate observations
Duplicate observations often occur when two or
more observations have identical values for all or
most of the variables in the dataset.
01.

In this example, we see that the identified duplicate


observations are exact duplicates, and we can
simply remove the second occurrence to clean the
dataset.
02.

01. These particular observations have been duplicated.

Therefore, our calculations below are not accurate, as they include the duplicates. For
02.
instance, our ‘Count’ values are in excess by 2 because of counting the 2 duplicate entries.

10
Data cleaning

Removing duplicate observations

| Deduplication is the process of identifying and getting rid of the duplicate records using
various strategies, including deleting and merging.

Steps to deduplication 02. Handle the duplicate observations by:


a. Removing the duplicate, retaining only the
01. Search for the duplicate values in Google first or last occurrence.
Sheets using: b. Merging the duplicate observations.
a. The filter option in the toolbar. 03. Both steps 01. and 02. can be done manually,
b. Data > Column stats in the toolbar. where we inspect duplicates one by one, or
c. A conditional statement in a formula. automatically, using built-in data cleaning
d. Mixed references and comparisons in a functions in software.
formula.
Be careful when removing seemingly duplicate
observations. Ensure that they do not represent
distinct cases.
11
Data cleaning

Unwanted outliers

| Outliers are unusual data points that differ significantly from the rest of the values in the
dataset.

Why it’s bad for our data Possible causes


● Sampling errors
● Can skew the analysis results (towards ● Measurement errors
outliers). ● Data entry errors
● Can disturb distribution of data. ● Natural variation in the data

● Can affect the readability of results.

Although outliers may provide valuable insights into data, unwanted outliers can
significantly impact the accuracy and validity of our analysis results.

12
Data cleaning

Unwanted outliers

01. Can you identify a probable cause for the outlier


in the example? Tip: Consider the other features
02. in the dataset.

01. This value seems to be significantly higher than the rest, which means it is an outlier.

This means that our sum and average results have been skewed towards this outlier. We
02.
need to investigate the reason for this and find an appropriate way to handle the outlier.

13
Data cleaning

Filtering unwanted outliers

| During data cleaning, we remove unwanted outliers that may affect the performance of
the data.

Not all outliers are errors. Outliers can provide valuable insights into the data. It’s important to first
determine the validity of removing an outlier.

01. Identifying the outliers 02. Removing the outliers


● Plot the data for visual inspection, using a ● Delete the entire observation containing the
scatter plot, box plot, or histogram. outlier. This is if the observation is clearly an
error.
● Use of statistical measures that are based on
the distribution of data, such as the interquartile ● Replace the outlier value with a more accurate
range. or representative value, such as the mean.

● Use domain knowledge to determine the


validity.
14
Data cleaning

Irrelevant observations

| Irrelevant observations are observations that are not useful to the context or problem to be
solved and therefore do not contribute to our analysis.

Why it’s bad for our data Possible causes

● Can make our dataset large which makes it less ● Unnecessary data picked during data
manageable and efficient. collection/web scraping
● Data entry errors
● Can reduce the accuracy and effectiveness of
● Merging of multiple datasets
our analysis.

● Can introduce bias. Whether an observation or feature is irrelevant and


whether or not it should be removed depends on
the size and complexity of the dataset and the
problem to be solved.
15
Data cleaning

Irrelevant observations
In this example, the irrelevant observations and
01.
features are easy to identify and simple to remove
because no other features are dependent on them.

However, in many cases it will require greater


insight to identify these irrelevant observations
and to determine possible interdependency.
02.

The second row in our data seems to contain some additional metadata that does not add any
01.
value to our data.

The currency column is also unnecessary. Since our dataset contains maize prices in Kenya, it is
02.
unlikely that we will have differing currencies.

16
Data cleaning

Removing irrelevant observations

| We remove irrelevant observations such that we are only left with data that are necessary
to our analysis.

The appropriate strategy for handling irrelevant observations and features depends on the goals of our
analysis and the structure of the data.

In some cases, removing irrelevant observations We need to be careful of the criteria used to
may be appropriate, while in other cases, they may identify irrelevant observations such that there are
be retained but not used in the analysis. no biases introduced.

It is important to note that additional observations


and features often contribute to the story we are
trying to tell with the data, even if they do not
influence the analysis.

17
Data cleaning

Structural issues

| Structural errors are issues within the data such as inconsistent naming conventions,
inconsistent formatting, typos, incorrect capitalization, and inconsistent data types.

Why it’s bad for our data Possible causes

● Can cause the data to be processed incorrectly ● Data entry errors


and thus give incorrect or biased results. ● Incomplete survey responses
● Error during data collection
● Can lead to mislabeled categories, which may
● Incorrect importing of data
cause us to miss out on or misinterpret key
● Formula errors
findings.
The severity of structural errors depends on the
data structure, how the data are being used in
analysis, and the specific type of structural error.

18
Data cleaning

Structural issues
The structural issues in this example did not
influence our sum, count, and average analysis.
However, this is not always the case.
01.
Incorrect file imports into Google Sheets often
result in structural issues due to varying data
02.
types, inconsistent delimiters, etc.

In the date column, we have different date formats. This is not only untidy but may also cause
01.
confusion in our analysis.

For the market column, inconsistent naming conventions have been used. We have some
02. rows where the full city name has been used, while others have the abbreviated form of the
name. These will be treated as different categories even though they represent the same thing.

19
Data cleaning

Fixing structural issues

|
Depending on the type of structural issue, we can apply various strategies to correct it,
including correcting data entries and imports, converting to consistent formats, and
imputation.

01. Standardise the data and structure 03. Spell check to correct typos

This involves keeping the data consistent We can do this by either inspecting and then
throughout the dataset, for example, lowercase, correcting them manually or automatically using
uppercase, naming conventions, measurements, programming or spelling and grammar tools.
date formats, padding for string size, etc.
These are only some of the ways to fix structural
02. Convert to the correct data type issues. Often, some of our other data cleaning
strategies will also help with fixing structural issues.
This ensures that each column has the appropriate
data type and every value will be processed
appropriately.

20

You might also like