Data cleaning is the process of transforming raw, inconsistent data into a structured format suitable for analysis. It ensures that datasets are free of errors, missing values and inconsistencies that can give incorrect analysis or poor modeling. The main goals of data cleaning include:
- Eliminating errors and redundancy
- Ensuring accuracy and completeness
- Standardizing formats and types
Characteristics of Clean Data
Clean data is structured, complete and free from errors or irrelevant values.
- No duplicate rows or values
- No spelling mistakes or inconsistencies
- Correct data types
- No unwanted special characters
- Outliers handled or understood
Common Signs of Messy Data
Messy data often requires transformation or correction before use.
- Special characters (example., commas in numbers)
- Numbers stored as text
- Duplicate or missing rows
- Spelling inconsistencies
- Extra whitespace or zeros instead of nulls
Overview of a Typical Data Analysis Chain
Data flows through successive stages from raw data to consistent data with each stage requiring specific cleaning and validation steps. Understanding this flow helps us understand where changes like type correction, Null/NA value handling or standardization are done before moving forward. In this chain,
- Raw Data: As received, possibly lacking headers or with incorrect types.
- Technically Correct Data: Imported into R with proper names and data types, ready for inspection.
- Consistent Data: Fully cleaned and standardized, suitable for statistical analysis or modeling.
Data Cleaning in RImplementation of Data Cleaning
We will explore the process of data cleaning using R programming language.
1. Importing the Dataset
The airquality dataset is an inbuilt dataset in R language which we will use to implement data cleaning steps. We begin by loading the dataset and inspecting the top records.
R
Output:
Messy DataWe can notice missing values (NA) in the Ozone and Solar.R columns.
2. Identify Missing Values
We check whether the columns contain missing values.
R
mean(airquality$Solar.R)
mean(airquality$Ozone)
mean(airquality$Wind)
Output:
[1] NA
[1] NA
[1] 9.957516
We can see that both Solar.R and Ozone contain missing values which return NA when calculating the mean. The Wind column has no missing values.
3. Handle Missing Values with na.rm
We will use the na.rm = TRUE parameter which allows the functions to ignore NA values when calculating metrics like mean in this case.
R
mean(airquality$Solar.R, na.rm = TRUE)
mean(airquality$Ozone, na.rm = TRUE)
Output:
[1] 185.9315
[1] 42.1293
4. Summary and Boxplot for Inspection
We will use summary(airquality) and boxplot(airquality) to spot outliers and missing values.
R
summary(airquality)
boxplot(airquality)
Output:
Summary of messy data
Box plot of messy dataWe replace NA values in Ozone and Solar.R with the median of their respective columns.
R
New_df <- airquality
New_df$Ozone <- ifelse(is.na(New_df$Ozone),
median(New_df$Ozone, na.rm = TRUE),
New_df$Ozone)
New_df$Solar.R <- ifelse(is.na(New_df$Solar.R),
median(New_df$Solar.R, na.rm = TRUE),
New_df$Solar.R)
summary(New_df)
Output:
Cleaned dataset's summary6. Handling Outliers using IQR method
We are using the IQR method to detect and cap outliers so they don’t influence the analysis.
R
Q1 <- quantile(New_df$Ozone, 0.25, na.rm = TRUE)
Q3 <- quantile(New_df$Ozone, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
New_df$Ozone <- pmin(pmax(New_df$Ozone, lower_bound), upper_bound)
7. Visualizing the results
After cleaning we now use the head() and boxplot() to visualise the cleaned dataset to confirm no missing values remain.
R
head(New_df)
boxplot(New_df)
Output:
Cleaned Data
Cleaned Data boxplotThis article demonstrated how to clean a dataset in R using basic techniques like handling missing values and checking for inconsistencies. Clean data ensures accurate statistical analysis and is a critical step in any data science project.
Similar Reads
Data Type Conversion in R Data Type conversion refers to changing data from one type to another. In R, the primary data types are Numeric, Logical and Character. R provides built-in functions such as as.numeric(), as.character(), as.logical() etc to perform these conversions.Implementation of Data Type Conversion in RWe demo
3 min read
Check data format in R In this article, we will discuss how to check the format of the different data in the R Programming Language. we will use the typeof() Function in R. typeof() Function in RIn R Programming Language The typeof()Â function returns the types of data used as the arguments. Syntax: typeof(x) Parameters:x:
1 min read
How to create dataframe in R Dataframes are basic structures in R used to store and work with data in table format. They help us arrange data in rows and columns, like a spreadsheet or database table. We can use different ways to make dataframes while working with data in R.1. Creating and Combining Vectors Using data.frameWe m
2 min read
Read RData Files in R In this article we are going to see how to read R File using R Programming Language. We often have already written R scripts that can be reused using some simple code. Reading R Programming Language code from a file allows us to use the already written code again, enabling the possibility to update
3 min read
Data Cleaning & Transformation with Dplyr in R In R, data formatting typically involves preparing and structuring your data in a way that is suitable for analysis or visualization. The exact steps for data formatting may vary depending on your specific dataset and the analysis you want to perform. Here are some common data formatting tasks in th
6 min read
Learn R Programming R is a Programming Language that is mostly used for machine learning, data analysis, and statistical computing. It is an interpreted language and is platform independent that means it can be used on platforms like Windows, Linux, and macOS. In this R Language tutorial, we will Learn R Programming La
15+ min read