Open In App

Data Cleaning in R

Last Updated : 25 Jun, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Data cleaning is the process of transforming raw, inconsistent data into a structured format suitable for analysis. It ensures that datasets are free of errors, missing values and inconsistencies that can give incorrect analysis or poor modeling. The main goals of data cleaning include:

  • Eliminating errors and redundancy
  • Ensuring accuracy and completeness
  • Standardizing formats and types

Characteristics of Clean Data

Clean data is structured, complete and free from errors or irrelevant values.

  • No duplicate rows or values
  • No spelling mistakes or inconsistencies
  • Correct data types
  • No unwanted special characters
  • Outliers handled or understood

Common Signs of Messy Data

Messy data often requires transformation or correction before use.

  • Special characters (example., commas in numbers)
  • Numbers stored as text
  • Duplicate or missing rows
  • Spelling inconsistencies
  • Extra whitespace or zeros instead of nulls

Overview of a Typical Data Analysis Chain

Data flows through successive stages from raw data to consistent data with each stage requiring specific cleaning and validation steps. Understanding this flow helps us understand where changes like type correction, Null/NA value handling or standardization are done before moving forward. In this chain,

  • Raw Data: As received, possibly lacking headers or with incorrect types.
  • Technically Correct Data: Imported into R with proper names and data types, ready for inspection.
  • Consistent Data: Fully cleaned and standardized, suitable for statistical analysis or modeling.
Data Cleaning in RGeeksforgeeks
Data Cleaning in R

Implementation of Data Cleaning

We will explore the process of data cleaning using R programming language.

1. Importing the Dataset

The airquality dataset is an inbuilt dataset in R language which we will use to implement data cleaning steps. We begin by loading the dataset and inspecting the top records.

R
head(airquality)

Output:

messy-data
Messy Data

We can notice missing values (NA) in the Ozone and Solar.R columns.

2. Identify Missing Values

We check whether the columns contain missing values.

R
mean(airquality$Solar.R)
mean(airquality$Ozone)
mean(airquality$Wind)


Output:

[1] NA
[1] NA
[1] 9.957516

We can see that both Solar.R and Ozone contain missing values which return NA when calculating the mean. The Wind column has no missing values.

3. Handle Missing Values with na.rm

We will use the na.rm = TRUE parameter which allows the functions to ignore NA values when calculating metrics like mean in this case.

R
mean(airquality$Solar.R, na.rm = TRUE)
mean(airquality$Ozone, na.rm = TRUE)

Output:

[1] 185.9315
[1] 42.1293

4. Summary and Boxplot for Inspection

We will use summary(airquality) and boxplot(airquality) to spot outliers and missing values.


R
summary(airquality)
boxplot(airquality)

Output:

Summary
Summary of messy data
plot
Box plot of messy data

5. Replace Missing Values with Median

We replace NA values in Ozone and Solar.R with the median of their respective columns.

R
New_df <- airquality

New_df$Ozone <- ifelse(is.na(New_df$Ozone),
                       median(New_df$Ozone, na.rm = TRUE),
                       New_df$Ozone)

New_df$Solar.R <- ifelse(is.na(New_df$Solar.R),
                         median(New_df$Solar.R, na.rm = TRUE),
                         New_df$Solar.R)

summary(New_df)

Output:

new-summary
Cleaned dataset's summary

6. Handling Outliers using IQR method

We are using the IQR method to detect and cap outliers so they don’t influence the analysis.

R
Q1 <- quantile(New_df$Ozone, 0.25, na.rm = TRUE)
Q3 <- quantile(New_df$Ozone, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val

New_df$Ozone <- pmin(pmax(New_df$Ozone, lower_bound), upper_bound)

7. Visualizing the results

After cleaning we now use the head() and boxplot() to visualise the cleaned dataset to confirm no missing values remain.

R
head(New_df)
boxplot(New_df)

Output:

cleaned_data
Cleaned Data
cleaned_plot
Cleaned Data boxplot

This article demonstrated how to clean a dataset in R using basic techniques like handling missing values and checking for inconsistencies. Clean data ensures accurate statistical analysis and is a critical step in any data science project.


Similar Reads