Data Cleaning in R

Last Updated : 25 Jun, 2025

Data cleaning is the process of transforming raw, inconsistent data into a structured format suitable for analysis. It ensures that datasets are free of errors, missing values and inconsistencies that can give incorrect analysis or poor modeling. The main goals of data cleaning include:

Eliminating errors and redundancy
Ensuring accuracy and completeness
Standardizing formats and types

Characteristics of Clean Data

Clean data is structured, complete and free from errors or irrelevant values.

No duplicate rows or values
No spelling mistakes or inconsistencies
Correct data types
No unwanted special characters
Outliers handled or understood

Common Signs of Messy Data

Messy data often requires transformation or correction before use.

Special characters (example., commas in numbers)
Numbers stored as text
Duplicate or missing rows
Spelling inconsistencies
Extra whitespace or zeros instead of nulls

Overview of a Typical Data Analysis Chain

Data flows through successive stages from raw data to consistent data with each stage requiring specific cleaning and validation steps. Understanding this flow helps us understand where changes like type correction, Null/NA value handling or standardization are done before moving forward. In this chain,

Raw Data: As received, possibly lacking headers or with incorrect types.
Technically Correct Data: Imported into R with proper names and data types, ready for inspection.
Consistent Data: Fully cleaned and standardized, suitable for statistical analysis or modeling.

Data Cleaning in RGeeksforgeeks — Data Cleaning in R

Implementation of Data Cleaning

We will explore the process of data cleaning using R programming language.

1. Importing the Dataset

The airquality dataset is an inbuilt dataset in R language which we will use to implement data cleaning steps. We begin by loading the dataset and inspecting the top records.

head(airquality)

Output:

We can notice missing values (NA) in the Ozone and Solar.R columns.

2. Identify Missing Values

We check whether the columns contain missing values.

mean(airquality$Solar.R)
mean(airquality$Ozone)
mean(airquality$Wind)

Output:

[1] NA
[1] NA
[1] 9.957516

We can see that both Solar.R and Ozone contain missing values which return NA when calculating the mean. The Wind column has no missing values.

3. Handle Missing Values with `na.rm`

We will use the na.rm = TRUE parameter which allows the functions to ignore NA values when calculating metrics like mean in this case.

mean(airquality$Solar.R, na.rm = TRUE)
mean(airquality$Ozone, na.rm = TRUE)

Output:

[1] 185.9315
[1] 42.1293

4. Summary and Boxplot for Inspection

We will use summary(airquality) and boxplot(airquality) to spot outliers and missing values.

summary(airquality)
boxplot(airquality)

Output:

5. Replace Missing Values with Median

We replace NA values in Ozone and Solar.R with the median of their respective columns.

New_df <- airquality

New_df$Ozone <- ifelse(is.na(New_df$Ozone),
                       median(New_df$Ozone, na.rm = TRUE),
                       New_df$Ozone)

New_df$Solar.R <- ifelse(is.na(New_df$Solar.R),
                         median(New_df$Solar.R, na.rm = TRUE),
                         New_df$Solar.R)

summary(New_df)

Output:

6. Handling Outliers using IQR method

We are using the IQR method to detect and cap outliers so they don’t influence the analysis.

Q1 <- quantile(New_df$Ozone, 0.25, na.rm = TRUE)
Q3 <- quantile(New_df$Ozone, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val

New_df$Ozone <- pmin(pmax(New_df$Ozone, lower_bound), upper_bound)

7. Visualizing the results

After cleaning we now use the head() and boxplot() to visualise the cleaned dataset to confirm no missing values remain.

head(New_df)
boxplot(New_df)

Output:

This article demonstrated how to clean a dataset in R using basic techniques like handling missing values and checking for inconsistencies. Clean data ensures accurate statistical analysis and is a critical step in any data science project.

Data Type Conversion in R

geetansh044

Improve

Article Tags :

Data Cleaning in R

Characteristics of Clean Data

Common Signs of Messy Data

Overview of a Typical Data Analysis Chain

Implementation of Data Cleaning

1. Importing the Dataset

2. Identify Missing Values

3. Handle Missing Values with na.rm

4. Summary and Boxplot for Inspection

5. Replace Missing Values with Median

6. Handling Outliers using IQR method

7. Visualizing the results

Similar Reads

Thank You!

What kind of Experience do you want to share?

3. Handle Missing Values with `na.rm`