Open In App

Data Preprocessing in R

Last Updated : 23 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Data preprocessing is an important step in data analysis and machine learning. In R, we use various tools to clean, manipulate and prepare data for analysis. In this article we will explore the essential steps involved in data preprocessing using R.

1. Installing and Loading Required Packages

The tidyverse package provides essential tools for data manipulation and visualization. We can install tidyverse package using install.packages() function. The library() function loads the package for use in the script.

R
install.packages("tidyverse")
library(tidyverse)

2. Loading Data

To begin working with data, we need to load it into R. We can do this by using the read.csv() function for CSV files. We will be using the "titanic" dataset, which can be downloaded from here.

R
data <- read.csv("/path/to/your/data.csv")

3. Exploring the Dataset

To understand the structure of the data, we can use the dim() and head() functions to see the dimensions and the first few rows of the dataset.

R
dim(data)
head(data)

Output:

head_titanic
First Few Rows of the Data


4. Descriptive Statistics

To get an overview of the dataset, we can compute descriptive statistics using the summary() function. The summary() function generates basic descriptive statistics for each column, including the minimum, maximum, mean and quartiles.

R
summary(data)

Output:

summary_titanic
Statistics of the Data


5. Data Cleaning

Data cleaning involves addressing missing values, correcting data types and removing irrelevant columns. The as.numeric() function ensures that the Age and Fare columns are treated as numeric values for further analysis.

R
data$Age <- as.numeric(data$Age)
data$Fare <- as.numeric(data$Fare)

6. Handling Missing Values

We can fill missing values using the most common values (mode) or the median for numerical columns.

R
data$Age[is.na(data$Age)] <- median(data$Age, na.rm = TRUE)

data$Sex[is.na(data$Sex)] <- "male"

7. Feature Scaling

Feature scaling ensures that all features have a similar scale, which is essential for many machine learning algorithms. We use two common methods: Standardization and Normalization.

1. Standardization

This method is useful when features have different units. The scale() function standardizes the Fare column by converting it to have a mean of 0 and a standard deviation of 1, which is essential for models that rely on distance metrics.

R
data$Fare <- scale(data$Fare)
data$Age <- scale(data$Age)


2. Normalization

Normalization scales the data to a fixed range, usually [0, 1]. This is useful when features need to be within a specific range for machine learning algorithms.

R
data$Fare <- (data$Fare - min(data$Fare)) / (max(data$Fare) - min(data$Fare))

8. Encoding Categorical Data

Categorical data needs to be converted into numeric form for machine learning models. We can use One-Hot Encoding or Ordinal Encoding.

1. One-Hot Encoding

One-hot encoding creates binary columns for each category in a categorical variable. model.matrix() creates binary columns for each level of the Sex column and cbind() combines these columns with the original data.

R
encoded_data <- cbind(data, model.matrix(~ Sex - 1, data))

2. Ordinal Encoding

Ordinal encoding is used for categorical variables with an inherent order. We can assign numerical values based on the order of categories. We assign numerical values to the categories "male" and "female" based on their order and replaces the categorical values in Sex with the corresponding numeric values.

R
mapping <- c("male" = 0, "female" = 1)
data$Sex <- mapping[data$Sex]

9. Handling Outliers

Outliers can distort statistical analyses and machine learning models. We can use box plots to detect and handle outliers. We will define an Interquartile Range (IQR) for Fare column and removes values outside of the calculated bounds.

R
boxplot(data$Fare,main="Before Removing Outliers")

IQR_value <- IQR(data$Fare)
lower_bound <- quantile(data$Fare, 0.25) - 1.5 * IQR_value
upper_bound <- quantile(data$Fare, 0.75) + 1.5 * IQR_value
data <- data[data$Fare >= lower_bound & data$Fare <= upper_bound, ]

boxplot(data$Fare, main="After Removing Outliers")

Output:

before
Before Outlier Removal
after
After Outlier Removal

10. Visualizing the Results

After performing the above steps we will see how out dataset has changed.

After_processing
After Preprocessing the Data

We can see that the categorical column "Sex" is converted into a numerical column and the columns "Age" and "Fare" have been scaled. Therefore, the dataset can now be used for analysis or machine learning.



Next Article

Similar Reads