Open In App

Encoding Categorical Data in R

Last Updated : 26 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Encoding is the process of converting categorical data into numerical values. Categorical data is a type of data which can be classified into categories or groups (such as colors or job titles). Since categorical variables cannot be directly used in statistical analysis or machine learning models, encoding is necessary to represent them in a format that models can process.

Different Techniques to Encode Categorical Data

The Categorical data can be encoded in R using a variety of techniques. We'll go over three of the most popular approaches: label encoding, frequency encoding, and one-hot encoding.

1. One-Hot Encoding

One-hot encoding is a technique used to convert categorical data into a binary matrix. Each unique category value in a variable is assigned its own column in the matrix. For each row, if a category value is present, the corresponding column is marked with a 1, while all other columns for that row are set to 0. This technique ensures that categorical values are represented numerically, allowing them to be used in machine learning models.

In this example, we create a sample dataset and convert the gender column , which is a categorical column, to a numerical format using one hot encoding.

R
gender <-  c("male", "female", "male", "male", "female")
age    <-  c(23, 34, 52, 21, 19)
income <-  c(50000, 70000, 80000, 45000, 55000)
df     <-  data.frame(gender, age, income)

encoded_gender <- model.matrix(~gender-1, data=df)

Output:

one-hot
One-hot Encoding


2. Label Encoding

The Label encoding method is for encoding categorical variables that assigns the number value to each distinct value. For the instance, the numerical values 1, 2, and 3 might be assigned to a categorical variable with the three unique values of "red," "green," and "blue," respectively. The factor() function in R can be used to turn a category variable into a factor, that can subsequently be turned into integers using the as.integer() function.

In this example, the data frame contains a column color which is a categorical column. We can label encrypt the color column using the factor() function and change its type to integer using as.integer() function.

R
color <-  c("red", "green", "blue", "blue", "red")
df    <-  data.frame(color)

df$color <-  as.integer(factor(df$color))

Output:

label_encoding
Label Encoding


3. Frequency Encoding

The Frequency Each distinct value is assigned the frequency with which it occurs in the data when encoding categorical variables. The numerical values for each of these values may be 3, 4, and 2, respectively, if a categorical variable has three distinct values (red, green, and blue), and each of those values appears three, four, or two times.

In this example, we will frequency encode the color column of the data frame.

R
color <-  c("red", "green", "blue", "blue", "red")
df    <-  data.frame(color)

freq_count  <-  table(df$color)
df$color    <-  match(df$color, names(freq_count)

Output:

Frequency Encoding

Choosing an Encoding Method

The choice of encoding method depends on the type of analysis or model being used and the characteristics of the data. For categorical variables with a small number of unique values, Label Encoding and Frequency Encoding are commonly used. On the other hand, One-Hot Encoding is typically preferred for categorical variables with many unique values.

It's important to note that Label Encoding and Frequency Encoding can introduce unintended order or hierarchy into the data, which may affect the validity of analysis or machine learning models. In such cases, One-Hot Encoding may be a more suitable choice.

Difference Between all the Methods

Encoding MethodDescriptionWhen to Use
One-Hot EncodingConverts each category into a binary vector, where one element is set to 1, and all others are 0.When there is no inherent order between categories.

Ideal for categorical variables with a large number of unique values.
Frequency EncodingAssigns a numerical value to each category based on its frequency in the dataset.
When there are many categories and you want to retain information about the frequency of categories.

Useful for large datasets but may imply an unintended hierarchy.
Label EncodingAssigns a unique numerical value to each category based on its order in the dataset.



When there is an ordinal relationship between categories (e.g., low, medium, high).

Suitable for variables with a limited number of categories.

Not recommended for nominal data as it may create artificial ranking

In this article, we discussed three encoding methods One-Hot Encoding, Frequency Encoding, and Label Encoding and when to use each based on the nature of the categorical data and the analysis or model requirements.


Article Tags :

Similar Reads