How to Perform Ordinal Encoding Using Sklearn
Last Updated :
19 Jun, 2024
Have you ever played a game where you rank things, like your favorite pizza toppings or the scariest monsters? In the field of computers, it is essentially what ordinal encoding accomplishes! It converts ordered data, such as "small," "medium", and "large," into numerical values that a computer can comprehend.
Understanding Ordinal Encoding
What is Ordinal Encoding?
Imagine you're helping a friend sort their movie collection. Three categories are available to you : "terrible," "okay," and "awesome." Each category is given a number by ordinal encoding, such as "terrible" = 1, "okay" = 2, and "awesome" = 3. In this manner "awesome" movies come after "okay" ones so the computer can comprehend. One way to transform categorical data into numerical data is to use ordinal encoding. 'Red, Green and Blue ' are examples of categories or groups that are represented by categorical data. Each category in ordinal encoding is given an integer, such as 1 for "Red," 2 for "Green," and 3 for "Blue." The important thing to remember is that these categories have a purposeful hierarchy. Here are some key terms to remember:
- Category: A group of similar things, like the movie ratings in our example.
- Ordinal Data: Data that has a natural order like movie ratings shirt sizes (small, medium, large), or video game difficulty levels (easy, normal, hard).
- Encoding: Converting data into a format a computer can understand.
Why Use Ordinal Encoding?
Numerous algorithms in machine learning function best with numerical data. To assist the algorithms in processing the data and producing predictions, ordinal encoding converts categories into a number format. In cases when the categories are naturally arranged like grades (A, B, C) or levels (Low, Medium, High) it is extremely helpful.
Preparing the Data
Before we can perform ordinal encoding, we need to have some data to work with. Let's consider a simple dataset:
Student
| Grade
|
---|
Alice
| A
|
---|
Bob
| B
|
---|
Charlie
| C
|
---|
David
| A
|
---|
Eva
| B
|
---|
In this dataset, 'Grade' is a categorical variable that we want to encode.
Implementing Ordinal Encoding in Sklearn
Now, let's move on to the actual implementation using Sklearn.
Step 1: Install Sklearn
First, ensure you have Sklearn installed. You can install it using pip:
Python
Step 2: Import Necessary Libraries
Next, we'll import the required libraries.
Python
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
Step 3: Create the DataFrame
We'll create a DataFrame using the sample data.
Python
data = {
'Student': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Grade': ['A', 'B', 'C', 'A', 'B']
}
df = pd.DataFrame(data)
print(df)
Output:
Student Grade
0 Alice A
1 Bob B
2 Charlie C
3 David A
4 Eva B
Step 4: Initialize and Apply OrdinalEncoder
We'll initialize the OrdinalEncoder and apply it to the 'Grade' column.
Python
encoder = OrdinalEncoder(categories=[['A', 'B', 'C']])
df['Grade_encoded'] = encoder.fit_transform(df[['Grade']])
print(df)
Output:
Student Grade Grade_encoded
0 Alice A 0.0
1 Bob B 1.0
2 Charlie C 2.0
3 David A 0.0
4 Eva B 1.0
Verifying the Results
To verify our results, we can check that the 'Grade' column has been correctly encoded. Each category 'A', 'B', and 'C' has been replaced with 0.0, 1.0, and 2.0, respectively. This confirms that our ordinal encoding has been applied successfully.
Example 2: Ordinal Encoding with a Public Dataset
Step 1: Import Necessary Libraries
We'll start by importing the necessary libraries.
Python
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Load the Public Dataset
We'll load the "Titanic" dataset directly from a URL.
Python
# Load the Titanic dataset
url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
df = pd.read_csv(url)
print("Original Data:")
print(df.head())
Output:
Original Data:
Survived Pclass Name \
0 0 3 Mr. Owen Harris Braund
1 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cum...
2 1 3 Miss. Laina Heikkinen
3 1 1 Mrs. Jacques Heath (Lily May Peel) Futrelle
4 0 3 Mr. William Henry Allen
Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
0 male 22.0 1 0 7.2500
1 female 38.0 1 0 71.2833
2 female 26.0 0 0 7.9250
3 female 35.0 1 0 53.1000
4 male 35.0 0 0 8.0500
The Titanic dataset contains various features about passengers, such as 'Pclass' (passenger class), 'Sex', 'Age', etc.
Step 3: Visualize the Column
We'll focus on the 'Sex' column and visualize its distribution.
Python
# Visualizing the 'Sex' column
sns.countplot(x='Sex', data=df)
plt.title('Original Sex Distribution')
plt.show()
Output:

The plot shows the frequency of male and female passengers in the dataset.
Step 4: Initialize and Apply OrdinalEncoder
We'll initialize the OrdinalEncoder and apply it to the 'Sex' column.
Python
# Initialize the OrdinalEncoder
encoder = OrdinalEncoder(categories=[['female', 'male']])
df['Sex_encoded'] = encoder.fit_transform(df[['Sex']])
print("\nEncoded Data:")
print(df[['Sex', 'Sex_encoded']].head())
Output:
Encoded Data:
Sex Sex_encoded
0 male 1.0
1 female 0.0
2 female 0.0
3 female 0.0
4 male 1.0
We specify the order of the categories as 'female' and 'male'. The OrdinalEncoder assigns 0 to 'female' and 1 to 'male'.
Step 5: Visualize the Encoded Data
Let's visualize the encoded data to see the distribution of numerical values.
Python
# Visualizing the encoded data
sns.countplot(x='Sex_encoded', data=df)
plt.title('Encoded Sex Distribution')
plt.show()
Output:
.png)
Conclusion
Ordinal encoding is a handy way to prepare your data for machine learning tasks. The method is simple and seamless thanks to Sklearn's OrdinalEncoder. You can now use order to your advantage in your data analysis endeavors! When the categories have a natural order, ordinal encoding is a simple yet effective method for turning categorical data into numerical representation. This procedure is significantly more accessible, when Sklearn is used. You may use ordinal encoding into your machine learning projects with ease by following the instructions provided in this article.
Similar Reads
One Hot Encoding using Tensorflow
In this post, we will be seeing how to initialize a vector in TensorFlow with all zeros or ones. The function you will be calling is tf.ones(). To initialize with zeros you could use tf.zeros() instead. These functions take in a shape and return an array full of zeros and ones accordingly. Code: pyt
2 min read
Target encoding using nested CV in sklearn pipeline
In machine learning, feature engineering plays a pivotal role in enhancing model performance. One such technique is target encoding, which is particularly useful for categorical variables. However, improper implementation can lead to data leakage and overfitting. This article delves into the intrica
7 min read
How To Do Train Test Split Using Sklearn In Python
In this article, let's learn how to do a train test split using Sklearn in Python. Train Test Split Using Sklearn The train_test_split() method is used to split our data into train and test sets. First, we need to divide our data into features (X) and labels (y). The dataframe gets divided into X_t
5 min read
Label Encoding in R programming
The data that has to be processed for performing manipulations and Analyses should be easily understood and well denoted. The computer finds it difficult to process strings and other objects when data training and predictions based on it have to be performed. Label encoding is a mechanism to assign
3 min read
One Hot Encoding in Machine Learning
One Hot Encoding is a method for converting categorical variables into a binary format. It creates new columns for each category where 1 means the category is present and 0 means it is not. The primary purpose of One Hot Encoding is to ensure that categorical data can be effectively used in machine
9 min read
How to use datasets.fetch_mldata() in sklearn - Python?
mldata.org does not have an enforced convention for storing data or naming the columns in a data set. The default behavior of this function works well with most of the common cases mentioned below: Data values stored in the column are 'Dataâ, and target values stored in the column are âlabelâ.The fi
2 min read
How to Include Factors in Regression using R Programming?
Categorical variables (also known as a factor or qualitative variables) are variables that classify observational values into groups. They are either string or numeric are called factor variables in statistical modeling. Saving normal string variables as factors save a lot of memory. Factors can als
4 min read
Guided Ordinal Encoding Techniques
There are specifically two types of guided encoding techniques for categorical features, namely - target guided ordinal encoding & mean guided ordinal encoding. Tools and Technologies needed:Understanding of pandas libraryBasic knowledge of how a pandas Dataframe work.Jupyter Notebook or Google
3 min read
One-Hot Encoding in NLP
Natural Language Processing (NLP) is a quickly expanding discipline that works with computer-human language exchanges. One of the most basic jobs in NLP is to represent text data numerically so that machine learning algorithms can comprehend it. One common method for accomplishing this is one-hot en
9 min read
Bag of word and Frequency count in text using sklearn
Text data is ubiquitous in today's digital world, from emails and social media posts to research articles and customer reviews. To analyze and derive insights from this textual information, it's essential to convert text into a numerical form that machine learning algorithms can process. One of the
3 min read