How to process categorical features in Python?

This recipe helps you process categorical features in Python

Recipe Objective

Machine Learning Models can not work on categorical variables in the form of strings, so we need to change it into numerical form. We can assign numbers for each categories but it may not be that effective when difference between the categories can not be measured. This can be done by making new features according to the categories with bool values. For this we will be using dummy variables to do so.

So this is the recipe on how we can process categorical features in Python .

Step 1 - Importing Library

from sklearn import preprocessing import pandas as pd

We have only imported pandas and preprocessing which is needed.

Step 2 - Creating DataFrame

We have created a Dictionary and passed it through pd.DataFrame to create dataframe with different features. raw_data = {"first_name": ["Jason", "Molly", "Tina", "Jake", "Amy"], "last_name": ["Miller", "Jacobson", "Ali", "Milner", "Cooze"], "age": [42, 52, 36, 24, 73], "city": ["San Francisco", "Baltimore", "Miami", "Douglas", "Boston"]} df = pd.DataFrame(raw_data, columns = ["first_name", "last_name", "age", "city"]) print(df)

Step 3 - Processing Categorical variables

We have first made the dummy variables with binary values for the categorical variable in feature city. Then we have used label encoder to fit and transform the data. print(pd.get_dummies(df["city"])) integerized_data = preprocessing.LabelEncoder().fit_transform(df["city"]) print(integerized_data) So the output comes as

  first_name last_name  age           city
0      Jason    Miller   42  San Francisco
1      Molly  Jacobson   52      Baltimore
2       Tina       Ali   36          Miami
3       Jake    Milner   24        Douglas
4        Amy     Cooze   73         Boston

   Baltimore  Boston  Douglas  Miami  San Francisco
0          0       0        0      0              1
1          1       0        0      0              0
2          0       0        0      1              0
3          0       0        1      0              0
4          0       1        0      0              0

[4 0 3 2 1]

Download Materials


What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Word2Vec and FastText Word Embedding with Gensim in Python
In this NLP Project, you will learn how to use the popular topic modelling library Gensim for implementing two state-of-the-art word embedding methods Word2Vec and FastText models.

Learn to Build a Polynomial Regression Model from Scratch
In this Machine Learning Regression project, you will learn to build a polynomial regression model to predict points scored by the sports team.

Time Series Python Project using Greykite and Neural Prophet
In this time series project, you will forecast Walmart sales over time using the powerful, fast, and flexible time series forecasting library Greykite that helps automate time series problems.

Classification Projects on Machine Learning for Beginners - 2
Learn to implement various ensemble techniques to predict license status for a given business.

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

BigMart Sales Prediction ML Project in Python
The goal of the BigMart Sales Prediction ML project is to build and evaluate different predictive models and determine the sales of each product at a store.

PyCaret Project to Build and Deploy an ML App using Streamlit
In this PyCaret Project, you will build a customer segmentation model with PyCaret and deploy the machine learning application using Streamlit.

Build a Music Recommendation Algorithm using KKBox's Dataset
Music Recommendation Project using Machine Learning - Use the KKBox dataset to predict the chances of a user listening to a song again after their very first noticeable listening event.

Build Deep Autoencoders Model for Anomaly Detection in Python
In this deep learning project , you will build and deploy a deep autoencoders model using Flask.