Streamlining Data Cleaning with PyJanitor: A Comprehensive Guide

Data cleaning is one of the most important steps in the data analysis process. Raw data often contains missing values, inconsistent column names, duplicates, or unwanted entries. Cleaning such data manually using Pandas can become repetitive and error-prone, especially for large datasets.

What is PyJanitor?

PyJanitor is an open-source Python library that extends Pandas by adding convenient functions for data cleaning. It helps perform common tasks such as renaming columns, handling missing values, filtering data, and encoding categorical variables with minimal code.

The goal of PyJanitor is to make data cleaning simpler, faster, and less error-prone, especially for beginners.

Key Features of PyJanitor

PyJanitor offers a variety of features that simplify data cleaning:

Method Chaining: Apply multiple cleaning steps in a single readable pipeline.
Convenient Functions: Ready-made functions for common data-cleaning tasks.
Pandas Integration: Works seamlessly with Pandas DataFrames.
Custom Functions: Allows users to integrate their own cleaning logic.

Installing PyJanitor

You can install PyJanitor using pip:

pip install pyjanitor

Using PyJanitor for Data Cleaning in Python

1. Cleaning Column Names

clean_names() function standardizes column names by:

Converting to lowercase
Replacing spaces with underscores
Removing special characters

Python

import pandas as pd
import janitor

data = pd.DataFrame({'Column_1': [1, 2], 'Column_2': [3, 4]})
data = data.clean_names(remove_special=True)
print(data)

Output

column_1 column_2
0 1 3
1 2 4

2. Removing Empty Rows and Columns

Use remove_empty() to remove rows or columns containing only missing values.

Python

import pandas as pd
import janitor

data = {'A': [1, None, 3], 'B': [4, None, 6]}
data = pd.DataFrame(data)  

data = data.remove_empty()   
print(data)

Output

A B
0 1.0 4.0
1 3.0 6.0

3. Identifying Duplicate Data Points

We can identify the data points that are repeated using the duplicated() function, which returns True if all the columns of a data point are repeated, and False if any one is not repeated.

Python

import pandas as pd
import janitor

data = {
    'A': [1, 2, 2, 4],
    'B': [5, 6, 6, 8]
}
data = pd.DataFrame(data)

dup = data.duplicated()
print(dup)

Output

0 False
1 False
2 True
3 False
dtype: bool

4. Encoding Object Data Type to Categorical

We can encode an object data type to a categorical data type using the encode_categorical() function, in which we need to pass the column names for which we want to encode.

Python

import pandas as pd
import janitor

data = {
    'A': ['low', 'medium', 'high', 'medium', 'low'],
    'B': ['type1', 'type2', 'type1', 'type3', 'type2']
}
data = pd.DataFrame(data)
print(data.dtypes)

data = data.encode_categorical(column_names=['A', 'B'])
print(data)
print(data.dtypes)

Output

A object
B object
dtype: object
A B
0 low type1
1 medium type2
2 high type1
3 medium type3
4 low type2
A category
B category
dtype: object

Explanation:

data = pd.DataFrame({...}): Create a DataFrame with columns 'A' and 'B'.
data.encode_categorical(column_names=['A', 'B']): Convert specified columns to categorical type.

5. Renaming Columns

Renaming columns is common in data cleaning; PyJanitor’s clean_names standardizes names to lowercase and replaces spaces with underscores.

Python

import pandas as pd
import janitor
data = {
    'First Name': [1, 2, 3, 4],
    'Last Name': [5, 6, 7, 8],
    'Age (Years)': [9, 10, 11, 12]
}
df = pd.DataFrame(data)

c1 = df.clean_names()
print(c1)

Output

first_name last_name age_years_
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12

6. Filtering Data

Filtering data based on certain conditions is a common data cleaning task. PyJanitor provides the filter_string function to filter rows based on string conditions.

Python

import pandas as pd
import janitor

data = {
    'Name': ['Smith', 'Robert', 'Mary', 'Harry'],
    'Age': [25, 30, 35, 40]
}
df = pd.DataFrame(data)
d1 = df.filter_string(column_name='Name', search_string='a')
print(d1)

Output

Name Age
2 Mary 35
3 Harry 40

Pipe() Method: Chaining Operations

The pipe() method of PyJanitor is used to chain multiple data-cleaning operations.
This method helps us to write more readable code.
We can do a series of operations in a clear manner, making it easier to understand. Here’s an example of how to use this function.

Example:

Python

import pandas as pd
from janitor import clean_names, remove_empty

company_sales = {
    'SalesMonth': ['Jan', 'Feb', None, 'April'],
    'Company1': [150.0, 200.0, None, 400.0],
    'Company2': [180.0, 250.0, None, 500.0],
    'Company3': [400.0, 500.0, None, 675.0]
}

data = (
    pd.DataFrame.from_dict(company_sales)
    .pipe(clean_names)
    .pipe(remove_empty)
)
print(data)

Output

salesmonth company1 company2 company3
0 Jan 150.0 180.0 400.0
1 Feb 200.0 250.0 500.0
2 April 400.0 500.0 675.0

Key PyJanitor Functions

Now that we have understood the main features of PyJanitor, let's dive deep into some other main functions.

1. fill_empty(data, column_names, value)

The fill_empty function replaces empty values in a column with a specified value.

column_names: the column(s) to fill (use a string for one column or a list for multiple columns).
value: the value to replace the empty cells.

Example:

Python

import pandas as pd
import janitor as jn

data = pd.DataFrame(
    {
        'col1': [1, 2, 3],
        'col2': [None, 4, None],
        'col3': [None, 5, 6]
    }
)

data = jn.impute(data, column_names=['col2', 'col3'], value=0)
print(data)

Output

col1 col2 col3
0 1 0.0 0.0
1 2 4.0 5.0
2 3 0.0 6.0

Explanation: jn.impute(data, column_names=['col2', 'col3'], value=0): Fill missing values in columns 'col2' and 'col3' with 0.

2. filter_on(data, criteria, complement=False)

The filter_on function lets you filter rows in a DataFrame based on a condition. It does not change the original data.

criteria: the condition as a string (e.g., "score >= 50").
complement: default is False. Set to True to get rows not matching the condition.

Example:

Python

import pandas as pd

data = pd.DataFrame({
    "student_id": ["S1", "S2", "S3", "S4", "S5"],
    "score": [45, 75, 50, 90, 30],
})

f1 = data.query("score >= 50")
print(f1)

Output

student_id score
1 S2 75
2 S3 50
3 S4 90

Explanation: f1 = data.query("score >= 50"): Filter rows where the score is greater than or equal to 50.

3. rename_column(data, old_column_name, new_column_name)

The rename_column function is used to change a column name in a DataFrame.

old_column_name: the current column name you want to change.
new_column_name: the new name you want to give.

Example:

Python

import pandas as pd

data = pd.DataFrame({"x": [10, 20, 30], "y": [40, 50, 60]})
data = data.rename(columns={'x': 'x_new'})
print(data)

Output

x_new y
0 10 40
1 20 50
2 30 60

Explanation: data = data.rename(columns={'x': 'x_new'}): Rename column x to x_new.

4. add_column(df, column_name, value, fill_remaining=False)

The add_column function is used to add a new column to a DataFrame.

column_name: the name of the new column.
value: the value(s) to fill in the column. You can provide a single value (e.g., 3) for all rows, or a list/range for different values in each row.

Example:

Python

import pandas as pd

data = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
data = data.assign(
    c=1,
    d=list("efg"),
    e=range(4, 7)
)
print(data)

Output

a b c d e
0 0 a 1 e 4
1 1 b 1 f 5
2 2 c 1 g 6

Explanation:

pd.DataFrame({"a": list(range(3)), "b": list("abc")}): Create a DataFrame with columns a and b.
data.assign(c=1, d=list("efg"), e=range(4, 7)): Add new columns:
c with all values 1
d with values 'e', 'f', 'g'
e with values 4, 5, 6

How to Automate Data Cleaning in Python?
Pandas Tutorial

Streamlining Data Cleaning with PyJanitor: A Comprehensive Guide

What is PyJanitor?

Key Features of PyJanitor

Installing PyJanitor

Using PyJanitor for Data Cleaning in Python

1. Cleaning Column Names

2. Removing Empty Rows and Columns

3. Identifying Duplicate Data Points

4. Encoding Object Data Type to Categorical

5. Renaming Columns

6. Filtering Data

Pipe() Method: Chaining Operations

Key PyJanitor Functions

1. fill_empty(data, column_names, value)

2. filter_on(data, criteria, complement=False)

3. rename_column(data, old_column_name, new_column_name)

4. add_column(df, column_name, value, fill_remaining=False)

Related Articles:

Explore