What is the Equivalent of DataFrame.drop_duplicates() from Pandas in Polars?

Last Updated : 09 Aug, 2024

In data analysis, data manipulation is a critical task and sometimes involves removing duplicates from the data. Removing duplicate elements is crucial as it can affect the program, making it look perfect even though in reality it is flawed.

Pandas, a popular data manipulation library in Python, has the drop_duplicates function which can remove duplicate elements. However, Polars, a manipulation library known for its speed and efficiency, provides a similar function for removing duplicates in data.

In this article, we will learn how to use Polars to remove duplicates in data.

What is Polars in Python?

Polars is a high-performance DataFrame library for Rust and Python, designed to handle large datasets efficiently. The core of Polars is written in Rust, which provides C/C++ performance and is available for Python, R, and Node.js. The goal of Polars is to provide a lightning-fast DataFrame library that can utilize all cores on your machine, handle datasets larger than your available RAM, and offer a consistent and predictable API.

Polars can be installed in Python using a simple pip command. Just open your terminal type the following command and press enter.

pip install polars

Once polars is installed you can verify it by writing a simple Python Script. This will give detailed information about the installed version of Polars.

Python

# import module
import polars as pl

# print version
pl.show_versions()

Output:

Removing Duplicates in Polars

In Polars, the equivalent of Pandas' drop_duplicates() method is achieved using the unique() method. This method removes duplicate rows based on the specified columns

Let us see step by step how we can load the data into Polars and remove duplicate values using the unique() function.

Import Polars

Once the Polars library is installed on your system, you can import it an use it in your programs.

import polars as pl

Load Data

Next step is to load the data into the Polars DataFrame using the DataFrame() function.

df = pl.DataFrame({
    "a": [1, 2, 2, 3, 3, 4],
    "b": ["a", "b", "b", "c"]
})

Remove Duplicate values

After loading the data, use the unique() function to remove duplicates.

df_unique = df.unique()

And finally print the Data Frame.

Code Examples

New let us see a few examples to see how the unique() function actually works.

Removing Duplicates from all Columns

In this example, we will create a simple dataframe and with two columns. Then we will use unique() function to remove duplicate values and print the dataframe.

Python

# import polars
import polars as pl

# create data frame
df = pl.DataFrame({
    "a": [1, 2, 2, 3],
    "b": ["a", "b", "b", "c"]
})

# remove duplicate values
df_unique = df.unique()

# print dataframe
print(df_unique)

Output:

Removed Duplicate values in Polars — Removed Duplicate values using unique()

Remove Duplicates from Specific Columns

In this example, we will create a simple dataframe with some duplicate values in it. Then we will use unique() function to remove duplicate values but from only one specific column and print the dataframe.

To remove the duplicate values from a specific data frame, we provide a subset parameter to the unique() function and provide the column name to remove duplicate values from.

Python

# import polars
import polars as pl

# create data frame
df = pl.DataFrame({
    "a": [1, 2, 2, 3, 4, 4, 5],
    "b": ["a", "b", "b", "c", "d", "d", "e"]
})

# remove duplicates from column "a"
df_unique = df.unique(subset=["a"], keep="first")

# print dataframe
print(df_unique)

Output:

Removing Duplicate values from specific Column

Conclusion

In this article, we have explored Polars and some of its functions. Both Pandas and Polars offer straightforward methods to remove duplicate rows from a DataFrame, each with its own syntax and performance characteristics. Polars, with its focus on speed and efficiency, is particularly well-suited for handling large datasets. By following the methods mentioned in this article, you can effectively remove duplicates from your DataFrame.

How to drop duplicates and keep one in PySpark dataframe

punitss8u0w

Improve

Article Tags :

Practice Tags :

python

What is the Equivalent of DataFrame.drop_duplicates() from Pandas in Polars?

What is Polars in Python?

Removing Duplicates in Polars

Import Polars

Load Data

Remove Duplicate values

Code Examples

Removing Duplicates from all Columns

Remove Duplicates from Specific Columns

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?