What is the Equivalent of DataFrame.drop_duplicates() from Pandas in Polars?
Last Updated :
09 Aug, 2024
In data analysis, data manipulation is a critical task and sometimes involves removing duplicates from the data. Removing duplicate elements is crucial as it can affect the program, making it look perfect even though in reality it is flawed.
Pandas, a popular data manipulation library in Python, has the drop_duplicates function which can remove duplicate elements. However, Polars, a manipulation library known for its speed and efficiency, provides a similar function for removing duplicates in data.
In this article, we will learn how to use Polars to remove duplicates in data.
What is Polars in Python?
Polars is a high-performance DataFrame library for Rust and Python, designed to handle large datasets efficiently. The core of Polars is written in Rust, which provides C/C++ performance and is available for Python, R, and Node.js. The goal of Polars is to provide a lightning-fast DataFrame library that can utilize all cores on your machine, handle datasets larger than your available RAM, and offer a consistent and predictable API.
Polars can be installed in Python using a simple pip command. Just open your terminal type the following command and press enter.
pip install polars
Once polars is installed you can verify it by writing a simple Python Script. This will give detailed information about the installed version of Polars.
Python
# import module
import polars as pl
# print version
pl.show_versions()
Output:
Polars VersionRemoving Duplicates in Polars
In Polars, the equivalent of Pandas' drop_duplicates()
method is achieved using the unique()
method. This method removes duplicate rows based on the specified columns
Let us see step by step how we can load the data into Polars and remove duplicate values using the unique() function.
Import Polars
Once the Polars library is installed on your system, you can import it an use it in your programs.
import polars as pl
Load Data
Next step is to load the data into the Polars DataFrame using the DataFrame() function.
df = pl.DataFrame({
"a": [1, 2, 2, 3, 3, 4],
"b": ["a", "b", "b", "c"]
})
Remove Duplicate values
After loading the data, use the unique() function to remove duplicates.
df_unique = df.unique()
And finally print the Data Frame.
Code Examples
New let us see a few examples to see how the unique() function actually works.
Removing Duplicates from all Columns
In this example, we will create a simple dataframe and with two columns. Then we will use unique() function to remove duplicate values and print the dataframe.
Python
# import polars
import polars as pl
# create data frame
df = pl.DataFrame({
"a": [1, 2, 2, 3],
"b": ["a", "b", "b", "c"]
})
# remove duplicate values
df_unique = df.unique()
# print dataframe
print(df_unique)
Output:
Removed Duplicate values using unique()Remove Duplicates from Specific Columns
In this example, we will create a simple dataframe with some duplicate values in it. Then we will use unique() function to remove duplicate values but from only one specific column and print the dataframe.
To remove the duplicate values from a specific data frame, we provide a subset parameter to the unique() function and provide the column name to remove duplicate values from.
Python
# import polars
import polars as pl
# create data frame
df = pl.DataFrame({
"a": [1, 2, 2, 3, 4, 4, 5],
"b": ["a", "b", "b", "c", "d", "d", "e"]
})
# remove duplicates from column "a"
df_unique = df.unique(subset=["a"], keep="first")
# print dataframe
print(df_unique)
Output:
Removing Duplicate values from specific ColumnConclusion
In this article, we have explored Polars and some of its functions. Both Pandas and Polars offer straightforward methods to remove duplicate rows from a DataFrame, each with its own syntax and performance characteristics. Polars, with its focus on speed and efficiency, is particularly well-suited for handling large datasets. By following the methods mentioned in this article, you can effectively remove duplicates from your DataFrame.
Similar Reads
How to Find & Drop duplicate columns in a Pandas DataFrame? Letâs discuss How to Find and drop duplicate columns in a Pandas DataFrame. First, Letâs create a simple Dataframe with column names 'Name', 'Age', 'Domicile', and 'Age'/'Marks'. Find Duplicate Columns from a DataFrameTo find duplicate columns we need to iterate through all columns of a DataFrame a
4 min read
How to drop duplicates and keep one in PySpark dataframe In this article, we will discuss how to handle duplicate values in a pyspark dataframe. A dataset may contain repeated rows or repeated data points that are not useful for our task. These repeated values in our dataframe are called duplicate values. To handle duplicate values, we may use a strategy
3 min read
Drop duplicate rows in PySpark DataFrame In this article, we are going to drop the duplicate rows by using distinct() and dropDuplicates() functions from dataframe using pyspark in Python. Let's create a sample Dataframe Python3 # importing module import pyspark # importing sparksession from # pyspark.sql module from pyspark.sql import Sp
2 min read
Remove duplicates from a dataframe in PySpark In this article, we are going to drop the duplicate data from dataframe using pyspark in Python Before starting we are going to create Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creati
2 min read
Pandas DataFrame duplicated() Method - Python Pandas is widely used library in Python used for tasks like cleaning, analyzing and transforming data. One important part of cleaning data is identifying and handling duplicate rows which can lead to incorrect results if left unchecked.The duplicated() method in Pandas helps us to find these duplica
2 min read
Pandas DataFrame duplicated() Method - Python Pandas is widely used library in Python used for tasks like cleaning, analyzing and transforming data. One important part of cleaning data is identifying and handling duplicate rows which can lead to incorrect results if left unchecked.The duplicated() method in Pandas helps us to find these duplica
2 min read