How to use pandas cut() and qcut()?
Last Updated :
22 May, 2024
Pandas is an open-source library that is made mainly for working with relational or labeled data both easily and intuitively. It provides various data structures and operations for manipulating numerical data and time series.
In this tutorial, we'll look at pandas' intelligent cut and qcut functions. Basically, we use cut and qcut to convert a numerical column into a categorical one, perhaps to make it better suited for a machine learning model (in case of a fairly skewed numerical column), or just for better analyzing the data at hand. Now, rather than blurting out technical definitions of cut and qcut, we'd be better off seeing what both these functions are good at and how to use them.
We'll first import the necessary data-manipulating libraries.
Python3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
We'll be using the CarDekho dataset, containing data about used cars listed on the platform. You can find the dataset here:
- 'Year' is the year in which the car was purchased.
- 'Selling_Price' is the price the owner wants to sell the car at.
- 'Present_Price' is the current ex-showroom price of the car.
- 'Owner' defines the number of owners the car has previously had before this car was put up on the platform.
The rest of the columns are pretty self-explanatory.
Python3
df = pd.read_csv('cardekho.csv')
df.head()
Output:

Python3
# no missing values in the dataset
df.info()

We'll infuse a missing value to better demonstrate how cut and qcut would handle an 'imperfect' dataset.
Python3
df.loc[0, 'Year'] = np.nan
# these are the 'unique' years in
# the data
np.array(sorted(df.Year.unique()))
Output:

pd.cut()
We can use the 'cut' function in broadly 2 ways: by specifying the number of bins directly and let pandas do the work of calculating equal-sized bins for us, or we can manually specify the bin edges as we desire.
Python3
# right = True, by default
pd.cut(df.Year, bins=3, right=True).head()
Output:

When we specified bins=3, pandas saw that the year range in the data is 2003 to 2018, hence appropriately cut it into 3 equal-width bins of 5 years each: [(2002.985, 2008.0] < (2008.0, 2013.0] < (2013.0, 2018.0]. Note that pandas automatically took the lower bound value of the first category (2002.985) to be a fraction less that the least value in the 'Year' column (2003), to include the year 2003 in the results as well, because you see, the lower bounds of the bins are open ended, while the upper bounds are closed ended (as right=True)
Note: Did you notice that the NaN values are kept as NaNs in the output result as well? When we specify right=False, the left bounds are now closed ended, while right bounds get open ended.
Python3
pd.cut(df.Year, bins=3, right=False).head()
Output:

Instead of getting the intervals back, we can specify the 'labels' parameter as a list for better analysis.
Python3
pd.cut(df.Year, bins=3,
labels=['old', 'medium', 'new']).head()
Output:

We will assign this series back to the original dataframe:
Python3
df['Yr_cut'] = pd.cut(df.Year, bins=3,
labels=['old', 'medium', 'new'])
df.head()
Output:

If we specify labels=False, instead of bin labels, we will get numeric representation of the bins: Here, 0 represents old, 1 is medium and 2 is new.
Python3
pd.cut(df.Year, bins=3, labels=False).head()
Output:

Just to see how many values fall in each bin:
Python3
df['Yr_cut'].value_counts()
Output:

And just because drawing a graph pleases more people than offends..
Python3
df['Yr_cut'].value_counts().plot(kind='barh')
plt.show()
Output:

Now, if we need the bin intervals along with the discretized series at one go, we specify retbins=True. Pandas will give us back a tuple containing 2 elements: the series, and the bin intervals. We will use tuple unpacking to grab both outputs.
Python3
cut_series, cut_intervals = pd.cut(df.Year,
bins=3,
retbins=True)
print("Cut series:")
print(cut_series.head())
print()
print("Cut intervals: ", cut_intervals)
Output:

As mentioned earlier, we can also specify bin edges manually by passing in a list:
Python3
pd.cut(df.Year, bins=[2003,
2007,
2010,
2015,
2018],
include_lowest=True).head()
Output:

Here, we had to mention include_lowest=True. Can you guess why? Because by default 'include_lowest' parameter is set to False, and hence when pandas sees the list that we passed, it will exclude 2003 from calculations. For the eagle-eyed, we could have used any value less than 2003 as well, like 1999 or 2002 or 2002.255 etc and gone ahead with the default setting of include_lowest=False.
pd.qcut()
Qcut (quantile-cut) differs from cut in the sense that, in qcut, the number of elements in each bin will be roughly the same, but this will come at the cost of differently sized interval widths. On the other hand, in cut, the bin edges were equal sized (when we specified bins=3) with uneven number of elements in each bin or group. Also, cut is useful when you know for sure the interval ranges and the bins,
For example, if binning an 'age' column, we know infants are between 0 and 1 years old, 1-12 years are kids, 13-19 are teenagers, 20-60 are working class grownups, and 60+ senior citizens. So we can appropriately set bins=[0, 1, 12, 19, 60, 140] and labels=['infant', 'kid', 'teenager', 'grownup', 'senior citizen']. In qcut, when we specify q=5, we are telling pandas to cut the Year column into 5 equal quantiles, i.e. 0-20%, 20-40%, 40-60%, 60-80% and 80-100% buckets/bins.
Python3
pd.qcut(df.Year, q=5).head(7)
Output:

We'll assign this series to the dataframe.
Python3
df['Yr_qcut'] = pd.qcut(df.Year, q=5,
labels=['oldest',
'not so old',
'medium',
'newer',
'latest'])
df.head()
df['Yr_qcut'].value_counts().plot(kind='barh')
plt.show()
qcut_series, qcut_intervals = pd.qcut(df.Year, q=5,
labels=['oldest',
'not so old',
'medium',
'newer',
'latest'],
retbins=True)
qcut_series.value_counts()
Output:



Now just to highlight the fact that q=5 indeed implies splitting values into 5 equal quantiles of 20% each, we'll manually specify the quantiles, and get the same bin distributions as above.
Python3
pd.qcut(df.Year,
q=[0, 0.2, 0.4,
0.6, 0.8, 1.0],
labels=['oldest',
'not so old',
'medium',
'newer',
'latest']).value_counts()
Output:

We'll now see the qcut intervals array we got using tuple unpacking:
Python3
Output:

You see? Here in qcut, the bin edges are of unequal widths, because it is accommodating 20% of the values in each bucket, and hence it is calculating the bin widths on its own to achieve that objective.
Python3
# demonstrating with some random quantiles
pd.qcut(df.Year, q=[0,0.15,0.35,
0.51,0.78,1]).head()
Output:

Now it is binning the data into our custom made list of quantiles of 0-15%, 15-35%, 35-51%, 51-78% and 78-100%.
With qcut, we're answering the question of "which data points lie in the first 15% of the data, or in the 51-78 percentile range etc. Additionally, we can also use pandas' interval_range, or numpy's linspace and arange to generate a list of interval ranges and feed it to cut and qcut as the bins and q parameter respectively.
Python3
pd.cut(df.Year,
bins=pd.interval_range(start=2002.99,
end=2018,
periods=3)).head()
Output:

Using linspace:
Python3
# basically dividing data into 3
# quantiles
pd.qcut(df.Kms_Driven,
q=np.linspace(0,1,4)).head()
Output:

Using arange:
Python3
pd.qcut(df.Selling_Price,
q=np.arange(0, 1.01, 0.25),
labels=['inexpensive',
'average',
'high cost',
'over budget']).head()
Output:

Sometimes when we ask pandas to calculate the bin edges for us, you may run into an error which looks like:
ValueError: Bin edges must be unique error. This implies that while calculating the bin intervals, pandas found that some bin edges were the same on both ends, like an interval of (2014, 2014] and hence it raised that error. This usually happens when the number of bins is large and the value range of the particular column is small. You can silence this error by passing the argument of duplicates='drop'.
Similar Reads
How to Use Python Pandas
to manipulate and analyze data efficientlyPandas is a Python toolbox for working with data collections. It includes functions for analyzing, cleaning, examining, and modifying data. In this article, we will see how we can use Python Pandas with the help of examples.What is Python Pandas?A Python lib
5 min read
How to Use axis=0 and axis=1 in Pandas?
In this article, we will discuss how to use axis=0 and axis=1 in pandas using Python. Sometimes we need to do operations only on rows, and sometimes only on columns, in such situations, we specify the axis parameter. In this article, let's see a few examples to know when and how to use the axis para
2 min read
How to Filter rows using Pandas Chaining?
In this article, we will learn how to filter rows using Pandas chaining. For this first we have to look into some previous terms which are given below : Pandas DataFrame: It is a two-dimensional data structure, i.e. the data is tabularly aligned in rows and columns. The Pandas DataFrame has three ma
4 min read
How to use Pandas filter with IQR?
The IQR or Inter Quartile Range is a statistical measure used to measure the variability in a given data. In naive terms, it tells us inside what range the bulk of our data lies. It can be calculated by taking the difference between the third quartile and the first quartile within a dataset. IQR = Q
4 min read
How to Drop First Row in Pandas?
We are given a DataFrame and our task is to remove the first row from it. For example, if we have a DataFrame like this: A B C 0 1 2 3 1 4 5 6 2 7 8 9 Then the output will be: A B C 1 4 5 6 2 7 8 9 Using iloc for Index-Based SelectionTo drop the first row using iloc, select all rows starting from in
4 min read
How to Calculate Quantiles by Group in Pandas?
In this article, how to calculate quantiles by group in Pandas using Python. There are many methods to calculate the quantile, but pandas provide groupby.quantile() function to find it in a simple few lines of code. This is the Method to use when the desired quantile falls between two points. Syntax
2 min read
Methods to Round Values in Pandas DataFrame
There are various ways to Round Values in Pandas DataFrame so let's see each one by one: Let's create a Dataframe with 'Data Entry' Column only: Code: Python3 # import Dataframe class # from pandas library from pandas import DataFrame # import numpy library import numpy as np # dictionary Myvalue =
3 min read
How to write Pandas DataFrame as TSV using Python?
In this article, we will discuss how to write pandas dataframe as TSV using Python. Let's start by creating a data frame. It can be done by importing an existing file, but for simplicity, we will create our own. Python3 # importing the module import pandas as pd # creating some sample data sample =
1 min read
Python | Pandas Index.slice_locs()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas Index.slice_locs() function compute slice locations for input labels. It takes
2 min read
Pandas query() Method
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages that makes importing and analyzing data much easier. Analyzing data requires a lot of filtering operations. Pandas Dataframe provide many
2 min read