Write custom aggregation function in Pandas
Last Updated :
20 Aug, 2020
Pandas in python in widely used for Data Analysis purpose and it consists of some fine data structures like Dataframe and Series. There are several functions in pandas that proves to be a great help for a programmer one of them is an aggregate function. This function returns a single value from multiple values taken as input which are grouped together on certain criteria. A few of the aggregate functions are average, count, maximum, among others.
Syntax: DataFrame.agg(func=None, axis=0, *args, **kwargs)
Parameters:
- axis: {0 or ‘index’, 1 or ‘columns’} = 0 or ‘index’ means the function is applied to each column and 1 or ‘columns’ means the function is applied to each row.
- func: function, str, list or dict = It describes the function that is to be used for aggregation. Accepted combinations are: function, string function name (str), list of functions (list/dict).
- *args: It specifies the positional arguments to pass to the function.
- **kwargs: It specifies the keyword arguments to pass to the function.
Return: This function can return scalar, Series or Dataframe. The return is scalar when Series.agg is called with a single function, it is Series when Dataframe.agg is called with a single function, it will be Dataframe when Dataframe.agg is called with several functions.
Let’s create a Dataframe:
Python3
import pandas as pd
df = pd.DataFrame([[ 10 , 20 , 30 ],
[ 40 , 50 , 60 ],
[ 70 , 80 , 90 ],
[ 100 , 110 , 120 ]],
columns = [ 'Col_A' , 'Col_B' ,
'Col_C' ])
df
|
Output:

Now, let’s perform some operations:
1. Performing aggregation over the rows: This performs aggregate functions over the rows of the Dataframe. As you can see in the below examples, the example 1 has two keywords inside the aggregate function, sum and min. The sum adds up the first (10,40,70,100), second (20,50,80,110) and third (30,60,90,120) element of each row separately and print it, the min finds the minimum number among the elements of rows and print it. Similar process is with the second example.
Example 1:
Output:

Example 2:
Python3
df.agg([ 'sum' , 'min' , 'max' ])
|
Output:

2. Performing aggregation per column: This performs aggregate function on the columns, the columns are selected particularly as shown in the examples. In the first example, two columns are selected, ‘Col_A’ and ‘Col_B’ and operations are to be performed on them. For Col_A, the minimum value and the summed up value is calculated and for the Col_B, minimum and maximum value is calculated. Similar process is with example 2.
Example 1:
Python3
df.agg({ 'Col_A' : [ 'sum' , 'min' ],
'Col_B' : [ 'min' , 'max' ]})
|
Output:

Example 2:
Python3
df.agg({ 'Col_A' : [ 'sum' , 'min' ],
'Col_B' : [ 'min' , 'max' ],
'Col_C' : [ 'sum' , 'mean' ]})
|
Output:

Note: It will print NaN if a particular aggregation is not performed on a particular column.
3. Performing aggregation over the columns: This performs aggregate function over the columns. As shown in example 1, the mean of first (10,20,30), second (40,50,60), third (70,80,90) and fourth (100,110,120) elements of each column is calculated separately and printed.
Example:
Python3
df.agg( "mean" , axis = "columns" )
|
Output:

4. Custom Aggregate function: Sometimes it becomes a need to create our own aggregate function.
Example: Consider a data frame consisting of student id (stu_id), subject code (sub_code) and marks (marks).
Python3
import pandas as pd
df = pd.DataFrame(
{ 'stud_id' : [ 101 , 102 , 103 , 104 ,
101 , 102 , 103 , 104 ],
'sub_code' : [ 'CSE6001' , 'CSE6001' , 'CSE6001' ,
'CSE6001' , 'CSE6002' , 'CSE6002' ,
'CSE6002' , 'CSE6002' ],
'marks' : [ 77 , 86 , 55 , 90 ,
65 , 90 , 80 , 67 ]}
)
df
|
Output:

Now if you need to calculate the total marks (marks of two subjects) of each student (unique stu_id). This process can be done using custom aggregate function. Here my custom aggregate function is ‘total’.
Python3
from functools import reduce
def total(series):
return reduce ( lambda x, y: x + y, series)
df.groupby( 'stud_id' ).agg({ 'marks' : [ 'sum' , total]})
|
Output:

As you can see, both the columns have same values of total marks, so our aggregate function is correctly calculating the total marks in this case.
Similar Reads
Count distinct in Pandas aggregation
In this article, let's see how we can count distinct in pandas aggregation. So to count the distinct in pandas aggregation we are going to use groupby() and agg() method. groupby(): This method is used to split the data into groups based on some criteria. Pandas objects can be split on any of their
2 min read
Groupby without aggregation in Pandas
Pandas is a great python package for manipulating data and some of the tools which we learn as a beginner are an aggregation and group by functions of pandas. Groupby() is a function used to split the data in dataframe into groups based on a given condition. Aggregation on other hand operates on ser
4 min read
Grouping and Aggregating with Pandas
In this article, we are going to see grouping and aggregating using pandas. Grouping and aggregating will help to achieve data analysis easily using various functions. These methods will help us to the group and summarize our data and make complex analysis comparatively easy. Creating a sample datas
3 min read
pandas.crosstab() function in Python
pandas.crosstab() function in Python is used to compute a cross-tabulation (contingency table) of two or more categorical variables. By default, computes a frequency table of the factors unless an array of values and an aggregation function are passed. It also supports aggregation when additional da
6 min read
Using SQLite Aggregate functions in Python
In this article, we are going to see how to use the aggregate function in SQLite Python. An aggregate function is a database management function that groups the values of numerous rows into a single summary value. Average (i.e., arithmetic mean), sum, max, min, Count are common aggregation functions
4 min read
How to combine Groupby and Multiple Aggregate Functions in Pandas?
Pandas is an open-source Python library built on top of NumPy. It allows data structures and functions to manipulate and analyze numerical data and time series efficiently. It is widely used in data analysis for tasks like data manipulation, cleaning and exploration. One of its key feature is to gro
3 min read
Pyspark - Aggregation on multiple columns
In this article, we will discuss how to perform aggregation on multiple columns in Pyspark using Python. We can do this by using Groupby() function Let's create a dataframe for demonstration: C/C++ Code # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.s
3 min read
Creating a Pandas Series from Dictionary
A Pandas Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). It has to be remembered that, unlike Python lists, a Series will always contain data of the same type. Letâs see how to create a Pandas Series from P
2 min read
Creating a dataframe from Pandas series
Series is a type of list in Pandas that can take integer values, string values, double values, and more. But in Pandas Series we return an object in the form of a list, having an index starting from 0 to n, Where n is the length of values in the series. Later in this article, we will discuss Datafra
5 min read
Apply function to every row in a Pandas DataFrame
Python is a great language for performing data analysis tasks. It provides a huge amount of Classes and functions which help in analyzing and manipulating data more easily. In this article, we will see how we can apply a function to every row in a Pandas Dataframe. Apply Function to Every Row in a P
7 min read