Pandas Interview Questions

Pandas is an open-source Python library used for data manipulation and analysis. In interviews, questions on Pandas are often asked to assess your ability to work with structured data effectively. Below are some of the most frequently asked interview questions and answers covering key Pandas topics.

1. List Key Features of Pandas.

Pandas are used for efficient data analysis. The key features of Pandas are as follows:

Fast and efficient data manipulation and analysis
Provides time-series functionality
We can easily handle missing data
Faster data merging and joining
Flexible reshaping of data sets
Group by functionality
Data from different file objects can be loaded
Integrates with NumPy

2. What are the Different Types of Data Structures in Pandas?

The two data structures that are supported by Pandas are Series and DataFrames.

Series is a one-dimensional labelled array that can hold data of any type. It is mostly used to represent a single column or row of data.
DataFrame is a two-dimensional heterogeneous data structure. It stores data in a tabular form. Its three main components are data, rows and columns.

3. What is Series in Pandas?

A Series in Pandas is a one-dimensional labelled array. Its columns are like an Excel sheet that can hold any type of data like integer, string, Python objects, etc. Its axis labels are known as the index. Series contains homogeneous data and its values can be changed but the size of the series is immutable. A series can be created from a Python tuple, list and dictionary. The syntax for creating a series is as follows:

Python

import pandas as pd
series = pd.Series(data)

4. What are the Different Ways to Create a Series?

In Pandas, a series can be created in many ways. They are as follows:

1. Creating a Series from a List

We can create a series using a Python list and pass it to the Series() constructor.

Python

import pandas as pd
list1 = ['g', 'e', 'e', 'k', 's']

print(pd.Series(list1))

Output:

0 g
1 e
2 e
3 k
4 s
dtype: object

2. Creating a Series from Dictionary

A Series can also be created from a Python dictionary. The keys of the dictionary as used to construct indexes of the series.

Python

import pandas as pd

dict = {'Geeks': 10,'for': 20, 'geeks': 30}

print(pd.Series(dict))

Output:

Geeks 10
for 20
geeks 30
dtype: int64

3. Creating a Series from Scalar Value

To create a series from a Scalar value, we must provide an index. The Series constructor will take two arguments, one will be the scalar value and the other will be a list of indexes. The value will repeat until all the index values are filled.

Python

import pandas as pd
import numpy as np

ser = pd.Series(10, index=[0, 1, 2, 3, 4, 5])

print(ser)

Output:

0 10
1 10
2 10
3 10
4 10
5 10
dtype: int64

4. Creating a Series using NumPy Functions

The Numpy module's functions, such as numpy.linspace() and numpy.random.randn() can also be used to create a Pandas series.

Python

import pandas as pd
import numpy as np

ser1 = pd.Series(np.linspace(3, 33, 3))
print(ser1)

ser2 = pd.Series(np.random.randn(3))
print("\n", ser2)

Output:

0 3.0
1 18.0
2 33.0
dtype: float64
0 -0.341027
1 -1.700664
2 0.364409
dtype: float64

5. Creating a Series using List Comprehension

Here, we will use the Python list comprehension technique to create a series in Pandas. We will use the range function to define the values and a for loop for indexes.

Python

import pandas as pd
ser = pd.Series(range(1, 20, 3),
index=[x for x in 'abcdefg'])
print(ser)

Output:

a 1
b 4
c 7
d 10
e 13
f 16
g 19
dtype: int64

5. What is a DataFrame in Pandas?

A DataFrame in Panda is a data structure used to store the data in tabular form, that is in the form of rows and columns. It is two-dimensional, size-mutable and heterogeneous in nature. The main components of a dataframe are data, rows and columns. A dataframe can be created by loading the dataset from existing storage such as SL database, CSV file, Excel file, etc. The syntax for creating a dataframe is as follows:

Python

import pandas as pd
dataframe = pd.DataFrame(data)

6. What are the Different ways to Create a DataFrame in Pandas?

In Pandas, a dataframe can be created in many ways. They are as follows:

1. Creating a DataFrame using a List

In order to create a DataFrame from a Python list, just pass the list to the DataFrame() constructor.

Python

import pandas as pd

lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']

print(pd.DataFrame(lst))

Output:

0
0 Geeks
1 For
2 Geeks
3 is
4 portal
5 for
6 Geeks

2. Creating a DataFrame using a Dictionary

A DataFrame can be created from a Python dictionary and passed to the DataFrame() constructor. The Keys of the dictionary will be the column names and the values of the dictionary are the data of the DataFrame.

Python

import pandas as pd

data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}

print(pd.DataFrame(data))

Output:

Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18

3. Creating a DataFrame using a List of Dictionaries

Another way to create a DataFrame is by using Python list of dictionaries. The list is passed to the DataFrame() constructor. The Keys of each dictionary element will be the column names.

Python

import pandas as pd

lst = [{1: 'Geeks', 2: 'For', 3: 'Geeks'},
{1: 'Portal', 2: 'for', 3: 'Geeks'}]

print(pd.DataFrame(lst))

Output:

1 2 3
0 Geeks For Geeks
1 Portal for Geeks

4. Creating a DataFrame from Pandas Series

A DataFrame in Pandas can also be created by using the Pandas series.

Python

import pandas as pd

lst = pd.Series(['Geeks', 'For', 'Geeks'])

print(pd.DataFrame(lst))

Output:

0
0 Geeks
1 For
2 Geeks

7. How to Read Data into a DataFrame from a CSV file?

We can create a data frame from a CSV (Comma Separated Values) file. This can be done by using the read_csv() method which takes the csv file as the parameter.

Python

pandas.read_csv(file_name)

Another way to do this is by using the read_table() method which takes the CSV file and a delimiter value as the parameter.

Python

pandas.read_table(file_name, delimiter)

8. How can a DataFrame be Converted to an Excel File?

A Pandas dataframe can be converted to an Excel file by using the to_excel() function which takes the file name as the parameter. We can also specify the sheet name in this function.

Python

DataFrame.to_excel(file_name)

9. How to Convert a DataFrame into a Numpy Array?

Pandas Numpy is an inbuilt Python package that is used to perform large numerical computations. It is used for processing multidimensional array elements to perform complicated mathematical operations.

Pandas dataframe can be converted to a NumPy array by using the to_numpy() method. We can also provide the datatype as an optional argument.

Python

Dataframe.to_numpy()

We can also use .values to convert dataframe values to NumPy array

Python

df.values

10. How to access the first few rows of a dataframe?

The first few records of a dataframe can be accessed by using the pandas head() method. It takes one optional argument n, which is the number of rows. By default, it returns the first 5 rows of the dataframe. The head() method has the following syntax:

Python

df.head(n)

Another way to do it is by using iloc() method. It is similar to the Python list-slicing technique. It has the following syntax:

Python

df.iloc[:n]

11. How to Select a Single Column of a DataFrame?

There are many ways to Select a single column of a dataframe. They are as follows:

By using the Dot operator, we can access any column of a dataframe.

Python

Dataframe.column_name

Another way to select a column is by using the square brackets [].

Python

DataFrame[column_name]

12. How to Rename a Column in a DataFrame?

A column of the dataframe can be renamed by using the rename() function. We can rename a single as well as multiple columns at the same time using this method.

Python

DataFrame.rename(columns={'column1': 'COLUMN_1', 'column2':'COLUMN_2'}, inplace=True)

Another way is by using the set_axis() function which takes the new column name and axis to be replaced with the new name.

Python

DataFrame.set_axis(labels=['COLUMN_1','COLUMN_2'], axis=1, inplace=True)

In case we want to add a prefix or suffix to the column names, we can use the add_prefix() or add_suffix() methods.

Python

DataFrame.add_prefix(prefix='PREFIX_')
DataFrame.add_suffix(suffix='_suffix')

13. How to add Row or Column to an Existing Dataframe?

1. Adding Rows

The df.loc[] is used to access a group of rows or columns and can be used to add a row to a dataframe.

Python

DataFrame.loc[Row_Index]=new_row

We can also add multiple rows in a dataframe by using pandas.concat() function which takes a list of dataframes to be added together.

Python

pandas.concat([Dataframe1,Dataframe2])

2. Adding Columns

We can add a column to an existing dataframe by just declaring the column name and the list or dictionary of values.

Python

DataFrame[data] = list_of_values

Another way to add a column is by using df.insert() method which take a value where the column should be added, column name and the value of the column as parameters.

Python

DataFrameName.insert(col_index, col_name, value)

We can also add a column to a dataframe by using df.assign() function

Python

DataFrame.assign(**kwargs)

14. How to Delete an Row or Column from an Existing DataFrame?

We can delete a row or a column from a dataframe by using df.drop() method. and provide the row or column name as the parameter.

1. To delete a column

Python

DataFrame.drop(['Column_Name'], axis=1)

2. To delete a row

Python

DataFrame.drop([Row_Index_Number], axis=0)

15. How to Merge Two DataFrames?

In pandas, we can combine two dataframes using the pandas.merge() method which takes 2 dataframes as the parameters.

Python

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3],
'B': [4, 5, 6]},
index=[10, 20, 30])

df2 = pd.DataFrame({'C': [7, 8, 9],
'D': [10, 11, 12]},
index=[20, 30, 40])

result = pd.merge(df1, df2, left_index=True, right_index=True)
print(result)

Output:

A B C D
20 2 5 7 10
30 3 6 8 11

16. How to Sort a Dataframe?

A dataframe in pandas can be sorted in ascending or descending order according to a particular column. We can do so by using the sort_values() method and providing the column name according to which we want to sort the dataframe. We can also sort it by multiple columns.

To sort it in descending order, we pass an additional parameter 'ascending' and set it to False.

Python

DataFrame.sort_values(by='Age',ascending=True)

17. How to Compute Mean, Median, Mode, Variance, Standard Deviation and Various Quantile Ranges in Pandas?

The mean, median, mode, Variance, Standard Deviation and Quantile range can be computed using the following commands in Python.

DataFrame.mean(): To calculate the mean
DataFrame.median(): To calculate median
DataFrame.mode(): To calculate the mode
DataFrame.var(): To calculate variance
DataFrame.std(): To calculate the standard deviation
DataFrame.quantile(): To calculate quantile range, with range value as a parameter

18. Difference between Shallow Copy and Deep Copy?

In Pandas, there are two ways to create a copy of the Series. They are as follows:

1. Shallow Copy is a copy of the series object where the indices and the data of the original object are not copied. It only copies the references to the indices and data. This means any changes made to a series will be reflected in the other. A shallow copy of the series can be created by writing the following syntax:

Python

ser.copy(deep=False)

2. Deep Copy is a copy of the series object where it has its own indices and data. This means changes made to a copy of the object will not be reflected to the original series object. A deep copy of the series can be created by writing the following syntax:

Python

ser.copy(deep=True)

The default value of the deep parameter of the copy() function is set to True.

19. How to Check and Remove Duplicate Values in Pandas.

In pandas, duplicate values can be checked by using the duplicated() method.

Python

DataFrame.duplicated()

To remove the duplicated values we can use the drop_duplicates() method.

Python

DataFrame.drop_duplicates()

20. How to Handle Missing Data in Pandas?

Generally dataset has some missing values and it can happen for a variety of reasons such as data collection issues, data entry errors or data not being available for certain observations. This can cause a big problem. To handle these missing values Pandas provides various functions.

These functions are used for detecting, removing and replacing null values in Pandas DataFrame:

isnull(): It returns True for NaN values or null values and False for present values
notnull(): It returns False for NaN values and True for present values
dropna(): It analyzes and drops Rows/Columns with Null values
fillna(): It let the user replace NaN values with some value of their own
replace(): It is used to replace a string, regex, list, dictionary, series, number, etc.
interpolate(): It fills NA values in the dataframe or series.

21. Difference between the interpolate() and fillna()

The interpolate() and fillna() methods in pandas are used to handle missing or NaN (Not a Number) values in a DataFrame or Series. The following table shows the difference between interpolate() and fillna():

Feature	interpolate()	fillna()
Purpose	Estimates and fills missing values using interpolation techniques	Fills missing values using a specified constant or computed value
How it works	Calculates missing values based on existing surrounding data	Directly replaces NaN with given value(s) or strategies
Common methods supported	linear, polynomial, time, spline, etc.	0, mean(), median(), mode(), forward-fill (ffill), back-fill (bfill) etc.
Data types supported	Mainly numeric and datetime data (where logical continuity exists)	Numeric, categorical and datetime data
Use case	Used when missing values depend on surrounding trends or time sequence	Used when missing values can be filled with a fixed or computed known value
Return type	Returns a DataFrame/Series with estimated values replacing NaN	Returns a DataFrame/Series with specified values replacing NaN
Example	df['col'].interpolate(method='linear')	df['col'].fillna(df['col'].mean())

22. Difference between map(), applymap() and apply()

The map(), applymap() and apply() methods are used in pandas for applying functions or transformations to elements in a DataFrame or Series. The following table shows the difference between map(), applymap() and apply():

Feature	map()	applymap()	apply()
Defined on	Series only	DataFrame only	Both Series and DataFrame
Works on	Each element of a Series	Each element of a DataFrame	Entire row/column (or whole Series)
Axis support	No axis parameter	No axis parameter	Has axis parameter (axis=0 for columns, axis=1 for rows)
Function application level	Element-wise	Element-wise	Row-wise or Column-wise (can also be element-wise on Series)
Typical use	Apply a function/dict to each element of a Series	Apply a function to each element of a DataFrame	Apply a function across rows/columns or to a whole Series
Example use case	Convert each name in a Series to uppercase	Square each element in a numeric DataFrame	Calculate sum/mean of each row or column
Return type	Series	DataFrame	Series (if applied on DataFrame rows/columns) or scalar (if aggregated)

23. How to Set and Reset the Index in a Panda dataFrame?

1. Set Index: We can set the index to a Pandas dataframe by using the set_index() method, which is used to set a list, series or dataframe as the index of a dataframe.

Python

DataFrame.set_index('Column_Name')

2. Reset Index: The index of Pandas dataframes can be reset by using the reset_index() method. It can be used to simply reset the index to the default integer index beginning at 0.

Python

DataFrame.reset_index(inplace = True)

24. What is Reindexing in Pandas?

Reindexing in Pandas as the name suggests means changing the index of the rows and columns of a dataframe. It can be done by using the Pandas reindex() method. In case of missing values or new values that are not present in the dataframe, the reindex() method assigns it as NaN.

Python

df.reindex(new_index)

25. What is Multi-Indexing in Pandas?

Multi-indexing refers to selecting two or more rows or columns in the index. It is a multi-level or hierarchical object for pandas object and deals with data analysis and works with higher dimensional data. Multi-indexing in Pandas can be achieved by using a number of functions such as:

MultiIndex.from_arrays
MultiIndex.from_tuples
MultiIndex.from_product
MultiIndex.from_frame

26. What is the difference between loc and iloc in Pandas?

1. loc: It is label-based i.e you access rows and columns using their labels (row and column names).

Python

df.loc[row_labels, column_labels]

2. iloc: It is integer-position based and here you access rows and columns using their numeric index positions (row and column numbers).

Python

df.iloc[row_positions, column_positions]

27. What is the Significance of Pandas Described Command?

Pandas describe() is used to view some basic statistical details of a data frame or a series of numeric values. It can give a different output when it is applied to a series of strings. It can get details like percentile, mean, standard deviation, etc.

Python

DataFrame.describe()

28. How to Find the Correlation Using Pandas?

Pandas dataframe.corr() method is used to find the correlation of all the columns of a dataframe. It automatically ignores any missing or non-numerical values.

Python

DataFrame.corr()

29. What is groupby() in Pandas and how is it used?

The groupby() function in Pandas is used to split the data into groups based on one or more columns, then apply an operation (like aggregation, transformation or filtering) on each group separately.

Python

df.groupby(by_column)

For example:

Python

import pandas as pd

data = {'Dept': ['IT', 'IT', 'HR', 'HR'],
        'Salary': [50000, 60000, 45000, 55000]}
df = pd.DataFrame(data)

result = df.groupby('Dept')['Salary'].mean()
print(result)

Output:

Dept
HR 50000.0
IT 55000.0
Name: Salary, dtype: float64

30. How can we use Pivot table in Pandas?

In Pandas, pivot_table() is used to summarize and reshape data into a tabular format. It allows you to aggregate values like sum, mean, count, etc by specifying which columns become rows (index), which become columns and which contain the values to aggregate.

We can pivot the dataframe in Pandas by using the pivot_table() method. To unpivot the dataframe to its original form we can melt the dataframe by using the melt() method.

31. What is the difference between pivot_table() and groupby()

Both pivot_table() and groupby() are useful methods in pandas used for aggregating and summarizing data. The following table shows the difference between pivot_table() and groupby():

Feature	pivot_table()	groupby()
Purpose	Summarizes and aggregates data in a tabular (pivoted) format	Performs aggregation on grouped data of one or more columns
Reshaping	Used to reshape data based on column values	Used to group data based on categorical variables
Output structure	Returns a new reshaped DataFrame	Returns a GroupBy object which must be followed by aggregation functions
Multi-level grouping	Can handle multiple levels of grouping using index and columns parameters	Can handle multiple levels of grouping using multiple column names in groupby()
Comparison across dimensions	Used when we want to compare data across multiple dimensions	Used to summarize data within groups
Typical use case	Summarizing data with one axis as rows and another as columns	Grouping by one or more columns and then applying aggregation

32. What is Data Aggregation in Pandas?

In Pandas, data aggregation refers to the act of summarizing or decreasing data in order to produce a statistical summary of one or more columns in a dataset. In order to calculate statistical measures like sum, mean, minimum, maximum, count, etc aggregation functions must be applied to groups or subsets of data.

The agg() function in Pandas is frequently used to aggregate data. Applying one or more aggregation functions to one or more columns in a DataFrame or Series is possible using this approach. Pandas' built-in functions or specially created user-defined functions can be used as aggregation functions.

Python

DataFrame.agg({'Col_name1': ['sum', 'min', 'max'], 'Col_name2': 'count'})

33. Difference between join(), merge() and concat()

The following table shows the difference between join(), merge() and concat():

Feature	join()	merge()	concat()
Purpose	Combines two DataFrames on their index or on a key column.	Combines DataFrames using common columns or indices (like SQL joins).	Combines DataFrames along rows or columns.
Default on	Joins on index by default.	Joins on common columns by default.	Just stacks DataFrames without joining keys.
Join types	left, right, inner, outer	left, right, inner, outer	Not applicable (simply concatenates).
Axis support	Always horizontal (columns)	Always horizontal (columns)	Can be vertical (rows) or horizontal (columns) using axis.
Typical use	Combine DataFrames by their index labels.	Combine DataFrames based on matching column values.	Stack multiple DataFrames into one.

34. What is Time Series in Pandas?

Time series is a collection of data points with timestamps. It depicts the evolution of quantity over time. Pandas provide various functions to handle time series data efficiently. It is used to work with data timestamps, resampling time series for different time periods, working with missing data, slicing the data using timestamps, etc.

We have various time-series function in pandas like:

Pandas Built-in Function	Operation
pandas.to_datetime(DataFrame['Date'])	Convert 'Date' column of DataFrame to datetime dtype
DataFrame.set_index('Date', inplace=True)	Set 'Date' as the index
DataFrame.resample('H').sum()	Resample time series to a different frequency (e.g., Hourly, daily, weekly, monthly etc)
DataFrame.interpolate()	Fill missing values using linear interpolation
DataFrame.loc[start_date:end_date]	Slice the data based on timestamps

35. How to convert a String to Datetime in Pandas?

A Python string can be converted to a DateTime object by using:

1. Pandas.to_datetime()

Python

import pandas as pd


date_string = '2023-07-17'
dateTime = pd.to_datetime(date_string)
print(dateTime)

Output:

2023-07-17 00:00:00

2. datetime.strptime

Python

from datetime import datetime

date_string = '2023-07-17'
dateTime = datetime.strptime(date_string, '%Y-%m-%d')
print(dateTime)

Output:

2023-07-17 00:00:00

36. What is Time Delta in Pandas?

The time delta is the difference in dates and time. It indicates the duration or difference in time. The time delta object can be created by using the timedelta() method and providing the number of weeks, days, seconds, milliseconds, etc as the parameter.

Python

Duration = pandas.Timedelta(days=7, hours=4, minutes=30, seconds=23)

With the help of the Timedelta data type, you can easily perform arithmetic operations, comparisons and other time-related manipulations. In terms of different units, such as days, hours, minutes, seconds, milliseconds and microseconds.

37. How to make Label Encoding using Pandas?

Label encoding is used to convert categorical data into numerical data so that a machine-learning model can fit it. To apply label encoding using pandas we can use:

1. pandas.Categorical().codes: It only gives codes.

pd.Categorical() converts the data into a Categorical type.
.codes gives the integer code for each category.

Python

import pandas as pd

data = pd.Series(['Red', 'Blue', 'Green', 'Blue'])
encoded = pd.Categorical(data).codes
print(encoded)

Output:

[2 0 1 0]

2. pandas.factorize(): It gives both codes and unique labels.

factorize() returns a tuple: (encoded_array, unique_categories).
The first array gives integer codes and the second gives the mapping of categories.
It automatically detects unique values and assigns codes in their first-seen order.

Python

import pandas as pd

data = pd.Series(['Red', 'Blue', 'Green', 'Blue'])
encoded, uniques = pd.factorize(data)
print(encoded)

Output:

[0 1 2 1]

38. How to make Onehot Encoding using Pandas?

One-hot encoding is a technique for representing categorical data as numerical values in a machine-learning model. It works by creating a separate binary variable for each category in the data. The value of the binary variable is 1 if the observation belongs to that category and 0 otherwise. It can improve the performance of the model.

To apply one hot encoding, we greater a dummy column for our dataframe by using get_dummies() method.

Python

pd.get_dummies(data, columns=None)

For example:

Python

import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green']})

encoded = pd.get_dummies(df, columns=['Color'])
print(encoded)

Output:

Color_Blue Color_Green Color_Red
0 0 0 1
1 1 0 0
2 0 1 0

Pandas Interview Questions

1. List Key Features of Pandas.

2. What are the Different Types of Data Structures in Pandas?

3. What is Series in Pandas?

4. What are the Different Ways to Create a Series?

5. What is a DataFrame in Pandas?

6. What are the Different ways to Create a DataFrame in Pandas?

7. How to Read Data into a DataFrame from a CSV file?

8. How can a DataFrame be Converted to an Excel File?

9. How to Convert a DataFrame into a Numpy Array?

10. How to access the first few rows of a dataframe?

11. How to Select a Single Column of a DataFrame?

12. How to Rename a Column in a DataFrame?

13. How to add Row or Column to an Existing Dataframe?

14. How to Delete an Row or Column from an Existing DataFrame?

15. How to Merge Two DataFrames?

16. How to Sort a Dataframe?

17. How to Compute Mean, Median, Mode, Variance, Standard Deviation and Various Quantile Ranges in Pandas?

18. Difference between Shallow Copy and Deep Copy?

19. How to Check and Remove Duplicate Values in Pandas.

20. How to Handle Missing Data in Pandas?

21. Difference between the interpolate() and fillna()

22. Difference between map(), applymap() and apply()

23. How to Set and Reset the Index in a Panda dataFrame?

24. What is Reindexing in Pandas?

25. What is Multi-Indexing in Pandas?

26. What is the difference between loc and iloc in Pandas?

27. What is the Significance of Pandas Described Command?

28. How to Find the Correlation Using Pandas?

29. What is groupby() in Pandas and how is it used?

30. How can we use Pivot table in Pandas?

31. What is the difference between pivot_table() and groupby()

32. What is Data Aggregation in Pandas?

33. Difference between join(), merge() and concat()

34. What is Time Series in Pandas?

35. How to convert a String to Datetime in Pandas?

36. What is Time Delta in Pandas?

37. How to make Label Encoding using Pandas?

38. How to make Onehot Encoding using Pandas?

Explore