Pandas is an open-source Python library used for data manipulation and analysis. In interviews, questions on Pandas are often asked to assess your ability to work with structured data effectively. Below are some of the most frequently asked interview questions and answers covering key Pandas topics.
1. List Key Features of Pandas.
Pandas are used for efficient data analysis. The key features of Pandas are as follows:
- Fast and efficient data manipulation and analysis
- Provides time-series functionality
- We can easily handle missing data
- Faster data merging and joining
- Flexible reshaping of data sets
- Group by functionality
- Data from different file objects can be loaded
- Integrates with NumPy
2. What are the Different Types of Data Structures in Pandas?
The two data structures that are supported by Pandas are Series and DataFrames.
- Series is a one-dimensional labelled array that can hold data of any type. It is mostly used to represent a single column or row of data.
- DataFrame is a two-dimensional heterogeneous data structure. It stores data in a tabular form. Its three main components are data, rows and columns.
3. What is Series in Pandas?
A Series in Pandas is a one-dimensional labelled array. Its columns are like an Excel sheet that can hold any type of data like integer, string, Python objects, etc. Its axis labels are known as the index. Series contains homogeneous data and its values can be changed but the size of the series is immutable. A series can be created from a Python tuple, list and dictionary. The syntax for creating a series is as follows:
import pandas as pd
series = pd.Series(data)
4. What are the Different Ways to Create a Series?
In Pandas, a series can be created in many ways. They are as follows:
1. Creating a Series from a List
We can create a series using a Python list and pass it to the Series() constructor.
import pandas as pd
list1 = ['g', 'e', 'e', 'k', 's']
print(pd.Series(list1))
Output:
0 g
1 e
2 e
3 k
4 s
dtype: object
2. Creating a Series from Dictionary
A Series can also be created from a Python dictionary. The keys of the dictionary as used to construct indexes of the series.
import pandas as pd
dict = {'Geeks': 10,'for': 20, 'geeks': 30}
print(pd.Series(dict))
Output:
Geeks 10
for 20
geeks 30
dtype: int64
3. Creating a Series from Scalar Value
To create a series from a Scalar value, we must provide an index. The Series constructor will take two arguments, one will be the scalar value and the other will be a list of indexes. The value will repeat until all the index values are filled.
import pandas as pd
import numpy as np
ser = pd.Series(10, index=[0, 1, 2, 3, 4, 5])
print(ser)
Output:
0 10
1 10
2 10
3 10
4 10
5 10
dtype: int64
4. Creating a Series using NumPy Functions
The Numpy module's functions, such as numpy.linspace() and numpy.random.randn() can also be used to create a Pandas series.
import pandas as pd
import numpy as np
ser1 = pd.Series(np.linspace(3, 33, 3))
print(ser1)
ser2 = pd.Series(np.random.randn(3))
print("\n", ser2)
Output:
0 3.0
1 18.0
2 33.0
dtype: float64
0 -0.341027
1 -1.700664
2 0.364409
dtype: float64
5. Creating a Series using List Comprehension
Here, we will use the Python list comprehension technique to create a series in Pandas. We will use the range function to define the values and a for loop for indexes.
import pandas as pd
ser = pd.Series(range(1, 20, 3),
index=[x for x in 'abcdefg'])
print(ser)
Output:
a 1
b 4
c 7
d 10
e 13
f 16
g 19
dtype: int64
5. What is a DataFrame in Pandas?
A DataFrame in Panda is a data structure used to store the data in tabular form, that is in the form of rows and columns. It is two-dimensional, size-mutable and heterogeneous in nature. The main components of a dataframe are data, rows and columns. A dataframe can be created by loading the dataset from existing storage such as SL database, CSV file, Excel file, etc. The syntax for creating a dataframe is as follows:
import pandas as pd
dataframe = pd.DataFrame(data)
6. What are the Different ways to Create a DataFrame in Pandas?
In Pandas, a dataframe can be created in many ways. They are as follows:
1. Creating a DataFrame using a List
In order to create a DataFrame from a Python list, just pass the list to the DataFrame() constructor.
import pandas as pd
lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']
print(pd.DataFrame(lst))
Output:
0
0 Geeks
1 For
2 Geeks
3 is
4 portal
5 for
6 Geeks
2. Creating a DataFrame using a Dictionary
A DataFrame can be created from a Python dictionary and passed to the DataFrame() constructor. The Keys of the dictionary will be the column names and the values of the dictionary are the data of the DataFrame.
import pandas as pd
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
print(pd.DataFrame(data))
Output:
Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
3. Creating a DataFrame using a List of Dictionaries
Another way to create a DataFrame is by using Python list of dictionaries. The list is passed to the DataFrame() constructor. The Keys of each dictionary element will be the column names.
import pandas as pd
lst = [{1: 'Geeks', 2: 'For', 3: 'Geeks'},
{1: 'Portal', 2: 'for', 3: 'Geeks'}]
print(pd.DataFrame(lst))
Output:
1 2 3
0 Geeks For Geeks
1 Portal for Geeks
4. Creating a DataFrame from Pandas Series
A DataFrame in Pandas can also be created by using the Pandas series.
import pandas as pd
lst = pd.Series(['Geeks', 'For', 'Geeks'])
print(pd.DataFrame(lst))
Output:
0
0 Geeks
1 For
2 Geeks
7. How to Read Data into a DataFrame from a CSV file?
We can create a data frame from a CSV (Comma Separated Values) file. This can be done by using the read_csv() method which takes the csv file as the parameter.
pandas.read_csv(file_name)
Another way to do this is by using the read_table() method which takes the CSV file and a delimiter value as the parameter.
pandas.read_table(file_name, delimiter)
8. How can a DataFrame be Converted to an Excel File?
A Pandas dataframe can be converted to an Excel file by using the to_excel() function which takes the file name as the parameter. We can also specify the sheet name in this function.
DataFrame.to_excel(file_name)
9. How to Convert a DataFrame into a Numpy Array?
Pandas Numpy is an inbuilt Python package that is used to perform large numerical computations. It is used for processing multidimensional array elements to perform complicated mathematical operations.
Pandas dataframe can be converted to a NumPy array by using the to_numpy() method. We can also provide the datatype as an optional argument.
Dataframe.to_numpy()
We can also use .values to convert dataframe values to NumPy array
df.values
10. How to access the first few rows of a dataframe?
The first few records of a dataframe can be accessed by using the pandas head() method. It takes one optional argument n, which is the number of rows. By default, it returns the first 5 rows of the dataframe. The head() method has the following syntax:
df.head(n)
Another way to do it is by using iloc() method. It is similar to the Python list-slicing technique. It has the following syntax:
df.iloc[:n]
11. How to Select a Single Column of a DataFrame?
There are many ways to Select a single column of a dataframe. They are as follows:
By using the Dot operator, we can access any column of a dataframe.
Dataframe.column_name
Another way to select a column is by using the square brackets [].
DataFrame[column_name]
12. How to Rename a Column in a DataFrame?
A column of the dataframe can be renamed by using the rename() function. We can rename a single as well as multiple columns at the same time using this method.
DataFrame.rename(columns={'column1': 'COLUMN_1', 'column2':'COLUMN_2'}, inplace=True)
Another way is by using the set_axis() function which takes the new column name and axis to be replaced with the new name.
DataFrame.set_axis(labels=['COLUMN_1','COLUMN_2'], axis=1, inplace=True)
In case we want to add a prefix or suffix to the column names, we can use the add_prefix() or add_suffix() methods.
DataFrame.add_prefix(prefix='PREFIX_')
DataFrame.add_suffix(suffix='_suffix')
13. How to add Row or Column to an Existing Dataframe?
1. Adding Rows
The df.loc[] is used to access a group of rows or columns and can be used to add a row to a dataframe.
DataFrame.loc[Row_Index]=new_row
We can also add multiple rows in a dataframe by using pandas.concat() function which takes a list of dataframes to be added together.
pandas.concat([Dataframe1,Dataframe2])
2. Adding Columns
We can add a column to an existing dataframe by just declaring the column name and the list or dictionary of values.
DataFrame[data] = list_of_values
Another way to add a column is by using df.insert() method which take a value where the column should be added, column name and the value of the column as parameters.
DataFrameName.insert(col_index, col_name, value)
We can also add a column to a dataframe by using df.assign() function
DataFrame.assign(**kwargs)
14. How to Delete an Row or Column from an Existing DataFrame?
We can delete a row or a column from a dataframe by using df.drop() method. and provide the row or column name as the parameter.
1. To delete a column
DataFrame.drop(['Column_Name'], axis=1)
2. To delete a row
DataFrame.drop([Row_Index_Number], axis=0)
15. How to Merge Two DataFrames?
In pandas, we can combine two dataframes using the pandas.merge() method which takes 2 dataframes as the parameters.
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3],
'B': [4, 5, 6]},
index=[10, 20, 30])
df2 = pd.DataFrame({'C': [7, 8, 9],
'D': [10, 11, 12]},
index=[20, 30, 40])
result = pd.merge(df1, df2, left_index=True, right_index=True)
print(result)
Output:
A B C D
20 2 5 7 10
30 3 6 8 11
16. How to Sort a Dataframe?
A dataframe in pandas can be sorted in ascending or descending order according to a particular column. We can do so by using the sort_values() method and providing the column name according to which we want to sort the dataframe. We can also sort it by multiple columns.
To sort it in descending order, we pass an additional parameter 'ascending' and set it to False.
DataFrame.sort_values(by='Age',ascending=True)
17. How to Compute Mean, Median, Mode, Variance, Standard Deviation and Various Quantile Ranges in Pandas?
The mean, median, mode, Variance, Standard Deviation and Quantile range can be computed using the following commands in Python.
- DataFrame.mean(): To calculate the mean
- DataFrame.median(): To calculate median
- DataFrame.mode(): To calculate the mode
- DataFrame.var(): To calculate variance
- DataFrame.std(): To calculate the standard deviation
- DataFrame.quantile(): To calculate quantile range, with range value as a parameter
18. Difference between Shallow Copy and Deep Copy?
In Pandas, there are two ways to create a copy of the Series. They are as follows:
1. Shallow Copy is a copy of the series object where the indices and the data of the original object are not copied. It only copies the references to the indices and data. This means any changes made to a series will be reflected in the other. A shallow copy of the series can be created by writing the following syntax:
ser.copy(deep=False)
2. Deep Copy is a copy of the series object where it has its own indices and data. This means changes made to a copy of the object will not be reflected to the original series object. A deep copy of the series can be created by writing the following syntax:
ser.copy(deep=True)
The default value of the deep parameter of the copy() function is set to True.
19. How to Check and Remove Duplicate Values in Pandas.
In pandas, duplicate values can be checked by using the duplicated() method.
DataFrame.duplicated()
To remove the duplicated values we can use the drop_duplicates() method.
DataFrame.drop_duplicates()
20. How to Handle Missing Data in Pandas?
Generally dataset has some missing values and it can happen for a variety of reasons such as data collection issues, data entry errors or data not being available for certain observations. This can cause a big problem. To handle these missing values Pandas provides various functions.
These functions are used for detecting, removing and replacing null values in Pandas DataFrame:
- isnull(): It returns True for NaN values or null values and False for present values
- notnull(): It returns False for NaN values and True for present values
- dropna(): It analyzes and drops Rows/Columns with Null values
- fillna(): It let the user replace NaN values with some value of their own
- replace(): It is used to replace a string, regex, list, dictionary, series, number, etc.
- interpolate(): It fills NA values in the dataframe or series.
21. Difference between the interpolate() and fillna()
The interpolate() and fillna() methods in pandas are used to handle missing or NaN (Not a Number) values in a DataFrame or Series. The following table shows the difference between interpolate() and fillna():
| Feature | interpolate() | fillna() |
|---|---|---|
| Purpose | Estimates and fills missing values using interpolation techniques | Fills missing values using a specified constant or computed value |
| How it works | Calculates missing values based on existing surrounding data | Directly replaces NaN with given value(s) or strategies |
| Common methods supported | linear, polynomial, time, spline, etc. | 0, mean(), median(), mode(), forward-fill (ffill), back-fill (bfill) etc. |
| Data types supported | Mainly numeric and datetime data (where logical continuity exists) | Numeric, categorical and datetime data |
| Use case | Used when missing values depend on surrounding trends or time sequence | Used when missing values can be filled with a fixed or computed known value |
| Return type | Returns a DataFrame/Series with estimated values replacing NaN | Returns a DataFrame/Series with specified values replacing NaN |
| Example | df['col'].interpolate(method='linear') | df['col'].fillna(df['col'].mean()) |
22. Difference between map(), applymap() and apply()
The map(), applymap() and apply() methods are used in pandas for applying functions or transformations to elements in a DataFrame or Series. The following table shows the difference between map(), applymap() and apply():
| Feature | map() | applymap() | apply() |
|---|---|---|---|
| Defined on | Series only | DataFrame only | Both Series and DataFrame |
| Works on | Each element of a Series | Each element of a DataFrame | Entire row/column (or whole Series) |
| Axis support | No axis parameter | No axis parameter | Has axis parameter (axis=0 for columns, axis=1 for rows) |
| Function application level | Element-wise | Element-wise | Row-wise or Column-wise (can also be element-wise on Series) |
| Typical use | Apply a function/dict to each element of a Series | Apply a function to each element of a DataFrame | Apply a function across rows/columns or to a whole Series |
| Example use case | Convert each name in a Series to uppercase | Square each element in a numeric DataFrame | Calculate sum/mean of each row or column |
| Return type | Series | DataFrame | Series (if applied on DataFrame rows/columns) or scalar (if aggregated) |
23. How to Set and Reset the Index in a Panda dataFrame?
1. Set Index: We can set the index to a Pandas dataframe by using the set_index() method, which is used to set a list, series or dataframe as the index of a dataframe.
DataFrame.set_index('Column_Name')
2. Reset Index: The index of Pandas dataframes can be reset by using the reset_index() method. It can be used to simply reset the index to the default integer index beginning at 0.
DataFrame.reset_index(inplace = True)
24. What is Reindexing in Pandas?
Reindexing in Pandas as the name suggests means changing the index of the rows and columns of a dataframe. It can be done by using the Pandas reindex() method. In case of missing values or new values that are not present in the dataframe, the reindex() method assigns it as NaN.
df.reindex(new_index)
25. What is Multi-Indexing in Pandas?
Multi-indexing refers to selecting two or more rows or columns in the index. It is a multi-level or hierarchical object for pandas object and deals with data analysis and works with higher dimensional data. Multi-indexing in Pandas can be achieved by using a number of functions such as:
- MultiIndex.from_arrays
- MultiIndex.from_tuples
- MultiIndex.from_product
- MultiIndex.from_frame
26. What is the difference between loc and iloc in Pandas?
1. loc: It is label-based i.e you access rows and columns using their labels (row and column names).
df.loc[row_labels, column_labels]
2. iloc: It is integer-position based and here you access rows and columns using their numeric index positions (row and column numbers).
df.iloc[row_positions, column_positions]
27. What is the Significance of Pandas Described Command?
Pandas describe() is used to view some basic statistical details of a data frame or a series of numeric values. It can give a different output when it is applied to a series of strings. It can get details like percentile, mean, standard deviation, etc.
DataFrame.describe()
28. How to Find the Correlation Using Pandas?
Pandas dataframe.corr() method is used to find the correlation of all the columns of a dataframe. It automatically ignores any missing or non-numerical values.
DataFrame.corr()
29. What is groupby() in Pandas and how is it used?
The groupby() function in Pandas is used to split the data into groups based on one or more columns, then apply an operation (like aggregation, transformation or filtering) on each group separately.
df.groupby(by_column)
For example:
import pandas as pd
data = {'Dept': ['IT', 'IT', 'HR', 'HR'],
'Salary': [50000, 60000, 45000, 55000]}
df = pd.DataFrame(data)
result = df.groupby('Dept')['Salary'].mean()
print(result)
Output:
Dept
HR 50000.0
IT 55000.0
Name: Salary, dtype: float64
30. How can we use Pivot table in Pandas?
In Pandas, pivot_table() is used to summarize and reshape data into a tabular format. It allows you to aggregate values like sum, mean, count, etc by specifying which columns become rows (index), which become columns and which contain the values to aggregate.
We can pivot the dataframe in Pandas by using the pivot_table() method. To unpivot the dataframe to its original form we can melt the dataframe by using the melt() method.
31. What is the difference between pivot_table() and groupby()
Both pivot_table() and groupby() are useful methods in pandas used for aggregating and summarizing data. The following table shows the difference between pivot_table() and groupby():
| Feature | pivot_table() | groupby() |
|---|---|---|
| Purpose | Summarizes and aggregates data in a tabular (pivoted) format | Performs aggregation on grouped data of one or more columns |
| Reshaping | Used to reshape data based on column values | Used to group data based on categorical variables |
| Output structure | Returns a new reshaped DataFrame | Returns a GroupBy object which must be followed by aggregation functions |
| Multi-level grouping | Can handle multiple levels of grouping using index and columns parameters | Can handle multiple levels of grouping using multiple column names in groupby() |
| Comparison across dimensions | Used when we want to compare data across multiple dimensions | Used to summarize data within groups |
| Typical use case | Summarizing data with one axis as rows and another as columns | Grouping by one or more columns and then applying aggregation |
32. What is Data Aggregation in Pandas?
In Pandas, data aggregation refers to the act of summarizing or decreasing data in order to produce a statistical summary of one or more columns in a dataset. In order to calculate statistical measures like sum, mean, minimum, maximum, count, etc aggregation functions must be applied to groups or subsets of data.
The agg() function in Pandas is frequently used to aggregate data. Applying one or more aggregation functions to one or more columns in a DataFrame or Series is possible using this approach. Pandas' built-in functions or specially created user-defined functions can be used as aggregation functions.
DataFrame.agg({'Col_name1': ['sum', 'min', 'max'], 'Col_name2': 'count'})
33. Difference between join(), merge() and concat()
The following table shows the difference between join(), merge() and concat():
| Feature | join() | merge() | concat() |
|---|---|---|---|
| Purpose | Combines two DataFrames on their index or on a key column. | Combines DataFrames using common columns or indices (like SQL joins). | Combines DataFrames along rows or columns. |
| Default on | Joins on index by default. | Joins on common columns by default. | Just stacks DataFrames without joining keys. |
| Join types | left, right, inner, outer | left, right, inner, outer | Not applicable (simply concatenates). |
| Axis support | Always horizontal (columns) | Always horizontal (columns) | Can be vertical (rows) or horizontal (columns) using axis. |
| Typical use | Combine DataFrames by their index labels. | Combine DataFrames based on matching column values. | Stack multiple DataFrames into one. |
34. What is Time Series in Pandas?
Time series is a collection of data points with timestamps. It depicts the evolution of quantity over time. Pandas provide various functions to handle time series data efficiently. It is used to work with data timestamps, resampling time series for different time periods, working with missing data, slicing the data using timestamps, etc.
We have various time-series function in pandas like:
Pandas Built-in Function | Operation |
|---|---|
pandas.to_datetime(DataFrame['Date']) | Convert 'Date' column of DataFrame to datetime dtype |
DataFrame.set_index('Date', inplace=True) | Set 'Date' as the index |
DataFrame.resample('H').sum() | Resample time series to a different frequency (e.g., Hourly, daily, weekly, monthly etc) |
DataFrame.interpolate() | Fill missing values using linear interpolation |
DataFrame.loc[start_date:end_date] | Slice the data based on timestamps |
35. How to convert a String to Datetime in Pandas?
A Python string can be converted to a DateTime object by using:
1. Pandas.to_datetime()
import pandas as pd
date_string = '2023-07-17'
dateTime = pd.to_datetime(date_string)
print(dateTime)
Output:
2023-07-17 00:00:00
2. datetime.strptime
from datetime import datetime
date_string = '2023-07-17'
dateTime = datetime.strptime(date_string, '%Y-%m-%d')
print(dateTime)
Output:
2023-07-17 00:00:00
36. What is Time Delta in Pandas?
The time delta is the difference in dates and time. It indicates the duration or difference in time. The time delta object can be created by using the timedelta() method and providing the number of weeks, days, seconds, milliseconds, etc as the parameter.
Duration = pandas.Timedelta(days=7, hours=4, minutes=30, seconds=23)
With the help of the Timedelta data type, you can easily perform arithmetic operations, comparisons and other time-related manipulations. In terms of different units, such as days, hours, minutes, seconds, milliseconds and microseconds.
37. How to make Label Encoding using Pandas?
Label encoding is used to convert categorical data into numerical data so that a machine-learning model can fit it. To apply label encoding using pandas we can use:
1. pandas.Categorical().codes: It only gives codes.
- pd.Categorical() converts the data into a Categorical type.
- .codes gives the integer code for each category.
import pandas as pd
data = pd.Series(['Red', 'Blue', 'Green', 'Blue'])
encoded = pd.Categorical(data).codes
print(encoded)
Output:
[2 0 1 0]
2. pandas.factorize(): It gives both codes and unique labels.
- factorize() returns a tuple: (encoded_array, unique_categories).
- The first array gives integer codes and the second gives the mapping of categories.
- It automatically detects unique values and assigns codes in their first-seen order.
import pandas as pd
data = pd.Series(['Red', 'Blue', 'Green', 'Blue'])
encoded, uniques = pd.factorize(data)
print(encoded)
Output:
[0 1 2 1]
38. How to make Onehot Encoding using Pandas?
One-hot encoding is a technique for representing categorical data as numerical values in a machine-learning model. It works by creating a separate binary variable for each category in the data. The value of the binary variable is 1 if the observation belongs to that category and 0 otherwise. It can improve the performance of the model.
To apply one hot encoding, we greater a dummy column for our dataframe by using get_dummies() method.
pd.get_dummies(data, columns=None)
For example:
import pandas as pd
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green']})
encoded = pd.get_dummies(df, columns=['Color'])
print(encoded)
Output:
Color_Blue Color_Green Color_Red
0 0 0 1
1 1 0 0
2 0 1 0