Merging DataFrames is a common operation when working with multiple datasets in Pandas. The `merge()` function allows you to combine two DataFrames based on a common column or index. In this article, we will explore how to merge DataFrames using various options and techniques.
We will load the datasets into two Pandas DataFrames and merge them based on the ID column.
Python
import pandas as pd
# Create DataFrame for Dataset 1
data1 = [[1, 'Raj', 25],
[2, 'Sharma', 42],
[3, 'Saroj', 22],
[4, 'Raja', 51],
[5, 'Kajal', 26]]
df1 = pd.DataFrame(data1, columns=['ID', 'Name', 'Age'])
# Create DataFrame for Dataset 2
data2 = [[1, 'Male', 25000],
[2, 'Male', 42500],
[3, 'Female', 22300],
[4, 'Male', 51400],
[5, 'Female', 26800]]
df2 = pd.DataFrame(data2, columns=['ID', 'Gender', 'Salary'])
Method 1: Inner Merge
An inner merge returns only the rows that have matching values in both DataFrames. If a row doesnt have a corresponding match in either DataFrame, it is excluded.
- Call the merge() function with how=inner.
Python
# Perform an inner merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)
This merge includes only the rows where the ID exists in both DataFrames. In this case, the rows for ID = 1, 2, 4, and 5 will be included in the result.
Method 2: Left Merge
A left merge returns all the rows from the left DataFrame (first DataFrame) and the matching rows from the right DataFrame. If there is no match, the result will contain NaN values for columns from the right DataFrame.
- Call the merge() function with how=left.
Python
# Perform a left merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='left')
print(merged_df)
This method ensures all rows from the left DataFrame (df1) are kept, and columns from df2 are added where a matching ID is found. If no match is found, NaN is used for the columns from df2.
Method 3: Right Merge
A right merge is the opposite of a left merge. It returns all the rows from the right DataFrame (second DataFrame) and the matching rows from the left DataFrame.
- Call the merge() function with how=right.
Python
# Perform a right merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='right')
print(merged_df)
This merge ensures that all rows from the right DataFrame (df2) are included. If no match is found for a row in df1, the left DataFrame’s columns will be NaN.
Method 4: Outer Merge
An outer merge returns all rows from both DataFrames, with `NaN` for missing values where there is no match. This is useful when you want to keep all the data from both DataFrames.
- Call the merge() function with how=outer.
Python
# Perform an outer merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='outer')
print(merged_df)
This method includes all rows from both DataFrames. If a row in df1 doesnt have a match in df2, or vice versa, the corresponding columns will have NaN values.
Method 5: Merging on Multiple Columns
You can merge DataFrames based on multiple columns by passing a list of column names to the on parameter. This allows you to perform more complex merges when the matching criteria are based on multiple columns.
- Call the merge() function with a list of column names.
Python
# Example with additional columns
df1['Country'] = ['USA', 'Canada', 'USA', 'Canada', 'USA']
df2['Country'] = ['USA', 'Canada', 'USA', 'Canada', 'Mexico']
# Merge on 'ID' and 'Country'
merged_df = pd.merge(df1, df2, on=['ID', 'Country'], how='inner')
print(merged_df)
Here, the merge is based on both the ID and Country columns. Only rows that have matching values in both columns from both DataFrames will be included.
You can refer this article for more detailed explanation: Merge Join two dataframes on multiple columns in Pandas
Method 6: Merging with Different Column Names
If the columns you want to merge on have different names in the two DataFrames, you can use the left_on and right_on parameters to specify the columns to merge on.
- Use left_on and right_on to specify the merge columns.
Python
# Rename 'ID' in df2 to 'EmployeeID' for demonstration
df2.rename(columns={'ID': 'EmployeeID'}, inplace=True)
# Merge using different column names
merged_df = pd.merge(df1, df2, left_on='ID', right_on='EmployeeID', how='inner')
print(merged_df)
Here, we merged the DataFrames using different column names: ID in df1 and EmployeeID in df2. The left_on and right_on parameters allow you to specify which columns to use for the merge.
You can refer this article for more detailed explanation: Merge two dataframes with different columns
Summary:
- Inner merge: Includes only the matching rows from both DataFrames.
- Left merge: Keeps all rows from the left DataFrame, adding matching rows from the right.
- Right merge: Keeps all rows from the right DataFrame, adding matching rows from the left.
- Outer merge: Includes all rows from both DataFrames, with NaN where there are no matches.
- Merge on multiple columns: Combine DataFrames based on multiple criteria.
- Merge with different column names: Specify different columns to merge using left_on and right_on.
Recommendation: For most scenarios, the inner merge is the most commonly used, but depending on your data and the relationship between the two DataFrames, other merge types may be more appropriate. Experiment with these options to find the best fit for your data merging needs.
Similar Reads
Merge Multiple Dataframes - Pandas
Merging allow us to combine data from two or more DataFrames into one based on index values. This is used when we want to bring together related information from different sources. In Pandas there are different ways to combine DataFrames: 1. Merging DataFrames Using merge()We use merge() when we wan
3 min read
Python | Pandas dataframe.melt()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.melt() function unpivots a DataFrame from wide format to long format,
2 min read
Python | Pandas dataframe.eq()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.eq() is a wrapper used for the flexible comparison. It provides a con
3 min read
Python | Pandas dataframe.mul()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier.Pandas dataframe.mul() function return multiplication of dataframe and other element- w
2 min read
Pandas DataFrame.reset_index()
In Pandas, reset_index() method is used to reset the index of a DataFrame. By default, it creates a new integer-based index starting from 0, making the DataFrame easier to work with in various scenarios, especially after performing operations like filtering, grouping or multi-level indexing. Example
3 min read
Python | Pandas dataframe.mask()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.mask() function return an object of same shape as self and whose corr
3 min read
Pandas Dataframe Rename Index
To rename the index of a Pandas DataFrame, rename() method is most easier way to rename specific index values in a pandas dataFrame; allows to selectively change index names without affecting other values. [GFGTABS] Python import pandas as pd data = {'Name': ['John', 'Alice',
3 min read
Joining two Pandas DataFrames using merge()
The merge() function is designed to merge two DataFrames based on one or more columns with matching values. The basic idea is to identify columns that contain common data between the DataFrames and use them to align rows. Let's understand the process of joining two pandas DataFrames using merge(), e
4 min read
How to Join Pandas DataFrames using Merge?
Joining and merging DataFrames is that the core process to start out with data analysis and machine learning tasks. It's one of the toolkits which each Data Analyst or Data Scientist should master because in most cases data comes from multiple sources and files. In this tutorial, you'll how to join
3 min read
Slicing Pandas Dataframe
Slicing a Pandas DataFrame is a important skill for extracting specific data subsets. Whether you want to select rows, columns or individual cells, Pandas provides efficient methods like iloc[] and loc[]. In this guide weâll explore how to use integer-based and label-based indexing to slice DataFram
3 min read