
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Merge Two CSV Files by Specific Column Using Pandas in Python
CSV (Comma Separated Values) files are widely used for storing and exchanging data in a simple format. In many data processing tasks, it is necessary to merge two or more CSV files based on a specific column. Fortunately, this can be easily achieved using the Pandas library in Python.
In this article, we will learn how to merge two CSV files by a specific column using Pandas in Python.
What is Pandas Library?
Pandas is an open-source library for information control and examination in Python. It offers tools for working with structured data, such as tabular, time-series, and multidimensional data, as well as high-performance data structures. Pandas is widely used in finance, data science, machine learning, and other fields that require data manipulation.
Steps to Merge two CSV Files by Specific Column in Python
Below are the complete steps to merge the two CSV files by their specific column in Python using the Pandas library ?
Step 1: Import Pandas library
The first step to merging two CSV files is to import the pandas library. Pandas is a powerful data analysis library for Python. It provides data structures for efficiently storing and manipulating large datasets. To use Pandas, we first need to import it in our Python program. We can do this using the following command ?
import pandas as pd
Step 2: Read CSV Files
Our next step is to read the two CSV files that we want to merge. We can use the read_csv() function of Pandas to read CSV files into a Pandas DataFrame. We need to provide the file path or URL of the CSV file as an argument to the read_csv() function. For example ?
df1 = pd.read_csv('file1.csv') df2 = pd.read_csv('file2.csv')
Step 3: Merging CSV Files
After reading the CSV files into Pandas DataFrames, now its time to merge them based on a specific column using the merge() function. The merge() function takes two DataFrames as input and merges them based on a common column. Here's the basic syntax ?
merged_df = pd.merge(df1, df2, on='column_name')
In this example, merged_df is the new DataFrame that contains the merged data from df1 and df2. The on parameter specifies the name of the common column on which the two DataFrames are merged.
For example, let's say we have two CSV files sales.csv and customers.csv. Both files contain a column named CustomerID. We want to merge the two files based on the CustomerID column. Here's how we can do it ?
sales = pd.read_csv('sales.csv') customers = pd.read_csv('customers.csv') merged_df = pd.merge(sales, customers, on='CustomerID')
Step 4: Handling Missing Values (if any)
When merging two CSV files based on a specific column, there may be missing values in the merged DataFrame. These missing values may occur if there are no matching values in the corresponding column of one of the CSV files.
To handle missing values, we can use the fillna() function of Pandas to replace them with a default value or a calculated value. For example, we can replace missing values with a string 'Unknown' using the following code ?
merged_df.fillna('Unknown', inplace=True)
In this example, we have used the fillna() function to replace missing values in the merged DataFrame with the string 'Unknown'. We have also specified the inplace parameter as True, which means that the original DataFrame will be modified instead of creating a new one.
Example 1: Using the Merge() Function
In this example, we will use the merge() function of Pandas to merge two CSV files based on a specific column.
CSV Files
Suppose we have two CSV files: employees.csv and departments.csv. The employees.csv file contains the following data ?
EmployeeID | Name | DepartmentID | Salary |
---|---|---|---|
1 | John | 1 | 50000 |
2 | Sarah | 2 | 60000 |
3 | David | 1 | 55000 |
4 | Alex | 3 | 65000 |
5 | Emily | 2 | 55000 |
The departments.csv file contains the following data ?
DepartmentID | DepartmentName |
---|---|
1 | IT |
2 | Sales |
3 | Marketing |
4 | HR |
Example
import pandas as pd # Load CSV files employees = pd.read_csv('employees.csv') departments = pd.read_csv('departments.csv') # Merge dataframes based on DepartmentID column merged_df = pd.merge(employees, departments, on='DepartmentID') # Print merged dataframe print(merged_df.head()) # Save merged dataframe to a new CSV file merged_df.to_csv('merged_employees_departments.csv', index=False)
Output
EmployeeID Name DepartmentID Salary DepartmentName 0 1 John 1 50000 IT 1 3 David 1 55000 IT 2 2 Sarah 2 60000 Sales 3 5 Emily 2 55000 Sales 4 4 Alex 3 65000 Marketing
Example 2: Using the Join() Function
In this example, we will use the join() method of Pandas to merge two CSV files based on a specific column.
CSV Files
Suppose we have two CSV files: orders.csv and customers.csv. The orders.csv file contains the following data ?
OrderID | CustomerID | OrderDate | TotalAmount |
---|---|---|---|
1 | 1 | 2022-05-01 | 100.0 |
2 | 3 | 2022-05-02 | 150.0 |
3 | 2 | 2022-05-03 | 200.0 |
4 | 1 | 2022-05-04 | 75.0 |
5 | 4 | 2022-05-05 | 120.0 |
The customers.csv file contains the following data ?
CustomerID | CustomerName | |
---|---|---|
1 | John | [email protected] |
2 | Sarah | [email protected] |
3 | David | [email protected] |
4 | Emily | [email protected] |
Example
import pandas as pd # Load CSV files orders = pd.read_csv('orders.csv') customers = pd.read_csv('customers.csv') # Join dataframes based on CustomerID column joined_df = orders.set_index('CustomerID').join(customers.set_index('CustomerID')) # Print joined dataframe print(joined_df.head()) # Save joined dataframe to a new CSV file joined_df.to_csv('joined_orders_customers.csv')
Output
OrderID OrderDate TotalAmount CustomerName Email CustomerID 1 1 2022-05-01 100.0 John [email protected] 1 4 2022-05-04 75.0 John [email protected] 2 3 2022-05-03 200.0 Sarah [email protected] 3 2 2022-05-02 150.0 David [email protected] 4 5 2022-05-05 120.0 Emily [email protected]
Using Pandas' merge() function, we have merged two CSV files based on the "id" column in this example. The DataFrame that is the result of merging the two CSV files includes the "id" column in addition to the "name_x," "email_x," "name_y," and "email_y" columns.
Note that there are missing values in the combined DataFrame for the 'name_y' and 'email_y' segments, which relate to the lines where there were no matching qualities in the second CSV record. As demonstrated in the previous step, the Pandas fillna() and dropna() functions can be used to handle these missing values.
Conclusion
Merging two CSV files based on a specific column is a common data processing task, and it can be easily achieved using the Pandas library in Python. In this article, we learned how to merge two CSV files using the merge() function of Pandas. We have also discussed how to handle missing values and how to save the merged DataFrame to a new CSV file.