
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Fuzzy Matching on Pandas DataFrame Column using Python
We will match words in the first DataFrame with words in the second DataFrame. For closest matches, we will use threshold. We took the value of threshold as 70 i.e., match occurs when the strings at more than 70% close to each other.
Let us first create Dictionaries and convert to pandas dataframe −
# dictionaries d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]} d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]} # convert dictionaries to pandas dataframes df1 = pd.DataFrame(d1) df2 = pd.DataFrame(d2)
Now, convert dataframe column to list of elements for fuzzy matching −
myList1 = df1['Car'].tolist() myList2 = df2['Car'].tolist()
Example
Following is the complete code −
import pandas as pd from fuzzywuzzy import fuzz from fuzzywuzzy import process # dictionaries d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]} d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]} # convert dictionaries to pandas dataframes df1 = pd.DataFrame(d1) df2 = pd.DataFrame(d2) # printing the pandas dataframes print("Dataframe 1 =
",df1) print("Dataframe 2 =
",df2) # empty lists for storing the matches later match1 = [] match2 = [] k = [] # converting dataframe column to list of elements for fuzzy matching myList1 = df1['Car'].tolist() myList2 = df2['Car'].tolist() threshold = 70 # iterating myList1 to extract closest match from myList2 for i in myList1: match1.append(process.extractOne(i, myList2, scorer=fuzz.ratio)) df1['matches'] = match1 for j in df1['matches']: if j[1] >= threshold: k.append(j[0]) match2.append(",".join(k)) k = [] # saving matches to df1 df1['matches'] = match2 print("
Matches...") print(df1)
Output
This will produce the following output −
Dataframe 1 = Car 0 BMW 1 Audi 2 Lexus 3 Mercedes 4 Rolls Dataframe 2 = Car 0 BM 1 Audi 2 Le 3 Mercedes 4 Rolls Royce Matches... Car matches 0 BM BM 1 Audi Audi 2 Lexus 3 Mercedes MERCEDES 4 Rolls
Advertisements