PySpark Entity Resolution
PySpark Entity Resolution
In [ ]: import os
import sys
from pyspark import SparkContext
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
from pyspark.sql import SparkSession
Read the data from the csv file specified by the path and convert it into a DataFrame
In [ ]: prev = spark.read.csv("data/linkage/donation/block_1/block_1.csv")
prev
In [ ]: prev.show(2)
Display the schema of the DataFrame and print the first 5 rows
In [ ]: parsed.printSchema()
parsed.show(5)
In [ ]: parsed.count()
Cache the DataFrame to memory. This stores the intermediate computations/state of the
dataframe in the cluster node's memory so the can be reused for subsequent actions. This
prevents repeated computations. The .cache() method on a DataFrame saves it to the default
storage level 'MEMORY_AND_DISK'
In [ ]: parsed.cache()
Perform an aggregate operation and display:
Group the rows of the dataframe by the distinct values of the 'is_match' column
count the number of rows for each group
order the groups by descending order of the count
This displays the number of entities in the DataFrame that are matched correctly and
incorrectly
In [ ]: parsed.createOrReplaceTempView("linkage")
spark.sql("""
SELECT is_match, COUNT(*) cnt
FROM linkage
GROUP BY is_match
ORDER BY cnt DESC
""").show()
The describe() function provides summary statistics for all numeric columns of the
dataframe, such as minimum and maximum values, mean, standard deviation. These are
stored in a DataFrame named summary. This creates a column for each variable along with a
'summary' column which describes the metric that is present.
In [ ]: summary = parsed.describe()
The summary statistics for the columns 'cmp_fname_c1' and 'cmp_fnamec2' are displayed by
selecting the necessary columns from the summary table
The parsed dataframe is filtered to store only the rows with the value True in the 'is_match'
column in the matches table and rows with value False in the 'is_match' column in the misses
table. For each table, the summary statistics are stored.
The summary table is converted to a Pandas dataframe using the toPandas() function. On
the pandas dataframe, the table can be manipulated
In [ ]: summary_p = summary.toPandas()
Display the first 5 rows of the dataframe and its shape (no. of rows, no. of columns).
In [ ]: summary_p.head()
summary_p.shape
This set of functions converts the summary table, so that each row displays one variable with
each column representing the calculated metric for that variable.
In [ ]: summary_p = summary_p.set_index('summary').transpose().reset_index()
...
summary_p = summary_p.rename(columns={'index':'field'})
...
summary_p = summary_p.rename_axis(None, axis=1)
...
summary_p.shape
The manipulated pandas dataframe is converted back to a Spark DataFrame and the schema
is displayed
In [ ]: summaryT = spark.createDataFrame(summary_p)
summaryT
summaryT.printSchema()
For each numeric column in the dataframe, as in the .columns attribute of the dataframe, the
column is casted to be of double type, and replace the original column, which was of type
string. The withColumn function adds a new column to the dataframe as a result of some
operation, in this case .cast() on an existing column. Here, since both the existing and newly
computed columns have the same name, the existing column is replaced with the newer
one.
This is done for all columns except 'field' which contains inherently data of string type. The
results of this will be displayed in the new schema of the dataframe, with all the numeric
columns converted to type double
Both the miss_summary and match_summary dataframes are pivoted and reshaped using
the pivot_summary function defined above
In [ ]: match_summaryT = pivot_summary(match_summary)
miss_summaryT = pivot_summary(miss_summary)
Temporary views are created for both the match and miss summary pivoted tables.
An SQL query is then executed on the join of the two tabales, joined so that the 'field'
column of both tables match. Inner join only takes the rows of both tables where there
is a match
It then displays the sums of the count values for matched rows of the tables and the
difference of the means, for all rows excluding 'id1' and 'id2', ordering it in descending
order of the difference field
This is used to select good features that determine whether two entities are the same, as the
good features usually have significantly different values for matches and nonmatches (large
difference in means) and they occur often enough in the data (large value of count)
In [ ]: match_summaryT.createOrReplaceTempView("match_desc")
miss_summaryT.createOrReplaceTempView("miss_desc")
spark.sql("""
SELECT a.field, a.count + b.count total, a.mean - b.mean delta
FROM match_desc a INNER JOIN miss_desc b ON a.field = b.field
WHERE a.field NOT IN ("id_1", "id_2")
ORDER BY delta DESC, total DESC
""")
Using the results of the previous query, good features are selected and an string expression
is created to describe the sum of all the good features
A new table, scored is created from the parsed table, using only a subset of the columns
corresponding to the good features. A new column, 'score' is added, computed by summing
the value for each of these columns in the row. Finally, only the 'score' and 'is_match' column
are included in the new table.
Finally, a suitable threshold needs to be selected so that scores above this threshold can be
classified as matches and scores below this are classified as nonmatches. The threshold must
be such that when the rows are classified using it, there should be maximum accuracy. For
this a function is defined which creates a Cross Tabulation (essentially, a confusion matrix)
which counts the number of rows whose rows fall above or below a certain threshold along
with the number of rows in each of those categories (above/below) that were or were not a
match respectively. This can allow comparing and choosing a good value of the threshold
In [ ]: crossTabs(scored, 4.0).show()
crossTabs(scored, 2.0).show()
Cross tabulations are calculated for thresholds of 4 and 2. A high threshold filters out almost
all nonmatches and most of the matches. A lower threshold captures all matches, but has a
much higher count of false positives.
In [ ]: