03 Hash
03 Hash
Hashing
1 / 84
Reading
2 / 84
Libraries
library(knitr)
opts_chunk$set(message=FALSE, warning=FALSE)
# load libraries
library(rvest) # web scraping
library(stringr) # string manipulation
library(dplyr) # data manipulation
library(tidyr) # tidy data
library(purrr) # functional programming
library(scales) # formatting for rmd output
library(ggplot2) # plots
library(numbers) # divisors function
library(textreuse) # detecting text reuse and
# document similarity
3 / 84
Background
4 / 84
Piping
5 / 84
Examples of Piping
# Example One
1:50 %>% mean
## [1] 25.5
mean(1:50)
## [1] 25.5
6 / 84
Examples of Piping
# Example Two
years <- factor(2008:2012)
# nesting
as.numeric(as.character(years))
# piping
years %>% as.character %>% as.numeric
7 / 84
The dplyr package
A very useful function that I recommend that you explore and learn
about is called dplyr and there is a very good blogpost about its
functions: https://www.r-bloggers.com/
useful-dplyr-functions-wexamples/
8 / 84
The dplyr package
9 / 84
The dplyr package
There are many useful cheat sheets that have been made and you
can find them here reading R and Rmarkdown in general that will
help you with this module and optimizing your R code in general:
https://www.rstudio.com/resources/cheatsheets/
We won’t walk through an example as the blogpost above does a
very nice job of going through functions within dplyr, such as
filter, mutate, etc. Please go through these and your own
become familiar with them.
10 / 84
Information Retrieval
11 / 84
Finding Similar Items
12 / 84
Notions of Similarity
|S ∩ T |
|S ∪ T |
.
We will refer to the Jaccard similarity of S and T by SIM(S, T ).
13 / 84
Jaccard Similarity
Figure 1: Two sets S and T with Jaccard similarity 2/5. The two sets
share two elements in common, and there are five elements in total.
14 / 84
Shingling of Documents
15 / 84
k-Shingles
16 / 84
k-Shingles
17 / 84
Choosing the shingle size
18 / 84
Choosing the shingle size
19 / 84
Choosing the shingle size
20 / 84
Jaccard similarity of Beatles songs
Now that we have shingled records (sets), we can now look at the
Jaccard similarity for each song by looking at the shingles for each
song as a set (Si ) and computing the Jaccard similarity
|Si ∩ Sj |
for i 6= j.
|Si ∪ Sj |
21 / 84
Data
We will scrape lyrics from http://www.metrolyrics.com. (Please
note that I do not expect you to be able to scrape the data, but the
code is on the next two slides in case you are interested in it.)
# get beatles lyrics
links <- read_html("http://www.metrolyrics.com/beatles-lyrics.html") %>% # lyrics site
html_nodes("td a") # get all links to all songs
words %>%
paste(collapse = ' ') %>% # paste all paragraphs together as one string
str_replace_all("[\r\n]" , ". ") %>% # remove newline chars
return()
}
22 / 84
Data (continued)
24 / 84
Shingling Eleanor Rigby
# get k = 5 shingles for "Eleanor Rigby"
# this song is the best
# filter finds when cases are true
best_song <- songs %>% filter(name == "Eleanor Rigby")
# shingle the lyrics
shingles <- tokenize_ngrams(best_song$lyrics, n = 5)
# inspect results
head(shingles) %>%
kable() # pretty table
# inspect results
head(shingles) %>%
kable()
ah look at
look at all
at all the
all the lonely
the lonely people
lonely people ah
26 / 84
Shingling Eleanor Rigby
27 / 84
Shingling all Beatles songs
28 / 84
Shingling all Beatles songs
29 / 84
Jaccard similarity of Beatles songs
# create all pairs to compare then get the jacard similarity of each
# start by first getting all possible combinations
song_sim <- expand.grid(song1 = seq_len(nrow(songs)), song2 = seq_len(nrow(songs))) %>%
filter(song1 < song2) %>% # don't need to compare the same things twice
group_by(song1, song2) %>% # converts to a grouped table
mutate(jaccard_sim = jaccard_similarity(songs$shingles[song1][[1]],
songs$shingles[song2][[1]])) %>%
ungroup() %>% # Undoes the grouping
mutate(song1 = songs$name[song1],
song2 = songs$name[song2]) # store the names, not "id"
# inspect results
summary(song_sim)
30 / 84
Jaccard similarity of Beatles songs
Yellow Submarine
With A Little Help From My Friends
While My Guitar Gently Weeps
When I'm Sixty−four
When I'm 64
We Can Work It Out
Two Of Us
Ticket To Ride
There are places I remember
The End
Strawberry Fields Forever
Something
She's Leaving Home
She Loves You
She Came In Through The Bathroom Window
Sgt. Pepper's Lonely Hearts Club Band
Rocky Raccoon
Revolution
Please Please Me
Penny Lane
Octopus's Garden
Obladi Oblada
Nowhere Man
Norwegian Wood
Michelle
jaccard_sim
Maxwell's Silver Hammer
Lucy In The Sky With Diamonds
Love Me Do
0.75
song2
Lady Madonna
Imagine
If I Fell
I've Just Seen A Face 0.50
I've Got A Feeling
I'm So Tired
I Will 0.25
I Saw Her Standing There
I Am The Walrus
Here, There And Everywhere
Help
0.00
Good Day Sunshine
Golden Slumbers/Carry That Wieght/Ending
Golden Slumbers
Girl
Get Back
For No One
Eleanor Rigby
Eight Days a Week
Dig a Pony
Dear Prudence
Come Together
Birthday
Because
Bad To Me
And I Love Her
All You Need is Love
All Together Now
All My Loving
Across The Universe
Abbey Road Medley
A Hard Day's Night
A Day In The Life
A Day In The Life
If I Fell
Imagine
In My Life
Lady Madonna
Love Me Do
Lucy In The Sky With Diamonds
Michelle
Norwegian Wood
Nowhere Man
Obladi Oblada
Penny Lane
Please Please Me
Revolution
Rocky Raccoon
I'm So Tired
I've Got A Feeling
I've Just Seen A Face
Octopus's Garden
When I'm 64
When I'm Sixty−four
song1
It looks like there are a couple song pairings with very high Jaccard similarity. Let’s filter to find out which ones.
31 / 84
Jaccard similarity of Beatles songs
Two of the three matches seem to be from duplicate songs in the data, although it’s interesting that their Jaccard
similarity coefficients are not equal to 1. (This is not suprising. Why?). Perhaps these are different versions of the
song.
32 / 84
Jaccard similarity of Beatles songs
# inspect lyrics
print(songs$lyrics[songs$name == "In My Life"])
## [1] "There are places I remember. All my life though some have changed. Some forever not for better. So
## [1] "There are places I remember all my life. Though some have changed. Some forever, not for better. S
By looking at the seemingly different pair of songs lyrics, we can see that these are actually the same song as well,
but with slightly different transcription of the lyrics. Thus we have identified three songs that have duplicates in
our database by using the Jaccard similarity. What about non-duplicates? Of these, which is the most similar pair
of songs?
33 / 84
Jaccard dissimilarity of Beatles songs
These appear to not be the same songs, and hence, we don’t want to look at these. Typically our goal is to look at
items that are most similar and discard items that aren’t similar.
34 / 84
Jaccard similarity of Beatles songs
35 / 84
Solution to Exercise (Go over on your own)
36 / 84
Solution to Exercise (Go over on your own)
37 / 84
Solution to Exercise (Go over on your own)
a b score
38 / 84
Jaccard similarity of Beatles songs
39 / 84
Hashing
40 / 84
Hashing shingles
41 / 84
Similarity Preserving Summaries of Sets
Sets of shingles are large. Even if we hash them to four bytes each,
the space needed to store a set is still roughly four times the space
taken up by the document.
If we have millions of documents, it may not be possible to store all
the shingle-sets in memory. (There is another more serious problem,
which we will address later in the module.)
Here, we replace large sets by smaller representations, called
signatures. The important property we need for signatures is that
we can compare the signatures of two sets and estimate the Jaccard
similarity of the underlying sets from the signatures alone.
It is not possible for the signatures to give the exact similarity of the
sets they represent, but the estimates they provide are close.
42 / 84
Characteristic Matrix
Before describing how to construct small signatures from large sets,
we visualize a collection of sets as a characteristic matrix.
The columns correspond to sets and the rows correspond to the
universal set from which the elements are drawn.
There is a 1 in row r, column c if the element for row r is a member
of the set for column c. Otherwise, the value for (r , c) is 0.
Below is an example of a characteristic matrix, with four shingles
and five records.
library("pander")
element <- c("a","b","c","d","e")
S1 <- c(0,1,1,0,1)
S2 <- c(0,0,1,0,0)
S3 <- c(1,0,0,0,0)
S4 <- c(1,1,0,1,1)
my.data <- cbind(element,S1,S2,S3,S4)
43 / 84
Characteristic Matrix
pandoc.table(my.data,style="rmarkdown")
##
##
## | element | S1 | S2 | S3 | S4 |
## |:-------:|:--:|:--:|:--:|:--:|
## | a | 0 | 0 | 1 | 1 |
## | b | 1 | 0 | 0 | 1 |
## | c | 1 | 1 | 0 | 0 |
## | d | 0 | 0 | 0 | 1 |
## | e | 1 | 0 | 0 | 1 |
44 / 84
Minhashing
45 / 84
Minhashing
Now, we permute the rows of the characteristic matrix to form a
permuted matrix.
The permuted matrix is simply a reordering of the original
characteristic matrix, with the rows swapped in some arrangement.
Figure 2 shows the characteristic matrix converted to a permuted
matrix by a given permutation. We repeat the permutation step for
several iterations to obtain multiple permuted matrices.
46 / 84
The Signature Matrix
47 / 84
Signature Matrix
library("pander")
signature <- c(2,4,3,1)
pandoc.table(signature,style="rmarkdown")
##
## | | | | |
## |:-:|:-:|:-:|:-:|
## | 2 | 4 | 3 | 1 |
48 / 84
Minhashing and Jaccard Similarity
49 / 84
Minhashing and Jaccard Similarity
The relationship between the random permutations of the
characteristic matrix and the Jaccard Similarity is:
|A ∩ B|
Pr {min[h(A)] = min[h(B)]} =
|A ∪ B|
The equation means that the probability that the minimum values
of the given hash function, in this case h, is the same for sets A and
B is equivalent to the Jaccard Similarity, especially as the number of
record comparisons increases.
We use this relationship to calculate the similarity between any two
records.
We look down each column, and compare it to any other column:
the number of agreements over the total number of combinations is
equal to Jaccard measure.
50 / 84
Back to the Beatles
Recall that textreuse::TextReuseCorpus() function hashes our
shingled lyrics automatically using function that hashes a string to
an integer. We can look at these hashes for “Eleanor Rigby”.
51 / 84
Back to the Beatles
52 / 84
Back to the Beatles
Now instead of storing the strings (shingles), we can just store the
hashed values. Since these are integers, they will take less space
which will be useful if we have large documents. Instead of
performing the pairwise Jaccard similarities on the strings, we can
perform them on the hashes.
# compute jaccard similarity on hashes instead of shingled lyrics
# add this column to our song data.frame
song_sim %>%
group_by(song1, song2) %>%
mutate(jaccard_sim_hash =
jaccard_similarity(songs$hash[songs$name == song1][[1]],
songs$hash[songs$name == song2][[1]])) -> song_sim
## [1] 0
53 / 84
Back to the Beatles
54 / 84
Characterisitic Matrix for the Beatles
# return if an item is in a list
item_in_list <- function(item, list) {
as.integer(item %in% list)
}
# inspect results
char_matrix[1:4, 1:4] %>%
kable()
item A Day In The Life A Hard Day’s Night Abbey Road Medley
-2147299362 0 0 0
-2147099044 0 0 0
-2146305897 0 0 0
-2145853136 0 0 0
55 / 84
Characterisitic Matrix for the Beatles
56 / 84
Signature Matrix for the Beatles
57 / 84
Signature Matrix for the Beatles
# set seed for reproducibility
set.seed(09142017)
# inspect results
sig_matrix[1, 1:4] %>% kable()
# inspect results
sig_matrix[1:4, 1:4] %>% kable()
A Day In The Life A Hard Day’s Night Abbey Road Medley Across The Universe
28 87 6 63
70 30 4 98
3 181 21 20
42 23 5 25
59 / 84
Signature Matrix, Jaccard Similarity, and the Beatles
|A ∩ B|
Pr {min[h(A)] = min[h(B)]} =
|A ∪ B|
60 / 84
Signature Matrix, Jaccard Similarity, and the Beatles
61 / 84
Evaluating Performance
62 / 84
Evaluating Performance
63 / 84
Evaluating Performance
Once you have a handmatched data set, we can calculate the recall
and the precision.
How do we calculate the recall and precision?
We need to learn some new terminology first and then we’re all set!
64 / 84
Classifications
1. Pairs of data can be linked in both the handmatched training
data (which we refer to as ‘truth’) and under the estimated
linked data. We refer to this situation as true positives (TP).
2. Pairs of data can be linked under the truth but not linked
under the estimate, which are called false negatives (FN).
3. Pairs of data can be not linked under the truth but linked
under the estimate, which are called false positives (FP).
4. Pairs of data can be not linked under the truth and also not
linked under the estimate, which we refer to as true negatives
(TN).
Then the true number of links is TP + FN, while the estimated
number of links is TP + FP. The usual definitions of false negative
rate and false positive rate are
FNR = FN/(TP+FN)
FPR = FP/(TP+FP)
65 / 84
Classifications
FNR = FN/(TP+FN)
FPR = FP/(TP+FP)
Then
recall = 1 − FNR.
precision = 1 − FPR.
How would you look at the sensitivity of a method for the recall
andd the precision?
66 / 84
Signature Matrix, Jaccard Similarity, and the Beatles
67 / 84
General idea of LSH
68 / 84
Candidate Pairs
Any pair that is hashed to the same bucket for any hashings is
called a candidate pair.
We only check candidate pairs for similarity.
The hope is that most of the dissimilar pairs will never hash to the
same bucket (and never be checked).
69 / 84
False Negative and False Positives
Dissimilar pairs that are hashed to the same bucket are called false
positives.
We hope that most of the truly similar pairs will hash to the same
bucket under at least one of the hash functions.
Those that don’t are called false negatives.
70 / 84
How to choose the hashings
71 / 84
Example
The second and fourth columns each have a column vector [0, 2, 1]
in the first band, so they will be mapped to the same bucket in the
hashing for the first band.
Regardless of the other columns in the other bands, this pair of
columns will be a candidate pair.
72 / 84
Example (continued)
It’s possible that other columns such as the first two will hash to the
same bucket according to the hashing in the first band.
But their column vectors are different, [1, 3, 0] and [0, 2, 1] and
there are many buckets for each hashing, so an accidential collision
here is thought to be small.
73 / 84
Analysis of the banding technique
You can learn some about the analysis of the banding technique on
your own in Mining Massive Datasets, Ch 3, 3.4.2.
74 / 84
Combining the Techniques
We can now give an approach to find the set of candidate pairs for similar documents and then find similar
documents among these.
1. Pick a value k and construct from each document the set of k-shingles. (Optionally, hash the k-shingles to
shorter bucket numbers).
2. Sort the document-shingle pairs to order them by shingle.
3. Pick a length n for the minhash signatures. Feed the sorted list to the algorithm to compute the minhash
signatures for all the documents.
4. Choose a threshold t that defines how similar doucments have to be in order to be regarded a similar pair.
Pick a number b bands and r rows such that br = n and then the threshold is approximately (1/b)1/r .
5. Construct candidate pairs by applying LSH.
6. Examine each candidate pair’s signatures and determine whether the fraction of components they agree is
at least t.
7. Optionally, if the signatures are sufficiently similar, go to the documents themselves and check that they
are truly similar.
75 / 84
LSH for the Beatles
76 / 84
LSH for the Beatles
We want to hash items several times such that similar items are
more likely to be hashed into the same bucket. Any pair that is
hashed to the same bucket for any hashing is called a candidate pair
and we only check candidate pairs for similarity.
78 / 84
LSH for the Beatles
# look at probability of binned together for various bin sizes and similarity values
tibble(s = c(.25, .75), h = m) %>% # look at two different similarity values
mutate(b = (map(h, divisors))) %>% # find possible bin sizes for m
unnest() %>% # expand dataframe
group_by(h, b, s) %>%
mutate(prob = lsh_probability(h, b, s)) %>%
ungroup() -> bin_probs # get probabilities
# plot as curves
bin_probs %>%
mutate(s = factor(s)) %>%
ggplot() +
geom_line(aes(x = prob, y = b, colour = s, group = s)) +
geom_point(aes(x = prob, y = b, colour = factor(s)))
300
s
200
0.25
b
0.75
100
0
0.00 0.25 0.50 0.75 1.00
prob
b 0.25 0.75
60 0.0145434 0.9999922
72 0.0679295 1.0000000
90 0.2968963 1.0000000
120 0.8488984 1.0000000
180 0.9999910 1.0000000
For b = 90, a pair of records with Jaccard similarity of .25 will have a 29.7% chance of being matched as
candidates and a pair of records with Jaccard similarity of .75 will have a 100% chance of being matched as
candidates. This seems adequate for our application, so I will continue with b = 90.
Why would we not consider b = 120? What happens if you look at b = 72 and b = 60 (do this on your own).
80 / 84
LSH for the Beatles
Now that we now how to bin our signature matrix, we can go ahead
and get the candidates.
# bin the signature matrix
b <- 90
sig_matrix %>%
as_tibble() %>%
mutate(bin = rep(1:b, each = m/b)) %>% # add bins
gather(song, hash, -bin) %>% # tall data instead of wide
group_by(bin, song) %>% # within each bin, get the min-hash values for each song
summarise(hash = paste0(hash, collapse = "-")) %>%
ungroup() -> binned_sig
# inspect results
binned_sig %>%
head() %>%
kable()
81 / 84
LSH for the Beatles
binned_sig %>%
group_by(bin, hash) %>%
filter(n() > 1) %>% # find only those hashes with more th
summarise(song = paste0(song, collapse = ";;")) %>%
separate(song, into = c("song1", "song2"), sep = ";;") %>
group_by(song1, song2) %>%
summarise() %>% # get the unique song pairs
ungroup() -> candidates
# inspect candidates
candidates %>%
kable()
song1 song2
Golden Slumbers Golden Slumbers/Carry That Wieght/Ending
In My Life There are places I remember
When I’m 64 When I’m Sixty-four 82 / 84
LSH for the Beatles
Notice that LSH has identified the same pairs of documents as
potential matches that we found with pairwise comparisons, but did
so without having to calculate all of the pairwise comparisons. We
can now compute the Jaccard similarity scores for only these
candidate pairs of songs instead of all possible pairs.
# calculate the Jaccard similarity only for the candidate pairs of songs
candidates %>%
group_by(song1, song2) %>%
mutate(jaccard_sim = jaccard_similarity(songs$hash[songs$name == song1][[1]],
songs$hash[songs$name == song2][[1]])) %>%
arrange(desc(jaccard_sim)) %>%
kable()
83 / 84
LSH for the Beatles
There is a much easier way to do this whole process using the built
in functions in textreuse via the functions
textreuse::minhash_generator and textreuse::lsh.
# create the minhash function
minhash <- minhash_generator(n = m, seed = 09142017)
a b score
84 / 84