Chapter Three
Term weighting and
similarity measures
1
term-document matrix
• Documents and queries are represented as vectors or “bags of
words” (BOW) in a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a term
in the document
– The weight of terms (wij) may be a binary weight or Non-binary
weight . wij is zero means the term doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1 1 if freqij 0
D2 w12 w22 … wt2
Wij
: : : :
0 else freq ij 0
: : : :
Dn w1n w2n … wtn
2
Binary Weights
• Only the presence (1) or docs t1 t2 t3
absence (0) of a term is D1 1 0 1
D2 1 0 0
included in the vector D3 0 1 1
• Binary formula gives every D4 1 0 0
D5 1 1 1
word that appears in a D6 1 1 0
document equal relevance. D7 0 1 0
D8 0 1 0
• It can be useful when D9 0 0 1
frequency is not important. D10 0 1 1
D11 1 0 1
• Binary Weights Formula:
1 if freq ij 0
freq ij
0 if freq ij 03
Why use term weighting?
• Binary weights are too limiting.
– Terms are either present or absent.
– Not allow to order documents according to their level of
relevance for a given query
• Non-binary weights allow to model partial matching.
– Partial matching allows retrieval of documents that
approximate the query.
• Term-weighting helps to apply best matching that
improves quality of answer set.
– Term weighting enables ranking of retrieved documents; such
that best matching documents are ordered at the top as they
are more relevant than others. 4
Term Weighting: Term Frequency
(TF)
• TF (term frequency) - Count the number of docs t1 t2 t3
times term occurs in document. D1 2 0 3
fij = frequency of term i in document j D2 1 0 0
• The more times a term t occurs in D3 0 4 7
document d the more likely it is that t is D4 3 0 0
relevant to the document, i.e. more D5 1 6 3
indicative of the topic.. D6 3 5 0
– If used alone, it favors common words and
D7 0 8 0
long documents.
– It gives too much credit to words that appears
D8 0 10 0
more frequently. D9 0 0 1
D10 0 3 5
• May want to normalize term frequency (tf)
D11 4 0 1
5
Document Normalization
• Long documents have an unfair advantage: f min( f )
ij il
– They use a lot of terms tf ij
max( f ik ) min( f il )
• So they get more matches than short
documents
– And they use the same words repeatedly
• So they have much higher term
frequencies
f ij
• Normalization seeks to remove these tf ij
effects: max( f ik )
– Related somehow to maximum term
frequency.
– But also sensitive to the number of terms.
• If we don’t normalize short documents may 6
not be recognized as relevant.
Problems with term frequency
• Need a mechanism for attenuating the effect of terms that
occur too often in the collection to be meaningful for
relevance/meaning determination
• Scale down the weight of terms with high collection
frequency
– Reduce the tf weight of a term by a factor that grows
with the collection frequency
• More common for this purpose is document frequency
– how many documents in the collection contain the term
• The example shows that collection
frequency and document frequency
behaves differently 7
Document Frequency
• It is defined to be the number of documents in the
collection that contain a term
DF = document frequency
– Count the frequency considering the whole collection of
documents.
– Less frequently a term appears in the whole collection,
the more discriminating it is.
df i (document frequency of term i)
= number of documents containing term i
8
Inverse Document Frequency (IDF)
• IDF measures rarity of the term in collection. The IDF is a
measure of the general importance of the term
– Inverts the document frequency.
• It diminishes the weight of terms that occur very frequently
in the collection and increases the weight of terms that
occur rarely.
– Gives full weight to terms that occur in one document
only.
– Gives zero weight to terms that occur in all documents.
– Terms that appear in many different documents are less indicative
of overall topic.
idfi = inverse document frequency of term i, where N: total
number of documents
idf i log 2 ( N / df i )
9
Inverse Document Frequency
• Example: given a collection of 1000 documents and
document frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
– Log used to dampen the effect relative to tf.
– Make the difference between Document frequency vs. corpus
frequency ? 10
TF*IDF Weighting
• A good weight must take into account two effects:
– Quantification of intra-document contents (similarity)
• tf factor, the term frequency within a document
– Quantification of inter-documents separation (dissimilarity)
• idf factor, the inverse document frequency
• As a result of which the most widely used term-weighting
by IR systems is tf*idf weighting technique:
wij = tfij idfi = tfij * log2 (N/ dfi)
• A term occurring frequently in the document but rarely in the
rest of the collection is given high weight.
– The tf*idf value for a term will always be greater than or equal
to zero. 11
TF*IDF weighting
• When does TF*IDF registers a high weight? when a term t
occurs many times within a small number of documents
– Highest tf*idf for a term shows a term has a high term frequency (in
the given document) and a low document frequency (in the whole
collection of documents);
– the weights hence tend to filter out common terms.
– Thus lending high discriminating power to those documents
• Lower TF*IDF is registered when the term occurs fewer
times in a document, or occurs in many documents
– Thus offering a less pronounced relevance signal
• Lowest TF*IDF is registered when the term occurs in
12
virtually all documents
Computing TF-IDF: An Example
• Assume collection contains 10,000 documents and statistical analysis
shows that document frequencies (DF) of three terms are: A(50),
B(1300), C(250). And also term frequencies (TF) of these terms are:
A(3), B(2), C(1). Compute TF*IDF for each term?
A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774
• Query is also treated as a short document and also tf-idf weighted.
wij = (0.5 + [0.5*tfij ])* log2 (N/ dfi)
13
More Example
• Consider a document containing 100 words wherein
the word computer appears 3 times. Now, assume
we have 10, 000, 000 documents and computer
appears in 1, 000 of these.
– The term frequency (TF) for computer :
3/100 = 0.03
– The inverse document frequency is
log2(10,000,000 / 1,000) = 13.228
– The TF*IDF score is the product of these frequencies:
0.03 * 13.228 = 0.39684 14
Exercise
• Let C = number of times Word C TW TD DF TF IDF TFIDF
a given word appears in airplane 5 46 3 1
a document; blue 1 46 3 1
• TW = total number of
chair 7 46 3 3
words in a document;
• TD = total number of computer 3 46 3 1
documents in a corpus, forest 2 46 3 1
and justice 7 46 3 3
• DF = total number of love 2 46 3 1
documents containing a might 2 46 3 1
given word; perl 5 46 3 2
• compute TF, IDF and
rose 6 46 3 3
TF*IDF score for each
term shoe 4 46 3 1
thesis 2 46 3 2 15
Similarity Measure
• We now have vectors for all documents in the
collection, a vector for the query, how to
compute similarity?
t3
• A similarity measure is a function that
computes the degree of similarity or distance
between document vector and query vector.
• Using a similarity measure between the query D1
Q
and each document:
– It is possible to rank the retrieved documents t1
in the order of presumed relevance.
– It is possible to enforce a certain threshold t2 D2
so that the size of the retrieved set can be
controlled.
16
Similarity/Dissimilarity
Measures
• Euclidean distance
–It is the most common similarity measure. Euclidean
distance examines the root of square differences between
coordinates of a pair of document and query terms.
• Dot product
–The dot product is also known as the scalar product or
inner product
–the dot product is defined as the product of the
magnitudes of query and document vectors
• Cosine similarity (or normalized inner product)
–It projects document and query vectors into a term space
and calculate the cosine angle between these. 17
Euclidean distance
• Similarity between vectors for the document di and query
q can be computed as:
n
sim(dj,q) = |dj – q| =
(w
i 1
ij wiq ) 2
where wij is the weight of term i in document j and wiq
is the weight of term i in the query
• Example: Determine the Euclidean distance between
the document 1 vector (0, 3, 2, 1, 10) and query vector
(2, 7, 1, 0, 0). 0 means corresponding term not found in
document or query
(0 2) (3 7) (2 1) (1 0) (10 0) 11 .05
2 2 2 2 2
18
Dissimilarity Measures
• Euclidean distance is generalized to the p opular
dissimilarity measure called: Minkowski distance:
n
Dis ( wij , wiq ) m
(w
i 1
ij wiq ) m
where X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) are two n-
dimensional data objects; n is size of vector attributes of the
data object; q= 1,2,3,…
• If q = 1, dis(X,Y) is Manhattan distance
n
Dis ( wij , wiq ) (| wij wiq |) 19
i 1
Inner Product
• Similarity between vectors for the document di and query q
can be computed as the vector inner product:
n
sim(dj,q) = dj•q =
wij · wiq
i 1
where wij is the weight of term i in document j and wiq is
the weight of term i in the query q
• For binary vectors, the inner product is the number of
matched query terms in the document (size of intersection).
• For weighted term vectors, it is the sum of the products of
the weights of the matched terms.
20
Inner Product -- Examples
• Given the following term-document matrix, using inner
product which document is more relevant for the query
Q?
Retrieval Database Architecture
D1 2 3 5
D2 3 7 1
Q 1 0 2
• sim(D1 , Q) = 2*1 + 3*0 + 5*2 = 12
• sim(D2 , Q) = 3*1 + 7*0 + 1*2 = 5
21
Cosine similarity
• Measures similarity between d1 and d2 captured by the
cosine of the angle x between them.
n
d j q wi , j wi ,q
sim(d j , q ) i 1
n n
dj q w 2
i, j
2
w
i ,q
i 1 i 1
• The denominator involves the lengths of the vectors
• So the cosine measure is also known as the normalized
inner product
i 1 i, j
n
Length d j w 2
22
Example 1: Computing Cosine
Similarity
• Let say we have query vector Q = (0.4, 0.8); and also
document D1 = (0.2, 0.7). Compute their similarity using
cosine?
(0.4 * 0.2) (0.8 * 0.7)
sim (Q, D1 )
[(0.4) 2 (0.8) 2 ] * [(0.2) 2 (0.7) 2 ]
0.64
0.98
0.42
23
Example 2: Computing Cosine Similarity
• Let say we have two documents in our corpus; D1 =
(0.8, 0.3) and D2 = (0.2, 0.7). Given query vector Q =
(0.4, 0.8), determine which document is more relevant
one for the query?
1.0 Q
D2
cos 1 0.74 0.8
2
cos 2 0.98
0.6
0.4
1 D1
0.2
0.2 0.4 0.6 0.8 1.0
24
Exercise
• Document 1: The game of life is a game of everlasting
learning
• Document 2: The unexamined life is not worth living
• Document 3: Never stop learning
• Let us imagine that you are doing a search on these
documents with the following query: life learning
• Given three documents; D1, D2 and D3
– Compute tf, idf, tf*idf ?
– Which documents are more similar using cosine similarity
measurement?
25
Step 1: Term Frequency (TF)
• Term Frequency also known as TF measures the number of times
a term (word) occurs in a document. Given below are the terms
and their frequency on each of the document.
TF for Document 1
Document1 the game of life is a everlasting learning
Term Frequency 1 2 2 1 1 1 1 1
TF for Document 2
Document2 the unexamined life is not worth living
Term
1 1 1 1 1 1 1
Frequency
TF for Document 3
Document3 never stop learning
Term Frequency 1 1 1
26
In reality each document will be of different size. On a large document the
frequency of the terms will be much higher than the smaller ones. Hence we need
to normalize the document based on its size. A simple trick is to divide the term
frequency by the total number of terms.
Normalized TF for Document 1
Document1 the game of life is a everlasting learning
Normalized
0.1 0.2 0.2 0.1 0.1 0.1 0.1 0.1
TF
Normalized TF for Document 2
Document2 the unexamined life is not worth living
Normalized TF 0.142857 0.142857 0.142857 0.142857 0.142857 0.142857 0.142857
Normalized TF for Document 3
Document3 never stop learning
Normalized TF 0.333333 0.333333 0.333333
27
Step 2: Inverse Document Frequency (IDF)
Terms DF IDF
certain terms that occur too the ?
frequently have little power in game ?
determining the relevance. of ?
life ?
We need a way to weigh down the is ?
effects of too frequently occurring a ?
terms. Also the terms that occur everlasting ?
less in the document can be more learning ?
relevant. unexamined ?
not ?
We need a way to weigh up the
?
effects of less frequently occurring
worth
?
terms. living
never ?
stop ? 28
Step 3: TF * IDF
For each term in the query multiply its normalized term frequency
with its IDF on each document.
Document1 Document2 Document3
life
learning
29
• The query entered by the user can also be
represented as a vector. We will calculate the
TF*IDF for the query
TF*IDF
TF IDF
life 0.5
learning 0.5
30
Step 4: Cosine Similarity
• from trigonometry background knowledge, the cosine value is
always between –1 and 1: the cosine of a small angle is near 1,
and the cosine of a large angle near 180 degrees is close to –1.
small angles map to high similarity, near 1, and large angles
should map to near –1.
The query entered by the user can also be represented as a vector.
We will calculate the TF*IDF for the query
TF IDF TF*IDF
life
learning
31
• Given below is the similarity scores for all the
documents and the query
Document1 Document2 Document3
Cosine Similarity
32
33