0% found this document useful (0 votes)

173 views33 pages

Term Weighting and Similarity Measures

Here are the TF-IDF scores for the given words: Word C TW TD DF TF IDF TF-IDF airplane 5 46 3 1 5/46 log2(3/1) = 1 5/46 * 1 = 0.1087 blue 1 46 3 1 1/46 log2(3/1) = 1 1/46 * 1 = 0.0217 chair 7 46 3 3 7/46 log2(3/3) = 0 7/46 * 0 = 0 computer 3 46 3 1 3/46 log

Uploaded by

Alemayehu Getachew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

173 views33 pages

Term Weighting and Similarity Measures

Uploaded by

Alemayehu Getachew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Chapter Three

Term weighting and

similarity measures

1
term-document matrix
• Documents and queries are represented as vectors or “bags of
words” (BOW) in a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a term
in the document
– The weight of terms (wij) may be a binary weight or Non-binary
weight . wij is zero means the term doesn’t exist in the document.

T1 T2 …. Tt
D1 w11 w21 … wt1  1 if freqij  0

D2 w12 w22 … wt2
Wij  
: : : :
0  else freq ij  0

: : : :
Dn w1n w2n … wtn
2
Binary Weights
• Only the presence (1) or docs t1 t2 t3
absence (0) of a term is D1 1 0 1
D2 1 0 0
included in the vector D3 0 1 1
• Binary formula gives every D4 1 0 0
D5 1 1 1
word that appears in a D6 1 1 0
document equal relevance. D7 0 1 0
D8 0 1 0
• It can be useful when D9 0 0 1
frequency is not important. D10 0 1 1
D11 1 0 1
• Binary Weights Formula:
1 if freq ij  0

freq ij  
0 if freq ij  03

Why use term weighting?
• Binary weights are too limiting.
– Terms are either present or absent.
– Not allow to order documents according to their level of
relevance for a given query

• Non-binary weights allow to model partial matching.

– Partial matching allows retrieval of documents that
approximate the query.
• Term-weighting helps to apply best matching that
improves quality of answer set.
– Term weighting enables ranking of retrieved documents; such
that best matching documents are ordered at the top as they
are more relevant than others. 4
Term Weighting: Term Frequency
(TF)
• TF (term frequency) - Count the number of docs t1 t2 t3
times term occurs in document. D1 2 0 3
fij = frequency of term i in document j D2 1 0 0
• The more times a term t occurs in D3 0 4 7
document d the more likely it is that t is D4 3 0 0
relevant to the document, i.e. more D5 1 6 3
indicative of the topic.. D6 3 5 0
– If used alone, it favors common words and
D7 0 8 0
long documents.
– It gives too much credit to words that appears
D8 0 10 0
more frequently. D9 0 0 1
D10 0 3 5
• May want to normalize term frequency (tf)
D11 4 0 1
5
Document Normalization
• Long documents have an unfair advantage: f  min( f )
ij il
– They use a lot of terms tf ij 
max( f ik )  min( f il )
• So they get more matches than short
documents
– And they use the same words repeatedly
• So they have much higher term
frequencies
f ij
• Normalization seeks to remove these tf ij 
effects: max( f ik )
– Related somehow to maximum term
frequency.
– But also sensitive to the number of terms.
• If we don’t normalize short documents may 6
not be recognized as relevant.
Problems with term frequency
• Need a mechanism for attenuating the effect of terms that
occur too often in the collection to be meaningful for
relevance/meaning determination
• Scale down the weight of terms with high collection
frequency
– Reduce the tf weight of a term by a factor that grows
with the collection frequency
• More common for this purpose is document frequency
– how many documents in the collection contain the term
• The example shows that collection
frequency and document frequency
behaves differently 7
Document Frequency
• It is defined to be the number of documents in the
collection that contain a term
DF = document frequency
– Count the frequency considering the whole collection of
documents.
– Less frequently a term appears in the whole collection,
the more discriminating it is.
df i (document frequency of term i)
= number of documents containing term i
8
Inverse Document Frequency (IDF)
• IDF measures rarity of the term in collection. The IDF is a
measure of the general importance of the term
– Inverts the document frequency.
• It diminishes the weight of terms that occur very frequently
in the collection and increases the weight of terms that
occur rarely.
– Gives full weight to terms that occur in one document
only.
– Gives zero weight to terms that occur in all documents.
– Terms that appear in many different documents are less indicative
of overall topic.
idfi = inverse document frequency of term i, where N: total
number of documents
idf i  log 2 ( N / df i )
9
Inverse Document Frequency
• Example: given a collection of 1000 documents and
document frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
– Log used to dampen the effect relative to tf.
– Make the difference between Document frequency vs. corpus
frequency ? 10
TF*IDF Weighting
• A good weight must take into account two effects:
– Quantification of intra-document contents (similarity)
• tf factor, the term frequency within a document
– Quantification of inter-documents separation (dissimilarity)
• idf factor, the inverse document frequency
• As a result of which the most widely used term-weighting
by IR systems is tf*idf weighting technique:
wij = tfij idfi = tfij * log2 (N/ dfi)

• A term occurring frequently in the document but rarely in the

rest of the collection is given high weight.
– The tf*idf value for a term will always be greater than or equal
to zero. 11
TF*IDF weighting

• When does TF*IDF registers a high weight? when a term t

occurs many times within a small number of documents
– Highest tf*idf for a term shows a term has a high term frequency (in
the given document) and a low document frequency (in the whole
collection of documents);
– the weights hence tend to filter out common terms.
– Thus lending high discriminating power to those documents

• Lower TF*IDF is registered when the term occurs fewer

times in a document, or occurs in many documents
– Thus offering a less pronounced relevance signal

• Lowest TF*IDF is registered when the term occurs in

12
virtually all documents
Computing TF-IDF: An Example
• Assume collection contains 10,000 documents and statistical analysis
shows that document frequencies (DF) of three terms are: A(50),
B(1300), C(250). And also term frequencies (TF) of these terms are:
A(3), B(2), C(1). Compute TF*IDF for each term?
A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774
• Query is also treated as a short document and also tf-idf weighted.
wij = (0.5 + [0.5*tfij ])* log2 (N/ dfi)

13
More Example
• Consider a document containing 100 words wherein
the word computer appears 3 times. Now, assume
we have 10, 000, 000 documents and computer
appears in 1, 000 of these.
– The term frequency (TF) for computer :
3/100 = 0.03
– The inverse document frequency is
log2(10,000,000 / 1,000) = 13.228
– The TF*IDF score is the product of these frequencies:
0.03 * 13.228 = 0.39684 14
Exercise
• Let C = number of times Word C TW TD DF TF IDF TFIDF
a given word appears in airplane 5 46 3 1
a document; blue 1 46 3 1
• TW = total number of
chair 7 46 3 3
words in a document;
• TD = total number of computer 3 46 3 1
documents in a corpus, forest 2 46 3 1
and justice 7 46 3 3
• DF = total number of love 2 46 3 1
documents containing a might 2 46 3 1
given word; perl 5 46 3 2
• compute TF, IDF and
rose 6 46 3 3
TF*IDF score for each
term shoe 4 46 3 1
thesis 2 46 3 2 15
Similarity Measure
• We now have vectors for all documents in the
collection, a vector for the query, how to
compute similarity?
t3
• A similarity measure is a function that
computes the degree of similarity or distance
between document vector and query vector. 

• Using a similarity measure between the query D1

Q
and each document:

– It is possible to rank the retrieved documents t1
in the order of presumed relevance.
– It is possible to enforce a certain threshold t2 D2
so that the size of the retrieved set can be
controlled.
16
Similarity/Dissimilarity
Measures
• Euclidean distance
–It is the most common similarity measure. Euclidean
distance examines the root of square differences between
coordinates of a pair of document and query terms.
• Dot product
–The dot product is also known as the scalar product or
inner product
–the dot product is defined as the product of the
magnitudes of query and document vectors
• Cosine similarity (or normalized inner product)
–It projects document and query vectors into a term space
and calculate the cosine angle between these. 17
Euclidean distance
• Similarity between vectors for the document di and query
q can be computed as:
n
sim(dj,q) = |dj – q| =
 (w
i 1
ij  wiq ) 2

where wij is the weight of term i in document j and wiq

is the weight of term i in the query
• Example: Determine the Euclidean distance between
the document 1 vector (0, 3, 2, 1, 10) and query vector
(2, 7, 1, 0, 0). 0 means corresponding term not found in
document or query
 (0  2)  (3  7)  (2  1)  (1  0)  (10  0)  11 .05
2 2 2 2 2
18
Dissimilarity Measures
• Euclidean distance is generalized to the p opular
dissimilarity measure called: Minkowski distance:
n
Dis ( wij , wiq )  m
 (w
i 1
ij  wiq ) m

where X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) are two n-
dimensional data objects; n is size of vector attributes of the
data object; q= 1,2,3,…

• If q = 1, dis(X,Y) is Manhattan distance

n
Dis ( wij , wiq )   (| wij  wiq |) 19
i 1
Inner Product
• Similarity between vectors for the document di and query q
can be computed as the vector inner product:
n
sim(dj,q) = dj•q =

wij · wiq
i 1

where wij is the weight of term i in document j and wiq is

the weight of term i in the query q
• For binary vectors, the inner product is the number of
matched query terms in the document (size of intersection).
• For weighted term vectors, it is the sum of the products of
the weights of the matched terms.
20
Inner Product -- Examples
• Given the following term-document matrix, using inner
product which document is more relevant for the query
Q?

Retrieval Database Architecture

D1 2 3 5
D2 3 7 1
Q 1 0 2
• sim(D1 , Q) = 2*1 + 3*0 + 5*2 = 12
• sim(D2 , Q) = 3*1 + 7*0 + 1*2 = 5

21
Cosine similarity
• Measures similarity between d1 and d2 captured by the
cosine of the angle x between them.
 

n
d j q wi , j wi ,q
sim(d j , q )     i 1

 
n n
dj q w 2
i, j
2
w
i ,q
i 1 i 1

• The denominator involves the lengths of the vectors

• So the cosine measure is also known as the normalized
inner product

i 1 i, j
n
Length d j  w 2

22
Example 1: Computing Cosine
Similarity

• Let say we have query vector Q = (0.4, 0.8); and also

document D1 = (0.2, 0.7). Compute their similarity using
cosine?

(0.4 * 0.2)  (0.8 * 0.7)

sim (Q, D1 ) 
[(0.4) 2  (0.8) 2 ] * [(0.2) 2  (0.7) 2 ]
0.64
  0.98
0.42

23
Example 2: Computing Cosine Similarity
• Let say we have two documents in our corpus; D1 =
(0.8, 0.3) and D2 = (0.2, 0.7). Given query vector Q =
(0.4, 0.8), determine which document is more relevant
one for the query?

1.0 Q
D2
cos 1  0.74 0.8

2
cos  2  0.98
0.6

0.4
1 D1
0.2

0.2 0.4 0.6 0.8 1.0

24
Exercise

• Document 1: The game of life is a game of everlasting

learning
• Document 2: The unexamined life is not worth living
• Document 3: Never stop learning
• Let us imagine that you are doing a search on these
documents with the following query: life learning
• Given three documents; D1, D2 and D3
– Compute tf, idf, tf*idf ?
– Which documents are more similar using cosine similarity
measurement?
25
Step 1: Term Frequency (TF)
• Term Frequency also known as TF measures the number of times
a term (word) occurs in a document. Given below are the terms
and their frequency on each of the document.
TF for Document 1

Document1 the game of life is a everlasting learning

Term Frequency 1 2 2 1 1 1 1 1

TF for Document 2
Document2 the unexamined life is not worth living

Term
1 1 1 1 1 1 1
Frequency

TF for Document 3
Document3 never stop learning

Term Frequency 1 1 1
26
In reality each document will be of different size. On a large document the
frequency of the terms will be much higher than the smaller ones. Hence we need
to normalize the document based on its size. A simple trick is to divide the term
frequency by the total number of terms.

Normalized TF for Document 1

Document1 the game of life is a everlasting learning

Normalized
0.1 0.2 0.2 0.1 0.1 0.1 0.1 0.1
TF

Normalized TF for Document 2

Document2 the unexamined life is not worth living

Normalized TF 0.142857 0.142857 0.142857 0.142857 0.142857 0.142857 0.142857

Normalized TF for Document 3

Document3 never stop learning

Normalized TF 0.333333 0.333333 0.333333

27
Step 2: Inverse Document Frequency (IDF)
Terms DF IDF
certain terms that occur too the ?
frequently have little power in game ?
determining the relevance. of ?
life ?
We need a way to weigh down the is ?
effects of too frequently occurring a ?
terms. Also the terms that occur everlasting ?
less in the document can be more learning ?
relevant. unexamined ?
not ?
We need a way to weigh up the
?
effects of less frequently occurring
worth

?
terms. living

never ?
stop ? 28
Step 3: TF * IDF
For each term in the query multiply its normalized term frequency
with its IDF on each document.

Document1 Document2 Document3

life

learning

29
• The query entered by the user can also be
represented as a vector. We will calculate the
TF*IDF for the query
TF*IDF
TF IDF

life 0.5

learning 0.5

30
Step 4: Cosine Similarity
• from trigonometry background knowledge, the cosine value is
always between –1 and 1: the cosine of a small angle is near 1,
and the cosine of a large angle near 180 degrees is close to –1.
small angles map to high similarity, near 1, and large angles
should map to near –1.

The query entered by the user can also be represented as a vector.

We will calculate the TF*IDF for the query

TF IDF TF*IDF

life

learning

31
• Given below is the similarity scores for all the
documents and the query

Document1 Document2 Document3

Cosine Similarity

32
33

Term Weighting in Document Retrieval
No ratings yet
Term Weighting in Document Retrieval
25 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
26 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
32 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
34 pages
Term Weighting in Document Retrieval
No ratings yet
Term Weighting in Document Retrieval
34 pages
Understanding Term Weighting Methods
No ratings yet
Understanding Term Weighting Methods
34 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
11 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
Term Weighting in Document Retrieval
No ratings yet
Term Weighting in Document Retrieval
34 pages
Understanding Term Weighting Methods
No ratings yet
Understanding Term Weighting Methods
17 pages
Term Weighting in Document Retrieval
No ratings yet
Term Weighting in Document Retrieval
45 pages
Term Weighting in Image Similarity Measures
50% (2)
Term Weighting in Image Similarity Measures
54 pages
Understanding TF-IDF Weighting
100% (2)
Understanding TF-IDF Weighting
38 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
38 pages
TFIDF & Vector Space Model Explained
No ratings yet
TFIDF & Vector Space Model Explained
27 pages
Understanding TF-IDF and IDF
No ratings yet
Understanding TF-IDF and IDF
51 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
40 pages
Understanding Bag-of-Words in NLP
No ratings yet
Understanding Bag-of-Words in NLP
40 pages
Term Weighting in Information Retrieval
No ratings yet
Term Weighting in Information Retrieval
71 pages
Document Scoring in Information Retrieval
100% (3)
Document Scoring in Information Retrieval
38 pages
Document Vector Representation Techniques
No ratings yet
Document Vector Representation Techniques
16 pages
Understanding Ranked Retrieval Models
No ratings yet
Understanding Ranked Retrieval Models
52 pages
Document Relevance and TF-IDF Explained
No ratings yet
Document Relevance and TF-IDF Explained
10 pages
Vector Space Model and Term Weighting
No ratings yet
Vector Space Model and Term Weighting
57 pages
TF-IDF Weighting in Document Retrieval
No ratings yet
TF-IDF Weighting in Document Retrieval
37 pages
Vector Space Model for Document Retrieval
No ratings yet
Vector Space Model for Document Retrieval
57 pages
Indexing and Retrieval Models Overview
No ratings yet
Indexing and Retrieval Models Overview
46 pages
The Classic TF-IDF Vector Space Model
No ratings yet
The Classic TF-IDF Vector Space Model
15 pages
Understanding Basic IR Models and TF-IDF
No ratings yet
Understanding Basic IR Models and TF-IDF
22 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
46 pages
String Similarity Measures Explained
No ratings yet
String Similarity Measures Explained
32 pages
TF-IDF Algorithm for Document Queries
No ratings yet
TF-IDF Algorithm for Document Queries
4 pages
Information Retrieval Weighting Techniques
No ratings yet
Information Retrieval Weighting Techniques
14 pages
TF-IDF and Vector Space Model Explained
No ratings yet
TF-IDF and Vector Space Model Explained
44 pages
Vector Space Model in Document Retrieval
No ratings yet
Vector Space Model in Document Retrieval
6 pages
Ranked Retrieval in Information Retrieval
No ratings yet
Ranked Retrieval in Information Retrieval
31 pages
Information Retrieval Models Explained
No ratings yet
Information Retrieval Models Explained
10 pages
Term Weighting in Vector Space Model
No ratings yet
Term Weighting in Vector Space Model
9 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
7 pages
Understanding IR Models and Weighting
No ratings yet
Understanding IR Models and Weighting
30 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
51 pages
Scoring and Weighting in IR Models
No ratings yet
Scoring and Weighting in IR Models
45 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
33 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Understanding IR Models and Indexing
No ratings yet
Understanding IR Models and Indexing
46 pages
Understanding Cosine and Jaccard Similarity
No ratings yet
Understanding Cosine and Jaccard Similarity
40 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
30 pages
Ranked Retrieval and Vector Space Model
No ratings yet
Ranked Retrieval and Vector Space Model
43 pages
Parameterized Decay Model for IR
No ratings yet
Parameterized Decay Model for IR
21 pages
Understanding TF-IDF Metrics
No ratings yet
Understanding TF-IDF Metrics
4 pages
Vector Space Model in Information Retrieval
100% (1)
Vector Space Model in Information Retrieval
16 pages
Understanding Search Techniques and Queries
No ratings yet
Understanding Search Techniques and Queries
24 pages
Automatic Indexing Techniques Overview
No ratings yet
Automatic Indexing Techniques Overview
28 pages
Boolean vs Vector Space Models
No ratings yet
Boolean vs Vector Space Models
27 pages
Retrieval Models in Information Retrieval
No ratings yet
Retrieval Models in Information Retrieval
37 pages
Vector Space Model in Information Retrieval
No ratings yet
Vector Space Model in Information Retrieval
43 pages
Understanding IR Models and Weighting Techniques
No ratings yet
Understanding IR Models and Weighting Techniques
33 pages
Information Retrieval Modeling Overview
No ratings yet
Information Retrieval Modeling Overview
39 pages
Big Data Seminar Report Overview
100% (2)
Big Data Seminar Report Overview
27 pages
Haramaya University Registration Form
No ratings yet
Haramaya University Registration Form
2 pages
Haramaya University Registration Form
No ratings yet
Haramaya University Registration Form
2 pages
Indexing Algorithms in IR Systems
100% (2)
Indexing Algorithms in IR Systems
60 pages
Haramaya University Registration List
No ratings yet
Haramaya University Registration List
2 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
33 pages
Comprehensive Guide to Network Security
No ratings yet
Comprehensive Guide to Network Security
91 pages
Water and Mineral Metabolism Overview
No ratings yet
Water and Mineral Metabolism Overview
151 pages
Addendum to Lease Agreement 2018
No ratings yet
Addendum to Lease Agreement 2018
4 pages
Unbordered Memories Sindhi Stories of Partition Ko
No ratings yet
Unbordered Memories Sindhi Stories of Partition Ko
171 pages
Galsworthy's Social Justice Plays
No ratings yet
Galsworthy's Social Justice Plays
13 pages
5th Grade Math Lesson Plan Template
No ratings yet
5th Grade Math Lesson Plan Template
9 pages
BA English Syllabus 2022-2023
No ratings yet
BA English Syllabus 2022-2023
33 pages
Understanding Islamic Finance Principles
No ratings yet
Understanding Islamic Finance Principles
21 pages
Solone (Prednisolone) Usage and Risks
No ratings yet
Solone (Prednisolone) Usage and Risks
2 pages
Article 1196: Obligations with Periods Explained
No ratings yet
Article 1196: Obligations with Periods Explained
10 pages
First Look: The Big Book of Kombucha
41% (39)
First Look: The Big Book of Kombucha
13 pages
Lawyer Mental Health Education Program
No ratings yet
Lawyer Mental Health Education Program
5 pages
Anglo Saxon Literature Overview
No ratings yet
Anglo Saxon Literature Overview
1 page
Lacan's Formulas of Sexuation Explained
100% (1)
Lacan's Formulas of Sexuation Explained
2 pages
Critique of the Book of Enoch
No ratings yet
Critique of the Book of Enoch
3 pages
Natural Law and Moral Development
No ratings yet
Natural Law and Moral Development
3 pages
Therapeutic Community Modality Program
No ratings yet
Therapeutic Community Modality Program
4 pages
Indian Constitution and Human Rights Course
No ratings yet
Indian Constitution and Human Rights Course
2 pages
Understanding Clauses: Types and Functions
No ratings yet
Understanding Clauses: Types and Functions
14 pages
Half and Full Adder Logic Circuits
No ratings yet
Half and Full Adder Logic Circuits
3 pages
Etap 16
100% (4)
Etap 16
150 pages
Local Chief Executives' Leadership in the Philippines
No ratings yet
Local Chief Executives' Leadership in the Philippines
23 pages
Living in a Shared Flat: Pros and Cons
No ratings yet
Living in a Shared Flat: Pros and Cons
2 pages
Skate Wizard Trick Names Guide
100% (1)
Skate Wizard Trick Names Guide
2 pages
Financial Performance of Jai Industries
No ratings yet
Financial Performance of Jai Industries
84 pages
Microlife BP Monitor Instruction Manual
No ratings yet
Microlife BP Monitor Instruction Manual
46 pages
Numerical Differentiation Techniques
No ratings yet
Numerical Differentiation Techniques
34 pages
Ajmal Entrance Test 2025 Guidelines
No ratings yet
Ajmal Entrance Test 2025 Guidelines
112 pages
Dementia Decision-Making Insights
100% (20)
Dementia Decision-Making Insights
17 pages
Detective Uncovers Killer's Identity
No ratings yet
Detective Uncovers Killer's Identity
8 pages