0% found this document useful (0 votes)

45 views26 pages

Term Weighting and Similarity Measures

Chapter 3 discusses term weighting and similarity measures in information retrieval, focusing on how documents and queries are represented in a term-document matrix. It explains various weighting methods, including binary weights, term frequency (TF), inverse document frequency (IDF), and the TF*IDF weighting, which combines both intra-document relevance and inter-document distinctiveness. The chapter also covers similarity measures such as Euclidean distance, inner product, and cosine similarity to rank documents based on their relevance to a query.

Uploaded by

biruktilahundinki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views26 pages

Term Weighting and Similarity Measures

Uploaded by

biruktilahundinki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Chapter 3

Term weighting and similarity

measures

1
term-document matrix
• Documents and queries are represented as vectors or
“bags of words” (BOW) in a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a
term in the document
–The weight of terms (wij) may be a binary weight or Non-binary
weight .
–wij is zero means the term doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1  1 if freq ij  0

D2 w12 w22 … wt2 Wij  
: : : : 0  else freq ij  0

: : : :
Dn w1n w2n … wtn
2
Binary Weights
• Only the presence (1) or docs t1 t2 t3
absence (0) of a term is D1 1 0 1
included in the vector D2 1 0 0
D3 0 1 1
• Binary formula gives every D4 1 0 0
D5 1 1 1
word that appears in a D6 1 1 0
document equal relevance. D7 0 1 0
D8 0 1 0
• It can be useful when D9 0 0 1
frequency is not important. D10 0 1 1
D11 1 0 1
• Binary Weights Formula:
1 if freq ij  0

freq ij 
0 if freq ij  0

Why use term weighting?
• Binary weights are too limiting.
– Terms are either present or absent.
– Not allow to order documents according to their level of
relevance for a given query

• Non-binary weights allow to model partial matching.

– Partial matching allows retrieval of documents that
approximate the query.
• Term-weighting helps to apply best matching that
improves quality of answer set.
– Term weighting enables ranking of retrieved documents;
such that best matching documents are ordered at the top as
they are more relevant than others. 4
Term Weighting: Term Frequency (TF)
• TF (term frequency) - Count the number
of times term occurs in document. docs t1 t2 t3
fij = frequency of term i in document j D1 2 0 3
• The more times a term t occurs in D2 1 0 0
document d the more likely it is that t is D3 0 4 7
relevant to the document, i.e. more D4 3 0 0
indicative of the topic. D5 1 6 3
– If used alone, it favors common words D6 3 5 0
and long documents. D7 0 8 0
– It gives too much credit to words that D8 0 10 0
appears more frequently. D9 0 0 1
• May want to normalize term frequency D10 0 3 5
(tf) across the entire corpus. D11 4 0 1
Document Normalization
• Long documents have an unfair f ij  min( f il )
advantage: tfij 
– They use a lot of terms max( f ik )  min( f il )
• So they get more matches than short f ij
documents tfij 
– And they use the same words repeatedly max( f ik )
• So they have much higher term frequencies

• If we don’t normalize short documents

may not be recognized as relevant.

6
…cont

7
Problems with term frequency
• Need a mechanism for attenuating the effect of terms
that occur too often in the collection to be meaningful
for relevance/meaning determination
• Scale down the weight of terms with high collection
frequency
• More common for this purpose is document frequency
– how many documents in the collection contain the term

• The example shows that collection

frequency and document frequency
behaves differently
8
Document Frequency
• It is defined to be the number of documents in
the collection that contain a term
DF = document frequency
– Count the frequency considering the whole
collection of documents.
– Less frequently a term appears in the whole
collection, the more discriminating it is.
df i (document frequency of term i)
= number of documents containing term i
9
Inverse Document Frequency (IDF)
• IDF measures rarity of the term in collection. The IDF is a
measure of the general importance of the term
– Inverts the document frequency.
• It diminishes the weight of terms that occur very
frequently in the collection and increases the weight of
terms that occur rarely.
– Gives full weight to terms that occur in one document
only.
– Gives zero weight to terms that occur in all documents.
– Terms that appear in many different documents are less
indicative of overall topic.
idfi = inverse document frequency of term i, where N: total
number of documents
idf i  log 2 ( N / df i )
10
Inverse Document Frequency
• Example: given a collection of 1000 documents and
document frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
– Make the difference between Document frequency vs. corpus
frequency ?
11
TF*IDF Weighting
A good weight must take into account two effects:
 Quantification of intra-document contents (similarity)
 This refers to how relevant a term is within a single document.
 The more frequently a term appears in a document (term
frequency), the more important it is considered for describing
that document's content.
 tf factor, the term frequency within a document
 Quantification of inter-documents separation
(dissimilarity)
 This refers to how distinct a term is across the entire
document collection.
 Terms that appear in many documents are less useful for
differentiating between documents
 idf factor, the inverse document frequency

12
TF*IDF weighting
• When does TF*IDF registers a high weight? when a term t
occurs many times within a small number of documents
– Highest tf*idf for a term shows a term has a high term frequency
(in the given document) and a low document frequency (in the
whole collection of documents);
– the weights hence tend to filter out common terms.
– Thus lending high discriminating power to those documents

• Lower TF*IDF is registered when the term occurs fewer

times in a document, or occurs in many documents
– Thus offering a less pronounced relevance signal

• Lowest TF*IDF is registered when the term occurs in

virtually all documents
Computing TF-IDF: An Example
• Assume collection contains 10,000 documents and
statistical analysis shows that document frequencies (DF)
of three terms are: A(50), B(1300), C(250). And also term
frequencies (TF) of these terms are: A(3), B(2), C(1). With a
maximum term frequency [Link] TF*IDF for each
term?
A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774

14
More Example
• Consider a document containing 100 words
wherein the word computer appears 3 times.
Now, assume we have 10, 000, 000 documents
and computer appears in 1, 000 of these.
– The term frequency (TF) for computer :
3/100 = 0.03
– The inverse document frequency is
log2(10,000,000 / 1,000) = 13.228
– The TF*IDF score is the product of these frequencies:
0.03 * 13.228 = 0.39684
15
Exercises
• A database collection consists of 1 million documents, of
which 200,000 contain the term holiday while 250,000
contain the term season. A document repeats holiday 7
times and season 5 times. It is known that holiday is
repeated more than any other term in the document.
Calculate the weight of both terms in this document
using three different term weight methods. Try with
(i) normalized and unnormalized TF;
(ii) TF*IDF based on normalized and unnormalized TF

16
Similarity Measure
• We now have vectors for all documents in the collection, a
vector for the query, how to compute similarity?
• A similarity measure is a function that computes the degree
of similarity or distance between document vector and query
vector.
• Using a similarity measure between the query and each
document:
– It is possible to rank the retrieved documents in the order
of presumed relevance.
– It is possible to enforce a certain threshold/ceases so that
the size of the retrieved set can be controlled.

17
Similarity/Dissimilarity Measures

18
Euclidean distance
• It is the most common similarity measure.
• Euclidean distance examines the root of square
differences between coordinates of a pair of document
and query terms.
• Similarity between vectors for the document di and
query q can be computed as:
n

sim(dj ,q) = |dj – q| =  (w

i 1
ij  wiq ) 2

where wij is the weight of term i in document j and wiq is

the weight of term i in the query
• Example: Determine the Euclidean distance between the
document vector (0, 3, 2, 1, 10) and query vector (2, 7, 1,
0, 0). 0 means corresponding term not found in document or
query  (0  2) 2  (3  7) 2  (2  1) 2  (1  0) 2  (10  0) 2  11
19
.05
Inner Product
–is also known as the scalar product or Dot product
• Similarity between vectors for the document di and
query q can be computed as the vector inner product:
n

w · w
sim(dj,q) = dj•q =
i 1
ij iq

where wij is the weight of term i in document j and wiq is

the weight of term i in the query q
• For binary vectors, the inner product is the number of
matched query terms in the document (size of
intersection).
• For weighted term vectors, it is the sum of the products
of the weights of the matched terms. 20
Inner Product -- Examples
 Binary weight :
 Size of vector = size of vocabulary = 7

Retrieval Database Term Computer Text Manage Data

sim(D, Q) = 3
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1

For binary vectors, the inner product is the number of matched query terms
in the document (size of intersection).
Inner Product -- Examples
• Given the following term-document matrix,
using inner product which document is more
relevant for the query Q?
Retrieval Database Architecture
D1 2 3 5
D2 3 7 1
Q 1 0 2

• sim(D1 , Q) = 21 + 30 + 5*2 = 12

• sim(D2 , Q) = 3*1 + 7*0 + 1*2 = 5
Cosine similarity
• It projects document and query vectors into a term
space and calculate the cosine angle between these.
 

n
d j q wi , j wi , q
sim( d j , q )     i 1

i 1 w i 1 i ,q
n n
dj q 2
i, j w 2

• The denominator involves the lengths of the vectors

• So the cosine measure is also known as the normalized


inner product n
Length d j  2
i 1
wi , j
 Cosine similarity ranges from -1 to 1
 1 means the vectors are identical in direction (perfect similarity).
 0 and -1 indicates the vectors are orthogonal (no similarity).
Example 1: Computing Cosine Similarity
• Let say we have query vector Q = (0.4, 0.8); and also
document D1 = (0.2, 0.7). Compute their similarity
using cosine?

( 0 . 4 * 0 .2 )  ( 0 .8 * 0 .7 )
sim(Q, D2 ) 
[(0.4) 2  (0.8) 2 ] * [(0.2) 2  (0.7) 2 ]
0.64
  0.98
0.42
Example 2: Computing Cosine Similarity
• Let say we have two documents in our corpus; D1 =
(0.8, 0.3) and D2 = (0.2, 0.7). Given query vector Q =
(0.4, 0.8), determine which document is more relevant
one for the query?

cos 1  0.74 1.0

D2
Q

cos  2  0.98 0.8

0.6 2
0.4
1 D1
0.2

0.2 0.4 0.6 0.8 1.0

25
Example
• Given three documents; D1, D2 and D3 with the
corresponding TFIDF weight, Which documents are more
similar using the three similarity measurement?

Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254

Term Weighting in Document Retrieval
No ratings yet
Term Weighting in Document Retrieval
25 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
33 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
32 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
34 pages
Term Weighting in Document Retrieval
No ratings yet
Term Weighting in Document Retrieval
34 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
11 pages
Understanding Term Weighting Methods
No ratings yet
Understanding Term Weighting Methods
34 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
Term Weighting in Document Retrieval
No ratings yet
Term Weighting in Document Retrieval
34 pages
Term Weighting in Image Similarity Measures
50% (2)
Term Weighting in Image Similarity Measures
54 pages
Understanding Term Weighting Methods
No ratings yet
Understanding Term Weighting Methods
17 pages
Understanding TF-IDF Weighting
100% (2)
Understanding TF-IDF Weighting
38 pages
Term Weighting in Document Retrieval
No ratings yet
Term Weighting in Document Retrieval
45 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
38 pages
TFIDF & Vector Space Model Explained
No ratings yet
TFIDF & Vector Space Model Explained
27 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
40 pages
Document Relevance and TF-IDF Explained
No ratings yet
Document Relevance and TF-IDF Explained
10 pages
Information Retrieval Weighting Techniques
No ratings yet
Information Retrieval Weighting Techniques
14 pages
Vector Space Model and Term Weighting
No ratings yet
Vector Space Model and Term Weighting
57 pages
Understanding TF-IDF and IDF
No ratings yet
Understanding TF-IDF and IDF
51 pages
Understanding Bag-of-Words in NLP
No ratings yet
Understanding Bag-of-Words in NLP
40 pages
Vector Space Model in Document Retrieval
No ratings yet
Vector Space Model in Document Retrieval
6 pages
Document Scoring in Information Retrieval
100% (3)
Document Scoring in Information Retrieval
38 pages
Term Weighting in Information Retrieval
No ratings yet
Term Weighting in Information Retrieval
71 pages
Document Vector Representation Techniques
No ratings yet
Document Vector Representation Techniques
16 pages
Understanding Ranked Retrieval Models
No ratings yet
Understanding Ranked Retrieval Models
52 pages
Information Retrieval Models Explained
No ratings yet
Information Retrieval Models Explained
10 pages
Understanding Basic IR Models and TF-IDF
No ratings yet
Understanding Basic IR Models and TF-IDF
22 pages
Indexing and Retrieval Models Overview
No ratings yet
Indexing and Retrieval Models Overview
46 pages
Vector Space Model for Document Retrieval
No ratings yet
Vector Space Model for Document Retrieval
57 pages
TF-IDF Weighting in Document Retrieval
No ratings yet
TF-IDF Weighting in Document Retrieval
37 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
46 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
7 pages
Understanding IR Models and Indexing
No ratings yet
Understanding IR Models and Indexing
46 pages
Understanding Cosine and Jaccard Similarity
No ratings yet
Understanding Cosine and Jaccard Similarity
40 pages
Ranked Retrieval in Information Retrieval
No ratings yet
Ranked Retrieval in Information Retrieval
31 pages
TF-IDF and Vector Space Model Explained
No ratings yet
TF-IDF and Vector Space Model Explained
44 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
51 pages
String Similarity Measures Explained
No ratings yet
String Similarity Measures Explained
32 pages
Understanding IR Models and Weighting
No ratings yet
Understanding IR Models and Weighting
30 pages
Ranked Retrieval and Vector Space Model
No ratings yet
Ranked Retrieval and Vector Space Model
43 pages
Understanding TF-IDF Metrics
No ratings yet
Understanding TF-IDF Metrics
4 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
33 pages
Understanding IR Models and Weighting Techniques
No ratings yet
Understanding IR Models and Weighting Techniques
33 pages
Retrieval Models in Information Retrieval
No ratings yet
Retrieval Models in Information Retrieval
37 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
30 pages
Boolean vs Vector Space Models
No ratings yet
Boolean vs Vector Space Models
27 pages
Vector Space Model in Information Retrieval
100% (1)
Vector Space Model in Information Retrieval
16 pages
Information Retrieval Modeling Overview
No ratings yet
Information Retrieval Modeling Overview
39 pages
Understanding IR Models and Techniques
100% (1)
Understanding IR Models and Techniques
32 pages
Vector Space Model in IR Systems
100% (1)
Vector Space Model in IR Systems
32 pages
Scoring and Weighting in IR Models
No ratings yet
Scoring and Weighting in IR Models
45 pages
Parameterized Decay Model for IR
No ratings yet
Parameterized Decay Model for IR
21 pages
Vector Space Model in Information Retrieval
No ratings yet
Vector Space Model in Information Retrieval
43 pages
Understanding Search Techniques and Queries
No ratings yet
Understanding Search Techniques and Queries
24 pages
Term Weighting in Vector Space Model
No ratings yet
Term Weighting in Vector Space Model
9 pages
IR Models and Document Retrieval Techniques
100% (1)
IR Models and Document Retrieval Techniques
26 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
43 pages
TF-IDF Algorithm for Document Queries
No ratings yet
TF-IDF Algorithm for Document Queries
4 pages
Communication Medium in Wireless Networks
No ratings yet
Communication Medium in Wireless Networks
16 pages
HTML
No ratings yet
HTML
1 page
Accounting Concepts and Principles Overview
No ratings yet
Accounting Concepts and Principles Overview
9 pages
Interactive Computer Graphics Essentials
No ratings yet
Interactive Computer Graphics Essentials
27 pages
Understanding Systems Theory in Organizations
No ratings yet
Understanding Systems Theory in Organizations
23 pages
Culture in Cross-Cultural Management
No ratings yet
Culture in Cross-Cultural Management
22 pages
Colonial Architecture in India: 1800-1950
14% (7)
Colonial Architecture in India: 1800-1950
2 pages
Gibbs Energy Minimization Techniques
No ratings yet
Gibbs Energy Minimization Techniques
9 pages
Overview of Sampling Methods
No ratings yet
Overview of Sampling Methods
18 pages
Nigerian Libraries' Reference Services
No ratings yet
Nigerian Libraries' Reference Services
10 pages
Download Traction eBook Now
No ratings yet
Download Traction eBook Now
36 pages
SAKO Off-Grid Solar Inverter Overview
No ratings yet
SAKO Off-Grid Solar Inverter Overview
29 pages
Step 4: Jishu Hozen in TPM
50% (2)
Step 4: Jishu Hozen in TPM
4 pages
Narrative Strategies in Crisis Decisions
No ratings yet
Narrative Strategies in Crisis Decisions
20 pages
Chase Business Checking Summary Report
No ratings yet
Chase Business Checking Summary Report
1 page
HK-5830 Truerta Level 4 Guide
100% (1)
HK-5830 Truerta Level 4 Guide
21 pages
2025 DSE Math Mock Exam M2 Questions
No ratings yet
2025 DSE Math Mock Exam M2 Questions
28 pages
Digital Marketing Insights from eMarketer
No ratings yet
Digital Marketing Insights from eMarketer
31 pages
MATH 1101 Precalculus Test 2 Instructions
No ratings yet
MATH 1101 Precalculus Test 2 Instructions
4 pages
Multicultural Diversity in Workplace Insights
No ratings yet
Multicultural Diversity in Workplace Insights
5 pages
ICAECT 2025 Conference Schedule
No ratings yet
ICAECT 2025 Conference Schedule
22 pages
Equipment Request Validation Form
No ratings yet
Equipment Request Validation Form
9 pages
Harm in Computer-Generated Child Pornography
No ratings yet
Harm in Computer-Generated Child Pornography
15 pages
VR Modeling Techniques and Applications
No ratings yet
VR Modeling Techniques and Applications
17 pages
Intercooler Assembly Process Flow Diagram
No ratings yet
Intercooler Assembly Process Flow Diagram
2 pages
Caustic Potash for Drilling Fluids
No ratings yet
Caustic Potash for Drilling Fluids
1 page
Mivi Roam 2 User Manual Guide
No ratings yet
Mivi Roam 2 User Manual Guide
1 page
EHL Master in Global Hospitality Management
No ratings yet
EHL Master in Global Hospitality Management
25 pages
Solar Water Level Sensor System
No ratings yet
Solar Water Level Sensor System
27 pages
Willard F. Libby: Pioneer of Radiocarbon Dating
No ratings yet
Willard F. Libby: Pioneer of Radiocarbon Dating
9 pages
Module Outline - Updated
No ratings yet
Module Outline - Updated
11 pages
Ribbon Microphone Characteristics and Use
No ratings yet
Ribbon Microphone Characteristics and Use
1 page
Vertical Template A0 ISGH Poster
No ratings yet
Vertical Template A0 ISGH Poster
1 page
Regression Analysis of Marketing Spend
100% (1)
Regression Analysis of Marketing Spend
3 pages
Scuba Diving Equipment Guidelines
No ratings yet
Scuba Diving Equipment Guidelines
185 pages

Term Weighting and Similarity Measures

Uploaded by

Term Weighting and Similarity Measures

Uploaded by

Chapter 3

Term weighting and similarity

• Non-binary weights allow to model partial matching.

• If we don’t normalize short documents

• The example shows that collection

• Lower TF*IDF is registered when the term occurs fewer

• Lowest TF*IDF is registered when the term occurs in

sim(dj ,q) = |dj – q| =  (w

where wij is the weight of term i in document j and wiq is

where wij is the weight of term i in document j and wiq is

Retrieval Database Term Computer Text Manage Data

• sim(D1 , Q) = 2*1 + 3*0 + 5*2 = 12

• The denominator involves the lengths of the vectors

cos 1  0.74 1.0

cos  2  0.98 0.8

0.2 0.4 0.6 0.8 1.0

You might also like

• sim(D1 , Q) = 21 + 30 + 5*2 = 12