0% found this document useful (0 votes)

60 views34 pages

Understanding Term Weighting Methods

Uploaded by

gcrossn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views34 pages

Understanding Term Weighting Methods

Uploaded by

gcrossn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Dereje Teferi

[Link]@[Link]

1
Terms
Terms are usually stems. Terms can be also phrases,
such as “Computer Science”, “World Wide Web”, etc.
Documents and queries are represented as vectors or
“bags of words” (BOW).
Each vector holds a place for every term in the
collection.
Position 1 corresponds to term 1, position 2 to term 2,
position n to term n.
Di  wd i1 , wd i 2 ,..., wd in
Q  wq1 , wq 2, ..., wqn
W=0 if a term is absent
Documents are represented by binary weights or
Non-binary weighted vectors of terms. 2
Document Collection
A collection of n documents can be represented in
the vector space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of
a term in the document; zero means the term has
no significance in the document or it simply
doesn’t exist in the document.

T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn

3
Binary Weights
• Only the presence (1) or docs t1 t2 t3
absence (0) of a term is D1 1 0 1
D2 1 0 0
included in the vector D3 0 1 1
• Binary formula gives every D4 1 0 0
D5 1 1 1
word that appears in a D6 1 1 0
document equal relevance. D7 0 1 0
D8 0 1 0
• It can be useful when D9 0 0 1
frequency is not important. D10 0 1 1
D11 1 0 1
• Binary Weights Formula:
1 if freq ij  0

freq ij  
0 if freq ij  0

Why use term weighting?
Binary weights are too limiting.
terms are either present or absent.
Not allow to order documents according to their level of
relevance for a given query
Non-binary weights allow to model partial matching .
Partial matching allows retrieval of docs that
approximate the query.
• Term-weighting improves quality of answer set.
Term weighting enables ranking of retrieved documents;
such that best matching documents are ordered at the top
as they are more relevant than others.
5
Term Weighting: Term Frequency (TF)
TF (term frequency) - Count the number
of times term occurs in document.
docs t1 t2 t3
fij = frequency of term i in document j
D1 2 0 3
The more times a term t occurs in D2 1 0 0
document d the more likely it is that t is D3 0 4 7
relevant to the document, i.e. more D4 3 0 0
indicative of the topic..
D5 1 6 3
 If used alone, it favors common words and
long documents.
D6 3 5 0
 It gives too much credit to words that appears D7 0 8 0
more frequently. D8 0 10 0
Many IR systems normalize term D9 0 0 1
frequency (tf) across the entire corpus: D10 0 3 5
tfij = fij / max{fij} D11 4 0 1
Document Normalization
 Long documents have an unfair advantage:
 They use a lot of terms
 So they get more matches than short documents

 And they use the same words repeatedly

 So they have much higher term frequencies

 Normalization seeks to remove these effects:

 Related somehow to maximum term frequency.
 But also sensitive to the number of terms.

 If we don’t normalize short documents may not be

recognized as relevant.

7
Problems with term frequency
One needs a mechanism for attenuating the effect
of terms that occur too often in the collection to
be meaningful for relevance/meaning
determination
Scale down the weight of terms with high collection
frequency
Reduce the tf weight of a term by a factor that grows with
the collection frequency
More common for this purpose is total no of words
in a document

8
Document Frequency
 It is defined to be the number of documents in the
collection that contain a term

DF = document frequency

 Count the frequency considering the whole

collection of documents.
 The less frequently a term appears in the whole
collection, the more discriminating power it has.

df i = document frequency of term i

= number of documents containing term i

9
Inverse Document Frequency (IDF)
IDF measures rarity of the term in collection. The IDF is
a measure of the general importance of the term
Inverts the document frequency.
It diminishes the weight of terms that occur very
frequently in the collection and increases the weight of
terms that occur rarely.
Gives full weight to terms that occur in one document
only.
Gives lowest weight to terms that occur in all documents.
Terms that appear in many different documents are less
indicative of overall topic.
idfi = inverse document frequency of term i,
= log2 (N/ df i) (N: total number of documents)

10
Inverse Document Frequency
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
• Log is used to dampen the effect relative to tf.
• Understand the difference between Document frequency vs.
corpus frequency?
11
TF*IDF Weighting
The most used term-weighting is tf*idf weighting
scheme:
wij = tfij idfi = tfij * log2 (N/ dfi)
A term occurring frequently in the document but
rarely in the rest of the collection is given high weight.
The tf*idf value for a term will always be greater than or
equal to zero.
Experimentally, tf*idf has been found to work well.
It is often used in the vector space model together with
cosine similarity to determine the similarity between
two documents.

12
TF*IDF weighting
When does tf*idf registers a high weight? when a term
t occurs many times within a small number of
documents
Highest tf*idf for a term shows a term has a high term
frequency (in the given document) and a low
document frequency (in the whole collection of
documents);
the weights hence tend to filter out common terms.
Lower tf*idf is registered when the term occurs fewer
times in a document, or occurs in many documents in
the collection. Thus offering a less pronounced
relevance signal
Lowest tf*idf is registered when the term occurs in
virtually in all documents
Computing TF*IDF: An Example
Assume collection contains 10,000 documents and statistical
analysis shows that document frequencies (DF) of three terms
A, B, and C are: A(50), B(1300), C(250).
In addition the term frequencies (TF) of these terms in a
document are: A(3), B(2), C(1) where the total number of
words in the document is 30. Compute tf*idf for each term in
the specific document having a total of these 30 words?
A: tf = 3/30=0.1 idf = log2(10000/50) = 7.644; tf*idf = 0.7644
B: tf = 2/30=0.0667 idf = log2(10000/1300) = 2.943; tf*idf = 0.1962
C: tf = 1/30=0.033 idf = log2(10000/250) = 5.322; tf*idf = 0.1774
Query vector is typically treated as a document and
also tf*idf weighted.
14
More Example
Consider a document containing 100 words wherein the
word cow appears 3 times. Now, assume we have 10
million documents and cow appears in one thousand of
these.
The term frequency (tf) for cow :
3/100 = 0.03

The inverse document frequency is idf

log2(10,000,000 / 1,000) = 13.228

The tfidf score is the product of these frequencies: 0.03

13.228 = 0.39684

15
Exercise Word C TW TD DF TF IDF TF*IDF
• Let C = number of times
a given word appears in airplane 5 46 3 1
a document; blue 1 46 3 1
• TW = total number of
words in a document; chair 7 46 3 3
• TD = total number of computer 3 46 3 1
documents in a corpus,
and forest 2 46 3 1
• DF = total number of justice 7 46 3 3
documents containing a
love 2 46 3 1
given word;
• compute TF, IDF and might 2 46 3 1
TF*IDF score for each
perl 5 46 3 2
term
rose 6 46 3 3
shoe 4 46 3 1
thesis 2 46 3 2
16
Concluding remarks
Suppose from a set of English documents, we wish to determine
which once are the most relevant to the query "the brown cow."
A simple way to start out is by eliminating documents that do not
contain all three words "the," "brown," and "cow," but this still leaves
many documents.
To further distinguish them, we might count the number of times
each term occurs in each document and sum them all together;
the number of times a term occurs in a document is called its TF. However,
because the term "the" is so common, this will tend to incorrectly
emphasize documents which happen to use the word "the" more, without
giving enough weight to the more meaningful terms "brown" and "cow".
Also the term "the" is not a good keyword to distinguish relevant and non-
relevant documents and terms like "brown" and "cow" that occur rarely are
good keywords to distinguish relevant documents from the non-relevant
once.

17
Concluding remarks
Hence IDF is incorporated which diminishes the
weight of terms that occur very frequently in the
collection and increases the weight of terms that occur
rarely.
This leads to use TF*IDF as a better weighting technique
On top of that we apply similarity measures to
calculate the distance between document i and query
j.
There are a number of similarity measures; the most
common similarity measures are
Euclidean distance , Inner or Dot product, Cosine
similarity, Dice similarity, Jaccard similarity, etc.
Similarity Measure
We now have vectors for all documents in
the collection and a vector for the query, t3
how do we compute similarity?
A similarity measure is a function that 
computes the degree of similarity or D1
distance between document vector and Q
query vector.  t1
Using a similarity measure between the
query and each document: t2 D2
It is possible to rank the retrieved
documents in the order of presumed
relevance.
It is possible to enforce a certain
threshold so that the size of the
retrieved set can be controlled. 19
t3
d2

d3
d1
θ
φ
t1

d5
t2
d4

Postulate: Documents that are “close together”

in the vector space talk about the same things and
are more similar than others.
Similarity Measure
1. If d1 is near d2, then d2 is near d1.
2. If d1 near d2, and d2 near d3, then d1 is not far from d3.
3. No document is closer to d than d itself.
i.e. The maximum possible similarity is the “distance”
between a document d and itself
A similarity measure attempts to compute the
distance between document vector wj and query wq
vector.
The assumption here is that documents whose vectors are
close to the query vector are more relevant to the query than
documents whose vectors are away from the query vector.
21
Similarity Measure: Techniques
• Euclidean distance
It is the most common similarity measure. Euclidean distance
examines the root of square differences between coordinates
of a pair of document and query terms.
Dot product
The dot product is also known as the scalar product or inner
product
the dot product is defined as the product of the magnitudes of
query and document vectors
Cosine similarity (or normalized inner product)
It projects document and query vectors into a term space and
calculate the cosine angle between these.
22
Euclidean distance
Similarity between vectors for the document di and
query q can be computed as:
n
sim(dj,q) = |dj – q| =  (w
i 1
ij  wiq ) 2

where wij is the weight of term i in document j and wiq is the

weight of term i in the query
• Example: Determine the Euclidean distance between the
document 1 vector (0, 3, 2, 1, 10) and query vector (2, 7, 1, 0,
0). 0 means corresponding term not found in document
or query

 (0  2)  (3  7)  (2  1)  (1  0)  (10  0)  11 .05
2 2 2 2 2

23
Inner Product
Similarity between vectors for the document di and
query q can be computed as the vector inner
product: n

j j

sim(d ,q) = d •q = w · w
i 1ij iq

where wij is the weight of term i in document j and wiq is the

weight of term i in the query q
For binary vectors, the inner product is the number
of matched query terms in the document (size of
intersection).
For weighted term vectors, it is the sum of the
products of the weights of the matched terms. 24
Properties of Inner Product

Favors long documents with a large number of

unique terms.
Again, the issue of normalization
Measures how many terms matched but not how
many terms are not matched.

25
Inner Product -- Examples
Binary weight :
 Size of vector = size of vocabulary = 7

sim(D, Q) = 3
Retrieval Database Term Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1

• Term Weighted: Retrieval Database Architecture

D1 2 3 5
D2 3 7 1
Q 1 0 2

sim(D1 , Q) = 21 + 30 + 5*2 = 12

Inner Product:
Example 1
k2
k1
d2 d6 d7
d4
d5
d3
d1
k1 k2 k3 q  dj
d1 1 0 1 2 k3
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1

q 1 1 1
27
Inner Product:
Exercise k1
k2
d2 d6 d7
d4 d5
d1 d3
k1 k2 k3 q  dj
d1 1 0 1 ? k3
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?

q 1 2 3
28
Cosine similarity
Measures similarity between d1 and d2 captured by
the cosine of the angle x between them.
 

n
d j q wij wiq
sim( d j , q )     i 1

i 1 w i 1 iq
n n
dj q 2
ij w 2

Or;  

n
d j  dk wij wik
sim(d j , d k )     i 1

i 1 w i 1 ik
n n
d j dk 2
ij w 2

The denominator involves the lengths of the vectors



n
Length d j  i 1
w 2
ij
So the cosine measure is also known as the
normalized inner product
Example: Computing Cosine Similarity
• Let us assume we have query vector Q = (0.4, 0.8);
and also document D1 = (0.2, 0.7). Compute their
similarity using cosine?

(0.4 * 0.2)  (0.8 * 0.7)

sim (Q, D2 ) 
[(0.4) 2  (0.8) 2 ] * [(0.2) 2  (0.7) 2 ]
0.64
  0.98
0.42
Example: Computing Cosine Similarity
• Let say we have two documents in our corpus; D1 =
(0.8, 0.3) and D2 = (0.2, 0.7). Given query vector Q
= (0.4, 0.8), determine which document is the most
relevant one for the query?

1.0
cos 1  0.74 D2
Q
0.8

cos  2  0.98 0.6

2
0.4
1 D1
0.2

0.2 0.4 0.6 0.8 1.0

31
Example
Given three documents; D1, D2 and D3 with the

corresponding tf*idf weight, Which documents are

more similar using the three measurement?

Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254

32
Cosine Similarity vs. Inner Product
Cosine similarity measures the cosine of the angle
between two vectors.
Inner product normalized by the vector lengths.
  t

dj q   ( wij  wiq )
 

i 1
CosSim(dj, q) = t t
dj  q  wij   wiq 2 2

i 1 i 1
 
InnerProduct(dj, q) = dj q 

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81

D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times
better using inner product in terms of closeness to query Q.
33
34

Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
Understanding Term Weighting Methods
No ratings yet
Understanding Term Weighting Methods
17 pages
Term Weighting in Document Retrieval
No ratings yet
Term Weighting in Document Retrieval
34 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
34 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
32 pages
Understanding TF-IDF Weighting
100% (2)
Understanding TF-IDF Weighting
38 pages
Term Weighting in Document Retrieval
No ratings yet
Term Weighting in Document Retrieval
34 pages
Term Weighting in Image Similarity Measures
50% (2)
Term Weighting in Image Similarity Measures
54 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
11 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
33 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
26 pages
Term Weighting in Document Retrieval
No ratings yet
Term Weighting in Document Retrieval
25 pages
Term Weighting in Document Retrieval
No ratings yet
Term Weighting in Document Retrieval
45 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
38 pages
TFIDF & Vector Space Model Explained
No ratings yet
TFIDF & Vector Space Model Explained
27 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
40 pages
Document Vector Representation Techniques
No ratings yet
Document Vector Representation Techniques
16 pages
Document Relevance and TF-IDF Explained
No ratings yet
Document Relevance and TF-IDF Explained
10 pages
Understanding Basic IR Models and TF-IDF
No ratings yet
Understanding Basic IR Models and TF-IDF
22 pages
Term Weighting in Information Retrieval
No ratings yet
Term Weighting in Information Retrieval
71 pages
Information Retrieval Models Explained
No ratings yet
Information Retrieval Models Explained
10 pages
Understanding Bag-of-Words in NLP
No ratings yet
Understanding Bag-of-Words in NLP
40 pages
Understanding TF-IDF and IDF
No ratings yet
Understanding TF-IDF and IDF
51 pages
Document Scoring in Information Retrieval
100% (3)
Document Scoring in Information Retrieval
38 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
7 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
33 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
46 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
51 pages
Information Retrieval Modeling Overview
No ratings yet
Information Retrieval Modeling Overview
39 pages
TF-IDF Weighting in Document Retrieval
No ratings yet
TF-IDF Weighting in Document Retrieval
37 pages
Vector Space Model and Term Weighting
No ratings yet
Vector Space Model and Term Weighting
57 pages
Understanding Ranked Retrieval Models
No ratings yet
Understanding Ranked Retrieval Models
52 pages
Understanding IR Models and Weighting Techniques
No ratings yet
Understanding IR Models and Weighting Techniques
33 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
30 pages
Understanding IR Models and Indexing
No ratings yet
Understanding IR Models and Indexing
46 pages
Understanding IR Models and Weighting
No ratings yet
Understanding IR Models and Weighting
30 pages
Ranked Retrieval in Information Retrieval
No ratings yet
Ranked Retrieval in Information Retrieval
31 pages
TF-IDF and Vector Space Model Explained
No ratings yet
TF-IDF and Vector Space Model Explained
44 pages
Vector Space Model in IR Systems
100% (1)
Vector Space Model in IR Systems
32 pages
Understanding IR Models and Techniques
100% (1)
Understanding IR Models and Techniques
32 pages
Information Retrieval Weighting Techniques
No ratings yet
Information Retrieval Weighting Techniques
14 pages
Retrieval Models in Information Retrieval
No ratings yet
Retrieval Models in Information Retrieval
37 pages
Vector Space Model for Document Retrieval
No ratings yet
Vector Space Model for Document Retrieval
57 pages
Understanding TF-IDF Metrics
No ratings yet
Understanding TF-IDF Metrics
4 pages
IR Models and Document Retrieval Techniques
100% (1)
IR Models and Document Retrieval Techniques
26 pages
Scoring and Weighting in IR Models
No ratings yet
Scoring and Weighting in IR Models
45 pages
Parameterized Decay Model for IR
No ratings yet
Parameterized Decay Model for IR
21 pages
Lecture 10
No ratings yet
Lecture 10
18 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
28 pages
Reusable Test Collections in IR Evaluations
No ratings yet
Reusable Test Collections in IR Evaluations
49 pages
Understanding IR Models and Techniques
No ratings yet
Understanding IR Models and Techniques
25 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Indexing and Retrieval Models Overview
No ratings yet
Indexing and Retrieval Models Overview
46 pages
Vector Space Model in Information Retrieval
No ratings yet
Vector Space Model in Information Retrieval
23 pages
Ranked Retrieval and Vector Space Model
No ratings yet
Ranked Retrieval and Vector Space Model
43 pages
Information Retrieval Models Overview
No ratings yet
Information Retrieval Models Overview
420 pages
Term Weighting in Vector Space Model
No ratings yet
Term Weighting in Vector Space Model
9 pages
Boolean vs Vector Space Models
No ratings yet
Boolean vs Vector Space Models
27 pages
Vector Space Model in Document Retrieval
No ratings yet
Vector Space Model in Document Retrieval
6 pages
Vocabulary Growth and Word Frequency
No ratings yet
Vocabulary Growth and Word Frequency
47 pages
Introduction to Multimedia Systems
No ratings yet
Introduction to Multimedia Systems
6 pages
4 Indexing
No ratings yet
4 Indexing
36 pages
Chapter 6
No ratings yet
Chapter 6
30 pages
Distinguishing CPS from Embedded Systems
No ratings yet
Distinguishing CPS from Embedded Systems
39 pages
Understanding Complexity Classes
No ratings yet
Understanding Complexity Classes
42 pages
Fundamentals of Computability Theory
No ratings yet
Fundamentals of Computability Theory
11 pages
Distinguishing CPS from Embedded Systems
No ratings yet
Distinguishing CPS from Embedded Systems
39 pages
Understanding Space Complexity in Turing Machines
No ratings yet
Understanding Space Complexity in Turing Machines
14 pages
Understanding Real-Time Communications
No ratings yet
Understanding Real-Time Communications
39 pages
Embedded Systems Design Challenges
No ratings yet
Embedded Systems Design Challenges
49 pages
Hard Real-Time System Requirements
No ratings yet
Hard Real-Time System Requirements
60 pages
CUET 2025 Mathematics Syllabus Overview
No ratings yet
CUET 2025 Mathematics Syllabus Overview
6 pages
Scalar Product Calculations and Vector Analysis
No ratings yet
Scalar Product Calculations and Vector Analysis
65 pages
General Physics 1 CG
50% (2)
General Physics 1 CG
311 pages
Physics 451 Fall 2004 Homework Solutions
100% (1)
Physics 451 Fall 2004 Homework Solutions
6 pages
Vector Analysis and Representation
No ratings yet
Vector Analysis and Representation
33 pages
Understanding Kinematics and Motion
No ratings yet
Understanding Kinematics and Motion
98 pages
2024 H2 Math Paper 1 Solutions
No ratings yet
2024 H2 Math Paper 1 Solutions
29 pages
MATLAB Basics: Operations & Visualizations
No ratings yet
MATLAB Basics: Operations & Visualizations
3 pages
JEE Trainer: Vectors and 3D Geometry
100% (3)
JEE Trainer: Vectors and 3D Geometry
56 pages
Inner Product and Hilbert Spaces Overview
No ratings yet
Inner Product and Hilbert Spaces Overview
13 pages
SVM and Kernel Trick in Machine Learning
100% (1)
SVM and Kernel Trick in Machine Learning
41 pages
Understanding Coplanar Vectors
100% (1)
Understanding Coplanar Vectors
22 pages
Consistent Linearization in Mechanics
No ratings yet
Consistent Linearization in Mechanics
7 pages
Rigid Bodies and Equivalent Forces
No ratings yet
Rigid Bodies and Equivalent Forces
48 pages
HSC Physics Vector Algebra Exam Questions
No ratings yet
HSC Physics Vector Algebra Exam Questions
2 pages
Understanding Scalar Product and Energy Concepts
No ratings yet
Understanding Scalar Product and Energy Concepts
16 pages
Motion in a Plane: Vectors and Scalars
No ratings yet
Motion in a Plane: Vectors and Scalars
8 pages
Isometries and Orthogonal Matrices in Rn
No ratings yet
Isometries and Orthogonal Matrices in Rn
19 pages
Non-Coplanar Vectors in Vector Analysis
No ratings yet
Non-Coplanar Vectors in Vector Analysis
26 pages
Understanding Vectors: Properties and Applications
No ratings yet
Understanding Vectors: Properties and Applications
80 pages
Complex Vector Spaces and Inner Products
No ratings yet
Complex Vector Spaces and Inner Products
8 pages
Orthogonality and Least Squares Concepts
No ratings yet
Orthogonality and Least Squares Concepts
18 pages
Understanding Orthogonal Matrices
No ratings yet
Understanding Orthogonal Matrices
27 pages
Understanding Force and Vectors in Mechanics
No ratings yet
Understanding Force and Vectors in Mechanics
25 pages
Collin
No ratings yet
Collin
859 pages
Ilka Agricola, Thomas Friedrich Global Analysis Differential Forms in Analysis, Geometry, and Physics Graduate Studies in Mathematics, V. 52 PDF
100% (3)
Ilka Agricola, Thomas Friedrich Global Analysis Differential Forms in Analysis, Geometry, and Physics Graduate Studies in Mathematics, V. 52 PDF
359 pages
AUA E-Math Test Sample Questions
No ratings yet
AUA E-Math Test Sample Questions
14 pages
Class 11 Vector Product Assignment
No ratings yet
Class 11 Vector Product Assignment
2 pages
Mathematics Long Quiz 2024-2025
No ratings yet
Mathematics Long Quiz 2024-2025
4 pages
Department of Mathematics The University Dundee DD1 4HN
No ratings yet
Department of Mathematics The University Dundee DD1 4HN
31 pages

Understanding Term Weighting Methods

Uploaded by

Understanding Term Weighting Methods

Uploaded by

Dereje Teferi

 And they use the same words repeatedly

 Normalization seeks to remove these effects:

 If we don’t normalize short documents may not be

 Count the frequency considering the whole

df i = document frequency of term i

The inverse document frequency is idf

The tf*idf score is the product of these frequencies: 0.03 *

Postulate: Documents that are “close together”

where wij is the weight of term i in document j and wiq is the

where wij is the weight of term i in document j and wiq is the

Favors long documents with a large number of

• Term Weighted: Retrieval Database Architecture

sim(D1 , Q) = 2*1 + 3*0 + 5*2 = 12

The denominator involves the lengths of the vectors

(0.4 * 0.2)  (0.8 * 0.7)

cos  2  0.98 0.6

0.2 0.4 0.6 0.8 1.0

corresponding tf*idf weight, Which documents are

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81

You might also like

The tfidf score is the product of these frequencies: 0.03

sim(D1 , Q) = 21 + 30 + 5*2 = 12