0% found this document useful (0 votes)
44 views17 pages

Bag of Words and TF-IDF

The document discusses the concepts of Bag of Words (BoW) and TF-IDF in Natural Language Processing (NLP). It explains the methods of learning word embeddings, including CBOW and Skip-gram, which are used to predict words based on their context. Additionally, it covers the importance of term frequency (TF) and inverse document frequency (IDF) in determining the significance of words within a document corpus.

Uploaded by

manipolaki2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views17 pages

Bag of Words and TF-IDF

The document discusses the concepts of Bag of Words (BoW) and TF-IDF in Natural Language Processing (NLP). It explains the methods of learning word embeddings, including CBOW and Skip-gram, which are used to predict words based on their context. Additionally, it covers the importance of term frequency (TF) and inverse document frequency (IDF) in determining the significance of words within a document corpus.

Uploaded by

manipolaki2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Welcome to

INTERNSHIP STUDIO
Module 06 | Lesson 03
Bag of Words and TF-IDF

WWW.INTERNSHIPSTUDIO.COM
Word embeddings

Idea: learn an embedding from words into vectors

Need to have a function W(word) that returns a vector encoding that word.

WWW.INTERNSHIPSTUDIO.COM
Learning word embeddings

First attempt:
Input data is sets of 5 words from a meaningful sentence. E.g., “one of the best
places”. Modify half of them by replacing middle word with a random word. “one of
function best places”
W is a map (depending on parameters, Q) from words to 50 dim’l vectors. E.g., a look-
up table or an RNN.
Feed 5 embeddings into a module R to determine ‘valid’ or ‘invalid’
Optimize over Q to predict better

WWW.INTERNSHIPSTUDIO.COM
What is a Bag of Words

Bag-of-Words is called (BoW) model as well. Aside from its funny-sounding name, a
BoW is a critical part of Natural Language Processing (NLP) and one of the building
blocks of performing Machine Learning on text.

A BoW is simply an unordered collection of words and their frequencies (counts). For
example, let's look at the following text:

"I sat on a plane and sat on a chair."

and chair on plane sat


1 1 2 1 2

WWW.INTERNSHIPSTUDIO.COM
What is a Bag of Words

WWW.INTERNSHIPSTUDIO.COM
Types of BOW

Predict words using context


Two versions: CBOW (continuous bag of words) and Skip-gram

WWW.INTERNSHIPSTUDIO.COM
CBOW

Takes vector embeddings of n words before target and n words after and adds them
(as vectors).
Also removes word order, but the vector sum is meaningful enough to deduce missing
word.

WWW.INTERNSHIPSTUDIO.COM
CBOW

E.g. “The cat sat on floor”


Window size = 2

the

cat
sat

on

floor

WWW.INTERNSHIPSTUDIO.COM
CBOW

Input layer
0
Index of cat in vocabulary 1
0
0

cat 0 Hidden layer Output layer


0
0 0
0 0
… 0
0 0

one-hot 0
sat one-hot
0
vector vector
0 0
0 1
0 …
1 0
0
on
0
0
0

0

12
www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx

WWW.INTERNSHIPSTUDIO.COM
Skip gram

Skip gram – alternative to CBOW


Start with a single word embedding and try to
predict the surrounding words.
Much less well-defined problem, but works
better in practice (scales better).
In this approach, each word or token is called a
“gram”. Creating a vocabulary of two-word
pairs is, in turn, called a bigram model. Again,
only the bigrams that appear in the corpus are
modeled, not all possible bigrams.

WWW.INTERNSHIPSTUDIO.COM
Skip gram

Map from center word to probability on


surrounding words. One input/output unit
below.
There is no activation function on the
hidden layer neurons, but the output
neurons use softmax.

WWW.INTERNSHIPSTUDIO.COM
Skip gram/CBOW intuition

Similar “contexts” (that is, what words are likely to appear


around them), lead to similar embeddings for two words.
One way for the network to output similar context predictions
for these two words is if the word vectors are similar. So, if two
words have similar contexts, then the network is motivated to
learn similar word vectors for these two words!

WWW.INTERNSHIPSTUDIO.COM
Term Frequency(TF)/Inverse Document
Frequency(IDF)

TFIDF, short for term frequency-inverse document frequency, is a numerical


statistic that is intended to reflect how important a word is to a document in
a collection or corpus.

This concept includes:

· Counts. Count the number of times each word appears in a document.

· Frequencies. Calculate the frequency that each word appears in a


document out of all the words in the document.

WWW.INTERNSHIPSTUDIO.COM
Term Frequency(TF)

Term frequency (TF) is used in connection with information retrieval and


shows how frequently an expression (term, word) occurs in a document.

TF can be said as what is the probability of finding a word in a document


(review).

WWW.INTERNSHIPSTUDIO.COM
Inverse Document Frequency(IDF)

The inverse document frequency is a measure of how much information the


word provides, i.e., if it’s common or rare across all documents.
It is used to calculate the weight of rare words across all documents in the
corpus. The words that occur rarely in the corpus have a high IDF score.

WWW.INTERNSHIPSTUDIO.COM
Term frequency–Inverse document frequency:

TF-IDF gives larger values for less frequent words in the document corpus.
TF-IDF value is high when both IDF and TF values are high i.e the word is
rare in the whole document but frequent in a document.

WWW.INTERNSHIPSTUDIO.COM
Term frequency–Inverse document frequency:

TF-IDF gives larger values for less frequent words in the document corpus. TF-IDF value is high
when both IDF and TF values are high i.e the word is rare in the whole document but frequent
in a document.

Sentence 1: The car is driven on the


road.
Sentence 2: The truck is driven on the
highway.

WWW.INTERNSHIPSTUDIO.COM

You might also like