Welcome to
INTERNSHIP STUDIO
Module 06 | Lesson 03
Bag of Words and TF-IDF
WWW.INTERNSHIPSTUDIO.COM
Word embeddings
Idea: learn an embedding from words into vectors
Need to have a function W(word) that returns a vector encoding that word.
WWW.INTERNSHIPSTUDIO.COM
Learning word embeddings
First attempt:
Input data is sets of 5 words from a meaningful sentence. E.g., “one of the best
places”. Modify half of them by replacing middle word with a random word. “one of
function best places”
W is a map (depending on parameters, Q) from words to 50 dim’l vectors. E.g., a look-
up table or an RNN.
Feed 5 embeddings into a module R to determine ‘valid’ or ‘invalid’
Optimize over Q to predict better
WWW.INTERNSHIPSTUDIO.COM
What is a Bag of Words
Bag-of-Words is called (BoW) model as well. Aside from its funny-sounding name, a
BoW is a critical part of Natural Language Processing (NLP) and one of the building
blocks of performing Machine Learning on text.
A BoW is simply an unordered collection of words and their frequencies (counts). For
example, let's look at the following text:
"I sat on a plane and sat on a chair."
and chair on plane sat
1 1 2 1 2
WWW.INTERNSHIPSTUDIO.COM
What is a Bag of Words
WWW.INTERNSHIPSTUDIO.COM
Types of BOW
Predict words using context
Two versions: CBOW (continuous bag of words) and Skip-gram
WWW.INTERNSHIPSTUDIO.COM
CBOW
Takes vector embeddings of n words before target and n words after and adds them
(as vectors).
Also removes word order, but the vector sum is meaningful enough to deduce missing
word.
WWW.INTERNSHIPSTUDIO.COM
CBOW
E.g. “The cat sat on floor”
Window size = 2
the
cat
sat
on
floor
WWW.INTERNSHIPSTUDIO.COM
CBOW
Input layer
0
Index of cat in vocabulary 1
0
0
cat 0 Hidden layer Output layer
0
0 0
0 0
… 0
0 0
one-hot 0
sat one-hot
0
vector vector
0 0
0 1
0 …
1 0
0
on
0
0
0
…
0
12
www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
WWW.INTERNSHIPSTUDIO.COM
Skip gram
Skip gram – alternative to CBOW
Start with a single word embedding and try to
predict the surrounding words.
Much less well-defined problem, but works
better in practice (scales better).
In this approach, each word or token is called a
“gram”. Creating a vocabulary of two-word
pairs is, in turn, called a bigram model. Again,
only the bigrams that appear in the corpus are
modeled, not all possible bigrams.
WWW.INTERNSHIPSTUDIO.COM
Skip gram
Map from center word to probability on
surrounding words. One input/output unit
below.
There is no activation function on the
hidden layer neurons, but the output
neurons use softmax.
WWW.INTERNSHIPSTUDIO.COM
Skip gram/CBOW intuition
Similar “contexts” (that is, what words are likely to appear
around them), lead to similar embeddings for two words.
One way for the network to output similar context predictions
for these two words is if the word vectors are similar. So, if two
words have similar contexts, then the network is motivated to
learn similar word vectors for these two words!
WWW.INTERNSHIPSTUDIO.COM
Term Frequency(TF)/Inverse Document
Frequency(IDF)
TFIDF, short for term frequency-inverse document frequency, is a numerical
statistic that is intended to reflect how important a word is to a document in
a collection or corpus.
This concept includes:
· Counts. Count the number of times each word appears in a document.
· Frequencies. Calculate the frequency that each word appears in a
document out of all the words in the document.
WWW.INTERNSHIPSTUDIO.COM
Term Frequency(TF)
Term frequency (TF) is used in connection with information retrieval and
shows how frequently an expression (term, word) occurs in a document.
TF can be said as what is the probability of finding a word in a document
(review).
WWW.INTERNSHIPSTUDIO.COM
Inverse Document Frequency(IDF)
The inverse document frequency is a measure of how much information the
word provides, i.e., if it’s common or rare across all documents.
It is used to calculate the weight of rare words across all documents in the
corpus. The words that occur rarely in the corpus have a high IDF score.
WWW.INTERNSHIPSTUDIO.COM
Term frequency–Inverse document frequency:
TF-IDF gives larger values for less frequent words in the document corpus.
TF-IDF value is high when both IDF and TF values are high i.e the word is
rare in the whole document but frequent in a document.
WWW.INTERNSHIPSTUDIO.COM
Term frequency–Inverse document frequency:
TF-IDF gives larger values for less frequent words in the document corpus. TF-IDF value is high
when both IDF and TF values are high i.e the word is rare in the whole document but frequent
in a document.
Sentence 1: The car is driven on the
road.
Sentence 2: The truck is driven on the
highway.
WWW.INTERNSHIPSTUDIO.COM