NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
Introduction
Before we start, have a look at the below examples.
1. You open Google and search for a news article on the ongoing Champions trophy
and get hundreds of search results in return about it.
2. Nate silver analysed millions of tweets and correctly predicted the results of 49 out
of 50 states in 2008 U.S Presidential Elections.
3. You type a sentence in google translate in English and get an Equivalent Chinese
conversion.
You possible guessed it right – TEXT processing. All the above three scenarios deal
with humongous amount of text to perform different range of tasks like clustering in the
google search example, classification in the second and Machine Translation in the third.
Humans can deal with text format quite intuitively but provided we have millions of
documents being generated in a single day, we cannot have humans performing the above
the three tasks. It is neither scalable nor effective.
So, how do we make computers of today perform clustering, classification etc on a text
data since we know that they are generally inefficient at handling and processing strings
or texts for any fruitful outputs?
Sure, a computer can match two strings and tell you whether they are same or not. But
how do we make computers tell you about football or Ronaldo when you search for Messi?
How do you make a computer understand that “Apple” in “Apple is a tasty fruit” is a fruit
that can be eaten and not a company?
The answer to the above questions lie in creating a representation for words that capture
their meanings, semantic relationships and the different types of contexts they are used
in.
1/18
And all of these are implemented by using
Word Embeddings or numerical
representations of texts so that computers
may handle them.
Practice Now
Are you a beginner looking for a place to start your journey in Natural Language
Processing? Presenting two comprehensive Certified Programs, covering the concepts of
Natural Language Processing(NLP), curated just for you!
Table of Contents
1. What are Word Embeddings?
2. Different types of Word Embeddings
2.1 Frequency based Embedding
2.1.1 Count Vectors
2.1.2 TF-IDF
2.1.3 Co-Occurrence Matrix
2.2 Prediction based Embedding
2.2.1 CBOW
2.2.2 Skip-Gram
2/18
3. Word Embeddings use case scenarios(what all can be done using word embeddings?
eg: similarity, odd one out etc.)
4. Using pre-trained Word Vectors
5. Training your own Word Vectors
6. End Notes
As it turns out, many Machine Learning algorithms and almost all Deep Learning
Architectures are incapable of processing strings or plain text in their raw form. They
require numbers as inputs to perform any sort of job, be it classification, regression etc. in
broad terms. And with the huge amount of data that is present in the text format, it is
imperative to extract knowledge out of it and build applications. Some real world
applications of text applications are – sentiment analysis of reviews by Amazon etc.,
document or news classification or clustering by Google etc.
Let us now define Word Embeddings formally. A Word Embedding format generally tries
to map a word using a dictionary to a vector. Let us break this sentence down into finer
details to have a clear view.
Take a look at this example – sentence=” Word Embeddings are Word converted into
numbers ”
A dictionary may be the list of all unique words in the sentence. So, a dictionary may
look like – [‘Word’,’Embeddings’,’are’,’Converted’,’into’,’numbers’]
A vector representation of a word may be a one-hot encoded vector where 1 stands for the
position where the word exists and 0 everywhere else. The vector representation of
“numbers”in this format according to the above dictionary is [0,0,0,0,0,1] and of
converted is[0,0,0,1,0,0].
This is just a very simple method to represent a word in the vector form. Let us look at
different types of Word Embeddings or Word Vectors and their advantages and
disadvantages over the rest.
3/18
Let us try to understand each of these methods in detail.
1. Count Vector
2. TF-IDF Vector
3. Co-Occurrence Vector
D1 1 1 2 1 0 0
D2 0 0 1 0 1 1
Now, a column can also be understood as word vector for the corresponding word in the
matrix M. For example, the word vector for ‘lazy’ in the above matrix is [2,1] and so
on.Here, the rows correspond to the documents in the corpus and the columns
correspond to the tokens in the dictionary. The second row in the above matrix may be
read as – D2 contains ‘lazy’: once, ‘Neeraj’: once and ‘person’ once.
Now there may be quite a few variations while preparing the above matrix M. The
variations will be generally in-
4/18
1. The way dictionary is prepared.
Why? Because in real world applications we might have a corpus which contains
millions of documents. And with millions of document, we can extract hundreds of
millions of unique words. So basically, the matrix that will be prepared like above
will be a very sparse one and inefficient for any computation. So an alternative to
using every unique word as a dictionary element would be to pick say top 10,000
words based on frequency and then prepare a dictionary.
2. The way count is taken for each word.
We may either take the frequency (number of times a word has appeared in the
document) or the presence(has the word appeared in the document?) to be the entry
in the count matrix M. But generally, frequency method is preferred over the latter.
This is another method which is based on the frequency method but it is different to the
count vectorization in the sense that it takes into account not just the occurrence of a
word in a single document but in the entire corpus. So, what is the rationale behind this?
Let us try to understand.
Common words like ‘is’, ‘the’, ‘a’ etc. tend to appear quite frequently in comparison to the
words which are important to a document. For example, a document A on Lionel Messi is
going to contain more occurences of the word “Messi” in comparison to other documents.
But common words like “the” etc. are also going to be present in higher frequency in
almost every document.
Ideally, what we would want is to down weight the common words occurring in almost all
documents and give more importance to words that appear in a subset of documents.
TF-IDF works by penalising these common words by assigning them lower weights while
giving importance to words like Messi in a particular document.
5/18
Consider the below sample table which gives the count of terms(tokens/words) in two
documents.
TF(This, Document2)=1/5
It denotes the contribution of the word to the document i.e words relevant to the
document should be frequent. eg: A document about Messi should contain the word
‘Messi’ in large number.
IDF = log(N/n), where, N is the number of documents and n is the number of documents
a term t has appeared in.
where N is the number of documents and n is the number of documents a term t has
appeared in.
So, how do we explain the reasoning behind IDF? Ideally, if a word has appeared in all the
document, then probably that word is not relevant to a particular document. But if it has
appeared in a subset of documents then probably the word is of some relevance to the
documents it is present in.
Now, let us compare the TF-IDF for a common word ‘This’ and a word ‘Messi’ which
seems to be of relevance to Document 1.
6/18
TF-IDF(Messi, Document1) = (4/8)*0.301 = 0.15
As, you can see for Document1 , TF-IDF method heavily penalises the word ‘This’ but
assigns greater weight to ‘Messi’. So, this may be understood as ‘Messi’ is an important
word for Document1 from the context of the entire corpus.
The big idea – Similar words tend to occur together and will have similar context for
example – Apple is a fruit. Mango is a fruit.
Apple and mango tend to have a similar context i.e fruit.
Before I dive into the details of how a co-occurrence matrix is constructed, there are two
concepts that need to be clarified – Co-Occurrence and Context Window.
Co-occurrence – For a given corpus, the co-occurrence of a pair of words say w1 and w2 is
the number of times they have appeared together in a Context Window.
Context Window – Context window is specified by a number and the direction. So what
does a context window of 2 (around) means? Let us see an example below,
The green words are a 2 (around) context window for the word ‘Fox’ and for calculating
the co-occurrence only these words will be counted. Let us see context window for the
word ‘Over’.
He 0 4 2 1 2 1
is 4 0 1 2 2 1
not 2 1 0 1 0 0
lazy 1 2 1 0 0 0
intelligent 2 2 0 0 0 0
smart 1 1 0 0 0 0
Let us understand this co-occurrence matrix by seeing two examples in the table above.
Red and the blue box.
7/18
Red box- It is the number of times ‘He’ and ‘is’ have appeared in the context window 2
and it can be seen that the count turns out to be 4. The below table will help you visualise
the count.
while the word ‘lazy’ has never appeared with ‘intelligent’ in the context window and
therefore has been assigned 0 in the blue box.
Let’s say there are V unique words in the corpus. So Vocabulary size = V. The columns of
the Co-occurrence matrix form the context words. The different variations of Co-
Occurrence Matrix are-
1. A co-occurrence matrix of size V X V. Now, for even a decent corpus V gets very
large and difficult to handle. So generally, this architecture is never preferred in
practice.
2. A co-occurrence matrix of size V X N where N is a subset of V and can be obtained
by removing irrelevant words like stopwords etc. for example. This is still very large
and presents computational difficulties.
But, remember this co-occurrence matrix is not the word vector representation that is
generally used. Instead, this Co-occurrence matrix is decomposed using techniques like
PCA, SVD etc. into factors and combination of these factors forms the word vector
representation.
Let me illustrate this more clearly. For example, you perform PCA on the above matrix of
size VXV. You will obtain V principal components. You can choose k components out of
these V components. So, the new matrix will be of the form V X k.
So, what PCA does at the back is decompose Co-Occurrence matrix into three matrices,
U,S and V where U and V are both orthogonal matrices. What is of importance is that dot
product of U and S gives the word vector representation and V gives the word context
representation.
8/18
Advantages of Co-occurrence Matrix
1. It preserves the semantic relationship between words. i.e man and woman tend to
be closer than man and apple.
2. It uses SVD at its core, which produces more accurate word vector representations
than existing methods.
3. It uses factorization which is a well-defined problem and can be efficiently solved.
4. It has to be computed once and can be used anytime once computed. In this sense, it
is faster in comparison to others.
So far, we have seen deterministic methods to determine word vectors. But these methods
proved to be limited in their word representations until Mitolov etc. el introduced
word2vec to the NLP community. These methods were prediction based in the sense that
they provided probabilities to the words and proved to be state of the art for tasks like
word analogies and word similarities. They were also able to achieve tasks like King -man
+woman = Queen, which was considered a result almost magical. So let us look at the
word2vec model used as of today to generate word vectors.
9/18
The way CBOW work is that it tends to predict the probability of a word given a context. A
context may be a single word or a group of words. But for simplicity, I will take a single
context word and try to predict a single target word.
Suppose, we have a corpus C = “Hey, this is sample corpus using only one context word.”
and we have defined a context window of 1. This corpus may be converted into a training
set for a CBOW model as follow. The input is shown below. The matrix on the right in the
below image contains the one-hot encoded from of the input on the left.
0 0 0 1 0 0 0 0 0 0
This matrix shown in the above image is sent into a shallow neural network with three
layers: an input layer, a hidden layer and an output layer. The output layer is a softmax
layer which is used to sum the probabilities obtained in the output layer to 1. Now let us
see how the forward propagation will work to calculate the hidden layer activation.
The matrix representation of the above image for a single data point is below.
10/18
The flow is as follows:
1. The input layer and the target, both are one- hot encoded of size [1 X V]. Here V=10
in the above example.
2. There are two sets of weights. one is between the input and the hidden layer and
second between hidden and output layer.
Input-Hidden layer matrix size =[V X N] , hidden-Output layer matrix size =[N X
V] : Where N is the number of dimensions we choose to represent our word in. It is
arbitary and a hyper-parameter for a Neural Network. Also, N is the number of
neurons in the hidden layer. Here, N=4.
3. There is a no activation function between any layers.( More specifically, I am
referring to linear activation)
4. The input is multiplied by the input-hidden weights and called hidden activation. It
is simply the corresponding row in the input-hidden matrix copied.
5. The hidden input gets multiplied by hidden- output weights and output is
calculated.
6. Error between output and target is calculated and propagated back to re-adjust the
weights.
7. The weight between the hidden layer and the output layer is taken as the word
vector representation of the word.
We saw the above steps for a single context word. Now, what about if we have multiple
context words? The image below describes the architecture for multiple context words.
11/18
The image above takes 3 context words and predicts the probability of a target word. The
input can be assumed as taking three one-hot encoded vectors in the input layer as shown
above in red, blue and green.
So, the input layer will have 3 [1 X V] Vectors in the input as shown above and 1 [1 X V] in
the output layer. Rest of the architecture is same as for a 1-context CBOW.
The steps remain the same, only the calculation of hidden activation changes. Instead of
just copying the corresponding rows of the input-hidden weight matrix to the hidden
layer, an average is taken over all the corresponding rows of the matrix. We can
understand this with the above figure. The average vector calculated becomes the hidden
activation. So, if we have three context words for a single target word, we will have three
initial hidden activations which are then averaged element-wise to obtain the final
activation.
In both a single context word and multiple context word, I have shown the images till the
calculation of the hidden activations since this is the part where CBOW differs from a
simple MLP network. The steps after the calculation of hidden layer are same as that of
the MLP as mentioned in this article – Understanding and Coding Neural Networks from
scratch.
The differences between MLP and CBOW are mentioned below for clarification:
wo : output word
wi: context words
Advantages of CBOW:
12/18
Disadvantages of CBOW:
1. CBOW takes the average of the context of a word (as seen above in calculation of
hidden activation). For example, Apple can be both a fruit and a company but
CBOW takes an average of both the contexts and places it in between a cluster for
fruits and companies.
2. Training a CBOW from scratch can take forever if not properly optimized.
The weights between the input and the hidden layer are taken as the word vector
representation after training. The loss function or the objective is of the same type as of
the CBOW model.
For a better understanding, matrix style structure with calculation has been shown below.
13/18
Let us break down the above image.
Input layer size – [1 X V], Input hidden weight matrix size – [V X N], Number of neurons
in hidden layer – N, Hidden-Output weight matrix size – [N X V], Output layer size – C [1
X V]
1. The row in red is the hidden activation corresponding to the input one-hot encoded
vector. It is basically the corresponding row of input-hidden matrix copied.
2. The yellow matrix is the weight between the hidden layer and the output layer.
3. The blue matrix is obtained by the matrix multiplication of hidden activation and
the hidden output weights. There will be two rows calculated for two target(context)
words.
4. Each row of the blue matrix is converted into its softmax probabilities individually
as shown in the green box.
5. The grey matrix contains the one hot encoded vectors of the two context
words(target).
6. Error is calculated by substracting the first row of the grey matrix(target) from the
first row of the green matrix(output) element-wise. This is repeated for the next row.
Therefore, for n target context words, we will have n error vectors.
7. Element-wise sum is taken over all the error vectors to obtain a final error vector.
8. This error vector is propagated back to update the weights.
14/18
Advantages of Skip-Gram Model
1. Skip-gram model can capture two semantics for a single word. i.e it will have two
vector representations of Apple. One for the company and other for the fruit.
2. Skip-gram with negative sub-sampling outperforms every other method generally.
This is an excellent interactive tool to visualise CBOW and skip gram in action. I would
suggest you to really go through this link for a better understanding.
15/18
4. Using pre-trained word vectors
We are going to use google’s pre-trained model. It contains word vectors for a vocabulary
of 3 million words trained on around 100 billion words from the google news dataset. The
downlaod link for the model is this. Beware it is a 1.5 GB download.
word2Vec requires that a format of list of list for training where every document is
contained in a list and every list contains list of tokens of that documents. I won’t be
covering the pre-preprocessing part here. So let’s take an example list of list to train our
word2vec model.
sentence=[[‘Neeraj’,’Boy’],[‘Sarwan’,’is’],[‘good’,’boy’]]
16/18
#training word2vec on 3 sentences
model = gensim.models.Word2Vec(sentence,
min_count = 1,size=300,workers=4 )
Projects
Now, its time to take the plunge and actually play with some other real datasets. So are
you ready to take on the challenge? Accelerate your NLP journey with the following
Practice Problems:
17/18
Practice Problem : To detect hate
Twitter Sentiment speech in
Analysis tweets
6. End Notes
Word Embeddings is an active research area trying to figure out better word
representations than the existing ones. But, with time they have grown large in number
and more complex. This article was aimed at simplying some of the workings of these
embedding models without carrying the mathematical overhead. If you feel think that I
was able to clear some of your confusion, comment below. Any changes or suggestions
would be welcomed.
Note: We also have a video course on Natural Language Processing covering many NLP
topics including bag of words, TF-IDF, and word embeddings. Do check it out!
38 Comments
18/18