0% found this document useful (0 votes)
80 views45 pages

Lecture 2 - Word Emedding

Word embeddings aim to map words to vectors in a way that similar words are close in vector space. There are two main types of word embeddings: frequency-based which represent words based on co-occurrence statistics, and prediction-based which learn embeddings as part of predicting neighboring words like word2vec. Frequency-based methods include count vectors, TF-IDF vectors, co-occurrence vectors which measure how often words co-occur within a context window, and GloVe which trains on global word-word co-occurrence statistics from a corpus.

Uploaded by

Andrew Chung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views45 pages

Lecture 2 - Word Emedding

Word embeddings aim to map words to vectors in a way that similar words are close in vector space. There are two main types of word embeddings: frequency-based which represent words based on co-occurrence statistics, and prediction-based which learn embeddings as part of predicting neighboring words like word2vec. Frequency-based methods include count vectors, TF-IDF vectors, co-occurrence vectors which measure how often words co-occur within a context window, and GloVe which trains on global word-word co-occurrence statistics from a corpus.

Uploaded by

Andrew Chung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Lecture 2

Word Embedding
What are Word Embeddings?
• Word embeddings aim at mapping of words to points in a vector space
such that nearby words (points) have similar meanings.
• Many applications:
 Measure similarities between words by using distance measures eg.
Cosine, Euclidean
 Use as features in document classification, document clustering, and
sentiment analysis
• Simplest approach: one-hot vector
• – represent every word as a |𝐶| × 1 vector with all 0s and one 1 at the
index of that word in the sorted list of words (vocabulary size = |𝐶|).
https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
https://towardsdatascience.com/word-embeddings-exploration-explanation-and-exploitation-with-code-in-python-
5dac99d5d795
Types of word embeddings
• The different types of word embeddings can •Many more word embeddings
be broadly classified into two categories-  FastText developed by Facebook AI
 Frequency based Embedding Research (FAIR) in 2016
• Count Vector  1D CNN (Convolutional Neural
Network) developed by Jacovi, et al.
• TF-IDF Vector (2018)
• Co-Occurrence Vector  ELMo (Embeddings from Language
• GloVe (Global Vectors for Word Representation) Models) developed by Peters et al.
 Prediction based Embedding (2018)
• Learn embeddings as part of the process of word  BERT (Bidirectional Encoder
prediction Representations from Transformers)
developed by Devlin et al. (2019) from
• Word2vec: Train a neural network model to predict
Google AI Language.
neighboring words (Developed by a team of researchers
led by Tomas Mikolov at Google in 2013)
 CBOW (Continuous Bag of words)
 Skip – Gram model
https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
Count Vector
• Consider a Corpus C of D documents {d1,d2…..dD} and N unique tokens
extracted out of the corpus C. The N tokens will form our dictionary and
the size of the Count Vector matrix M will be given by D X N. Each row in
the matrix M contains the frequency of tokens in document D(i).
• Let us understand this using a simple example.
• D1: He is a lazy boy. She is also lazy.
• D2: Neeraj is a lazy person.
• The dictionary created may be a list of unique tokens(words) in the corpus
=[‘He’,’She’,’lazy’,’boy’,’Neeraj’,’person’]
• Here, D=2, N=6. The count matrix M of size 2 X 6 will be represented as –
Count Vector
• The matrix that will be prepared like above will be a very sparse one
and inefficient for any computation. So an alternative to using every
unique word as a dictionary element would be to pick say top 10,000
words based on frequency and then prepare a dictionary.
TF-IDF Vector
• This is another method which is based on the frequency method but it is
different to the count vectorization in the sense that it takes into account not
just the occurrence of a word in a single document but in the entire corpus.
• Common words like ‘is’, ‘the’, ‘a’ etc. tend to appear quite frequently in
comparison to the words which are important to a document. Ideally, what
we would want is to down weight the common words occurring in almost all
documents and give more importance to words that appear in a subset of
documents.
• TF-IDF works by penalising these common words by assigning them lower
weights while giving importance to words like Messi in a particular
document.
 TF-IDF Vector
Co-Occurrence Vector
• The big idea – Similar words tend to occur together and will have
similar context for example – Apple is a fruit. Mango is a fruit.
Apple and mango tend to have a similar context i.e fruit.
• Co-occurrence – For a given corpus, the co-occurrence of a pair of
words say w1 and w2 is the number of times they have appeared
together in a Context Window.
• Context Window – Context window is specified by a number and the
direction. So what does a context window of 2 (around) means? Let
us see an example below,

https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
Co-Occurrence Vector

https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
Co-Occurrence Vector

https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
Variations of Co-occurrence Matrix
• Let’s say there are V unique words in the corpus. So Vocabulary size = V.
The columns of the Co-occurrence matrix form the context words. The
different variations of Co-Occurrence Matrix are-
• A co-occurrence matrix of size V X V. Now, for even a decent corpus V
gets very large and difficult to handle. So generally, this architecture is
never preferred in practice.
• A co-occurrence matrix of size V X N where N is a subset of V and can
be obtained by removing irrelevant words like stopwords etc. for
example. This is still very large and presents computational difficulties.
• So, the co-occurrence matrix is decomposed using techniques like PCA,
SVD etc. into factors and combination of these factors forms the word
vector representation.
https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
N-grams
Words Co-occurrence Statistics
• Consider a corpus consisting of the following documents:
 penny wise and pound foolish
 a penny saved is a penny earned
• Letting count(wnext|wcurrent) represent how many times
word wnext follows the word wcurrent, we can summarize co-occurrence
statistics for words “a” and “penny” as:

https://iksinc.online/2015/06/23/how-to-use-words-co-occurrence-statistics-to-map-words-to-vectors/
N-gram
• The above table shows that “a” is followed twice by “penny” while words “earned”,
“saved”, and “wise” each follows “penny” once in our corpus. Thus, “earned” is one
out of three times probable to appear after “penny.” The count shown above is
called bigram frequency; it looks into only the next word from a current word.
• Given a corpus of N words, we need a table of size NxN to represent bigram
frequencies of all possible word-pairs. Such a table is highly sparse as most
frequencies are equal to zero. In practice, the co-occurrence counts are converted to
probabilities. This results in row entries for each r
• We may count how many times a sequence of three words occurs together to
generate trigram frequencies.
• We may even count how many times a pair of words occurs together in sentences
irrespective of their positions in sentences. Such occurrences are called skip-bigram
frequencies. ow adding up to one in the co-occurrence matrix.
• Because of such variations in how co-occurrences are specified, these methods in
general are known as n-gram methods. 
N-grams
• The term context window is often used to specify the co-occurrence
relationship. For bigrams, the context window is asymmetrical one
word long to the right of the current word in co-occurrence counting.
• For trigrams, it is asymmetrical and two words long.
• In words to vector conversion approach via co-occurrence, it turns out
that a symmetrical context window looking at one preceding word
and one following word for computing bigram frequencies gives
better word vectors.
The SQRT(2) operation consists of simply dividing every co-occurrence matrix entry by   to ensure the range of Hellinger distance between 0-1. 

Hellinger Distance
• Hellinger distance is a measure of similarity between two probability
distributions. Given two discrete probability distributions P = (p1, . . . ,
pk) and Q = (q1, . . . , qk), the Hellinger distance H(P, Q) between the
distributions is defined as: The SQRT(2) operation consists of simply
dividing every co-occurrence matrix entry by 
 to ensure the range of Hellinger distance
between 0-1. 

https://iksinc.online/2015/06/23/how-to-use-words-co-occurrence-statistics-to-map-words-to-vectors/
Variations of Co-occurrence Matrix
• For example, you perform PCA on the above matrix of size VXV. You will
obtain V principal components. You can choose k components out of
these V components. So, the new matrix will be of the form V X k.
• And, a single word, instead of being represented in V dimensions will
be represented in k dimensions while still capturing almost the same
semantic meaning. k is generally of the order of hundreds.
• So, what PCA does at the back is decompose Co-Occurrence matrix into
three matrices, U,S and V where U and V are both orthogonal matrices.
What is of importance is that dot product of U and S gives the word
vector representation and V gives the word context representation.

https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
Latent Semantic Analysis
Latent Semantic Analysis

N-gram Corpus
Corpus
U VT Matrix Matrix
D(n_factor) Corpus
Corpus

D(n_factor)

N-gram
embedding
N-gram X matrix
N-gram

embedding
matrix
Co-Occurrence Matrix with a fixed context
window
• Advantages of Co-occurrence Matrix
 It preserves the semantic relationship between words. i.e man and woman tend to
be closer than man and apple.
 It uses SVD at its core, which produces more accurate word vector representations
than existing methods.
 It uses factorization which is a well-defined problem and can be efficiently solved.
 It has to be computed once and can be used anytime once computed. In this sense, it
is faster in comparison to others.
• Disadvantages of Co-Occurrence Matrix
 It requires huge memory to store the co-occurrence matrix.
But, this problem can be circumvented by factorizing the matrix out of the system for
example in Hadoop clusters etc. and can be saved.
https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
Document Classification with SVD
Example
Example
Term Loading Matrix and Document Loading
Matrix
Document Vectors
GloVe
• The basic idea behind the GloVe word embedding is to derive the
relationship between the words from statistics.  Unlike the occurrence
matrix, the co-occurrence matrix tells you how often a particular word
pair occurs together. Each value in the co-occurrence matrix
represents a pair of words occurring together. 

e.g.

https://analyticsindiamag.com/hands-on-guide-to-word-embeddings-using-glove/
https://nlp.stanford.edu/projects/glove/
GloVe (Global Vectors for Word Representation)
• The Global Vectors for Word Representation, or GloVe, algorithm is an
extension to the word2vec method for efficiently learning word vectors,
developed by Pennington, et al. at Stanford.
• GloVe is an unsupervised learning algorithm for obtaining vector
representations for words.
• Training is performed on aggregated global word-word co-occurrence statistics
from a corpus, and the resulting representations showcase interesting linear
substructures of the word vector space.
• GloVe is an approach to marry both the global statistics of matrix factorization
techniques like LSA (Latent Semantic Analysis) with the local context-based
learning in word2vec.
• Rather than using a window to define local context, GloVe constructs an explicit
word-context or word co-occurrence matrix using statistics across the whole
https://edumunozsala.github.io/BlogEms/jupyter/nlp/classification/embeddings/python/2020/08/15/
text corpus.
Intro_NLP_WordEmbeddings_Classification.html
Example 1: Word Embeddings using GloVe
• See example codes

https://www.kaggle.com/code/floser/examples-of-similar-word-embeddings-in-glove/notebook
Word2vec
• Word2vec is not a single algorithm but a combination of two
techniques CBOW(Continuous bag of words) and Skip-gram model.
 CBOW work tends to predict the probability of a word given a context. A
context may be a single word or a group of words. But for simplicity, I will take
a single context word and try to predict a single target word.
 The aim of skip-gram is to predict the context given a word.
• Both of these are shallow neural networks which map word(s) to the
target variable which is also a word(s).
• Both of these techniques learn weights which act as word vector
representations.

https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
CBOW (Continuous Bag of words)
• Suppose, we have a corpus C = “Hey, this is sample corpus using only one context
word.” and we have defined a context window of 1.
• This corpus may be converted into a training set for a CBOW model as follow. The
input is shown below. The matrix on the right in the below image contains the one-hot
encoded from of the input on the left.

• The target for a single datapoint say Datapoint 4 is shown as below.

https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
CBOW (Continuous Bag of words)
• This matrix below is a shallow neural network with
three layers: an input layer, a hidden layer and an
output layer. The output layer is a softmax layer which
is used to sum the probabilities obtained in the output
layer to 1. Now let us see how the forward propagation
will work to calculate the hidden layer activation.
• The matrix representation of the above image for a
single data point is below.

https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
CBOW (Continuous Bag of words)
• The flow is as follows:
The input layer and the target, both are one- hot encoded of size [1 X V]. Here V=10 in the
above example.
There are two sets of weights. one is between the input and the hidden layer and second
between hidden and output layer.
Input-Hidden layer matrix size =[V X N] , hidden-Output layer matrix  size =[N X V] : Where N is
the number of dimensions we choose to represent our word in. It is arbitary and a hyper-
parameter for a Neural Network. Also, N is the number of neurons in the hidden layer. Here,
N=4.
There is a no activation function between any layers.( More specifically, I am referring to linear
activation)
The input is multiplied by the input-hidden weights and called hidden activation. It is simply the
corresponding row in the input-hidden matrix copied.
The hidden input gets multiplied by hidden- output weights and output is calculated.
Error between output and target is calculated and propagated back to re-adjust the weights.
The weight  between the hidden layer and the output layer is taken as the word vector
representation of the word.
https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
CBOW (Continuous Bag of words)
• Now, what about if we have multiple context words?
• The image above takes 3 context words and predicts the probability of
a target word. The input can be assumed as taking three one-hot
encoded vectors in the input layer as shown above in red, blue and
green.
• So, the input layer will have 3 [1 X V] Vectors in the input as shown
above and 1 [1 X V] in the output layer. Rest of the architecture is same
as for a 1-context CBOW.
• The steps remain the same, only the calculation of hidden activation
changes. 

https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
CBOW (Continuous Bag of words)
• The differences between MLP and CBOW are  mentioned below for
clarification:
• The objective function in MLP is a MSE(mean square error) whereas
in CBOW it is negative log likelihood of a word given a set of context
i.e -log(p(wo/wi)), where p(wo/wi) is given as

• The gradient of error with respect to hidden-output weights and


input-hidden weights are different since MLP has  sigmoid
activations(generally) but CBOW has linear activations. The method
however to calculate the gradient is same as an MLP.
https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
CBOW (Continuous Bag of words)
• Advantages of CBOW:
 Being probabilistic is nature, it is supposed to perform superior to
deterministic methods(generally).
 It is low on memory. It does not need to have huge RAM requirements like
that of co-occurrence matrix where it needs to store three huge matrices.
• Disadvantages of CBOW:
 CBOW takes the average of the context of a word (as seen above in calculation
of hidden activation). For example, Apple can be both a fruit and a company
but CBOW takes an average of both the contexts and places it in between a
cluster for fruits and companies.
 Training a CBOW from scratch can take forever if not properly optimized.
https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
Skip – Gram model
• Skip – gram follows the same topology as of CBOW. The aim of skip-
gram is to predict the context given a word.
• Let us take the same corpus that we built our CBOW model on.
C=”Hey, this is sample corpus using only one context word.” Let us
construct the training data.

https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
Skip – Gram model
• The input vector for skip-gram is going to be similar to a 1-context CBOW model.
Also, the calculations up to hidden layer activations are going to be the same.
The difference will be in the target variable. Since we have defined a context
window of 1 on both the sides, there will be “two” one hot encoded target
variables and “two” corresponding outputs as can be seen by the blue section
in the image.
• Two separate errors are calculated with respect to the two target variables and
the two error vectors obtained are added element-wise to obtain a final error
vector which is propagated back to update the weights.
• The weights between the input and the hidden layer are taken as the word
vector representation after training. The loss function or the objective is of the
same type as of the CBOW model.
https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
Skip – Gram model
• The skip-gram architecture and matrix style
structure with calculation has been shown
here.

https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
• Input layer  size – [1 X V], Input hidden weight matrix size – [V X N],
Number of neurons in hidden layer – N, Hidden-Output weight matrix
size – [N X V], Output layer size – C [1 X V]
• For example, C is the number of context words=2, V= 10, N=4
• The row in red is the hidden activation corresponding to the input one-
hot encoded vector. It is basically the corresponding row of input-hidden
matrix copied.
• The yellow matrix is the weight between the hidden layer and the output
layer.
• The blue matrix is obtained by the matrix multiplication of hidden
activation and the hidden output weights. There will be two rows
calculated for two target(context) words.
https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
• Each row of the blue matrix is converted into its softmax probabilities
individually as shown in the green box.
• The grey matrix contains the one hot encoded vectors of the two
context words(target).
• Error is calculated by substracting the first row of the grey
matrix(target) from the first row of the green matrix(output) element-
wise. This is repeated for the next row. Therefore, for n target context
words, we will have n error vectors.
• Element-wise sum is taken over all the error vectors to obtain a final
error vector.
• This error vector is propagated back to update the weights.
https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
Skip – Gram model
• Advantages of Skip-Gram Model
• Skip-gram model can capture two semantics for a single word. i.e it
will have two vector representations of Apple. One for the company
and other for the fruit.
• Skip-gram with negative sub-sampling outperforms every other
method generally.

https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
Word Embeddings use case scenarios
• Since word embeddings or word Vectors are numerical
representations of contextual similarities between words, they can be
manipulated and made to perform amazing tasks like-

https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-
to-word2vec-8231e18dbe92
Training your own word vectors
• We will be training our own word2vec on a custom corpus. For
training the model we will be using gensim and the steps are
illustrated as below.

https://medium.com/analytics-vidhya/an-intuitive-understanding-of-word-embeddings-from-count-vectors-to-
word2vec-8231e18dbe92
Example 2 Word Embedding using Word2Vec
• See example codes

https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html
https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4
Other methods
• BERT (
https://towardsdatascience.com/bert-explained-state-of-the-art-language-mo
del-for-nlp-f8b21a9b6270
)
• ELMo (Embeddings from Language Models) developed by Peters et al. (2018) (
https://www.analyticsvidhya.com/blog/2019/03/learn-to-use-elmo-to-extract-
features-from-text/
)
• RNN
• LSTM
• Bi-LSTM
• Transformer (Attention)
• FastText developed by Facebook AI Research (FAIR) in 2016
References
• https://medium.com/analytics-vidhya/an-intuitive-understanding-of-
word-embeddings-from-count-vectors-to-word2vec-8231e18dbe92
• https://www.kdnuggets.com/2018/04/implementing-deep-learning-met
hods-feature-engineering-text-data-cbow.html
• Word Embedding using GloVe
 https://nlp.stanford.edu/pubs/glove.pdf
 https://medium.com/analytics-vidhya/basics-of-using-pre-trained-glove-vectors-
in-python-d38905f356db
• Word Embedding using Word2Vec
 https://www.geeksforgeeks.org/python-word-embedding-using-word2vec/
 https://www.geeksforgeeks.org/implement-your-own-word2vecskip-gram-mode
l-in-python/

You might also like