0% found this document useful (0 votes)

70 views

Lecture10 PDF

The document discusses learning vector representations of words from a corpus of text. It introduces the concept of a vocabulary V, which is the set of all unique words that appear across all documents in the corpus. A representation is needed for each word in the vocabulary so that words can be used as inputs for machine learning models. Traditional one-hot representations are discussed as a simple initial approach.

Uploaded by

Abhishek Sharma

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views

Lecture10 PDF

Uploaded by

Abhishek Sharma

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 268

CS7015 (Deep Learning) : Lecture 10

Learning Vectorial Representations Of Words

Mitesh M. Khapra

Department of Computer Science and Engineering

Indian Institute of Technology Madras

1/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Acknowledgments
‘word2vec Parameter Learning Explained’ by Xin Rong
‘word2vec Explained: deriving Mikolov et al.’s negative-sampling word-
embedding method’ by Yoav Goldberg and Omer Levy
Sebastian Ruder’s blogs on word embeddingsa
Ali Ghodsi’s video lecture on Word2Vec b

a
Blog1, Blog2, Blog3
b
Lectures on Word2Vec

2/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.1: One-hot representations of words

3/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Let us start with a very simple mo-
tivation for why we are interested in
vectorial representations of words

This is by far AAMIR KHAN’s best one. Finest

casting and terrific acting by all.

This is by far AAMIR KHAN’s best one. Finest

casting and terrific acting by all.

4/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Let us start with a very simple mo-
tivation for why we are interested in
vectorial representations of words
Suppose we are given an input stream
of words (sentence, document, etc.)
and we are interested in learning
some function of it (say, ŷ =
Model sentiments(words))
Say, we employ a machine learning al-
gorithm (some mathematical model)
for learning such a function (ŷ =
[5.7, 1.2, 2.3, -10.2, 4.5, ..., 11.9, 20.1, -0.5, 40.7]
f (x))
We first need a way of converting the
This is by far AAMIR KHAN’s best one. Finest
input stream (or each word in the
casting and terrific acting by all.
stream) to a vector x (a mathemat-
ical quantity)
4/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus,

5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus,
Corpus:
Human machine interface for computer
applications
User opinion of computer system response
time
User interface management system
System engineering for improved response
time

V = [human,machine, interface, for, computer,

applications, user, opinion, of, system, response,
time, interface, management, engineering,
improved]

5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus, consider the set V
Corpus: of all unique words across all input
Human machine interface for computer streams (i.e., all sentences or docu-
applications ments)
User opinion of computer system response
time
V is called the vocabulary of the
corpus (i.e., all sentences or docu-
User interface management system
ments)
System engineering for improved response
time We need a representation for every
word in V
V = [human,machine, interface, for, computer,
applications, user, opinion, of, system, response,
time, interface, management, engineering,
improved]

5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus, consider the set V
Corpus: of all unique words across all input
Human machine interface for computer streams (i.e., all sentences or docu-
applications ments)
User opinion of computer system response
time
V is called the vocabulary of the
corpus (i.e., all sentences or docu-
User interface management system
ments)
System engineering for improved response
time We need a representation for every
word in V
V = [human,machine, interface, for, computer, One very simple way of doing this is
applications, user, opinion, of, system, response,
to use one-hot vectors of size |V |
time, interface, management, engineering,
improved]

machine: 0 1 0 ... 0 0 0

5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus, consider the set V
Corpus: of all unique words across all input
Human machine interface for computer streams (i.e., all sentences or docu-
applications ments)
User opinion of computer system response
time
V is called the vocabulary of the
corpus (i.e., all sentences or docu-
User interface management system
ments)
System engineering for improved response
time We need a representation for every
word in V
V = [human,machine, interface, for, computer, One very simple way of doing this is
applications, user, opinion, of, system, response,
to use one-hot vectors of size |V |
time, interface, management, engineering,
improved] The representation of the i-th word
will have a 1 in the i-th position and
machine: 0 1 0 ... 0 0 0 a 0 in the remaining |V | − 1 positions

5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Problems:
cat: 0 0 0 0 0 1 0
V tends to be very large (for example,
dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T cor-
pus)
truck: 0 0 0 1 0 0 0

6/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Problems:
cat: 0 0 0 0 0 1 0
V tends to be very large (for example,
dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T cor-
pus)
truck: 0 0 0 1 0 0 0
These representations do not capture
any notion of similarity
√
euclid dist(cat, dog) = 2 Ideally, we would want the represent-
√
euclid dist(dog, truck) = 2 ations of cat and dog (both domestic
animals) to be closer to each other
than the representations of cat and
truck
However, with 1-hot representations,
the Euclidean distance between any
√
two words in the vocabulary in 2

6/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Problems:
cat: 0 0 0 0 0 1 0
V tends to be very large (for example,
dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T cor-
pus)
truck: 0 0 0 1 0 0 0
These representations do not capture
any notion of similarity
√
euclid dist(cat, dog) = 2 Ideally, we would want the represent-
√
euclid dist(dog, truck) = 2 ations of cat and dog (both domestic
animals) to be closer to each other
cosine sim(cat, dog) = 0
than the representations of cat and
cosine sim(dog, truck) = 0 truck
However, with 1-hot representations,
the Euclidean distance between any
√
two words in the vocabulary in 2
And the cosine similarity between
any two words in the vocabulary is
0 6/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.2: Distributed Representations of words

7/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
You shall know a word by the com-
pany it keeps - Firth, J. R. 1957:11

A bank is a financial institution that accepts

deposits from the public and creates credit.

8/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
You shall know a word by the com-
pany it keeps - Firth, J. R. 1957:11
Distributional similarity based rep-
resentations

A bank is a financial institution that accepts

deposits from the public and creates credit.

8/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
You shall know a word by the com-
pany it keeps - Firth, J. R. 1957:11
Distributional similarity based rep-
resentations
This leads us to the idea of co-
A bank is a financial institution that accepts occurrence matrix
deposits from the public and creates credit.

The idea is to use the accompanying words

(financial, deposits, credit) to represent bank

8/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
A co-occurrence matrix is a terms ×
Human machine interface for computer ap- terms matrix which captures the
plications
number of times a term appears in
User opinion of computer system response the context of another term
time
User interface management system
System engineering for improved response
time

Co-occurence Matrix

9/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
A co-occurrence matrix is a terms ×
Human machine interface for computer ap- terms matrix which captures the
plications
number of times a term appears in
User opinion of computer system response the context of another term
time
User interface management system
The context is defined as a window of
k words around the terms
System engineering for improved response
time Let us build a co-occurrence matrix
for this toy corpus with k = 2
human machine system for ... user
This is also known as a word ×
human
machine
0
1
1
0
0
0
1
1
...
...
0
0
context matrix
system 0 0 0 1 ... 2
for 1 1 1 0 ... 0 You could choose the set of words
. . . . . . .
. . . . . . . and contexts to be same or different
. . . . . . .
user 0 0 2 0 ... 0 Each row (column) of the co-
Co-occurence Matrix occurrence matrix gives a vectorial
representation of the corresponding
word (context) 9/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Stop words (a, the, for, etc.) are very
frequent → these counts will be very
human machine system for ... user high
human 0 1 0 1 ... 0
machine 1 0 0 1 ... 0
system 0 0 0 1 ... 2
for 1 1 1 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 2 0 ... 0

10/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Stop words (a, the, for, etc.) are very
frequent → these counts will be very
human
human
0
machine
1
system
0
...
...
user
0
high
machine 1 0 0 ... 0
system 0 0 0 ... 2 Solution 1: Ignore very frequent
. . . . . .
. . . . . . words
. . . . . .
user 0 0 2 ... 0

10/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Stop words (a, the, for, etc.) are very
frequent → these counts will be very
human machine system for ... user high
human 0 1 0 x ... 0
machine 1 0 0 x ... 0
system 0 0 0 x ... 2
Solution 1: Ignore very frequent
for
.
x
.
x
.
x
.
x
.
...
.
x
.
words
. . . . . . .
. . . . . . . Solution 2: Use a threshold t (say, t
user 0 0 2 x ... 0
= 100)

Xij = min(count(wi , cj ), t),

where w is word and c is context.

10/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Solution 3: Instead of count(w, c) use
P M I(w, c)

11/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Solution 3: Instead of count(w, c) use
P M I(w, c)

p(c|w)
P M I(w, c) = log
p(c)
count(w, c) ∗ N
= log
count(c) ∗ count(w)
N is the total number of words

11/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Solution 3: Instead of count(w, c) use
P M I(w, c)
human machine system for ... user p(c|w)
human 0 2.944 0 2.25 ... 0 P M I(w, c) = log
machine 2.944 0 0 2.25 ... 0 p(c)
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0 count(w, c) ∗ N
= log
. . . . . . . count(c) ∗ count(w)
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0 N is the total number of words

Instead use,

P M I0 (w, c) = P M I(w, c) if count(w, c) > 0

=0 otherwise

Instead use,

P M I0 (w, c) = P M I(w, c) if count(w, c) > 0

=0 otherwise

P P M I(w, c) = P M I(w, c) if P M I(w, c) > 0

=0 otherwise
11/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (severe) problems
Very high dimensional (|V |)

human machine system for ... user

human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0

12/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (severe) problems
Very high dimensional (|V |)
Very sparse
human machine system for ... user
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0

12/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (severe) problems
Very high dimensional (|V |)
Very sparse
human machine system for ... user
human 0 2.944 0 2.25 ... 0 Grows with the size of the vocabulary
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0

12/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (severe) problems
Very high dimensional (|V |)
Very sparse
human machine system for ... user
human 0 2.944 0 2.25 ... 0 Grows with the size of the vocabulary
machine 2.944 0 0 2.25 ... 0
system
for
0
2.25
0
2.25
0
1.15
1.15
0
...
...
1.84
0
Solution: Use dimensionality reduc-
.
.
.
.
.
.
.
.
.
.
.
.
.
.
tion (SVD)
. . . . . . .
user 0 0 1.84 0 ... 0

12/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.3: SVD for learning word representations

13/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Singular Value Decomposition
gives a rank k approximation of
the original matrix
 
T
  X = XP P M I m×n = Um×k Σk×k Vk×n
  =
 X 
XP P M I (simplifying notation to
m×n

↑ ··· ↑
 X) is the co-occurrence matrix
← v1T →
   
σ1
   ..   ..  with PPMI values
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n

14/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Singular Value Decomposition
gives a rank k approximation of
the original matrix
 
T
  X = XP P M I m×n = Um×k Σk×k Vk×n
  =
 X 
XP P M I (simplifying notation to
m×n

↑ ··· ↑
 X) is the co-occurrence matrix
← v1T →
   
σ1
   ..   ..  with PPMI values
. .
 
u1 ··· uk 
   
SVD gives the best rank-k ap-
↓ ··· ↓ m×k σk k×k
← vkT → k×n
proximation of the original data
(X)

14/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the product can be
written as a sum of k rank-1
matrices
 
 
  =
 X 
m×n
 
↑ ··· ↑ 
σ1
 
← v1T →

   ..   .. 
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n

= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT

15/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the product can be
written as a sum of k rank-1
matrices
Each σi ui viT ∈ Rm×n because it
 



 = is a product of a m × 1 vector
 X 
with a 1 × n vector
m×n
  If we truncate the sum at σ1 u1 v1T
↑ ··· ↑ 
σ1
 
← v1T →

then we get the best rank-1 ap-
   ..   .. 
. . proximation of X (By SVD the-
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n orem! But what does this mean?
= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT We will see on the next slide)

15/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
What do we mean by approxim-
ation here?

 
 
  =
 X 
m×n
 
↑ ··· ↑ 
σ1
 
← v1T →

   ..   .. 
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n

= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT

16/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
What do we mean by approxim-
ation here?
Notice that X has m × n entries
 
 
  =
 X 
m×n
 
↑ ··· ↑ 
σ1
 
← v1T →

   ..   .. 
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n

= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT

16/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
What do we mean by approxim-
ation here?
Notice that X has m × n entries
 
When we use the rank-1 approx-
imation we are using only m +
 
  =
 X 
n + 1 entries to reconstruct [u ∈
m×n
  Rm , v ∈ Rn , σ ∈ R1 ]
↑ ··· ↑ 
σ1
 
← v1T →

   ..   ..  But SVD theorem tells us that
. .
 
u1 ··· uk  u1 ,v1 and σ1 store the most im-
   
↓ ··· ↓ m×k σk ← vkT →
k×k k×n portant information in X (akin to
= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT the principal components in X)

16/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
verylight green As an analogy consider the case when
we are using 8 bits to represent colors
z }| { z }| {
0 0 0 1 1 0 1 1

light green
z }| { z }| {
0 0 1 0 1 0 1 1

dark green
z }| { z }| {
0 1 0 0 1 0 1 1

verydark green
z }| { z }| {
1 0 0 0 1 0 1 1
17/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
verylight green As an analogy consider the case when
we are using 8 bits to represent colors
z }| { z }| {
0 0 0 1 1 0 1 1 The representation of very light, light,
dark and very dark green would look
different
light green
z }| { z }| { But now what if we were asked to com-
press this into 4 bits? (akin to com-
0 0 1 0 1 0 1 1 pressing m × n values into m + n + 1
values on the previous slide)

dark green
z }| { z }| {
0 1 0 0 1 0 1 1

verydark green
z }| { z }| {
1 0 0 0 1 0 1 1

17/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
verylight green As an analogy consider the case when
we are using 8 bits to represent colors
z }| { z }| {
0 0 0 1 1 0 1 1 The representation of very light, light,
dark and very dark green would look
different
light green
z }| { z }| { But now what if we were asked to com-
press this into 4 bits? (akin to com-
0 0 1 0 1 0 1 1 pressing m × n values into m + n + 1
values on the previous slide)
We will retain the most important 4
dark green
z }| { z }| { bits and the previously (slightly) lat-
ent similarity between the colors now
0 1 0 0 1 0 1 1 becomes very obvious
Something similar is guaranteed by
SVD (retain the most important in-
verydark green
z }| { z }| { formation and discover the latent sim-
ilarities between words)
1 0 0 0 1 0 1 1

17/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
human machine system for ... user human machine system for ... user
human 0 2.944 0 2.25 ... 0 human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.944 0 0 2.25 ... 0 machine 2.01 2.01 0.23 2.14 ... 0.43
system 0 0 0 1.15 ... 1.84 system 0.23 0.23 1.17 0.96 ... 1.29
for 2.25 2.25 1.15 0 ... 0 for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
user 0 0 1.84 0 ... 0 user 0.43 0.43 1.29 -0.13 ... 1.71

Co-occurrence Matrix (X) Low rank X → Low rank X̂

Notice that after low rank reconstruction with SVD, the latent co-occurrence
between {system, machine} and {human, user} has become visible

18/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Recall that earlier each row of the original
X= matrix X served as the representation of a
human machine system for ... user word
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0

19/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Recall that earlier each row of the original
X= matrix X served as the representation of a
human machine system for ... user word
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0 Then XX T is a matrix whose ij-th entry is
system 0 0 0 1.15 ... 1.84 the dot product between the representation
for 2.25 2.25 1.15 0 ... 0
. . . . . . . of word i (X[i :]) and word j (X[j :])
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0

XX T =

human machine system for ... user

human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01
system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8
. . . . . . .
. . . . . . .
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3

cosine sim(human, user) = 0.21 19/70

X[j :]
XX T =

human machine system for ... user

cosine sim(human, user) = 0.21 19/70

human machine system for ... user

cosine sim(human, user) = 0.21 19/70

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Recall that earlier each row of the original
X= matrix X served as the representation of a
human machine system for ... user word
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0 Then XX T is a matrix whose ij-th entry is
system 0 0 0 1.15 ... 1.84 the dot product between the representation
for 2.25 2.25 1.15 0 ... 0
. . . . . . . of word i (X[i :]) and word j (X[j :])
. . . . . . .
. . . . . . .   
user 0 0 1.84 0 ... 0 X[i :] 1 2 3 1 2 1
 2 1 0  2 1 3 
X[j :] 1 3 5 3 0 5
XX T = | {z }| {z }
X XT
human machine system for ... user
human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01
system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8
. . . . . . .
. . . . . . .
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3

cosine sim(human, user) = 0.21 19/70

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Recall that earlier each row of the original
X= matrix X served as the representation of a
human machine system for ... user word
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0 Then XX T is a matrix whose ij-th entry is
system 0 0 0 1.15 ... 1.84 the dot product between the representation
for 2.25 2.25 1.15 0 ... 0
. . . . . . . of word i (X[i :]) and word j (X[j :])
. . . . . . .
. . . . . . .   
user 0 0 1.84 0 ... 0 X[i :] 1 2 3 1 2 1
 2 1 0  2 1 3 
X[j :] 1 3 5 3 0 5
XX T = | {z }| {z }
X XT
 
human machine system for ... user . . 22
human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01 = . . .
system 7.78 7.78 0 17.65 ... 21.84 . . .
for 20.25 20.25 17.65 36.3 ... 11.8 | {z }
. . . . . . .
. . . . . . . XX T
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3

cosine sim(human, user) = 0.21 19/70

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Recall that earlier each row of the original
X= matrix X served as the representation of a
human machine system for ... user word
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0 Then XX T is a matrix whose ij-th entry is
system 0 0 0 1.15 ... 1.84 the dot product between the representation
for 2.25 2.25 1.15 0 ... 0
. . . . . . . of word i (X[i :]) and word j (X[j :])
. . . . . . .
. . . . . . .   
user 0 0 1.84 0 ... 0 X[i :] 1 2 3 1 2 1
 2 1 0  2 1 3 
X[j :] 1 3 5 3 0 5
XX T = | {z }| {z }
X XT
 
human machine system for ... user . . 22
human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01 = . . .
system 7.78 7.78 0 17.65 ... 21.84 . . .
for 20.25 20.25 17.65 36.3 ... 11.8 | {z }
. . . . . . .
. . . . . . . XX T
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3
The ij-th entry of XX T thus (roughly)
captures the cosine similarity between
wordi , wordj
cosine sim(human, user) = 0.21 19/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Once we do an SVD what is a
good choice for the representation of
wordi ?

20/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Once we do an SVD what is a
X̂ = good choice for the representation of
human machine system for ... user wordi ?
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
Obviously, taking the i-th row of the
for
.
2.14
.
2.14
.
0.96
.
1.87
.
...
.
-0.13
.
reconstructed matrix does not make
.
.
.
.
.
.
.
.
.
.
.
.
.
.
sense because it is still high dimen-
user 0.43 0.43 1.29 -0.13 ... 1.71 sional
But we saw that the reconstructed
X̂ X̂ T = matrix X̂ = U ΣV T discovers latent
human machine system for ... user semantics and its word representa-
human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
tions are more meaningful
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32
. . . . . . .
. . . . . . .
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11

cosine sim(human, user) = 0.33 20/70

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Once we do an SVD what is a
X̂ = good choice for the representation of
human machine system for ... user wordi ?
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
Obviously, taking the i-th row of the
for
.
2.14
.
2.14
.
0.96
.
1.87
.
...
.
-0.13
.
reconstructed matrix does not make
.
.
.
.
.
.
.
.
.
.
.
.
.
.
sense because it is still high dimen-
user 0.43 0.43 1.29 -0.13 ... 1.71 sional
But we saw that the reconstructed
X̂ X̂ T = matrix X̂ = U ΣV T discovers latent
human machine system for ... user semantics and its word representa-
human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
tions are more meaningful
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32 Wishlist: We would want represent-
. . . . . . .
. . . . . . . ations of words (i, j) to be of smal-
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11 ler dimensions but still have the same
similarity (dot product) as the corres-
ponding rows of X̂
cosine sim(human, user) = 0.33 20/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the dot product between the
X̂ = rows of the the matrix Wword = U Σ is the
human machine system for ... user
same as the dot product between the rows
human 2.01 2.01 0.23 2.14 ... 0.43 of X̂
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 X̂ X̂ T = (U ΣV T )(U ΣV T )T
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71

X̂ X̂ T =

human machine system for ... user

human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32
. . . . . . .
. . . . . . .
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11

similarity = 0.33 21/70

X̂ X̂ T =

human machine system for ... user

similarity = 0.33 21/70

X̂ X̂ T =

human machine system for ... user

similarity = 0.33 21/70

human machine system for ... user

similarity = 0.33 21/70

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the dot product between the
X̂ = rows of the the matrix Wword = U Σ is the
human machine system for ... user
same as the dot product between the rows
human 2.01 2.01 0.23 2.14 ... 0.43 of X̂
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 X̂ X̂ T = (U ΣV T )(U ΣV T )T
. . . . . . .
. . . . . . . = (U ΣV T )(V ΣU T )
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71 = U ΣΣT U T (∵ V T V = I)
= U Σ(U Σ)T = Wword Wword
T
T
X̂ X̂ =
Conventionally,
human machine system for ... user
human 25.4 25.4 7.6 21.9 ... 6.84 Wword = U Σ ∈ Rm×k
machine 25.4 25.4 7.6 21.9 ... 6.84
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32 is taken as the representation of the m words
. . . . . . .
. . . . . . .
in the vocabulary and
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11 Wcontext = V

is taken as the representation of the context

similarity = 0.33 words 21/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.4: Continuous bag of words model

22/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The methods that we have seen so far are called count based models because
they use the co-occurrence counts of words

23/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The methods that we have seen so far are called count based models because
they use the co-occurrence counts of words
We will now see methods which directly learn word representations (these are
called (direct) prediction based models)

23/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model

24/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)

24/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings

24/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings
Evaluating word embeddings

24/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Consider this Task: Predict n-th
word given previous n-1 words

25/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Consider this Task: Predict n-th
word given previous n-1 words
Example: he sat on a chair

25/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Consider this Task: Predict n-th
Sometime in the 21st century, Joseph Cooper, word given previous n-1 words
a widowed former engineer and former NASA
pilot, runs a farm with his father-in-law Donald,
Example: he sat on a chair
son Tom, and daughter Murphy, It is post-truth Training data: All n-word windows
society ( Cooper is reprimanded for telling in your corpus
Murphy that the Apollo missions did indeed
happen) and a series of crop blights threatens hu-
manity’s survival. Murphy believes her bedroom
is haunted by a poltergeist. When a pattern
is created out of dust on the floor, Cooper
realizes that gravity is behind its formation,
not a ”ghost”. He interprets the pattern as
a set of geographic coordinates formed into
binary code. Cooper and Murphy follow the
coordinates to a secret NASA facility, where they
are met by Cooper’s former professor, Dr. Brand.

Some sample 4 word windows from a corpus

25/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Consider this Task: Predict n-th
Sometime in the 21st century, Joseph Cooper, word given previous n-1 words
a widowed former engineer and former NASA
pilot, runs a farm with his father-in-law Donald,
Example: he sat on a chair
son Tom, and daughter Murphy, It is post-truth Training data: All n-word windows
society ( Cooper is reprimanded for telling in your corpus
Murphy that the Apollo missions did indeed
happen) and a series of crop blights threatens hu- Training data for this task is easily
manity’s survival. Murphy believes her bedroom available (take all n word windows
is haunted by a poltergeist. When a pattern from the whole of wikipedia)
is created out of dust on the floor, Cooper
realizes that gravity is behind its formation,
not a ”ghost”. He interprets the pattern as
a set of geographic coordinates formed into
binary code. Cooper and Murphy follow the
coordinates to a secret NASA facility, where they
are met by Cooper’s former professor, Dr. Brand.

Some sample 4 word windows from a corpus

25/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Consider this Task: Predict n-th
Sometime in the 21st century, Joseph Cooper, word given previous n-1 words
a widowed former engineer and former NASA
pilot, runs a farm with his father-in-law Donald,
Example: he sat on a chair
son Tom, and daughter Murphy, It is post-truth Training data: All n-word windows
society ( Cooper is reprimanded for telling in your corpus
Murphy that the Apollo missions did indeed
happen) and a series of crop blights threatens hu- Training data for this task is easily
manity’s survival. Murphy believes her bedroom available (take all n word windows
is haunted by a poltergeist. When a pattern from the whole of wikipedia)
is created out of dust on the floor, Cooper For ease of illustration, we will first
realizes that gravity is behind its formation, focus on the case when n = 2 (i.e.,
not a ”ghost”. He interprets the pattern as
a set of geographic coordinates formed into predict second word based on first
binary code. Cooper and Murphy follow the word)
coordinates to a secret NASA facility, where they
are met by Cooper’s former professor, Dr. Brand.

Some sample 4 word windows from a corpus

25/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We will now try to answer these two questions:

26/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We will now try to answer these two questions:
How do you model this task?

26/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We will now try to answer these two questions:
How do you model this task?
What is the connection between this task and learning word representations?

26/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We will model this problem using a
feedforward neural network

27/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We will model this problem using a
feedforward neural network
Input: One-hot representation of the
context word

0 1 0 ... 0 0 0

sat

27/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
We will model this problem using a

P (on|sat)
P (he|sat)
. . . . . . . . . feedforward neural network
Input: One-hot representation of the
. . . . . . . . . . . . context word
Output: There are |V | words
(classes) possible and we want to pre-
dict a probability distribution over
these |V | classes (multi-class classific-
ation problem)
0 1 0 ... 0 0 0

sat

27/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
We will model this problem using a

P (on|sat)
P (he|sat)
. . . . . . . . . feedforward neural network
Input: One-hot representation of the
. . . . . . . . . . . . context word
Wword ∈ Rk×|V | Output: There are |V | words
(classes) possible and we want to pre-
. . . . . . . . . . h ∈ Rk
dict a probability distribution over
Wcontext ∈ Rk×|V |
these |V | classes (multi-class classific-
ation problem)
0 1 0 ... 0 0 0 x ∈ R|V |

sat

27/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
We will model this problem using a

27/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What is the product Wcontext x given that x

P (on|sat)
P (he|sat)
. . . . . . . . . is a one hot vector

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

28/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What is the product Wcontext x given that x

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

28/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What is the product Wcontext x given that x

sat

28/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What is the product Wcontext x given that x

P (on|sat)
P (he|sat)
. . . . . . . . . is a one hot vector
It is simply the i-th column of Wcontext
. . . . . . . . . . . . 
−1 0.5 2
   
0 0.5
 3 −1 −2   1  = −1
Wword ∈ Rk×|V |
−2 1.7 3 0 1.7
. . . . . . . . . . h ∈ Rk
So when the ith word is present the ith ele-
Wcontext ∈ Rk×|V |
ment in the one hot vector is ON and the ith
column of Wcontext gets selected
0 1 0 ... 0 0 0 x ∈ R|V | In other words, there is a one-to-one cor-
respondence between the words and the
sat columns of Wcontext

28/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What is the product Wcontext x given that x

28/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function?

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je

29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)

. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the

dot product between j th column of Wcontext
Wword ∈ Rk×|V | and ith column of Wword

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je

29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)

. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the

dot product between j th column of Wcontext
Wword ∈ Rk×|V | and ith column of Wword
P (word = i|sat) thus depends on the ith
. . . . . . . . . . h ∈ Rk
column of Wword

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je

29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)

. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the

sat

e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je

29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)

. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the

dot product between j th column of Wcontext
Wword ∈ Rk×|V | and ith column of Wword
P (word = i|sat) thus depends on the ith
. . . . . . . . . . h ∈ Rk
column of Wword
We thus treat the i-th column of Wword as
Wcontext ∈ Rk×|V |
the representation of word i
0 1 0 ... 0 0 0 x ∈ R|V | Hope you see an analogy with SVD! (there
we had a different way of learning Wcontext
sat and Wword but we saw that the ith column
of Wword corresponded to the representa-
tion of the ith word)
e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je

29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)

. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

P (man|sat)
P (he|sat)
the index w
. . . . . . . . .

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ?

Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ? softmax

Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ? softmax
What is an appropriate loss function?
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

. . . . . . . . . . h ∈ Rk L (θ) = − log ŷw = − log P (w|c)

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

. . . . . . . . . . h ∈ Rk L (θ) = − log ŷw = − log P (w|c)

h = Wcontext · xc = uc
Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

. . . . . . . . . . h ∈ Rk L (θ) = − log ŷw = − log P (w|c)

h = Wcontext · xc = uc
Wcontext ∈ Rk×|V |
exp(uc · vw )
ŷw = P
0 1 0 ... 0 0 0 |V | w0 ∈Vexp(uc · vw0 )
x∈R

sat

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

. . . . . . . . . . h ∈ Rk L (θ) = − log ŷw = − log P (w|c)

h = Wcontext · xc = uc
Wcontext ∈ Rk×|V |
exp(uc · vw )
ŷw = P
0 1 0 ... 0 0 0 |V | w0 ∈Vexp(uc · vw0 )
x∈R

uc is the column of Wcontext corresponding

sat
to context word c and vw is the column of
Wword corresponding to the word w

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
How do we train this simple feed for-

P (on|sat)
P (he|sat)
. . . . . . . . . ward neural network?

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

31/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
How do we train this simple feed for-

P (on|sat)
P (he|sat)
. . . . . . . . . ward neural network? backpropaga-
tion
. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

31/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
How do we train this simple feed for-

P (on|sat)
P (he|sat)
. . . . . . . . . ward neural network? backpropaga-
tion
. . . . . . . . . . . . Let us consider one input-output pair
Wword ∈ Rk×|V |
(c, w) and see the update rule for vw

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

31/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
X
= −(uc · vw − log exp(uc · vw0 ))
. . . . . . . . . . h ∈ Rk w0 ∈V

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
X
= −(uc · vw − log exp(uc · vw0 ))
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
Wcontext ∈ Rk×|V |
∇vw = −(uc − P · uc )
w0 ∈V exp(uc · vw0 )
0 1 0 ... 0 0 0 x ∈ R|V |

sat
∂
∇vw = L (θ)
∂vw

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

vw = vw − η∇vw

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

vw = vw − η∇vw
= vw + ηuc (1 − ŷw )

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
This update rule has a nice interpret-

P (on|sat)
P (he|sat)
. . . . . . . . . ation

. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
This update rule has a nice interpret-

P (on|sat)
P (he|sat)
. . . . . . . . . ation

. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-

ing the right word and vw will not be
. . . . . . . . . . h ∈ Rk
updated
Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
This update rule has a nice interpret-

P (on|sat)
P (he|sat)
. . . . . . . . . ation

. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-

ing the right word and vw will not be
. . . . . . . . . . h ∈ Rk
updated
Wcontext ∈ Rk×|V | If ŷw → 0 then vw gets updated by
adding a fraction of uc to it
0 1 0 ... 0 0 0 x ∈ R|V |

sat

33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
This update rule has a nice interpret-

P (on|sat)
P (he|sat)
. . . . . . . . . ation

. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-

33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
This update rule has a nice interpret-

P (on|sat)
P (he|sat)
. . . . . . . . . ation

. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-

33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
This update rule has a nice interpret-

P (on|sat)
P (he|sat)
. . . . . . . . . ation

. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-

33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
This update rule has a nice interpret-

P (on|sat)
P (he|sat)
. . . . . . . . . ation

. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-

ing the right word and vw will not be
. . . . . . . . . . h ∈ Rk
updated
Wcontext ∈ Rk×|V | If ŷw → 0 then vw gets updated by
adding a fraction of uc to it
0 1 0 ... 0 0 0 x ∈ R|V |
This increases the cosine similarity
sat between vw and uc (How? Refer to
slide 38 of Lecture 2)
The training objective ensures that
the cosine similarity between word
(vw ) and context word (uc ) is max-
imized 33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What happens to the representations

P (on|sat)
P (he|sat)
. . . . . . . . . of two words w and w0 which tend to
appear in similar context (c)
. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

34/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What happens to the representations

P (on|sat)
P (he|sat)
. . . . . . . . . of two words w and w0 which tend to
appear in similar context (c)
. . . . . . . . . . . . The training ensures that both vw
Wword ∈ Rk×|V |
and vw0 have a high cosine similarity
with uc and hence transitively (intuit-
. . . . . . . . . . h ∈ Rk ively) ensures that vw and vw0 have a
high cosine similarity with each other
Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

34/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What happens to the representations

sat

34/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What happens to the representations

34/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)

P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

[Wcontext , Wcontext ] ∈ Rk×2|V |

x ∈ R2|V |

he sat
0 1 0 ... 0 0 0
sat

0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)

P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R

. . . . . . . . . . h ∈ Rk

[Wcontext , Wcontext ] ∈ Rk×2|V |

x ∈ R2|V |

he sat
0 1 0 ... 0 0 0
sat

0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)

[Wcontext , Wcontext ] ∈ Rk×2|V |

x ∈ R2|V |

he sat
0 1 0 ... 0 0 0
sat

0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)

0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)

0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)

0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)

P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
[Wcontext , Wcontext ] just means that we are
. . . . . . . . . . h ∈ Rk stacking 2 copies of Wcontext matrix  
0
   1 } sat
−1 0.5 2 −1 0.5 2  0

[Wcontext , Wcontext ] ∈ Rk×2|V |
 3 −1.0 −2 3 −1.0 −2   
0
−2 1.7 3 −2 1.7 3  

0
x ∈ R2|V |
1 } he
he sat  
2.5
0 1 0 ... 0 0 0 = −3.0
sat 4.7
The resultant product would simply be the
0 0 0 ... 0 1 0 sum of the columns corresponding to ‘sat’
he and ‘he’ 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Of course in practice we will not do this expensive matrix multiplication

36/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Of course in practice we will not do this expensive matrix multiplication
If ‘he’ is the ith word in the vocabulary and sat is the j th word then we will
simply access columns Wcontext [:, i] and Wcontext [:, j] and add them

36/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Now what happens during backpropagation

37/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Now what happens during backpropagation
Recall that
d−1
X
h= uci
i=1

and

P (on|sat, he)
P (he|sat, he)
Some problems:

The denominator requires a summa-

tion over all words in the vocabulary
[Wcontext , Wcontext ] ∈ Rk×2|V |
We will revisit this issue soon
x ∈ R2|V |

he sat

38/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.5: Skip-gram model

39/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The model that we just saw is called the continuous bag of words model (it
predicts an output word give a bag of context words)

40/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The model that we just saw is called the continuous bag of words model (it
predicts an output word give a bag of context words)
We will now see the skip gram model (which predicts context words given an
input word)

40/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Notice that the role of context and
word has changed now

Wcontext ∈ Rk×|V |

. . . . . . . . . . h ∈ R|k|

Wword ∈ Rk×|V |

x ∈ R|V |
0 0 1 ... 0 0 0
on

41/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Notice that the role of context and
word has changed now
In the simple case when there is only
Wcontext ∈ Rk×|V |
one context word, we will arrive at
. . . . . . . . . . h ∈ R|k|
the same update rule for uc as we did
for vw earlier
Wword ∈ Rk×|V | Notice that even when we have mul-
tiple context words the loss function
0 0 1 ... 0 0 0 x ∈ R|V | would just be a summation of many
on cross entropy errors
d−1
X
L (θ) = − log ŷwi
i=1

Typically, we predict context words

on both sides of the given word
41/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words

Wcontext ∈ Rk×|V |

. . . . . . . . . . h ∈ R|k|

Wword ∈ Rk×|V |

x ∈ R|V |
0 0 1 ... 0 0 0
on

Wword ∈ Rk×|V |

x ∈ R|V |
0 0 1 ... 0 0 0
on

Wword ∈ Rk×|V |

x ∈ R|V |
0 0 1 ... 0 0 0
on

42/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling
Solution 2: Use contrastive estima-
Wword ∈ Rk×|V |
tion
x ∈ R|V |
0 0 1 ... 0 0 0
on

42/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
D = [(sat, on), (sat, Let D be the set of all correct (w, c) pairs in the
a), (sat, chair), (on, corpus
a), (on,chair), (a,chair),
(on,sat), (a, sat),
(chair,sat), (a, on),
(chair, on), (chair, a) ]

43/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
D = [(sat, on), (sat, Let D be the set of all correct (w, c) pairs in the
a), (sat, chair), (on, corpus
0
a), (on,chair), (a,chair), Let D be the set of all incorrect (w, r) pairs in
(on,sat), (a, sat), the corpus
(chair,sat), (a, on), 0
D can be constructed by randomly sampling a
(chair, on), (chair, a) ]
context word r which has never appeared with w
0
D = [(sat, oxygen), and creating a pair (w, r)
(sat, magic), (chair,
sad), (chair, walking)]

43/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
For a given (w, c) ∈ D we are interested in max-
P (z = 1|w, c) imizing

p(z = 1|w, c)
σ

uc vw

44/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
For a given (w, c) ∈ D we are interested in max-
P (z = 1|w, c) imizing

p(z = 1|w, c)
σ
Let us model this probability by

p(z = 1|w, c) = σ(uTc vw )

· 1
=
1 + e−uTc vw

uc vw

44/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
For a given (w, c) ∈ D we are interested in max-
P (z = 1|w, c) imizing

p(z = 1|w, c)
σ
Let us model this probability by

p(z = 1|w, c) = σ(uTc vw )

· 1
=
1 + e−uTc vw
Considering all (w, c) ∈ D, we are interested in
uc vw
Y
maximize p(z = 1|w, c)
θ
(w,c)∈D

where θ is the word representation (vw ) and con-

text representation (uc ) for all words in our corpus
44/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
0
For (w, r) ∈ D we are interested in maximizing
P (z = 0|w, r)
p(z = 0|w, r)

ur vw

45/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
0
For (w, r) ∈ D we are interested in maximizing
P (z = 0|w, r)
p(z = 0|w, r)

σ
Again we model this as

1
=1−
1 + e−vrT vw
· 1
= = σ(−uTr vw )
1 + euTr vw
0
Considering all (w, r) ∈ D , we are interested in
ur vw Y
maximize p(z = 0|w, r)
θ
(w,r)∈D0
45/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Combining the two we get:

P (z = 0|w, r)
Y Y
maximize p(z = 1|w, c) p(z = 0|w, r)
θ
(w,c)∈D 0
(w,r)∈D

ur vw

46/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Combining the two we get:

ur vw

46/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Combining the two we get:

ur vw

46/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Combining the two we get:

ur vw

46/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Combining the two we get:

ur vw 1
where σ(x) = 1+e−x

46/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
In the original paper, Mikolov et. al. sample k
P (z = 0|w, r) negative (w, r) pairs for every positive (w, c) pairs

ur vw

47/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
In the original paper, Mikolov et. al. sample k
P (z = 0|w, r) negative (w, r) pairs for every positive (w, c) pairs
0
The size of D is thus k times the size of D
The random context word is drawn from a modi-
σ fied unigram distribution

ur vw

47/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.6: Contrastive estimation

48/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling
Solution 2: Use contrastive estima-
Wword ∈ Rk×|V |
tion
x ∈ R|V |
Solution 3: Use hierarchical softmax
0 0 1 ... 0 0 0
on

49/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair

50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|

. . . . . . . . . . . . . . . . . . . .

Wh ∈ R2d×h Wh ∈ R2d×h

vc vw vc vw

sat on sat abracadabra

50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|

. . . . . . . . . . . . . . . . . . . .

Wh ∈ R2d×h Wh ∈ R2d×h

vc vw vc vw

We would like sr to be greater than s So we can maximize s − (sr + m)

Okay, so let us try to maximize s − sr What if s > sr + m (don’t do any thing)
But we would like the difference to be at
maximize max(0, s − (sr + m))
least m

50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.7: Hierarchical softmax

51/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling
Solution 2: Use contrastive estima-
Wword ∈ Rk×|V |
tion
x ∈ R|V |
Solution 3: Use hierarchical softmax
0 0 1 ... 0 0 0
on

52/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
vT u
max e c w
. . . 1 . . . . . . . . P

w0 ∈|V |
v T u0
e c w

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary

. . . .
. . . 1 . . . . . . . .
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

. . . .
. . . 1 . . . . . . . .
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary
There exists a unique path from the root
node to a leaf node.
Let l(w1 ), l(w2 ), ..., l(wp ) be the nodes on
the path from root to w
. . . .
. . . 1 . . . . . . . .
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary
π(on)2 = 0
There exists a unique path from the root
node to a leaf node.
π(on)3 = 0
Let l(w1 ), l(w2 ), ..., l(wp ) be the nodes on
the path from root to w
. . . .
Let π(w) be a binary vector such that:
. . . 1 . . . . . . . .
on π(w)k = 1 path branches left at node l(wk )
=0 otherwise

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary
π(on)2 = 0 u2
There exists a unique path from the root
node to a leaf node.
π(on)3 = 0
uV
Let l(w1 ), l(w2 ), ..., l(wp ) be the nodes on
the path from root to w
. . . .
Let π(w) be a binary vector such that:
. . . 1 . . . . . . . .
on π(w)k = 1 path branches left at node l(wk )
=0 otherwise

. . . . . . . . . . h = vc Finally each internal node is associated with

a vector ui

0 1 0 ... 0 0 0

sat

. . . . . . . . . . h = vc Finally each internal node is associated with

a vector ui
So the parameters of the module are
0 1 0 ... 0 0 0 Wcontext and u1 , u2 , . . . , uv (in effect, we
have the same number of parameters as be-
sat fore)

53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 For a given pair (w, c) we are interested in
the probability p(w|vc )
π(on)2 = 0 u2

π(on)3 = 0
uV

. . . .
. . . 1 . . . . . . . .
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

54/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 For a given pair (w, c) we are interested in
the probability p(w|vc )
π(on)2 = 0 u2 We model this probability as
Y
p(w|vc ) = (π(wk )|vc )
π(on)3 = 0 k
uV

. . . .
. . . 1 . . . . . . . .
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

. . . . For example

. . . 1 . . . . . . . . P (on|vsat ) = P (π(on)1 = 1|vsat )

on ∗P (π(on)2 = 0|vsat )
∗P (π(on)3 = 0|vsat )
. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

. . . . For example

. . . 1 . . . . . . . . P (on|vsat ) = P (π(on)1 = 1|vsat )

on ∗P (π(on)2 = 0|vsat )
∗P (π(on)3 = 0|vsat )
. . . . . . . . . . h = vc In effect, we are saying that the probability
of predicting a word is the same as predicting
the correct unique path from the root node
to that word
0 1 0 ... 0 0 0

sat

54/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 We model
1
P (π(on)i = 1) = T
π(on)2 = 0 u2 1 + e−vc ui
P (π(on)i = 0) = 1 − P (π(on)i = 1)
π(on)3 = 0 1
uV P (π(on)i = 0) = T
1 + evc ui
. . . .
. . . 1 . . . . . . . .
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

55/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 We model
1
P (π(on)i = 1) = T
π(on)2 = 0 u2 1 + e−vc ui
P (π(on)i = 0) = 1 − P (π(on)i = 1)
π(on)3 = 0 1
uV P (π(on)i = 0) = T
1 + evc ui
. . . . The above model ensures that the repres-
. . . 1 . . . . . . . . entation of a context word vc will have a
on high(low) similarity with the representation
of the node ui if ui appears on the path and
the path branches to the left(right) at ui
. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

55/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1

|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV

. . . .
. . . 1 . . . . . . . .
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

56/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1

|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV

. . . . Note that p(w|vc ) can now be com-

puted using |π(w)| computations in-
. . . 1 . . . . . . . . stead of |V | required by softmax
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

56/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1

|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV

. . . . Note that p(w|vc ) can now be com-

puted using |π(w)| computations in-
. . . 1 . . . . . . . . stead of |V | required by softmax
on
How do we construct the binary tree?
. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

56/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1

|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV

. . . . Note that p(w|vc ) can now be com-

puted using |π(w)| computations in-
. . . 1 . . . . . . . . stead of |V | required by softmax
on
How do we construct the binary tree?
. . . . . . . . . . h = vc Turns out that even a random ar-
rangement of the words on leaf nodes
does well in practice
0 1 0 ... 0 0 0

sat

56/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.8: GloVe representations

57/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Count based methods (SVD) rely on global co-occurrence counts from the
corpus for computing word representations

58/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Count based methods (SVD) rely on global co-occurrence counts from the
corpus for computing word representations
Predict based methods learn word representations using co-occurrence inform-
ation

58/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Xij encodes important global information
Human machine interface for computer applications
about the co-occurrence between i and j
User opinion of computer system response time
User interface management system
(global: because it is computed from the en-
System engineering for improved response time
tire corpus)

human machine system for ... user

human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71