Lecture10 PDF
Lecture10 PDF
Mitesh M. Khapra
1/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Acknowledgments
‘word2vec Parameter Learning Explained’ by Xin Rong
‘word2vec Explained: deriving Mikolov et al.’s negative-sampling word-
embedding method’ by Yoav Goldberg and Omer Levy
Sebastian Ruder’s blogs on word embeddingsa
Ali Ghodsi’s video lecture on Word2Vec b
a
Blog1, Blog2, Blog3
b
Lectures on Word2Vec
2/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.1: One-hot representations of words
3/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Let us start with a very simple mo-
tivation for why we are interested in
vectorial representations of words
4/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Let us start with a very simple mo-
tivation for why we are interested in
vectorial representations of words
Suppose we are given an input stream
of words (sentence, document, etc.)
and we are interested in learning
some function of it (say, ŷ =
sentiments(words))
4/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Let us start with a very simple mo-
tivation for why we are interested in
vectorial representations of words
Suppose we are given an input stream
of words (sentence, document, etc.)
and we are interested in learning
some function of it (say, ŷ =
Model sentiments(words))
Say, we employ a machine learning al-
gorithm (some mathematical model)
for learning such a function (ŷ =
f (x))
4/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Let us start with a very simple mo-
tivation for why we are interested in
vectorial representations of words
Suppose we are given an input stream
of words (sentence, document, etc.)
and we are interested in learning
some function of it (say, ŷ =
Model sentiments(words))
Say, we employ a machine learning al-
gorithm (some mathematical model)
for learning such a function (ŷ =
[5.7, 1.2, 2.3, -10.2, 4.5, ..., 11.9, 20.1, -0.5, 40.7]
f (x))
We first need a way of converting the
This is by far AAMIR KHAN’s best one. Finest
input stream (or each word in the
casting and terrific acting by all.
stream) to a vector x (a mathemat-
ical quantity)
4/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus,
5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus,
Corpus:
Human machine interface for computer
applications
User opinion of computer system response
time
User interface management system
System engineering for improved response
time
5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus, consider the set V
Corpus: of all unique words across all input
Human machine interface for computer streams (i.e., all sentences or docu-
applications ments)
User opinion of computer system response
time
User interface management system
System engineering for improved response
time
5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus, consider the set V
Corpus: of all unique words across all input
Human machine interface for computer streams (i.e., all sentences or docu-
applications ments)
User opinion of computer system response
time
V is called the vocabulary of the
corpus (i.e., all sentences or docu-
User interface management system
ments)
System engineering for improved response
time
5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus, consider the set V
Corpus: of all unique words across all input
Human machine interface for computer streams (i.e., all sentences or docu-
applications ments)
User opinion of computer system response
time
V is called the vocabulary of the
corpus (i.e., all sentences or docu-
User interface management system
ments)
System engineering for improved response
time We need a representation for every
word in V
V = [human,machine, interface, for, computer,
applications, user, opinion, of, system, response,
time, interface, management, engineering,
improved]
5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus, consider the set V
Corpus: of all unique words across all input
Human machine interface for computer streams (i.e., all sentences or docu-
applications ments)
User opinion of computer system response
time
V is called the vocabulary of the
corpus (i.e., all sentences or docu-
User interface management system
ments)
System engineering for improved response
time We need a representation for every
word in V
V = [human,machine, interface, for, computer, One very simple way of doing this is
applications, user, opinion, of, system, response,
to use one-hot vectors of size |V |
time, interface, management, engineering,
improved]
machine: 0 1 0 ... 0 0 0
5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus, consider the set V
Corpus: of all unique words across all input
Human machine interface for computer streams (i.e., all sentences or docu-
applications ments)
User opinion of computer system response
time
V is called the vocabulary of the
corpus (i.e., all sentences or docu-
User interface management system
ments)
System engineering for improved response
time We need a representation for every
word in V
V = [human,machine, interface, for, computer, One very simple way of doing this is
applications, user, opinion, of, system, response,
to use one-hot vectors of size |V |
time, interface, management, engineering,
improved] The representation of the i-th word
will have a 1 in the i-th position and
machine: 0 1 0 ... 0 0 0 a 0 in the remaining |V | − 1 positions
5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Problems:
cat: 0 0 0 0 0 1 0
V tends to be very large (for example,
dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T cor-
pus)
truck: 0 0 0 1 0 0 0
6/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Problems:
cat: 0 0 0 0 0 1 0
V tends to be very large (for example,
dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T cor-
pus)
truck: 0 0 0 1 0 0 0
These representations do not capture
any notion of similarity
6/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Problems:
cat: 0 0 0 0 0 1 0
V tends to be very large (for example,
dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T cor-
pus)
truck: 0 0 0 1 0 0 0
These representations do not capture
any notion of similarity
Ideally, we would want the represent-
ations of cat and dog (both domestic
animals) to be closer to each other
than the representations of cat and
truck
6/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Problems:
cat: 0 0 0 0 0 1 0
V tends to be very large (for example,
dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T cor-
pus)
truck: 0 0 0 1 0 0 0
These representations do not capture
any notion of similarity
√
euclid dist(cat, dog) = 2 Ideally, we would want the represent-
√
euclid dist(dog, truck) = 2 ations of cat and dog (both domestic
animals) to be closer to each other
than the representations of cat and
truck
However, with 1-hot representations,
the Euclidean distance between any
√
two words in the vocabulary in 2
6/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Problems:
cat: 0 0 0 0 0 1 0
V tends to be very large (for example,
dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T cor-
pus)
truck: 0 0 0 1 0 0 0
These representations do not capture
any notion of similarity
√
euclid dist(cat, dog) = 2 Ideally, we would want the represent-
√
euclid dist(dog, truck) = 2 ations of cat and dog (both domestic
animals) to be closer to each other
cosine sim(cat, dog) = 0
than the representations of cat and
cosine sim(dog, truck) = 0 truck
However, with 1-hot representations,
the Euclidean distance between any
√
two words in the vocabulary in 2
And the cosine similarity between
any two words in the vocabulary is
0 6/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.2: Distributed Representations of words
7/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
You shall know a word by the com-
pany it keeps - Firth, J. R. 1957:11
8/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
You shall know a word by the com-
pany it keeps - Firth, J. R. 1957:11
Distributional similarity based rep-
resentations
8/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
You shall know a word by the com-
pany it keeps - Firth, J. R. 1957:11
Distributional similarity based rep-
resentations
This leads us to the idea of co-
A bank is a financial institution that accepts occurrence matrix
deposits from the public and creates credit.
8/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
You shall know a word by the com-
pany it keeps - Firth, J. R. 1957:11
Distributional similarity based rep-
resentations
This leads us to the idea of co-
A bank is a financial institution that accepts occurrence matrix
deposits from the public and creates credit.
8/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
A co-occurrence matrix is a terms ×
Human machine interface for computer ap- terms matrix which captures the
plications
number of times a term appears in
User opinion of computer system response the context of another term
time
User interface management system
System engineering for improved response
time
9/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
A co-occurrence matrix is a terms ×
Human machine interface for computer ap- terms matrix which captures the
plications
number of times a term appears in
User opinion of computer system response the context of another term
time
User interface management system
The context is defined as a window of
k words around the terms
System engineering for improved response
time
9/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
A co-occurrence matrix is a terms ×
Human machine interface for computer ap- terms matrix which captures the
plications
number of times a term appears in
User opinion of computer system response the context of another term
time
User interface management system
The context is defined as a window of
k words around the terms
System engineering for improved response
time Let us build a co-occurrence matrix
for this toy corpus with k = 2
9/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
A co-occurrence matrix is a terms ×
Human machine interface for computer ap- terms matrix which captures the
plications
number of times a term appears in
User opinion of computer system response the context of another term
time
User interface management system
The context is defined as a window of
k words around the terms
System engineering for improved response
time Let us build a co-occurrence matrix
for this toy corpus with k = 2
human machine system for ... user
This is also known as a word ×
human
machine
0
1
1
0
0
0
1
1
...
...
0
0
context matrix
system 0 0 0 1 ... 2
for 1 1 1 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 2 0 ... 0
Co-occurence Matrix
9/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
A co-occurrence matrix is a terms ×
Human machine interface for computer ap- terms matrix which captures the
plications
number of times a term appears in
User opinion of computer system response the context of another term
time
User interface management system
The context is defined as a window of
k words around the terms
System engineering for improved response
time Let us build a co-occurrence matrix
for this toy corpus with k = 2
human machine system for ... user
This is also known as a word ×
human
machine
0
1
1
0
0
0
1
1
...
...
0
0
context matrix
system 0 0 0 1 ... 2
for 1 1 1 0 ... 0 You could choose the set of words
. . . . . . .
. . . . . . . and contexts to be same or different
. . . . . . .
user 0 0 2 0 ... 0
Co-occurence Matrix
9/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
A co-occurrence matrix is a terms ×
Human machine interface for computer ap- terms matrix which captures the
plications
number of times a term appears in
User opinion of computer system response the context of another term
time
User interface management system
The context is defined as a window of
k words around the terms
System engineering for improved response
time Let us build a co-occurrence matrix
for this toy corpus with k = 2
human machine system for ... user
This is also known as a word ×
human
machine
0
1
1
0
0
0
1
1
...
...
0
0
context matrix
system 0 0 0 1 ... 2
for 1 1 1 0 ... 0 You could choose the set of words
. . . . . . .
. . . . . . . and contexts to be same or different
. . . . . . .
user 0 0 2 0 ... 0 Each row (column) of the co-
Co-occurence Matrix occurrence matrix gives a vectorial
representation of the corresponding
word (context) 9/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Stop words (a, the, for, etc.) are very
frequent → these counts will be very
human machine system for ... user high
human 0 1 0 1 ... 0
machine 1 0 0 1 ... 0
system 0 0 0 1 ... 2
for 1 1 1 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 2 0 ... 0
10/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Stop words (a, the, for, etc.) are very
frequent → these counts will be very
human
human
0
machine
1
system
0
...
...
user
0
high
machine 1 0 0 ... 0
system 0 0 0 ... 2 Solution 1: Ignore very frequent
. . . . . .
. . . . . . words
. . . . . .
user 0 0 2 ... 0
10/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Stop words (a, the, for, etc.) are very
frequent → these counts will be very
human machine system for ... user high
human 0 1 0 x ... 0
machine 1 0 0 x ... 0
system 0 0 0 x ... 2
Solution 1: Ignore very frequent
for
.
x
.
x
.
x
.
x
.
...
.
x
.
words
. . . . . . .
. . . . . . . Solution 2: Use a threshold t (say, t
user 0 0 2 x ... 0
= 100)
10/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Solution 3: Instead of count(w, c) use
P M I(w, c)
11/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Solution 3: Instead of count(w, c) use
P M I(w, c)
p(c|w)
P M I(w, c) = log
p(c)
count(w, c) ∗ N
= log
count(c) ∗ count(w)
N is the total number of words
11/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Solution 3: Instead of count(w, c) use
P M I(w, c)
human machine system for ... user p(c|w)
human 0 2.944 0 2.25 ... 0 P M I(w, c) = log
machine 2.944 0 0 2.25 ... 0 p(c)
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0 count(w, c) ∗ N
= log
. . . . . . . count(c) ∗ count(w)
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0 N is the total number of words
11/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Solution 3: Instead of count(w, c) use
P M I(w, c)
human machine system for ... user p(c|w)
human 0 2.944 0 2.25 ... 0 P M I(w, c) = log
machine 2.944 0 0 2.25 ... 0 p(c)
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0 count(w, c) ∗ N
= log
. . . . . . . count(c) ∗ count(w)
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0 N is the total number of words
If count(w, c) = 0, P M I(w, c) = −∞
11/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Solution 3: Instead of count(w, c) use
P M I(w, c)
human machine system for ... user p(c|w)
human 0 2.944 0 2.25 ... 0 P M I(w, c) = log
machine 2.944 0 0 2.25 ... 0 p(c)
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0 count(w, c) ∗ N
= log
. . . . . . . count(c) ∗ count(w)
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0 N is the total number of words
If count(w, c) = 0, P M I(w, c) = −∞
Instead use,
11/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Solution 3: Instead of count(w, c) use
P M I(w, c)
human machine system for ... user p(c|w)
human 0 2.944 0 2.25 ... 0 P M I(w, c) = log
machine 2.944 0 0 2.25 ... 0 p(c)
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0 count(w, c) ∗ N
= log
. . . . . . . count(c) ∗ count(w)
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0 N is the total number of words
If count(w, c) = 0, P M I(w, c) = −∞
Instead use,
or
12/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (severe) problems
Very high dimensional (|V |)
Very sparse
human machine system for ... user
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0
12/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (severe) problems
Very high dimensional (|V |)
Very sparse
human machine system for ... user
human 0 2.944 0 2.25 ... 0 Grows with the size of the vocabulary
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0
12/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (severe) problems
Very high dimensional (|V |)
Very sparse
human machine system for ... user
human 0 2.944 0 2.25 ... 0 Grows with the size of the vocabulary
machine 2.944 0 0 2.25 ... 0
system
for
0
2.25
0
2.25
0
1.15
1.15
0
...
...
1.84
0
Solution: Use dimensionality reduc-
.
.
.
.
.
.
.
.
.
.
.
.
.
.
tion (SVD)
. . . . . . .
user 0 0 1.84 0 ... 0
12/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.3: SVD for learning word representations
13/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Singular Value Decomposition
gives a rank k approximation of
the original matrix
T
X = XP P M I m×n = Um×k Σk×k Vk×n
=
X
XP P M I (simplifying notation to
m×n
↑ ··· ↑
X) is the co-occurrence matrix
← v1T →
σ1
.. .. with PPMI values
. .
u1 ··· uk
↓ ··· ↓ m×k σk k×k
← vkT → k×n
14/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Singular Value Decomposition
gives a rank k approximation of
the original matrix
T
X = XP P M I m×n = Um×k Σk×k Vk×n
=
X
XP P M I (simplifying notation to
m×n
↑ ··· ↑
X) is the co-occurrence matrix
← v1T →
σ1
.. .. with PPMI values
. .
u1 ··· uk
SVD gives the best rank-k ap-
↓ ··· ↓ m×k σk k×k
← vkT → k×n
proximation of the original data
(X)
14/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Singular Value Decomposition
gives a rank k approximation of
the original matrix
T
X = XP P M I m×n = Um×k Σk×k Vk×n
=
X
XP P M I (simplifying notation to
m×n
↑ ··· ↑
X) is the co-occurrence matrix
← v1T →
σ1
.. .. with PPMI values
. .
u1 ··· uk
SVD gives the best rank-k ap-
↓ ··· ↓ m×k σk k×k
← vkT → k×n
proximation of the original data
(X)
Discovers latent semantics in the
corpus (We will soon examine
this with the help of an example)
14/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the product can be
written as a sum of k rank-1
matrices
=
X
m×n
↑ ··· ↑
σ1
← v1T →
.. ..
. .
u1 ··· uk
↓ ··· ↓ m×k σk k×k
← vkT → k×n
15/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the product can be
written as a sum of k rank-1
matrices
Each σi ui viT ∈ Rm×n because it
= is a product of a m × 1 vector
X
with a 1 × n vector
m×n
↑ ··· ↑
σ1
← v1T →
.. ..
. .
u1 ··· uk
↓ ··· ↓ m×k σk k×k
← vkT → k×n
15/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the product can be
written as a sum of k rank-1
matrices
Each σi ui viT ∈ Rm×n because it
= is a product of a m × 1 vector
X
with a 1 × n vector
m×n
If we truncate the sum at σ1 u1 v1T
↑ ··· ↑
σ1
← v1T →
then we get the best rank-1 ap-
.. ..
. . proximation of X
u1 ··· uk
↓ ··· ↓ m×k σk k×k
← vkT → k×n
15/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the product can be
written as a sum of k rank-1
matrices
Each σi ui viT ∈ Rm×n because it
= is a product of a m × 1 vector
X
with a 1 × n vector
m×n
If we truncate the sum at σ1 u1 v1T
↑ ··· ↑
σ1
← v1T →
then we get the best rank-1 ap-
.. ..
. . proximation of X (By SVD the-
u1 ··· uk
↓ ··· ↓ m×k σk k×k
← vkT → k×n orem! But what does this mean?
= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT We will see on the next slide)
15/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the product can be
written as a sum of k rank-1
matrices
Each σi ui viT ∈ Rm×n because it
= is a product of a m × 1 vector
X
with a 1 × n vector
m×n
If we truncate the sum at σ1 u1 v1T
↑ ··· ↑
σ1
← v1T →
then we get the best rank-1 ap-
.. ..
. . proximation of X (By SVD the-
u1 ··· uk
↓ ··· ↓ m×k σk k×k
← vkT → k×n orem! But what does this mean?
= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT We will see on the next slide)
If we truncate the sum at
σ1 u1 v1T + σ2 u2 v2T then we get the
best rank-2 approximation of X
and so on
15/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
What do we mean by approxim-
ation here?
=
X
m×n
↑ ··· ↑
σ1
← v1T →
.. ..
. .
u1 ··· uk
↓ ··· ↓ m×k σk k×k
← vkT → k×n
16/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
What do we mean by approxim-
ation here?
Notice that X has m × n entries
=
X
m×n
↑ ··· ↑
σ1
← v1T →
.. ..
. .
u1 ··· uk
↓ ··· ↓ m×k σk k×k
← vkT → k×n
16/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
What do we mean by approxim-
ation here?
Notice that X has m × n entries
When we use the rank-1 approx-
imation we are using only m +
=
X
n + 1 entries to reconstruct [u ∈
m×n
Rm , v ∈ Rn , σ ∈ R1 ]
↑ ··· ↑
σ1
← v1T →
.. ..
. .
u1 ··· uk
↓ ··· ↓ m×k σk k×k
← vkT → k×n
16/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
What do we mean by approxim-
ation here?
Notice that X has m × n entries
When we use the rank-1 approx-
imation we are using only m +
=
X
n + 1 entries to reconstruct [u ∈
m×n
Rm , v ∈ Rn , σ ∈ R1 ]
↑ ··· ↑
σ1
← v1T →
.. .. But SVD theorem tells us that
. .
u1 ··· uk u1 ,v1 and σ1 store the most im-
↓ ··· ↓ m×k σk ← vkT →
k×k k×n portant information in X (akin to
= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT the principal components in X)
16/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
What do we mean by approxim-
ation here?
Notice that X has m × n entries
When we use the rank-1 approx-
imation we are using only m +
=
X
n + 1 entries to reconstruct [u ∈
m×n
Rm , v ∈ Rn , σ ∈ R1 ]
↑ ··· ↑
σ1
← v1T →
.. .. But SVD theorem tells us that
. .
u1 ··· uk u1 ,v1 and σ1 store the most im-
↓ ··· ↓ m×k σk ← vkT →
k×k k×n portant information in X (akin to
= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT the principal components in X)
Each subsequent term (σ2 u2 v2T ,
σ3 u3 v3T , . . . ) stores less and less
important information
16/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
verylight green As an analogy consider the case when
we are using 8 bits to represent colors
z }| { z }| {
0 0 0 1 1 0 1 1
light green
z }| { z }| {
0 0 1 0 1 0 1 1
dark green
z }| { z }| {
0 1 0 0 1 0 1 1
verydark green
z }| { z }| {
1 0 0 0 1 0 1 1
17/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
verylight green As an analogy consider the case when
we are using 8 bits to represent colors
z }| { z }| {
0 0 0 1 1 0 1 1 The representation of very light, light,
dark and very dark green would look
different
light green
z }| { z }| {
0 0 1 0 1 0 1 1
dark green
z }| { z }| {
0 1 0 0 1 0 1 1
verydark green
z }| { z }| {
1 0 0 0 1 0 1 1
17/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
verylight green As an analogy consider the case when
we are using 8 bits to represent colors
z }| { z }| {
0 0 0 1 1 0 1 1 The representation of very light, light,
dark and very dark green would look
different
light green
z }| { z }| { But now what if we were asked to com-
press this into 4 bits? (akin to com-
0 0 1 0 1 0 1 1 pressing m × n values into m + n + 1
values on the previous slide)
dark green
z }| { z }| {
0 1 0 0 1 0 1 1
verydark green
z }| { z }| {
1 0 0 0 1 0 1 1
17/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
verylight green As an analogy consider the case when
we are using 8 bits to represent colors
z }| { z }| {
0 0 0 1 1 0 1 1 The representation of very light, light,
dark and very dark green would look
different
light green
z }| { z }| { But now what if we were asked to com-
press this into 4 bits? (akin to com-
0 0 1 0 1 0 1 1 pressing m × n values into m + n + 1
values on the previous slide)
We will retain the most important 4
dark green
z }| { z }| { bits and the previously (slightly) lat-
ent similarity between the colors now
0 1 0 0 1 0 1 1 becomes very obvious
verydark green
z }| { z }| {
1 0 0 0 1 0 1 1
17/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
verylight green As an analogy consider the case when
we are using 8 bits to represent colors
z }| { z }| {
0 0 0 1 1 0 1 1 The representation of very light, light,
dark and very dark green would look
different
light green
z }| { z }| { But now what if we were asked to com-
press this into 4 bits? (akin to com-
0 0 1 0 1 0 1 1 pressing m × n values into m + n + 1
values on the previous slide)
We will retain the most important 4
dark green
z }| { z }| { bits and the previously (slightly) lat-
ent similarity between the colors now
0 1 0 0 1 0 1 1 becomes very obvious
Something similar is guaranteed by
SVD (retain the most important in-
verydark green
z }| { z }| { formation and discover the latent sim-
ilarities between words)
1 0 0 0 1 0 1 1
17/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
human machine system for ... user human machine system for ... user
human 0 2.944 0 2.25 ... 0 human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.944 0 0 2.25 ... 0 machine 2.01 2.01 0.23 2.14 ... 0.43
system 0 0 0 1.15 ... 1.84 system 0.23 0.23 1.17 0.96 ... 1.29
for 2.25 2.25 1.15 0 ... 0 for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
user 0 0 1.84 0 ... 0 user 0.43 0.43 1.29 -0.13 ... 1.71
Notice that after low rank reconstruction with SVD, the latent co-occurrence
between {system, machine} and {human, user} has become visible
18/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Recall that earlier each row of the original
X= matrix X served as the representation of a
human machine system for ... user word
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0
19/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Recall that earlier each row of the original
X= matrix X served as the representation of a
human machine system for ... user word
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0 Then XX T is a matrix whose ij-th entry is
system 0 0 0 1.15 ... 1.84 the dot product between the representation
for 2.25 2.25 1.15 0 ... 0
. . . . . . . of word i (X[i :]) and word j (X[j :])
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0
XX T =
X[j :]
XX T =
20/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Once we do an SVD what is a
X̂ = good choice for the representation of
human machine system for ... user wordi ?
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
Obviously, taking the i-th row of the
for
.
2.14
.
2.14
.
0.96
.
1.87
.
...
.
-0.13
.
reconstructed matrix does not make
.
.
.
.
.
.
.
.
.
.
.
.
.
.
sense because it is still high dimen-
user 0.43 0.43 1.29 -0.13 ... 1.71 sional
20/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Once we do an SVD what is a
X̂ = good choice for the representation of
human machine system for ... user wordi ?
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
Obviously, taking the i-th row of the
for
.
2.14
.
2.14
.
0.96
.
1.87
.
...
.
-0.13
.
reconstructed matrix does not make
.
.
.
.
.
.
.
.
.
.
.
.
.
.
sense because it is still high dimen-
user 0.43 0.43 1.29 -0.13 ... 1.71 sional
But we saw that the reconstructed
X̂ X̂ T = matrix X̂ = U ΣV T discovers latent
human machine system for ... user semantics and its word representa-
human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
tions are more meaningful
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32
. . . . . . .
. . . . . . .
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11
X̂ X̂ T =
X̂ X̂ T =
X̂ X̂ T =
22/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The methods that we have seen so far are called count based models because
they use the co-occurrence counts of words
23/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The methods that we have seen so far are called count based models because
they use the co-occurrence counts of words
We will now see methods which directly learn word representations (these are
called (direct) prediction based models)
23/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model
24/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
24/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings
24/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings
Evaluating word embeddings
24/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings
Evaluating word embeddings
Good old SVD does just fine!!
24/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Consider this Task: Predict n-th
word given previous n-1 words
25/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Consider this Task: Predict n-th
word given previous n-1 words
Example: he sat on a chair
25/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Consider this Task: Predict n-th
Sometime in the 21st century, Joseph Cooper, word given previous n-1 words
a widowed former engineer and former NASA
pilot, runs a farm with his father-in-law Donald,
Example: he sat on a chair
son Tom, and daughter Murphy, It is post-truth Training data: All n-word windows
society ( Cooper is reprimanded for telling in your corpus
Murphy that the Apollo missions did indeed
happen) and a series of crop blights threatens hu-
manity’s survival. Murphy believes her bedroom
is haunted by a poltergeist. When a pattern
is created out of dust on the floor, Cooper
realizes that gravity is behind its formation,
not a ”ghost”. He interprets the pattern as
a set of geographic coordinates formed into
binary code. Cooper and Murphy follow the
coordinates to a secret NASA facility, where they
are met by Cooper’s former professor, Dr. Brand.
26/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We will now try to answer these two questions:
How do you model this task?
26/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We will now try to answer these two questions:
How do you model this task?
What is the connection between this task and learning word representations?
26/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We will model this problem using a
feedforward neural network
27/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We will model this problem using a
feedforward neural network
Input: One-hot representation of the
context word
0 1 0 ... 0 0 0
sat
27/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
We will model this problem using a
P (on|sat)
P (he|sat)
. . . . . . . . . feedforward neural network
Input: One-hot representation of the
. . . . . . . . . . . . context word
Output: There are |V | words
(classes) possible and we want to pre-
dict a probability distribution over
these |V | classes (multi-class classific-
ation problem)
0 1 0 ... 0 0 0
sat
27/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
We will model this problem using a
P (on|sat)
P (he|sat)
. . . . . . . . . feedforward neural network
Input: One-hot representation of the
. . . . . . . . . . . . context word
Wword ∈ Rk×|V | Output: There are |V | words
(classes) possible and we want to pre-
. . . . . . . . . . h ∈ Rk
dict a probability distribution over
Wcontext ∈ Rk×|V |
these |V | classes (multi-class classific-
ation problem)
0 1 0 ... 0 0 0 x ∈ R|V |
sat
27/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
We will model this problem using a
P (on|sat)
P (he|sat)
. . . . . . . . . feedforward neural network
Input: One-hot representation of the
. . . . . . . . . . . . context word
Wword ∈ Rk×|V | Output: There are |V | words
(classes) possible and we want to pre-
. . . . . . . . . . h ∈ Rk
dict a probability distribution over
Wcontext ∈ Rk×|V |
these |V | classes (multi-class classific-
ation problem)
0 1 0 ... 0 0 0 x ∈ R|V | Parameters: Wcontext ∈ Rk×|V | and
sat Wword ∈ Rk×|V |
(we are assuming that the set of
words and context words is the
same: each of size |V |)
27/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
What is the product Wcontext x given that x
P (on|sat)
P (he|sat)
. . . . . . . . . is a one hot vector
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
28/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
What is the product Wcontext x given that x
P (on|sat)
P (he|sat)
. . . . . . . . . is a one hot vector
It is simply the i-th column of Wcontext
. . . . . . . . . . . .
−1 0.5 2
0 0.5
3 −1 −2 1 = −1
Wword ∈ Rk×|V |
−2 1.7 3 0 1.7
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
28/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
What is the product Wcontext x given that x
P (on|sat)
P (he|sat)
. . . . . . . . . is a one hot vector
It is simply the i-th column of Wcontext
. . . . . . . . . . . .
−1 0.5 2
0 0.5
3 −1 −2 1 = −1
Wword ∈ Rk×|V |
−2 1.7 3 0 1.7
. . . . . . . . . . h ∈ Rk
So when the ith word is present the ith ele-
Wcontext ∈ Rk×|V |
ment in the one hot vector is ON and the ith
column of Wcontext gets selected
0 1 0 ... 0 0 0 x ∈ R|V |
sat
28/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
What is the product Wcontext x given that x
P (on|sat)
P (he|sat)
. . . . . . . . . is a one hot vector
It is simply the i-th column of Wcontext
. . . . . . . . . . . .
−1 0.5 2
0 0.5
3 −1 −2 1 = −1
Wword ∈ Rk×|V |
−2 1.7 3 0 1.7
. . . . . . . . . . h ∈ Rk
So when the ith word is present the ith ele-
Wcontext ∈ Rk×|V |
ment in the one hot vector is ON and the ith
column of Wcontext gets selected
0 1 0 ... 0 0 0 x ∈ R|V | In other words, there is a one-to-one cor-
respondence between the words and the
sat columns of Wcontext
28/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
What is the product Wcontext x given that x
P (on|sat)
P (he|sat)
. . . . . . . . . is a one hot vector
It is simply the i-th column of Wcontext
. . . . . . . . . . . .
−1 0.5 2
0 0.5
3 −1 −2 1 = −1
Wword ∈ Rk×|V |
−2 1.7 3 0 1.7
. . . . . . . . . . h ∈ Rk
So when the ith word is present the ith ele-
Wcontext ∈ Rk×|V |
ment in the one hot vector is ON and the ith
column of Wcontext gets selected
0 1 0 ... 0 0 0 x ∈ R|V | In other words, there is a one-to-one cor-
respondence between the words and the
sat columns of Wcontext
More specifically, we can treat the i-th
column of Wcontext as the representation of
context i
28/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function?
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je
29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je
29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je
29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)
sat
e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je
29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)
29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)
ŷ = P (on|sat)
dex c and the correct output word (on) by
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . .
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-
ŷ = P (on|sat)
dex c and the correct output word (on) by
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ?
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-
ŷ = P (on|sat)
dex c and the correct output word (on) by
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ? softmax
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-
ŷ = P (on|sat)
dex c and the correct output word (on) by
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ? softmax
What is an appropriate loss function?
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-
ŷ = P (on|sat)
dex c and the correct output word (on) by
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ? softmax
What is an appropriate loss function? cross
Wword ∈ Rk×|V | entropy
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-
ŷ = P (on|sat)
dex c and the correct output word (on) by
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ? softmax
What is an appropriate loss function? cross
Wword ∈ Rk×|V | entropy
0 1 0 ... 0 0 0 x ∈ R|V |
sat
30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-
ŷ = P (on|sat)
dex c and the correct output word (on) by
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ? softmax
What is an appropriate loss function? cross
Wword ∈ Rk×|V | entropy
sat
30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-
ŷ = P (on|sat)
dex c and the correct output word (on) by
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ? softmax
What is an appropriate loss function? cross
Wword ∈ Rk×|V | entropy
30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
How do we train this simple feed for-
P (on|sat)
P (he|sat)
. . . . . . . . . ward neural network?
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
31/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
How do we train this simple feed for-
P (on|sat)
P (he|sat)
. . . . . . . . . ward neural network? backpropaga-
tion
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
31/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
How do we train this simple feed for-
P (on|sat)
P (he|sat)
. . . . . . . . . ward neural network? backpropaga-
tion
. . . . . . . . . . . . Let us consider one input-output pair
Wword ∈ Rk×|V |
(c, w) and see the update rule for vw
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
31/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
X
= −(uc · vw − log exp(uc · vw0 ))
. . . . . . . . . . h ∈ Rk w0 ∈V
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
X
= −(uc · vw − log exp(uc · vw0 ))
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
Wcontext ∈ Rk×|V |
∇vw = −(uc − P · uc )
w0 ∈V exp(uc · vw0 )
0 1 0 ... 0 0 0 x ∈ R|V |
sat
∂
∇vw = L (θ)
∂vw
32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
X
= −(uc · vw − log exp(uc · vw0 ))
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
Wcontext ∈ Rk×|V |
∇vw = −(uc − P · uc )
w0 ∈V exp(uc · vw0 )
0 1 0 ... 0 0 0 x ∈ R|V | = −uc (1 − ŷw )
sat
32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
X
= −(uc · vw − log exp(uc · vw0 ))
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
Wcontext ∈ Rk×|V |
∇vw = −(uc − P · uc )
w0 ∈V exp(uc · vw0 )
0 1 0 ... 0 0 0 x ∈ R|V | = −uc (1 − ŷw )
sat And the update rule would be
32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
X
= −(uc · vw − log exp(uc · vw0 ))
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
Wcontext ∈ Rk×|V |
∇vw = −(uc − P · uc )
w0 ∈V exp(uc · vw0 )
0 1 0 ... 0 0 0 x ∈ R|V | = −uc (1 − ŷw )
sat And the update rule would be
vw = vw − η∇vw
32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
X
= −(uc · vw − log exp(uc · vw0 ))
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
Wcontext ∈ Rk×|V |
∇vw = −(uc − P · uc )
w0 ∈V exp(uc · vw0 )
0 1 0 ... 0 0 0 x ∈ R|V | = −uc (1 − ŷw )
sat And the update rule would be
vw = vw − η∇vw
= vw + ηuc (1 − ŷw )
32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
This update rule has a nice interpret-
P (on|sat)
P (he|sat)
. . . . . . . . . ation
. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
This update rule has a nice interpret-
P (on|sat)
P (he|sat)
. . . . . . . . . ation
. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )
0 1 0 ... 0 0 0 x ∈ R|V |
sat
33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
This update rule has a nice interpret-
P (on|sat)
P (he|sat)
. . . . . . . . . ation
. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )
sat
33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
This update rule has a nice interpret-
P (on|sat)
P (he|sat)
. . . . . . . . . ation
. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )
33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
This update rule has a nice interpret-
P (on|sat)
P (he|sat)
. . . . . . . . . ation
. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )
33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
This update rule has a nice interpret-
P (on|sat)
P (he|sat)
. . . . . . . . . ation
. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )
33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
This update rule has a nice interpret-
P (on|sat)
P (he|sat)
. . . . . . . . . ation
. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )
P (man|sat)
What happens to the representations
P (on|sat)
P (he|sat)
. . . . . . . . . of two words w and w0 which tend to
appear in similar context (c)
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
34/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
What happens to the representations
P (on|sat)
P (he|sat)
. . . . . . . . . of two words w and w0 which tend to
appear in similar context (c)
. . . . . . . . . . . . The training ensures that both vw
Wword ∈ Rk×|V |
and vw0 have a high cosine similarity
with uc and hence transitively (intuit-
. . . . . . . . . . h ∈ Rk ively) ensures that vw and vw0 have a
high cosine similarity with each other
Wcontext ∈ Rk×|V |
0 1 0 ... 0 0 0 x ∈ R|V |
sat
34/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
What happens to the representations
P (on|sat)
P (he|sat)
. . . . . . . . . of two words w and w0 which tend to
appear in similar context (c)
. . . . . . . . . . . . The training ensures that both vw
Wword ∈ Rk×|V |
and vw0 have a high cosine similarity
with uc and hence transitively (intuit-
. . . . . . . . . . h ∈ Rk ively) ensures that vw and vw0 have a
high cosine similarity with each other
Wcontext ∈ Rk×|V |
This is only an intuition (reasonable)
0 1 0 ... 0 0 0 x∈R |V |
sat
34/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)
P (man|sat)
What happens to the representations
P (on|sat)
P (he|sat)
. . . . . . . . . of two words w and w0 which tend to
appear in similar context (c)
. . . . . . . . . . . . The training ensures that both vw
Wword ∈ Rk×|V |
and vw0 have a high cosine similarity
with uc and hence transitively (intuit-
. . . . . . . . . . h ∈ Rk ively) ensures that vw and vw0 have a
high cosine similarity with each other
Wcontext ∈ Rk×|V |
This is only an intuition (reasonable)
0 1 0 ... 0 0 0 x∈R |V |
Haven’t come across a formal proof
sat for this!
34/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
x ∈ R2|V |
he sat
0 1 0 ... 0 0 0
sat
0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
. . . . . . . . . . h ∈ Rk
x ∈ R2|V |
he sat
0 1 0 ... 0 0 0
sat
0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
[Wcontext , Wcontext ] just means that we are
. . . . . . . . . . h ∈ Rk stacking 2 copies of Wcontext matrix
x ∈ R2|V |
he sat
0 1 0 ... 0 0 0
sat
0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
[Wcontext , Wcontext ] just means that we are
. . . . . . . . . . h ∈ Rk stacking 2 copies of Wcontext matrix
0
1 } sat
−1 0.5 2
[Wcontext , Wcontext ] ∈ Rk×2|V |
0
3 −1.0 −2
0
−2 1.7 3
0
x ∈ R2|V |
1 } he
he sat
0 1 0 ... 0 0 0
sat
0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
[Wcontext , Wcontext ] just means that we are
. . . . . . . . . . h ∈ Rk stacking 2 copies of Wcontext matrix
0
1 } sat
−1 0.5 2 −1 0.5 2 0
[Wcontext , Wcontext ] ∈ Rk×2|V |
3 −1.0 −2 3 −1.0 −2
0
−2 1.7 3 −2 1.7 3
0
x ∈ R2|V |
1 } he
he sat
0 1 0 ... 0 0 0
sat
0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
[Wcontext , Wcontext ] just means that we are
. . . . . . . . . . h ∈ Rk stacking 2 copies of Wcontext matrix
0
1 } sat
−1 0.5 2 −1 0.5 2 0
[Wcontext , Wcontext ] ∈ Rk×2|V |
3 −1.0 −2 3 −1.0 −2
0
−2 1.7 3 −2 1.7 3
0
x ∈ R2|V |
1 } he
he sat
2.5
0 1 0 ... 0 0 0 = −3.0
sat 4.7
0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
[Wcontext , Wcontext ] just means that we are
. . . . . . . . . . h ∈ Rk stacking 2 copies of Wcontext matrix
0
1 } sat
−1 0.5 2 −1 0.5 2 0
[Wcontext , Wcontext ] ∈ Rk×2|V |
3 −1.0 −2 3 −1.0 −2
0
−2 1.7 3 −2 1.7 3
0
x ∈ R2|V |
1 } he
he sat
2.5
0 1 0 ... 0 0 0 = −3.0
sat 4.7
The resultant product would simply be the
0 0 0 ... 0 1 0 sum of the columns corresponding to ‘sat’
he and ‘he’ 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Of course in practice we will not do this expensive matrix multiplication
36/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Of course in practice we will not do this expensive matrix multiplication
If ‘he’ is the ith word in the vocabulary and sat is the j th word then we will
simply access columns Wcontext [:, i] and Wcontext [:, j] and add them
36/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Now what happens during backpropagation
37/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Now what happens during backpropagation
Recall that
d−1
X
h= uci
i=1
37/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Now what happens during backpropagation
Recall that
d−1
X
h= uci
i=1
and
e(Wword h)[k]
P (on|sat, he) = P (W
word h)[j]
je
37/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Now what happens during backpropagation
Recall that
d−1
X
h= uci
i=1
and
e(Wword h)[k]
P (on|sat, he) = P (W
word h)[j]
je
37/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Now what happens during backpropagation
Recall that
d−1
X
h= uci
i=1
and
e(Wword h)[k]
P (on|sat, he) = P (W
word h)[j]
je
37/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Now what happens during backpropagation
Recall that
d−1
X
h= uci
i=1
and
e(Wword h)[k]
P (on|sat, he) = P (W
word h)[j]
je
P (on|sat, he)
P (he|sat, he)
Some problems:
P (chair|sat,
. . . . . . . . . Notice that the softmax function at
the output is computationally very
expensive
. . . . . . . . . . . .
Wword ∈ Rk×|V |
exp(uc · vw )
ŷw = P
w0 ∈V exp(uc · vw0 )
. . . . . . . . . . h ∈ Rk
x ∈ R2|V |
he sat
38/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
Some problems:
P (chair|sat,
. . . . . . . . . Notice that the softmax function at
the output is computationally very
expensive
. . . . . . . . . . . .
Wword ∈ Rk×|V |
exp(uc · vw )
ŷw = P
w0 ∈V exp(uc · vw0 )
. . . . . . . . . . h ∈ Rk
x ∈ R2|V |
he sat
38/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
Some problems:
P (chair|sat,
. . . . . . . . . Notice that the softmax function at
the output is computationally very
expensive
. . . . . . . . . . . .
Wword ∈ Rk×|V |
exp(uc · vw )
ŷw = P
w0 ∈V exp(uc · vw0 )
. . . . . . . . . . h ∈ Rk
he sat
38/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.5: Skip-gram model
39/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The model that we just saw is called the continuous bag of words model (it
predicts an output word give a bag of context words)
40/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The model that we just saw is called the continuous bag of words model (it
predicts an output word give a bag of context words)
We will now see the skip gram model (which predicts context words given an
input word)
40/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Notice that the role of context and
word has changed now
Wcontext ∈ Rk×|V |
. . . . . . . . . . h ∈ R|k|
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
41/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Notice that the role of context and
word has changed now
In the simple case when there is only
Wcontext ∈ Rk×|V |
one context word, we will arrive at
. . . . . . . . . . h ∈ R|k|
the same update rule for uc as we did
for vw earlier
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
41/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Notice that the role of context and
word has changed now
In the simple case when there is only
Wcontext ∈ Rk×|V |
one context word, we will arrive at
. . . . . . . . . . h ∈ R|k|
the same update rule for uc as we did
for vw earlier
Wword ∈ Rk×|V | Notice that even when we have mul-
tiple context words the loss function
0 0 1 ... 0 0 0 x ∈ R|V | would just be a summation of many
on cross entropy errors
d−1
X
L (θ) = − log ŷwi
i=1
41/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Notice that the role of context and
word has changed now
In the simple case when there is only
Wcontext ∈ Rk×|V |
one context word, we will arrive at
. . . . . . . . . . h ∈ R|k|
the same update rule for uc as we did
for vw earlier
Wword ∈ Rk×|V | Notice that even when we have mul-
tiple context words the loss function
0 0 1 ... 0 0 0 x ∈ R|V | would just be a summation of many
on cross entropy errors
d−1
X
L (θ) = − log ŷwi
i=1
Wcontext ∈ Rk×|V |
. . . . . . . . . . h ∈ R|k|
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
42/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k|
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
42/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
42/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling
Solution 2: Use contrastive estima-
Wword ∈ Rk×|V |
tion
x ∈ R|V |
0 0 1 ... 0 0 0
on
42/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling
Solution 2: Use contrastive estima-
Wword ∈ Rk×|V |
tion
x ∈ R|V |
Solution 3: Use hierarchical softmax
0 0 1 ... 0 0 0
on
42/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling
Solution 2: Use contrastive estima-
Wword ∈ Rk×|V |
tion
x ∈ R|V |
Solution 3: Use hierarchical softmax
0 0 1 ... 0 0 0
on
42/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
D = [(sat, on), (sat, Let D be the set of all correct (w, c) pairs in the
a), (sat, chair), (on, corpus
a), (on,chair), (a,chair),
(on,sat), (a, sat),
(chair,sat), (a, on),
(chair, on), (chair, a) ]
43/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
D = [(sat, on), (sat, Let D be the set of all correct (w, c) pairs in the
a), (sat, chair), (on, corpus
0
a), (on,chair), (a,chair), Let D be the set of all incorrect (w, r) pairs in
(on,sat), (a, sat), the corpus
(chair,sat), (a, on),
(chair, on), (chair, a) ]
0
D = [(sat, oxygen),
(sat, magic), (chair,
sad), (chair, walking)]
43/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
D = [(sat, on), (sat, Let D be the set of all correct (w, c) pairs in the
a), (sat, chair), (on, corpus
0
a), (on,chair), (a,chair), Let D be the set of all incorrect (w, r) pairs in
(on,sat), (a, sat), the corpus
(chair,sat), (a, on), 0
D can be constructed by randomly sampling a
(chair, on), (chair, a) ]
context word r which has never appeared with w
0
D = [(sat, oxygen), and creating a pair (w, r)
(sat, magic), (chair,
sad), (chair, walking)]
43/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
D = [(sat, on), (sat, Let D be the set of all correct (w, c) pairs in the
a), (sat, chair), (on, corpus
0
a), (on,chair), (a,chair), Let D be the set of all incorrect (w, r) pairs in
(on,sat), (a, sat), the corpus
(chair,sat), (a, on), 0
D can be constructed by randomly sampling a
(chair, on), (chair, a) ]
context word r which has never appeared with w
0
D = [(sat, oxygen), and creating a pair (w, r)
(sat, magic), (chair, As before let vw be the representation of the word
sad), (chair, walking)] w and uc be the representation of the context word
c
43/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
For a given (w, c) ∈ D we are interested in max-
P (z = 1|w, c) imizing
p(z = 1|w, c)
σ
uc vw
44/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
For a given (w, c) ∈ D we are interested in max-
P (z = 1|w, c) imizing
p(z = 1|w, c)
σ
Let us model this probability by
uc vw
44/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
For a given (w, c) ∈ D we are interested in max-
P (z = 1|w, c) imizing
p(z = 1|w, c)
σ
Let us model this probability by
ur vw
45/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
0
For (w, r) ∈ D we are interested in maximizing
P (z = 0|w, r)
p(z = 0|w, r)
σ
Again we model this as
ur vw
45/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
0
For (w, r) ∈ D we are interested in maximizing
P (z = 0|w, r)
p(z = 0|w, r)
σ
Again we model this as
ur vw
45/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
0
For (w, r) ∈ D we are interested in maximizing
P (z = 0|w, r)
p(z = 0|w, r)
σ
Again we model this as
ur vw
45/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
0
For (w, r) ∈ D we are interested in maximizing
P (z = 0|w, r)
p(z = 0|w, r)
σ
Again we model this as
ur vw
45/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
0
For (w, r) ∈ D we are interested in maximizing
P (z = 0|w, r)
p(z = 0|w, r)
σ
Again we model this as
P (z = 0|w, r)
Y Y
maximize p(z = 1|w, c) p(z = 0|w, r)
θ
(w,c)∈D 0
(w,r)∈D
ur vw
46/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Combining the two we get:
P (z = 0|w, r)
Y Y
maximize p(z = 1|w, c) p(z = 0|w, r)
θ
(w,c)∈D 0
(w,r)∈D
Y Y
=maximize p(z = 1|w, c) (1 − p(z = 1|w, r))
σ θ
(w,c)∈D 0
(w,r)∈D
ur vw
46/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Combining the two we get:
P (z = 0|w, r)
Y Y
maximize p(z = 1|w, c) p(z = 0|w, r)
θ
(w,c)∈D 0
(w,r)∈D
Y Y
=maximize p(z = 1|w, c) (1 − p(z = 1|w, r))
σ θ
(w,c)∈D 0
(w,r)∈D
X
=maximize log p(z = 1|w, c)
θ
(w,c)∈D
− X
+ log(1 − p(z = 1|w, r))
0
(w,r)∈D
ur vw
46/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Combining the two we get:
P (z = 0|w, r)
Y Y
maximize p(z = 1|w, c) p(z = 0|w, r)
θ
(w,c)∈D 0
(w,r)∈D
Y Y
=maximize p(z = 1|w, c) (1 − p(z = 1|w, r))
σ θ
(w,c)∈D 0
(w,r)∈D
X
=maximize log p(z = 1|w, c)
θ
(w,c)∈D
− X
+ log(1 − p(z = 1|w, r))
0
(w,r)∈D
X 1 X 1
=maximize log + log
· θ
(w,c)∈D
1+
T
e−vc vw 0
T
1 + evr vw
(w,r)∈D
ur vw
46/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Combining the two we get:
P (z = 0|w, r)
Y Y
maximize p(z = 1|w, c) p(z = 0|w, r)
θ
(w,c)∈D 0
(w,r)∈D
Y Y
=maximize p(z = 1|w, c) (1 − p(z = 1|w, r))
σ θ
(w,c)∈D 0
(w,r)∈D
X
=maximize log p(z = 1|w, c)
θ
(w,c)∈D
− X
+ log(1 − p(z = 1|w, r))
0
(w,r)∈D
X 1 X 1
=maximize log + log
· θ
(w,c)∈D
1+
T
e−vc vw 0
T
1 + evr vw
(w,r)∈D
X X
=maximize log σ(vcT vw ) + log σ(−vrT vw )
θ
(w,c)∈D 0
(w,r)∈D
ur vw 1
where σ(x) = 1+e−x
46/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
In the original paper, Mikolov et. al. sample k
P (z = 0|w, r) negative (w, r) pairs for every positive (w, c) pairs
ur vw
47/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
In the original paper, Mikolov et. al. sample k
P (z = 0|w, r) negative (w, r) pairs for every positive (w, c) pairs
0
The size of D is thus k times the size of D
ur vw
47/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
In the original paper, Mikolov et. al. sample k
P (z = 0|w, r) negative (w, r) pairs for every positive (w, c) pairs
0
The size of D is thus k times the size of D
The random context word is drawn from a modi-
σ fied unigram distribution
ur vw
47/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
In the original paper, Mikolov et. al. sample k
P (z = 0|w, r) negative (w, r) pairs for every positive (w, c) pairs
0
The size of D is thus k times the size of D
The random context word is drawn from a modi-
σ fied unigram distribution
3
r ∼ p(r) 4
− 3
count(r) 4
r∼
N
· N = total number of words in the corpus
ur vw
47/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.6: Contrastive estimation
48/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling
Solution 2: Use contrastive estima-
Wword ∈ Rk×|V |
tion
x ∈ R|V |
Solution 3: Use hierarchical softmax
0 0 1 ... 0 0 0
on
49/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|
. . . . . . . . . . . . . . . . . . . .
Wh ∈ R2d×h Wh ∈ R2d×h
vc vw vc vw
50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|
. . . . . . . . . . . . . . . . . . . .
Wh ∈ R2d×h Wh ∈ R2d×h
vc vw vc vw
50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|
. . . . . . . . . . . . . . . . . . . .
Wh ∈ R2d×h Wh ∈ R2d×h
vc vw vc vw
50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|
. . . . . . . . . . . . . . . . . . . .
Wh ∈ R2d×h Wh ∈ R2d×h
vc vw vc vw
50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|
. . . . . . . . . . . . . . . . . . . .
Wh ∈ R2d×h Wh ∈ R2d×h
vc vw vc vw
50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|
. . . . . . . . . . . . . . . . . . . .
Wh ∈ R2d×h Wh ∈ R2d×h
vc vw vc vw
50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|
. . . . . . . . . . . . . . . . . . . .
Wh ∈ R2d×h Wh ∈ R2d×h
vc vw vc vw
50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|
. . . . . . . . . . . . . . . . . . . .
Wh ∈ R2d×h Wh ∈ R2d×h
vc vw vc vw
50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.7: Hierarchical softmax
51/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling
Solution 2: Use contrastive estima-
Wword ∈ Rk×|V |
tion
x ∈ R|V |
Solution 3: Use hierarchical softmax
0 0 1 ... 0 0 0
on
52/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
vT u
max e c w
. . . 1 . . . . . . . . P
w0 ∈|V |
v T u0
e c w
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary
. . . .
. . . 1 . . . . . . . .
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary
There exists a unique path from the root
node to a leaf node.
. . . .
. . . 1 . . . . . . . .
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary
There exists a unique path from the root
node to a leaf node.
Let l(w1 ), l(w2 ), ..., l(wp ) be the nodes on
the path from root to w
. . . .
. . . 1 . . . . . . . .
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary
π(on)2 = 0
There exists a unique path from the root
node to a leaf node.
π(on)3 = 0
Let l(w1 ), l(w2 ), ..., l(wp ) be the nodes on
the path from root to w
. . . .
Let π(w) be a binary vector such that:
. . . 1 . . . . . . . .
on π(w)k = 1 path branches left at node l(wk )
=0 otherwise
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary
π(on)2 = 0 u2
There exists a unique path from the root
node to a leaf node.
π(on)3 = 0
uV
Let l(w1 ), l(w2 ), ..., l(wp ) be the nodes on
the path from root to w
. . . .
Let π(w) be a binary vector such that:
. . . 1 . . . . . . . .
on π(w)k = 1 path branches left at node l(wk )
=0 otherwise
0 1 0 ... 0 0 0
sat
53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary
π(on)2 = 0 u2
There exists a unique path from the root
node to a leaf node.
π(on)3 = 0
uV
Let l(w1 ), l(w2 ), ..., l(wp ) be the nodes on
the path from root to w
. . . .
Let π(w) be a binary vector such that:
. . . 1 . . . . . . . .
on π(w)k = 1 path branches left at node l(wk )
=0 otherwise
53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 For a given pair (w, c) we are interested in
the probability p(w|vc )
π(on)2 = 0 u2
π(on)3 = 0
uV
. . . .
. . . 1 . . . . . . . .
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
54/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 For a given pair (w, c) we are interested in
the probability p(w|vc )
π(on)2 = 0 u2 We model this probability as
Y
p(w|vc ) = (π(wk )|vc )
π(on)3 = 0 k
uV
. . . .
. . . 1 . . . . . . . .
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
54/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 For a given pair (w, c) we are interested in
the probability p(w|vc )
π(on)2 = 0 u2 We model this probability as
Y
p(w|vc ) = (π(wk )|vc )
π(on)3 = 0 k
uV
. . . . For example
0 1 0 ... 0 0 0
sat
54/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 For a given pair (w, c) we are interested in
the probability p(w|vc )
π(on)2 = 0 u2 We model this probability as
Y
p(w|vc ) = (π(wk )|vc )
π(on)3 = 0 k
uV
. . . . For example
sat
54/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 We model
1
P (π(on)i = 1) = T
π(on)2 = 0 u2 1 + e−vc ui
P (π(on)i = 0) = 1 − P (π(on)i = 1)
π(on)3 = 0 1
uV P (π(on)i = 0) = T
1 + evc ui
. . . .
. . . 1 . . . . . . . .
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
55/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 We model
1
P (π(on)i = 1) = T
π(on)2 = 0 u2 1 + e−vc ui
P (π(on)i = 0) = 1 − P (π(on)i = 1)
π(on)3 = 0 1
uV P (π(on)i = 0) = T
1 + evc ui
. . . . The above model ensures that the repres-
. . . 1 . . . . . . . . entation of a context word vc will have a
on high(low) similarity with the representation
of the node ui if ui appears on the path and
the path branches to the left(right) at ui
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
55/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 We model
1
P (π(on)i = 1) = T
π(on)2 = 0 u2 1 + e−vc ui
P (π(on)i = 0) = 1 − P (π(on)i = 1)
π(on)3 = 0 1
uV P (π(on)i = 0) = T
1 + evc ui
. . . . The above model ensures that the repres-
. . . 1 . . . . . . . . entation of a context word vc will have a
on high(low) similarity with the representation
of the node ui if ui appears on the path and
the path branches to the left(right) at ui
. . . . . . . . . . h = vc Again, transitively the representations of
contexts which appear with the same words
will have high similarity
0 1 0 ... 0 0 0
sat
55/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1
|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV
. . . .
. . . 1 . . . . . . . .
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
56/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1
|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
56/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1
|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV
0 1 0 ... 0 0 0
sat
56/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1
|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV
sat
56/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.8: GloVe representations
57/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Count based methods (SVD) rely on global co-occurrence counts from the
corpus for computing word representations
58/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Count based methods (SVD) rely on global co-occurrence counts from the
corpus for computing word representations
Predict based methods learn word representations using co-occurrence inform-
ation
58/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Count based methods (SVD) rely on global co-occurrence counts from the
corpus for computing word representations
Predict based methods learn word representations using co-occurrence inform-
ation
Why not combine the two (count and learn) ?
58/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Xij encodes important global information
Human machine interface for computer applications
about the co-occurrence between i and j
User opinion of computer system response time
User interface management system
(global: because it is computed from the en-
System engineering for improved response time
tire corpus)
X=
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
59/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Xij encodes important global information
Human machine interface for computer applications
about the co-occurrence between i and j
User opinion of computer system response time
User interface management system
(global: because it is computed from the en-
System engineering for improved response time
tire corpus)
Why not learn word vectors which are faith-
X= ful to this information?
human machine system for ... user
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
59/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Xij encodes important global information
Human machine interface for computer applications
about the co-occurrence between i and j
User opinion of computer system response time
User interface management system
(global: because it is computed from the en-
System engineering for improved response time
tire corpus)
Why not learn word vectors which are faith-
X= ful to this information?
human machine system for ... user For example, enforce
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43 viT vj = log P (j|i)
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 = log Xij − log(Xi )
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
59/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Xij encodes important global information
Human machine interface for computer applications
about the co-occurrence between i and j
User opinion of computer system response time
User interface management system
(global: because it is computed from the en-
System engineering for improved response time
tire corpus)
Why not learn word vectors which are faith-
X= ful to this information?
human machine system for ... user For example, enforce
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43 viT vj = log P (j|i)
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 = log Xij − log(Xi )
. . . . . . .
. . . . . . .
. . . . . . . Similarly,
user 0.43 0.43 1.29 -0.13 ... 1.71
59/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Xij encodes important global information
Human machine interface for computer applications
about the co-occurrence between i and j
User opinion of computer system response time
User interface management system
(global: because it is computed from the en-
System engineering for improved response time
tire corpus)
Why not learn word vectors which are faith-
X= ful to this information?
human machine system for ... user For example, enforce
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43 viT vj = log P (j|i)
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 = log Xij − log(Xi )
. . . . . . .
. . . . . . .
. . . . . . . Similarly,
user 0.43 0.43 1.29 -0.13 ... 1.71
59/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Adding the two equations we get
Human machine interface for computer applications
User opinion of computer system response time 2viT vj = 2 log Xij − log Xi − log Xj
User interface management system
1 1
System engineering for improved response time viT vj = log Xij − log Xi − log Xj
2 2
X=
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
60/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Adding the two equations we get
Human machine interface for computer applications
User opinion of computer system response time 2viT vj = 2 log Xij − log Xi − log Xj
User interface management system
1 1
System engineering for improved response time viT vj = log Xij − log Xi − log Xj
2 2
X= Note that log Xi and log Xj depend only on
human machine system for ... user
the words i & j and we can think of them as
human 2.01 2.01 0.23 2.14 ... 0.43 word specific biases which will be learned
machine 2.01 2.01 0.23 2.14 ... 0.43
system
for
0.23
2.14
0.23
2.14
1.17
0.96
0.96
1.87
...
...
1.29
-0.13 viT vj = log Xij − bi − bj
. . . . . . .
. . . . . . . viT vj + bi + bj = log Xij
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
60/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Adding the two equations we get
Human machine interface for computer applications
User opinion of computer system response time 2viT vj = 2 log Xij − log Xi − log Xj
User interface management system
1 1
System engineering for improved response time viT vj = log Xij − log Xi − log Xj
2 2
X= Note that log Xi and log Xj depend only on
human machine system for ... user
the words i & j and we can think of them as
human 2.01 2.01 0.23 2.14 ... 0.43 word specific biases which will be learned
machine 2.01 2.01 0.23 2.14 ... 0.43
system
for
0.23
2.14
0.23
2.14
1.17
0.96
0.96
1.87
...
...
1.29
-0.13 viT vj = log Xij − bi − bj
. . . . . . .
. . . . . . . viT vj + bi + bj = log Xij
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
We can then formulate this as the following
optimization problem
Xij Xij X T
P (j|i) = P = min (vi vj + bi + bj − log Xij )2
Xij Xi vi ,vj ,bi ,bj | {z } | {z }
i,j
predicted value actual value
Xij = Xji using model computed from
parameters the given corpus
60/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Human machine interface for computer applications
X T
User opinion of computer system response time min (vi vj + bi + bj − log Xij )2
User interface management system vi ,vj ,bi ,bj
i,j
System engineering for improved response time
X=
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
61/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Human machine interface for computer applications
X T
User opinion of computer system response time min (vi vj + bi + bj − log Xij )2
User interface management system vi ,vj ,bi ,bj
i,j
System engineering for improved response time
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
61/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Human machine interface for computer applications
X T
User opinion of computer system response time min (vi vj + bi + bj − log Xij )2
User interface management system vi ,vj ,bi ,bj
i,j
System engineering for improved response time
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
61/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Human machine interface for computer applications
X T
User opinion of computer system response time min (vi vj + bi + bj − log Xij )2
User interface management system vi ,vj ,bi ,bj
i,j
System engineering for improved response time
61/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.9: Evaluating word representations
62/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
How do we evaluate the learned word representations ?
63/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Semantic Relatedness
64/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Semantic Relatedness
Ask humans to judge the relatedness
between a pair of words
64/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Semantic Relatedness
Ask humans to judge the relatedness
between a pair of words
Compute the cosine similarity
between the corresponding word
vectors learned by the model
64/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Semantic Relatedness
Ask humans to judge the relatedness
between a pair of words
Compute the cosine similarity
between the corresponding word
vectors learned by the model
Given a large number of such
Shuman (cat, dog) = 0.8
word pairs, compute the correlation
T v
vcat
Smodel (cat, dog) =
dog
= 0.7 between Smodel & Shuman , and com-
k vcat kk vdog k pare different models
64/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Semantic Relatedness
Ask humans to judge the relatedness
between a pair of words
Compute the cosine similarity
between the corresponding word
vectors learned by the model
Given a large number of such
Shuman (cat, dog) = 0.8
word pairs, compute the correlation
T v
vcat
Smodel (cat, dog) =
dog
= 0.7 between Smodel & Shuman , and com-
k vcat kk vdog k pare different models
Model 1 is better than Model 2 if
correlation(Smodel1 , Shuman )
> correlation(Smodel2 , Shuman )
64/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Synonym Detection
65/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Synonym Detection
Given: a term and four candidate
synonyms
Term : levied
Candidates : {unposed,
believed, requested, correlated}
65/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Synonym Detection
Given: a term and four candidate
synonyms
Pick the candidate which has the
Term : levied largest cosine similarity with the term
Candidates : {unposed,
believed, requested, correlated}
65/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Synonym Detection
Given: a term and four candidate
synonyms
Pick the candidate which has the
Term : levied largest cosine similarity with the term
Candidates : {unposed, Compute the accuracy of different
believed, requested, correlated} models and compare
65/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Analogy
66/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Analogy
Semantic Analogy: Find nearest
neighbour of vbrother − vsister +
vgrandson
66/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Analogy
Semantic Analogy: Find nearest
neighbour of vbrother − vsister +
vgrandson
Syntactic Analogy: Find nearest
neighbour of vwork − vworks + vspeak
66/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
So which algorithm gives the best result ?
67/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
So which algorithm gives the best result ?
Boroni et.al [2014] showed that predict models consistently outperform count
models in all tasks.
67/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
So which algorithm gives the best result ?
Boroni et.al [2014] showed that predict models consistently outperform count
models in all tasks.
Levy et.al [2015] do a much more through analysis (IMO) and show that good
old SVD does better than prediction based models on similarity tasks but not
on analogy tasks.
67/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.10: Relation between SVD & word2Vec
68/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings
Evaluating word embeddings
Good old SVD does just fine!!
69/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Recall that SVD does a matrix factorization
of the co-occurrence matrix
Wcontext ∈ Rk×|V |
. . . . . . . . . . h ∈ R|k|
.
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
70/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Recall that SVD does a matrix factorization
of the co-occurrence matrix
Levy et.al [2015] show that word2vec also
Wcontext ∈ R k×|V | implicitly does a matrix factorization
. . . . . . . . . . h ∈ R|k|
.
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
70/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Recall that SVD does a matrix factorization
of the co-occurrence matrix
Levy et.al [2015] show that word2vec also
Wcontext ∈ R k×|V | implicitly does a matrix factorization
What does this mean ?
. . . . . . . . . . h ∈ R|k|
.
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
70/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Recall that SVD does a matrix factorization
of the co-occurrence matrix
Levy et.al [2015] show that word2vec also
Wcontext ∈ R k×|V | implicitly does a matrix factorization
What does this mean ?
. . . . . . . . . . h ∈ R|k|
Recall that word2vec gives us Wcontext &
Wword .
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
70/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Recall that SVD does a matrix factorization
of the co-occurrence matrix
Levy et.al [2015] show that word2vec also
Wcontext ∈ R k×|V | implicitly does a matrix factorization
What does this mean ?
. . . . . . . . . . h ∈ R|k|
Recall that word2vec gives us Wcontext &
Wword .
Wword ∈ Rk×|V |
Turns out that we can also show that
M = Wcontext ∗ Wword
x ∈ R|V |
0 0 1 ... 0 0 0
where
on
Mij = P M I(wi , ci ) − log(k)
k = number of negative samples
70/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Recall that SVD does a matrix factorization
of the co-occurrence matrix
Levy et.al [2015] show that word2vec also
Wcontext ∈ R k×|V | implicitly does a matrix factorization
What does this mean ?
. . . . . . . . . . h ∈ R|k|
Recall that word2vec gives us Wcontext &
Wword .
Wword ∈ Rk×|V |
Turns out that we can also show that
M = Wcontext ∗ Wword
x ∈ R|V |
0 0 1 ... 0 0 0
where
on
Mij = P M I(wi , ci ) − log(k)
k = number of negative samples