CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors
CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors
Deep Learning
for Natural Language Processing
Lecture 2: Word Vectors
Richard Socher
How do we represent the meaning of a word?
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
Dimensionality: 20K (speech) 50K (PTB) 500K (big vocab) 13M (Google 1T)
5
DistribuDonal similarity based representaDons
saying that Europe needs unified banking regulation to replace the hodgepodge
6
How to make neighbors represent words?
Rohde, Gonnerman,
m Plaut r r mModeling Word Meaning
Table 3. It is unc
S1 V1
S2 0 V2
by the singular v
n = n U1U2U3 . . . r S3 .
.
.
r V3
.
. as implied in Dee
m r r 0 m .
Sr
S1 TableT
3. ItHarshman
is unclear(1990)
wheth V1
X SU S 0
V the singular values, S, V2
2
by
n =m U U U
n 1
. . .2r S3 .
.
r 3
as
Computing the V3
.
.
0 k . k m implied in Deerwester, D .
S matrix with dim
r
S
S 0V T Harshman
V (1990). 1 1
m k S k
k m
SU 0 V
matrix
T
with dimensions
However, LSA nc
X S S 1
V V
requires time proportional
1
n Figure= n U U U .. k
1: The singular value
1 2 S k
. . decomposition
3 V.
2
of matrix
3 X. sparse and the SV 2
3
. for matrices with more than
X is the best rank k approximation to X, in terms of least matrices, allowin
0 S
k
However, sands
LSA co-occurren
of words a
X is the best rank k approxima4on to X , in terms of least squares.
T
squares. U S V
Figure 1: The singular value decomposition of matrix X. sparse andings the SVD
testedcomput
here
X is the best rank k approximation to X, in terms of least matrices, allowing pairwise the mode
compari
squares. tropy of the document distribution sands of words and docum
site (http://lsa.co
12 Richard Socher of row vector a. Words
ings3/31/16
tested here were gene
that are evenly distributed over documents will have high the Touchstone A
pairwise comparison interfa
Simple SVD word vectors in Python
Corpus:
I like deep learning. I like NLP. I enjoy ying.
15
Hacks to X
STOLEN
STEAL
STOLE
STEALING
TAKE
SPOKE SPEAK
SPOKEN
SPEAKING TAKEN TAKING
TOOK
THROW
THROWN THREW
THROWING
SHOWN
SHOWED EATEN
EAT
SHOWING ATE
EATING
SHOW
GROWN
GROW
GREW
GROWING
JANITOR
DRIVE SWIMMER
STUDENT
CLEAN TEACHER
DOCTOR
BRIDE
SWIM
PRIEST
LEARN TEACH
MARRY
TREAT PRAY
Figure 13: Multidimensional scaling for nouns and their associated verbs.
An Improved Model of Seman4c Similarity Based on Lexical Co-Occurrence
Rohde et al. 2005
10
Table
The 19 Richard Socher
10 nearest neighbors and their percent correlation 3/31/16
similarities for a set of nouns, under the COALS-14K mode
gun point mind monopoly cardboard lipstick leningrad feet
Problems with SVD
where o is the outside (or output) word id, c is the center word
put vector representations of w, and W is the num-
mulationid, u and v are center and outside vectors of o and c
is impractical because the cost of computing
h is often large (105 107 terms).
Every word has two vectors!
This is essen4ally dynamic logis4c regression
of the
24 full softmax is the hierarchical softmax. In the
Richard Socher 3/31/16
it was first introduced by Morin and Bengio [12]. The
Cost/ObjecDve funcDons
We will op4mize (maximize or minimize)
our objec4ve/cost func4ons
For now: minimize gradient descent
Refresher with trivial example: (from Wikipedia)
Find a local minimum of the func4on
f(x)=x43x3+2, with deriva4ve f'(x)=4x39x2.
Simple example:
30
Count based vs direct predicDon
Fast training
Scalable to huge corpora
Good performance even with small corpus, and small
vectors
32 Richard Socher 3/31/16
Glove results
Nearest words to
frog:
1. frogs
2. toad
3. litoria litoria leptodactylidae
4. leptodactylidae
5. rana
6. lizard
7. eleutherodactylus
rana eleutherodactylus
33 Richard Socher 3/31/16
Word Analogies
a:b :: c:?
man:woman :: king:?
[ ]
L = n
aardvark a at
x1 x2 x3 +1
39
Advantages of low dimensional word vectors