0% found this document useful (0 votes)
51 views

CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors

This document discusses representing the meaning of words in computers using word embeddings. It introduces the concept of using a word co-occurrence matrix to capture how words are distributed in text by counting how often words appear near each other in a window. However, these co-occurrence vectors become very high dimensional as vocabulary size increases. The document proposes using dimensionality reduction techniques like singular value decomposition to project the co-occurrence matrix onto a lower dimensional space to obtain dense word embedding vectors.

Uploaded by

Mahmood Kohansal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors

This document discusses representing the meaning of words in computers using word embeddings. It introduces the concept of using a word co-occurrence matrix to capture how words are distributed in text by counting how often words appear near each other in a window. However, these co-occurrence vectors become very high dimensional as vocabulary size increases. The document proposes using dimensionality reduction techniques like singular value decomposition to project the co-occurrence matrix onto a lower dimensional space to obtain dense word embedding vectors.

Uploaded by

Mahmood Kohansal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

CS224d

Deep Learning
for Natural Language Processing


Lecture 2: Word Vectors


Richard Socher
How do we represent the meaning of a word?

Deni4on: Meaning (Webster dic4onary)


the idea that is represented by a word, phrase, etc.
the idea that a person wants to express by using
words, signs, etc.
the idea that is expressed in a work of wri4ng, art, etc.

2 Richard Socher 3/31/16


How to represent meaning in a computer?
Common answer: Use a taxonomy like WordNet that has
hypernyms (is-a) rela4onships and
synonym sets (good):

[Synset('procyonid.n.01'), S: (adj) full, good


Synset('carnivore.n.01'), S: (adj) es4mable, good, honorable, respectable
Synset('placental.n.01'), S: (adj) benecial, good
Synset('mammal.n.01'), S: (adj) good, just, upright
Synset('vertebrate.n.01'), S: (adj) adept, expert, good, prac4ced,
Synset('chordate.n.01'), procient, skillful
Synset('animal.n.01'), S: (adj) dear, good, near
Synset('organism.n.01'), S: (adj) good, right, ripe
Synset('living_thing.n.01'),
Synset('whole.n.02'), S: (adv) well, good
Synset('object.n.01'), S: (adv) thoroughly, soundly, good
Synset('physical_en4ty.n.01'), S: (n) good, goodness
Synset('en4ty.n.01')] S: (n) commodity, trade good, good
3 Richard Socher 3/31/16
Problems with this discrete representaDon

Great as resource but missing nuances, e.g.


synonyms:
adept, expert, good, prac4ced, procient, skillful?

Missing new words (impossible to keep up to date):


wicked, badass, niXy, crack, ace, wizard, genius, ninjia
Subjec4ve
Requires human labor to create and adapt
Hard to compute accurate word similarity
4 Richard Socher 3/31/16
Problems with this discrete representaDon

The vast majority of rule-based and sta4s4cal NLP work regards


words as atomic symbols: hotel, conference, walk

In vector space terms, this is a vector with one 1 and a lot of zeroes

[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
Dimensionality: 20K (speech) 50K (PTB) 500K (big vocab) 13M (Google 1T)

We call this a one-hot representa4on. Its problem:


motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] AND
hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0

5
DistribuDonal similarity based representaDons

You can get a lot of value by represen4ng a word by


means of its neighbors
You shall know a word by the company it keeps
(J. R. Firth 1957: 11)

One of the most successful ideas of modern sta4s4cal NLP

government debt problems turning into banking crises as has happened in

saying that Europe needs unified banking regulation to replace the hodgepodge

These words will represent banking

6
How to make neighbors represent words?

Answer: With a cooccurrence matrix X


2 op4ons: full document vs windows
Word - document cooccurrence matrix will give
general topics (all sports terms will have similar
entries) leading to Latent Seman4c Analysis

Instead: Window around each word captures both


syntac4c (POS) and seman4c informa4on

7 Richard Socher 3/31/16


Window based cooccurence matrix
Window length 1 (more common: 5 - 10)
Symmetric (irrelevant whether leX or right context)
Example corpus:
I like deep learning.
I like NLP.
I enjoy ying.

8 Richard Socher 3/31/16


Window based cooccurence matrix
Example corpus:
I like deep learning.
I like NLP.
I enjoy ying.
counts I like enjoy deep learning NLP ying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
ying 0 0 1 0 0 0 0 1
9 . 0 0 0 Richard Socher
0 1 1 1 3/31/16 0
Problems with simple cooccurrence vectors

Increase in size with vocabulary



Very high dimensional: require a lot of storage

Subsequent classica4on models have sparsity issues

Models are less robust

10 Richard Socher 3/31/16


SoluDon: Low dimensional vectors

Idea: store most of the important informa4on in a


xed, small number of dimensions: a dense vector
Usually around 25 1000 dimensions

How to reduce the dimensionality?

11 Richard Socher 3/31/16


Method 1: Dimensionality ReducDon on X
Singular Value Decomposi4on of cooccurrence matrix X.
Rohde, Gonnerman, Plaut Modeling Word

Rohde, Gonnerman,
m Plaut r r mModeling Word Meaning
Table 3. It is unc
S1 V1
S2 0 V2
by the singular v
n = n U1U2U3 . . . r S3 .
.
.
r V3
.
. as implied in Dee
m r r 0 m .
Sr
S1 TableT
3. ItHarshman
is unclear(1990)
wheth V1
X SU S 0
V the singular values, S, V2
2
by
n =m U U U
n 1
. . .2r S3 .
.
r 3

as
Computing the V3
.
.
0 k . k m implied in Deerwester, D .
S matrix with dim
r
S
S 0V T Harshman
V (1990). 1 1

X n U S V requires time pro


= n UUU . . k
0
S.
1
.
2 k 3 V.
. Computing the SVDwith
for matrices itse
2
3
2
3

m k S k
k m
SU 0 V
matrix
T
with dimensions
However, LSA nc
X S S 1
V V
requires time proportional
1

n Figure= n U U U .. k
1: The singular value
1 2 S k
. . decomposition
3 V.
2

of matrix
3 X. sparse and the SV 2
3
. for matrices with more than
X is the best rank k approximation to X, in terms of least matrices, allowin
0 S
k

However, sands
LSA co-occurren
of words a
X is the best rank k approxima4on to X , in terms of least squares.
T
squares. U S V
Figure 1: The singular value decomposition of matrix X. sparse andings the SVD
testedcomput
here
X is the best rank k approximation to X, in terms of least matrices, allowing pairwise the mode
compari
squares. tropy of the document distribution sands of words and docum
site (http://lsa.co
12 Richard Socher of row vector a. Words
ings3/31/16
tested here were gene
that are evenly distributed over documents will have high the Touchstone A
pairwise comparison interfa
Simple SVD word vectors in Python
Corpus:
I like deep learning. I like NLP. I enjoy ying.

13 Richard Socher 3/31/16


Simple SVD word vectors in Python
Corpus: I like deep learning. I like NLP. I enjoy ying.
Prin4ng rst two columns of U corresponding to the 2 biggest singular values

14 Richard Socher 3/31/16


Word meaning is dened in terms of vectors

In all subsequent models, including deep learning models, a


word is represented as a dense vector


0.286
0.792
0.177
0.107
linguis,cs = 0.109
0.542
0.349
0.271

15
Hacks to X

Problem: func4on words (the, he, has) are too


frequent syntax has too much impact. Some xes:
min(X,t), with t~100
Ignore them all

Ramped windows that count closer words more


Use Pearson correla4ons instead of counts, then set
nega4ve values to 0
+++

16 Richard Socher 3/31/16


Figure 8: Multidimensional scaling for three noun classes.

InteresDng semanDc paPers emerge in the vectors


WRIST
ANKLE
SHOULDER
ARM
LEG
HAND
FOOT
HEAD
NOSE
FINGER
TOE
FACE
EAR
EYE
TOOTH
DOG
CAT
PUPPY
KITTEN
COW
MOUSE
TURTLE
OYSTER
LION
BULL
CHICAGO
ATLANTA
MONTREAL
NASHVILLE
TOKYO
CHINA
RUSSIA
AFRICA
ASIA
EUROPE
AMERICA
BRAZIL
MOSCOW
FRANCE
HAWAII

An Improved Model of Seman4c Similarity Based on Lexical Co-Occurrence


Figure 9: Hierarchical clustering for three noun classes using distances based on vector correlations.

Rohde et al. 2005



17 Richard Socher 3/31/16
20
InteresDng syntacDc paPers emerge in the vectors
CHOOSING
CHOOSE
CHOSE
CHOSEN

STOLEN
STEAL
STOLE
STEALING

TAKE
SPOKE SPEAK
SPOKEN
SPEAKING TAKEN TAKING
TOOK
THROW
THROWN THREW
THROWING

SHOWN
SHOWED EATEN
EAT
SHOWING ATE
EATING

SHOW

GROWN
GROW
GREW

GROWING

An Improved Model of Seman4c Similarity Based on Lexical Co-Occurrence


Figure 11: Multidimensional scaling of present, past, progressive, and past participle forms for eight verb families.
Rohde et al. 2005

18 Richard Socher 3/31/16
22
Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurren
InteresDng semanDc paPers emerge in the vectors
DRIVER

JANITOR
DRIVE SWIMMER
STUDENT

CLEAN TEACHER

DOCTOR
BRIDE
SWIM
PRIEST

LEARN TEACH
MARRY

TREAT PRAY

Figure 13: Multidimensional scaling for nouns and their associated verbs.
An Improved Model of Seman4c Similarity Based on Lexical Co-Occurrence
Rohde et al. 2005
10
Table
The 19 Richard Socher
10 nearest neighbors and their percent correlation 3/31/16
similarities for a set of nouns, under the COALS-14K mode
gun point mind monopoly cardboard lipstick leningrad feet
Problems with SVD

Computa4onal cost scales quadra4cally for n x m matrix:


O(mn2) ops (when n<m)
Bad for millions of words or documents

Hard to incorporate new words or documents
Dierent learning regime than other DL models

20 Richard Socher 3/31/16

Idea: Directly learn low-dimensional word vectors

Old idea. Relevant for this lecture & deep learning:


Learning representa4ons by back-propaga4ng errors.
(Rumelhart et al., 1986)
A neural probabilis4c language model (Bengio et al., 2003)
NLP (almost) from Scratch (Collobert & Weston, 2008)
A recent, even simpler and faster model:
word2vec (Mikolov et al. 2013) intro now

21 Richard Socher 3/31/16


Main Idea of word2vec
Instead of capturing cooccurrence counts directly,
Predict surrounding words of every word
Both are quite similar, see Glove: Global Vectors for
Word Representa,on by Pennington et al. (2014) and
Levy and Goldberg (2014) more later

Faster and can easily incorporate a new sentence/


document or add a word to the vocabulary

22 Richard Socher 3/31/16


Details of Word2Vec

Predict surrounding words in a window of length m of


every word.
Objec4ve func4on: Maximize the log probability of
any context word given the current center word:

Where represents all variables we op4mize

23 Richard Socher 3/31/16


Details of Word2Vec

Predict surrounding words in a window of length m of every


word
For the simplest rst formula4on is
ion defines p(wt+j |wt ) using the softmax function:
! "

exp vw O
vwI
#W ! " (2)

w=1 exp vw vwI

where o is the outside (or output) word id, c is the center word
put vector representations of w, and W is the num-
mulationid, u and v are center and outside vectors of o and c
is impractical because the cost of computing
h is often large (105 107 terms).
Every word has two vectors!
This is essen4ally dynamic logis4c regression
of the
24 full softmax is the hierarchical softmax. In the
Richard Socher 3/31/16
it was first introduced by Morin and Bengio [12]. The
Cost/ObjecDve funcDons
We will op4mize (maximize or minimize)
our objec4ve/cost func4ons

For now: minimize gradient descent

Refresher with trivial example: (from Wikipedia)
Find a local minimum of the func4on
f(x)=x43x3+2, with deriva4ve f'(x)=4x39x2.


25 Richard Socher 3/31/16


DerivaDons of gradient

Whiteboard (see video if youre not in class ;)


The basic Lego piece
Useful basics:
If in doubt: write out with indices

Chain rule! If y = f(u) and u = g(x), i.e. y=f(g(x)), then:

26 Richard Socher 3/31/16


Chain Rule

Chain rule! If y = f(u) and u = g(x), i.e. y=f(g(x)), then:

Simple example:

27 Richard Socher 3/31/16


InteracDve Whiteboard Session!

Lets derive gradient together


For one example window and one example outside word:

28 Richard Socher 3/31/16


ApproximaDons: PSet 1

With large vocabularies this objec4ve func4on is not


scalable and would train too slowly! Why?

Idea: approximate the normaliza4on or


Dene nega4ve predic4on that only samples a few
words that do not appear in the context
Similar to focusing on mostly posi4ve correla4ons
You will derive and implement this in Pset 1!

29 Richard Socher 3/31/16


Linear RelaDonships in word2vec
These representa4ons are very good at encoding dimensions of
similarity!
Analogies tes4ng dimensions of similarity can be solved quite
well just by doing vector subtrac4on in the embedding space
Syntac4cally
xapple xapples xcar xcars xfamily xfamilies

Similarly for verb and adjec4ve morphological forms


Seman4cally (Semeval 2012 task 2)
xshirt xclothing xchair xfurniture
xking xman xqueen xwoman

30
Count based vs direct predicDon

LSA, HAL (Lund & Burgess), NNLM, HLBL, RNN, Skip-


COALS (Rohde et al), gram/CBOW, (Bengio et al; Collobert
Hellinger-PCA (Lebret & Collobert) & Weston; Huang et al; Mnih & Hinton;
Mikolov et al; Mnih & Kavukcuoglu)

Fast training Scales with corpus size


Efficient usage of statistics
Inefficient usage of statistics
Primarily used to capture word Generate improved performance
similarity on other tasks
Disproportionate importance
given to large counts Can capture complex patterns
beyond word similarity

31 Richard Socher 3/31/16


Combining the best of both worlds: GloVe

Fast training
Scalable to huge corpora
Good performance even with small corpus, and small
vectors
32 Richard Socher 3/31/16
Glove results

Nearest words to
frog:

1. frogs
2. toad
3. litoria litoria leptodactylidae
4. leptodactylidae
5. rana
6. lizard
7. eleutherodactylus

rana eleutherodactylus
33 Richard Socher 3/31/16
Word Analogies

Test for linear rela4onships, examined by Mikolov et al. (2014)

a:b :: c:?

man:woman :: king:?

+ king [ 0.30 0.70 ] queen


king
- man [ 0.20 0.20 ]
+ woman [ 0.60 0.30 ]
woman
man
queen [ 0.70 0.80 ]
Glove VisualizaDons

35 Richard Socher 3/31/16


Glove VisualizaDons: Company - CEO

36 Richard Socher 3/31/16


Glove VisualizaDons: SuperlaDves

37 Richard Socher 3/31/16


Word embedding matrix
Ini4alize most word vectors of future models with our pre-
trained embedding matrix


|V|



[ ]
L = n

aardvark a at

Also called a look-up table


Conceptually you get a words vector by leX mul4plying a
one-hot vector e (of length |V|) by L: x = Le
38
Advantages of low dimensional word vectors

What is the major benet of deep learned word vectors?


Ability to also propagate any informa4on into them
via neural networks (next lecture).
c1 c2 c3

e T f (c,d ) S
P(c | d, ) = T f ( c!,d )
c!
e a1 a2

x1 x2 x3 +1

39
Advantages of low dimensional word vectors

Word vectors will form the basis for all subsequent


lectures.

All our seman4c representa4ons will be vectors!


Next lecture:
Some more details about word vectors
Predict labels for words in context for solving lots of
dierent tasks
40

You might also like