Natural Language Processing With Deep Learning CS224N/Ling284
Natural Language Processing With Deep Learning CS224N/Ling284
Christopher Manning
Lecture 2: Word Vectors,Word Senses, and
Classifier Review
Lecture Plan
Lecture 2: Word Vectors and Word Senses
1. Finish looking at word vectors and word2vec (10 mins)
2. Optimization basics (8 mins)
3. Can we capture this essence more effectively by counting? (12m)
4. The GloVe model of word vectors (10 min)
5. Evaluating word vectors (12 mins)
6. Word senses (6 mins)
7. Review of classification and how neural nets differ (10 mins)
8. Course advice (2 mins)
Goal:
2 be able to read word embeddings papers by the end of class
1. Review: Main idea of word2vec
• Start with random word vectors
• Iterate through each word in the whole corpus
• Try to predict surrounding words using word vectors
𝑃 𝑤5:9 | 𝑤5 𝑃 𝑤569 | 𝑤5
𝑃 𝑤5:7 | 𝑤5 𝑃 𝑤567 | 𝑤5
,- )
• 𝑃 𝑜𝑐 =
&'((*+ . • Update vectors so you
,- )
∑1∈3 &'((*1 . can predict better
• This algorithm learns word vectors that capture word
3
similarity and meaningful directions in a wordspace
Word2vec parameters and computations
••••• ••••• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
• • • •• • • • •• • •
U V 𝑈. 𝑣>? softmax(𝑈. 𝑣>? )
outside center dot product probabilities
5
2. Optimization: Gradient Descent
• We have a cost function 𝐽 𝜃 we want to minimize
• Gradient Descent is an algorithm to minimize 𝐽 𝜃
• Idea: for current value of 𝜃, calculate gradient of 𝐽 𝜃 , then take
small step in the direction of negative gradient. Repeat.
Note: Our
objectives
may not
be convex
like this
6
Gradient Descent
• Update equation (in matrix notation):
• Algorithm:
7
Stochastic Gradient Descent
• Problem: 𝐽 𝜃 is a function of all windows in the corpus
(potentially billions!)
• So is very expensive to compute
• You would wait a very long time before making a single update!
8
Stochastic gradients with word vectors!
• Iteratively take gradients at each such window for SGD
• But in each window, we only have at most 2m + 1 words,
so is very sparse!
9
Stochastic gradients with word vectors!
• We might only update the word vectors that actually appear!
[ ]
d
|V|
10
1b. Word2vec: More details
Why two vectors? à Easier optimization. Average both at the end
• But can do algorithm with just one vector per word
Two model variants:
1. Skip-grams (SG)
Predict context (“outside”) words (position independent) given center
word
2. Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
We presented: Skip-gram model
,- )
&'((*+ .
•𝑃 𝑜𝑐 = ,- )
∑1∈3 &'((*1 .
• Main idea: train binary logistic regressions for a true pair (center
word and word in its context window) versus several noise pairs
(the center word paired with a random word)
12
The skip-gram model with negative sampling (HW2)
• P(w)=U(w)3/4/Z,
the unigram distribution U(w) raised to the 3/4 power
(We provide this function in the starter code).
• The power makes less frequent words be sampled more often
14
3. Why not capture co-occurrence counts directly?
15
Example: Window based co-occurrence matrix
• Window length 1 (more common: 5–10)
• Symmetric (irrelevant whether left or right context)
• Example corpus:
• I like deep learning.
• I like NLP.
• I enjoy flying.
16
Window based co-occurrence matrix
• Example corpus:
• I like deep learning.
• I like NLP.
• I enjoy flying.
counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0
17
Problems with simple co-occurrence vectors
18
Solution: Low dimensional vectors
• Idea: store “most” of the important information in a fixed, small
number of dimensions: a dense vector
19
Method: Dimensionality Reduction on X (HW1)
Singular Value Decomposition of co-occurrence matrix X
Factorizes X into UΣVT, where U and V are orthonormal
k
X
21
Simple SVD word vectors in Python
Corpus: I like deep learning. I like NLP. I enjoy flying.
Printing first two columns of U corresponding to the 2 biggest singular values
22
Hacks to X (several used in Rohde et al. 2005)
STOLEN
STEAL
STOLE
STEALING
TAKE
SPOKE SPEAK
SPOKEN
SPEAKING TAKEN TAKING
TOOK
THROW
THROWN THREW
THROWING
SHOWN
SHOWED EATEN
EAT
SHOWING ATE
EATING
SHOW
GROWN
GROW
GREW
GROWING
22
Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurren
Interesting semantic patterns emerge in the vectors
DRIVER
JANITOR
DRIVE SWIMMER
STUDENT
CLEAN TEACHER
DOCTOR
BRIDE
SWIM
PRIEST
LEARN TEACH
MARRY
TREAT PRAY
COALS Figure
model13: Multidimensional scaling for nouns and their associated verbs.
from
An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Table 10 Rohde et al. ms., 2005
The 25
10 nearest neighbors and their percent correlation similarities for a set of nouns, under the COALS-14K mode
gun point mind monopoly cardboard lipstick leningrad feet
4. Towards GloVe: Count based vs. direct prediction
26
Encoding meaning in vector differences
[Pennington, Socher, and Manning, EMNLP 2014]
large small ~1 ~1
Encoding meaning in vector differences
[Pennington, Socher, and Manning, EMNLP 2014]
A: Log-bilinear model:
• Fast training
• Scalable to huge corpora
• Good performance even with
small corpus and small vectors
GloVe results
Nearest words to
frog:
1. frogs
2. toad
3. litoria litoria leptodactylidae
4. leptodactylidae
5. rana
6. lizard
7. eleutherodactylus
rana eleutherodactylus
31
5. How to evaluate word vectors?
• Related to general evaluation in NLP: Intrinsic vs. extrinsic
• Intrinsic:
• Evaluation on a specific/intermediate subtask
• Fast to compute
• Helps to understand that system
• Not clear if really helpful unless correlation to real task is established
• Extrinsic:
• Evaluation on a real task
• Can take a long time to compute accuracy
• Unclear if the subsystem is the problem or its interaction or other
subsystems
• If replacing exactly one subsystem with another improves accuracy à
Winning!
32
Intrinsic word vector evaluation
• Word Vector Analogies
a:b :: c:?
man:woman :: king:?
33
Glove Visualizations
34
Glove Visualizations: Company - CEO
35
Glove Visualizations: Superlatives
36
2013); skip-gram (SG) and CBOW results are
SG 300 1B 61 61 61 †
from evaluation
Analogy (Mikolov et and
al., 2013a,b); we trained SG
hyperparameters
s of CBOW † 300 1.6B 16.1 52.63 36.1
(19) and CBOW using the word2vec tool . See text
vLBLvectors
up-Glove word 300 1.5B 54.2 64.8 60.0
evaluation
for details and a description of the SVD models.
fre- ivLBL 300 1.5B 65.2 63.0 64.0
when
er of Model
GloVe Dim. Size Sem.
300 1.6B 80.8 Syn. Tot.
61.5 70.3
e to
er is ivLBL
SVD 100
300 1.5B6B 55.9 6.3 50.18.1 53.2
7.3
arge
(17) HPCA
SVD-S 100
300 1.6B6B 36.7 4.2 16.4
46.6 10.8
42.1
gen-
e we GloVe
SVD-L 100 1.6B
300 6B 67.556.6 54.3
63.0 60.3
60.1
SG † 300
CBOW 1B
6B 63.6 61 67.4 61 65.761
SG†
CBOW 300 1.6B6B 73.016.1 66.0
52.6 69.1
36.1
1,
,(19) vLBL
GloVe 300 1.5B6B 77.454.2 67.0
64.8 60.0
71.7
(20) ivLBL 1000
CBOW 300 1.5B6B 57.365.2 68.9
63.0 63.7
64.0
when
GloVe
SG 300 1.6B
1000 6B 80.866.1 61.5
65.1 70.3
65.6
e to
37 SVD
SVD-L 300 42B 6B 38.4 6.3 58.28.1 49.2
7.3
Analogy evaluation and hyperparameters
75 60 60
Accuracy [%]
Accuracy [%]
Accuracy [%]
70
50 55
65
60 40 50
Semantic
55 Syntactic
30 45
Overall
50
Gigaword5 + 20 40
Wiki2010 Wiki2014 Gigaword5 Common Crawl 0 100 200 300 400 500 600 2
Wiki2014 Vector Dimension
1B tokens 1.6B tokens 4.3B tokens 6B tokens 42B tokens
Accuracy [%]
recognition: finding
and CW.a person,
See text organization
for details. or location 70
60
Discrete 91.0 85.4 77.4 73.4
55
SVD 90.8 85.7 77.3 73.7 50
SVD-S 91.0 85.5 77.6 74.3
Wiki2010 Wik
SVD-L 90.5 84.8 73.6 71.5 1B tokens 1.6B
• Example: pike
43
Improving Word Representations Via Global Context
And Multiple Word Prototypes (Huang et al. 2012)
• Idea: Cluster word windows around words, retrain with each
word assigned to multiple different clusters bank1, bank2, etc
44
Linear Algebraic Structure of Word Senses, with
Applications to Polysemy (Arora, …, Ma, …, TACL 2018)
• Different senses of a word reside in a linear superposition (weighted
sum) in standard word embeddings like word2vec
• 𝑣pike = 𝛼7 𝑣pikeP + 𝛼9 𝑣pikeR + 𝛼S 𝑣pikeT
UP
• Where 𝛼7 = , etc., for frequency f
UP 6UR 6UT
• Surprising result:
• Because of ideas from sparse coding you can actually separate out
the senses (providing they are relatively common)
45
7. Classification review and notation
• Generally we have a training dataset consisting of samples
{xi,yi}Ni=1
46
Classification intuition
• Training data: {xi,yi}Ni=1
47
Details of the softmax classifier
49
Background: What is “cross entropy” loss/error?
• Concept of “cross entropy” is from information theory
• Let the true probability distribution be p
• Let our computed model probability be q
• The cross entropy is:
• Instead of
51
Traditional ML optimization
• For general machine learning 𝜃 usually
only consists of columns of W:
52
Neural Network Classifiers
• Softmax (≈ logistic regression) alone not very powerful
• Softmax gives only linear decision boundaries
This can be quite limiting
• à Unhelpful when a
problem is complex
• Wouldn’t it be cool to
get these correct?
53
Neural Nets for the Win!
• Neural networks can learn much more complex
functions and nonlinear decision boundaries!
• In the original space
54
Classification difference with word vectors
• Commonly in NLP deep learning:
• We learn both W and word vectors x
• We learn both conventional parameters and representations
• The word vectors re-represent one-hot vectors—move them
around in an intermediate layer vector space—for easy
classification with a (linear) softmax classifier via layer x = Le
Very large number of
parameters!
55
8. The course
A note on your experience 😀
57