CS11-747 Neural Networks for NLP
A Simple (?) Exercise:
Predicting the Next Word
Graham Neubig
Site
https://phontron.com/class/nn4nlp2019/
Are These Sentences OK?
• Jane went to the store.
• store to Jane went the.
• Jane went store.
• Jane goed to the store.
• The store went to Jane.
• The food truck went to Jane.
Calculating the Probability of
a Sentence
I
Y
P (X) = P (xi | x1 , . . . , xi 1)
i=1
Next Word Context
The big problem: How do we predict
P (xi | x1 , . . . , xi 1)
?!?!
Review: Count-based
Language Models
Count-based Language
Models
•Count up the frequency and divide:
c(xi n+1 , . . . , xi )
PM L (xi | xi n+1 , . . . , xi 1 ) :=
c(xi n+1 , . . . , xi 1 )
• Add smoothing, to deal with zero counts:
P (xi | xi n+1 , . . . , xi 1 ) = PM L (xi | xi n+1 , . . . , xi 1 )
+ (1 )P (xi | x1 n+2 , . . . , xi 1 )
• Modified Kneser-Ney smoothing
A Refresher on Evaluation
• Log-likelihood:
X
LL(Etest ) = log P (E)
E2Etest
• Per-word Log Likelihood:
1 X
W LL(Etest ) = P log P (E)
E2Etest |E|
E2Etest
• Per-word (Cross) Entropy:
1 X
H(Etest ) = P log2 P (E)
E2Etest |E| E2Etest
• Perplexity:
H(Etest ) W LL(Etest )
ppl(Etest ) = 2 =e
What Can we Do w/ LMs?
• Score sentences:
Jane went to the store . → high
store to Jane went the . → low
(same as calculating loss for training)
• Generate sentences:
while didn’t choose end-of-sentence symbol:
calculate probability
sample a new word from the probability distribution
Problems and Solutions?
• Cannot share strength among similar words
she bought a car she bought a bicycle
she purchased a car she purchased a bicycle
→ solution: class based language models
• Cannot condition on context with intervening words
Dr. Jane Smith Dr. Gertrude Smith
→ solution: skip-gram language models
• Cannot handle long-distance dependencies
for tennis class he wanted to buy his own racquet
for programming class he wanted to buy his own computer
→ solution: cache, trigger, topic, syntactic models, etc.
An Alternative:
Featurized Log-Linear Models
An Alternative:
Featurized Models
• Calculate features of the context
• Based on the features, calculate probabilities
• Optimize feature weights using gradient descent,
etc.
Example:
Previous words: “giving a"
a 3.0 -6.0 -0.2 -3.2
the 2.5 -5.1 -0.3 -2.9
talk b= -0.2 w1,a= 0.2 w2,giving= 1.0 s= 1.0
gift 0.1 0.1 2.0 2.2
hat 1.2 0.5 -1.2 0.6
… … … … …
How likely How likely
Words we’re How likely are they are they Total
predicting are they? given prev. given 2nd prev. score
word is “a”? word is “giving”?
Softmax
• Convert scores into probabilities by taking the
exponent and normalizing (softmax)
s(xi |xii 1
n+1 )
i 1 e
P (xi | xi n+1 ) =P
s(x̃i |xii 1
n+1 )
x̃i e
-3.2 0.002
-2.9 0.003
s= 1.0 p= 0.329
2.2 0.444
0.6 0.090
… …
A Computation Graph View
giving a
lookup2 lookup1 bias scores
+ + =
probs
softmax
Each vector is size of output vocabulary
A Note: “Lookup”
• Lookup can be viewed as “grabbing” a single
vector from a big matrix of word embeddings
num. words
vector
size
lookup(2)
• Similarly, can be viewed as multiplying by a “one-
hot” vector
num. words 00
vector 1
*0
size 0
…
• Former tends to be faster
Training a Model
• Reminder: to train, we calculate a “loss
function” (a measure of how bad our predictions
are), and move the parameters to reduce the loss
• The most common loss function for probabilistic
models is “negative log likelihood”
0.002
If element 3 0.003
(or zero-indexed, 2) p= 0.329 -log 1.112
is the correct answer: 0.444
0.090
…
Parameter Update
• Back propagation allows us to calculate the
derivative of the loss with respect to the parameters
@`
@✓
• Simple stochastic gradient descent optimizes
parameters according to the following rule
@`
✓ ✓ ↵
@✓
Choosing a Vocabulary
Unknown Words
• Necessity for UNK words
• We won’t have all the words in the world in training data
• Larger vocabularies require more memory and
computation time
• Common ways:
• Frequency threshold (usually UNK <= 1)
• Rank threshold
Evaluation and Vocabulary
• Important: the vocabulary must be the same over
models you compare
• Or more accurately, all models must be able to
generate the test set (it’s OK if they can generate
more than the test set, but not less)
• e.g. Comparing a character-based model to a
word-based model is fair, but not vice-versa
Let’s try it out!
(loglin-lm.py)
What Problems are Handled?
• Cannot share strength among similar words
she bought a car she bought a bicycle
she purchased a car she purchased a bicycle
→ not solved yet 😞
• Cannot condition on context with intervening words
Dr. Jane Smith Dr. Gertrude Smith
→ solved! 😀
• Cannot handle long-distance dependencies
for tennis class he wanted to buy his own racquet
for programming class he wanted to buy his own computer
→ not solved yet 😞
Beyond Linear Models
Linear Models can’t Learn
Feature Combinations
farmers eat steak → high cows eat steak → low
farmers eat hay → low cows eat hay → high
• These can’t be expressed by linear features
• What can we do?
• Remember combinations as features (individual
scores for “farmers eat”, “cows eat”)
→ Feature space explosion!
• Neural nets
Neural Language Models
giving a • (See Bengio et al. 2004)
lookup lookup
tanh(
W1*h + b1)
W + = softmax
bias scores probs
Where is Strength Shared?
giving a
Similar output words
lookup lookup
get similar rows in
in the softmax matrix
tanh(
Similar contexts get
W1*h + b1)
similar hidden states
Word embeddings: W + = softmax
Similar input words
get similar vectors bias scores probs
What Problems are Handled?
• Cannot share strength among similar words
she bought a car she bought a bicycle
she purchased a car she purchased a bicycle
→ solved, and similar contexts as well! 😀
• Cannot condition on context with intervening words
Dr. Jane Smith Dr. Gertrude Smith
→ solved! 😀
• Cannot handle long-distance dependencies
for tennis class he wanted to buy his own racquet
for programming class he wanted to buy his own computer
→ not solved yet 😞
Let’s Try it Out!
(nn-lm.py)
Tying Input/Output
giving a
Embeddings
• We can share parameters
pick row pick row between the input and output
embeddings (Press et al.
2016, inter alia)
tanh(
W1*h + b1)
W + = softmax
bias scores probs
Want to try? Delete the input embeddings, and
instead pick a row from the softmax matrix.
Training Tricks
Shuffling the Training Data
• Stochastic gradient methods update the
parameters a little bit at a time
• What if we have the sentence “I love this
sentence so much!” at the end of the training
data 50 times?
• To train correctly, we should randomly shuffle the
order at each time step
Other Optimization Options
• SGD with Momentum: Remember gradients from past
time steps to prevent sudden changes
• Adagrad: Adapt the learning rate to reduce learning
rate for frequently updated parameters (as measured
by the variance of the gradient)
• Adam: Like Adagrad, but keeps a running average of
momentum and gradient variance
• Many others: RMSProp, Adadelta, etc.
(See Ruder 2016 reference for more details)
Early Stopping, Learning
Rate Decay
• Neural nets have tons of parameters: we want to
prevent them from over-fitting
• We can do this by monitoring our performance on
held-out development data and stopping training
when it starts to get worse
• It also sometimes helps to reduce the learning rate
and continue training
Which One to Use?
• Adam is usually fast to converge and stable
• But simple SGD tends to do very will in terms of
generalization (Wilson et al. 2017)
• You should use learning rate decay, (e.g. on Machine
translation results by Denkowski & Neubig 2017)
Dropout
(Srivastava+ 14)
• Neural nets have lots of parameters, and are prone
to overfitting
• Dropout: randomly zero-out nodes in the hidden
layer with probability p at training time only
x
x
• Because the number of nodes at training/test is different, scaling is
necessary:
• Standard dropout: scale by p at test time
• Inverted dropout: scale by 1/(1-p) at training time
• An alternative: DropConnect (Wan+ 2013) instead zeros out
weights in the NN
Let’s Try it Out!
(nn-lm-optim.py)
Efficiency Tricks:
Operation Batching
Efficiency Tricks:
Mini-batching
• On modern hardware 10 operations of size 1 is
much slower than 1 operation of size 10
• Minibatching combines together smaller operations
into one big one
Minibatching
Manual Mini-batching
• Group together similar operations (e.g. loss calculations
for a single word) and execute them all together
• In the case of a feed-forward language model, each
word prediction in a sentence can be batched
• For recurrent neural nets, etc., more complicated
• How this works depends on toolkit
• Most toolkits have require you to add an extra
dimension representing the batch size
• DyNet has special minibatch operations for lookup
and loss functions, everything else automatic
Mini-batched Code Example
Let’s Try it Out!
(nn-lm-batch.py)
Automatic Mini-batching!
• TensorFlow Fold, DyNet Autobatching (see Neubig et al.
2017)
• Try it with the —dynet-autobatch command line option
Autobatching Usage
• for each minibatch:
• for each data point in mini-batch:
• define/add data
• sum losses
• forward (autobatch engine does magic!)
• backward
• update
Speed Improvements
A Case Study:
Regularizing and Optimizing LSTM
Language Models (Merity et al. 2017)
Regularizing and Optimizing LSTM
Language Models (Merity et al. 2017)
• Uses LSTMs as a backbone (discussed later)
• A number of tricks to improve stability and prevent overfitting:
• DropConnect regularization
• SGD w/ averaging triggered when model is close to
convergence
• Dropout on recurrent connections and embeddings
• Weight tying
• Independently tuned embedding and hidden layer sizes
• Regularization of activations of the network
• Strong baseline for language modeling, SOTA at the time
(without special model, just training methods)
Questions?