0% found this document useful (0 votes)
68 views47 pages

Nn4nlp 02 LM

This document discusses predicting the next word in a sequence using neural networks for natural language processing. It begins with an overview of count-based language models and their limitations, such as not sharing strength between similar words. It then introduces featurized log-linear models as an alternative, showing how features can be calculated from the context and used to predict probabilities. Finally, it discusses neural language models and how they can learn feature combinations through their hidden layers, addressing more challenges than earlier models.

Uploaded by

Brian Johnson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views47 pages

Nn4nlp 02 LM

This document discusses predicting the next word in a sequence using neural networks for natural language processing. It begins with an overview of count-based language models and their limitations, such as not sharing strength between similar words. It then introduces featurized log-linear models as an alternative, showing how features can be calculated from the context and used to predict probabilities. Finally, it discusses neural language models and how they can learn feature combinations through their hidden layers, addressing more challenges than earlier models.

Uploaded by

Brian Johnson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

CS11-747 Neural Networks for NLP

A Simple (?) Exercise:



Predicting the Next Word
Graham Neubig

Site
https://phontron.com/class/nn4nlp2019/
Are These Sentences OK?
• Jane went to the store.

• store to Jane went the.

• Jane went store.

• Jane goed to the store.

• The store went to Jane.

• The food truck went to Jane.


Calculating the Probability of
a Sentence
I
Y
P (X) = P (xi | x1 , . . . , xi 1)
i=1

Next Word Context

The big problem: How do we predict


P (xi | x1 , . . . , xi 1)
?!?!
Review: Count-based
Language Models
Count-based Language
Models
•Count up the frequency and divide:
c(xi n+1 , . . . , xi )
PM L (xi | xi n+1 , . . . , xi 1 ) :=
c(xi n+1 , . . . , xi 1 )
• Add smoothing, to deal with zero counts:
P (xi | xi n+1 , . . . , xi 1 ) = PM L (xi | xi n+1 , . . . , xi 1 )
+ (1 )P (xi | x1 n+2 , . . . , xi 1 )

• Modified Kneser-Ney smoothing


A Refresher on Evaluation
• Log-likelihood:
 X
LL(Etest ) = log P (E)
E2Etest
• Per-word Log Likelihood:

1 X
W LL(Etest ) = P log P (E)
E2Etest |E|
E2Etest
• Per-word (Cross) Entropy:

1 X
H(Etest ) = P log2 P (E)
E2Etest |E| E2Etest
• Perplexity:

H(Etest ) W LL(Etest )
ppl(Etest ) = 2 =e
What Can we Do w/ LMs?
• Score sentences:

Jane went to the store . → high


store to Jane went the . → low
(same as calculating loss for training)
• Generate sentences:
while didn’t choose end-of-sentence symbol:
calculate probability
sample a new word from the probability distribution
Problems and Solutions?
• Cannot share strength among similar words
she bought a car she bought a bicycle
she purchased a car she purchased a bicycle
→ solution: class based language models
• Cannot condition on context with intervening words
Dr. Jane Smith Dr. Gertrude Smith
→ solution: skip-gram language models
• Cannot handle long-distance dependencies
for tennis class he wanted to buy his own racquet
for programming class he wanted to buy his own computer
→ solution: cache, trigger, topic, syntactic models, etc.
An Alternative:

Featurized Log-Linear Models
An Alternative:

Featurized Models

• Calculate features of the context

• Based on the features, calculate probabilities

• Optimize feature weights using gradient descent,


etc.
Example:
Previous words: “giving a"
a 3.0 -6.0 -0.2 -3.2
the 2.5 -5.1 -0.3 -2.9
talk b= -0.2 w1,a= 0.2 w2,giving= 1.0 s= 1.0
gift 0.1 0.1 2.0 2.2
hat 1.2 0.5 -1.2 0.6
… … … … …
How likely How likely
Words we’re How likely are they are they Total
predicting are they? given prev. given 2nd prev. score
word is “a”? word is “giving”?
Softmax
• Convert scores into probabilities by taking the
exponent and normalizing (softmax)
s(xi |xii 1
n+1 )
i 1 e
P (xi | xi n+1 ) =P
s(x̃i |xii 1
n+1 )
x̃i e

-3.2 0.002
-2.9 0.003
s= 1.0 p= 0.329
2.2 0.444
0.6 0.090
… …
A Computation Graph View
giving a

lookup2 lookup1 bias scores

+ + =
probs
softmax

Each vector is size of output vocabulary


A Note: “Lookup”
• Lookup can be viewed as “grabbing” a single
vector from a big matrix of word embeddings
num. words
vector
size
lookup(2)
• Similarly, can be viewed as multiplying by a “one-
hot” vector
num. words 00
vector 1
*0
size 0

• Former tends to be faster
Training a Model
• Reminder: to train, we calculate a “loss
function” (a measure of how bad our predictions
are), and move the parameters to reduce the loss

• The most common loss function for probabilistic


models is “negative log likelihood”
0.002
If element 3 0.003
(or zero-indexed, 2) p= 0.329 -log 1.112
is the correct answer: 0.444
0.090

Parameter Update
• Back propagation allows us to calculate the
derivative of the loss with respect to the parameters
@`
@✓
• Simple stochastic gradient descent optimizes
parameters according to the following rule
@`
✓ ✓ ↵
@✓
Choosing a Vocabulary
Unknown Words
• Necessity for UNK words

• We won’t have all the words in the world in training data

• Larger vocabularies require more memory and


computation time

• Common ways:

• Frequency threshold (usually UNK <= 1)

• Rank threshold
Evaluation and Vocabulary

• Important: the vocabulary must be the same over


models you compare

• Or more accurately, all models must be able to


generate the test set (it’s OK if they can generate
more than the test set, but not less)

• e.g. Comparing a character-based model to a


word-based model is fair, but not vice-versa
Let’s try it out!
(loglin-lm.py)
What Problems are Handled?
• Cannot share strength among similar words
she bought a car she bought a bicycle
she purchased a car she purchased a bicycle
→ not solved yet 😞
• Cannot condition on context with intervening words
Dr. Jane Smith Dr. Gertrude Smith
→ solved! 😀
• Cannot handle long-distance dependencies
for tennis class he wanted to buy his own racquet
for programming class he wanted to buy his own computer
→ not solved yet 😞
Beyond Linear Models
Linear Models can’t Learn
Feature Combinations
farmers eat steak → high cows eat steak → low
farmers eat hay → low cows eat hay → high

• These can’t be expressed by linear features


• What can we do?
• Remember combinations as features (individual
scores for “farmers eat”, “cows eat”)

→ Feature space explosion!
• Neural nets
Neural Language Models
giving a • (See Bengio et al. 2004)

lookup lookup

tanh(

W1*h + b1)

W + = softmax

bias scores probs


Where is Strength Shared?
giving a
Similar output words
lookup lookup
get similar rows in
in the softmax matrix
tanh(
 Similar contexts get
W1*h + b1)
similar hidden states

Word embeddings: W + = softmax

Similar input words


get similar vectors bias scores probs
What Problems are Handled?
• Cannot share strength among similar words
she bought a car she bought a bicycle
she purchased a car she purchased a bicycle
→ solved, and similar contexts as well! 😀
• Cannot condition on context with intervening words
Dr. Jane Smith Dr. Gertrude Smith
→ solved! 😀
• Cannot handle long-distance dependencies
for tennis class he wanted to buy his own racquet
for programming class he wanted to buy his own computer
→ not solved yet 😞
Let’s Try it Out!
(nn-lm.py)
Tying Input/Output
giving a
Embeddings
• We can share parameters
pick row pick row between the input and output
embeddings (Press et al.
2016, inter alia)
tanh(

W1*h + b1)

W + = softmax

bias scores probs


Want to try? Delete the input embeddings, and
instead pick a row from the softmax matrix.
Training Tricks
Shuffling the Training Data
• Stochastic gradient methods update the
parameters a little bit at a time

• What if we have the sentence “I love this


sentence so much!” at the end of the training
data 50 times?

• To train correctly, we should randomly shuffle the


order at each time step
Other Optimization Options
• SGD with Momentum: Remember gradients from past
time steps to prevent sudden changes

• Adagrad: Adapt the learning rate to reduce learning


rate for frequently updated parameters (as measured
by the variance of the gradient)

• Adam: Like Adagrad, but keeps a running average of


momentum and gradient variance

• Many others: RMSProp, Adadelta, etc.



(See Ruder 2016 reference for more details)
Early Stopping, Learning
Rate Decay
• Neural nets have tons of parameters: we want to
prevent them from over-fitting

• We can do this by monitoring our performance on


held-out development data and stopping training
when it starts to get worse

• It also sometimes helps to reduce the learning rate


and continue training
Which One to Use?
• Adam is usually fast to converge and stable

• But simple SGD tends to do very will in terms of


generalization (Wilson et al. 2017)

• You should use learning rate decay, (e.g. on Machine


translation results by Denkowski & Neubig 2017)
Dropout
(Srivastava+ 14)
• Neural nets have lots of parameters, and are prone
to overfitting
• Dropout: randomly zero-out nodes in the hidden
layer with probability p at training time only
x
x
• Because the number of nodes at training/test is different, scaling is
necessary:
• Standard dropout: scale by p at test time
• Inverted dropout: scale by 1/(1-p) at training time
• An alternative: DropConnect (Wan+ 2013) instead zeros out
weights in the NN
Let’s Try it Out!
(nn-lm-optim.py)
Efficiency Tricks:

Operation Batching
Efficiency Tricks:

Mini-batching

• On modern hardware 10 operations of size 1 is


much slower than 1 operation of size 10

• Minibatching combines together smaller operations


into one big one
Minibatching
Manual Mini-batching
• Group together similar operations (e.g. loss calculations
for a single word) and execute them all together
• In the case of a feed-forward language model, each
word prediction in a sentence can be batched
• For recurrent neural nets, etc., more complicated
• How this works depends on toolkit
• Most toolkits have require you to add an extra
dimension representing the batch size
• DyNet has special minibatch operations for lookup
and loss functions, everything else automatic
Mini-batched Code Example
Let’s Try it Out!
(nn-lm-batch.py)
Automatic Mini-batching!

• TensorFlow Fold, DyNet Autobatching (see Neubig et al.


2017)

• Try it with the —dynet-autobatch command line option


Autobatching Usage
• for each minibatch:

• for each data point in mini-batch:

• define/add data

• sum losses

• forward (autobatch engine does magic!)

• backward

• update
Speed Improvements
A Case Study:
Regularizing and Optimizing LSTM
Language Models (Merity et al. 2017)
Regularizing and Optimizing LSTM
Language Models (Merity et al. 2017)
• Uses LSTMs as a backbone (discussed later)
• A number of tricks to improve stability and prevent overfitting:
• DropConnect regularization
• SGD w/ averaging triggered when model is close to
convergence
• Dropout on recurrent connections and embeddings
• Weight tying
• Independently tuned embedding and hidden layer sizes
• Regularization of activations of the network
• Strong baseline for language modeling, SOTA at the time
(without special model, just training methods)
Questions?

You might also like