CSCI 5832
Natural Language Processing
Jim Martin
Lecture 9
10/11/20 1
Today 2/12
• Review
GT example
• HMMs and Viterbi
POS tagging
2
10/11/20
Good-Turing Intuition
• Notation: Nx is the frequency-of-frequency-x
So N10=1, N1=3, etc
• To estimate counts/probs for unseen species
Use number of species (words) we’ve seen once
c0* =c1 p0 = N1/N
• All other estimates are adjusted (down) to allow for
increased probabilities for unseen
3
10/11/20
HW 0 Results
• Favorite color • 21 events
Blue 8 • Count of counts
Green 3 N1 = 4
Red 2 N2 = 3
Black 2 N3 = 1
White 2 N4,5,6,7 = 0
Periwinkle 1 N8 = 1
Gamboge 1
Eau-de-Nil 1
Brown 1
4
10/11/20
GT for a New Color
• Count of counts
• Treat the 0s as 1s so... N1 = 4
N0 = 4; P(new color) = 4/21 = .19 N2 = 3
If we new the number of colors out there we N3 = 1
would divide .19 by the number of colors not N4,5,6,7 = 0
seen. N8 = 1
• Otherwise
N*1 = (1+1) 3/4 = 6/4= 1.5
P*(Periwinkle) = 1.5/21 = .07
N*2 = (2+1) 1/3 = 1
P*(Black) = 1/21 = .047
5
10/11/20
GT for New Color
• Count of counts
• But 2 twists N1 = 4
N2 = 3
Treat the high flyers as
N3 = 1
trusted. N4,5,6,7 = 0
So P(Blue) should stay 8/21 N8 = 1
Use interpolation to smooth
the bin counts before re-
estimation
To deal with
• N3=(3+1) 0/1
6
10/11/20
Why Logs?
Simple Good-Turing does linear
interpolation in log-space. Why?
QuickTime™
QuickTime™ and and aa
TIFF
TIFF (Uncompressed)
(Uncompressed) decompressor
decompressor
are
are needed
needed to
to see
see this
this picture.
picture.
7
10/11/20
Part of Speech tagging
• Part of speech tagging
Parts of speech
What’s POS tagging good for anyhow?
Tag sets
Rule-based tagging
Statistical tagging
Simple most-frequent-tag baseline
Important Ideas
Training sets and test sets
Unknown words
HMM tagging 8
10/11/20
Parts of Speech
• 8 (ish) traditional parts of speech
Noun, verb, adjective, preposition, adverb,
article, interjection, pronoun, conjunction, etc
Called: parts-of-speech, lexical category,
word classes, morphological classes, lexical
tags, POS
Lots of debate in linguistics about the number,
nature, and universality of these
We’ll completely ignore this debate.
9
10/11/20
POS examples
• N noun chair, bandwidth, pacing
• V verb study, debate, munch
• ADJ adjective purple, tall, ridiculous
• ADV adverb unfortunately, slowly
• P preposition of, by, to
• PRO pronoun I, me, mine
• DET determiner the, a, that, those
10
10/11/20
POS Tagging example
WORD tag
the DET
koala N
put V
the DET
keys N
on P
the DET
table N
11
10/11/20
POS Tagging
• Words often have more than one POS:
back
The back door = JJ
On my back = NN
Win the voters back = RB
Promised to back the bill = VB
• The POS tagging problem is to determine
the POS tag for a particular instance of a
word.
These examples from Dekang Lin
12
10/11/20
How hard is POS tagging?
Measuring ambiguity
13
10/11/20
2 methods for POS tagging
1. Rule-based tagging
(ENGTWOL)
2. Stochastic (=Probabilistic) tagging
HMM (Hidden Markov Model) tagging
14
10/11/20
Hidden Markov Model Tagging
• Using an HMM to do POS tagging
• Is a special case of Bayesian inference
Foundational work in computational linguistics
Bledsoe 1959: OCR
Mosteller and Wallace 1964: authorship
identification
• It is also related to the “noisy channel”
model that’s the basis for ASR, OCR and
MT
15
10/11/20
POS Tagging as Sequence
Classification
• We are given a sentence (an “observation” or
“sequence of observations”)
Secretariat is expected to race tomorrow
• What is the best sequence of tags which
corresponds to this sequence of observations?
• Probabilistic view:
Consider all possible sequences of tags
Out of this universe of sequences, choose the tag
sequence which is most probable given the
observation sequence of n words w1…wn.
16
10/11/20
Road to HMMs
• We want, out of all sequences of n tags t1…tn the single
tag sequence such that P(t1…tn|w1…wn) is highest.
• Hat ^ means “our estimate of the best one”
• Argmaxx f(x) means “the x such that f(x) is maximized”
17
10/11/20
Road to HMMs
• This equation is guaranteed to give us the
best tag sequence
• But how to make it operational? How to
compute this value?
• Intuition of Bayesian classification:
Use Bayes rule to transform into a set of other
probabilities that are easier to compute
18
10/11/20
Using Bayes Rule
19
10/11/20
Likelihood and Prior
20
10/11/20
Two Sets of Probabilities (1)
• Tag transition probabilities p(ti|ti-1)
Determiners likely to precede adjs and nouns
That/DT flight/NN
The/DT yellow/JJ hat/NN
So we expect P(NN|DT) and P(JJ|DT) to be high
Compute P(NN|DT) by counting in a labeled
corpus:
21
10/11/20
Two Sets of Probabilities (2)
• Word likelihood probabilities p(wi|ti)
VBZ (3sg Pres verb) likely to be “is”
Compute P(is|VBZ) by counting in a
labeled corpus:
22
10/11/20
An Example: the verb “race”
• Secretariat/NNP is/VBZ expected/VBN to/TO
race/VB tomorrow/NR
• People/NNS continue/VB to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN
for/IN outer/JJ space/NN
• How do we pick the right tag?
23
10/11/20
Disambiguating “race”
24
10/11/20
Example
• P(NN|TO) = .00047
• P(VB|TO) = .83
• P(race|NN) = .00057
• P(race|VB) = .00012
• P(NR|VB) = .0027
• P(NR|NN) = .0012
• P(VB|TO)P(NR|VB)P(race|VB) = .00000027
• P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
• So we (correctly) choose the verb reading,
25
10/11/20
Hidden Markov Models
• What we’ve described with these two
kinds of probabilities is a Hidden Markov
Model
• Let’s just spend a bit of time tying this into
the model
• First some definitions.
26
10/11/20
Definitions
• A weighted finite-state automaton adds
probabilities to the arcs
The sum of the probabilities leaving any arc must sum
to one
• A Markov chain is a special case in which the
input sequence uniquely determines which
states the automaton will go through
• Markov chains can’t represent inherently
ambiguous problems
Useful for assigning probabilities to unambiguous
sequences
27
10/11/20
Markov chain for weather
28
10/11/20
Markov chain for words
29
10/11/20
Markov chain = “First-order
Observable Markov Model”
• A set of states
Q = q1, q2…qN; the state at time t is qt
• Transition probabilities:
a set of probabilities A = a01a02…an1…ann.
Each aij represents the probability of transitioning from
state i to state j
The set of these is the transition probability matrix A
• Current state only depends on previous state
P(qi | q1 ...qi−1) = P(qi | qi−1 )
30
10/11/20
Markov chain for weather
• What is the probability of 4 consecutive
rainy days?
• Sequence is rainy-rainy-rainy-rainy
• I.e., state sequence is 3-3-3-3
• P(3,3,3,3) =
1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432
31
10/11/20
HMM for Ice Cream
• You are a climatologist in the year 2799
• Studying global warming
• You can’t find any records of the weather
in Baltimore, MA for summer of 2007
• But you find Jason Eisner’s diary
• Which lists how many ice-creams Jason
ate every date that summer
• Our job: figure out how hot it was
32
10/11/20
Hidden Markov Model
• For Markov chains, the output symbols are the same
as the states.
See hot weather: we’re in state hot
• But in part-of-speech tagging (and other things)
The output symbols are words
But the hidden states are part-of-speech tags
• So we need an extension!
• A Hidden Markov Model is an extension of a Markov
chain in which the input symbols are not the same as
the states.
• This means we don’t know which state we are in.
33
10/11/20
Hidden Markov Models
• States Q = q1, q2…qN;
• Observations O= o1, o2…oN;
Each observation is a symbol from a vocabulary V = {v1,v2,…
vV}
• Transition probabilities
Transition probability matrix A = {aij}
aij = P(qt = j | qt−1 = i) 1 ≤ i, j ≤ N
• Observation likelihoods
Output probability matrix B={bi(k)}
b (k) = P(X t = ok | qt = i)
probability vector i
• Special initial €
34
10/11/20 π i = P(q1 = i) 1 ≤ i ≤ N
Eisner task
• Given
Ice Cream Observation Sequence:
1,2,3,2,2,2,3…
• Produce:
Weather Sequence: H,C,H,H,H,C…
35
10/11/20
HMM for ice cream
36
10/11/20
Transitions between the hidden
states of HMM, showing A probs
37
10/11/20
B observation likelihoods for
POS HMM
38
10/11/20
The A matrix for the POS HMM
39
10/11/20
The B matrix for the POS HMM
40
10/11/20
Viterbi intuition: we are looking
for the best ‘path’
S1 S2 S3 S4 S5
RB
NN
VBN
JJ DT VB
TO
VBD
VB NNP NN
promised to back the bill
41
10/11/20
The Viterbi Algorithm
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
42
10/11/20
Viterbi example
43
10/11/20
Error Analysis
• Look at a confusion matrix
• See what errors are causing problems
Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)
Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)
44
10/11/20
Evaluation
• The result is compared with a manually
coded “Gold Standard”
Typically accuracy reaches 96-97%
This may be compared with result for a
baseline tagger (one that uses no context).
• Important: 100% is impossible even for
human annotators.
45
10/11/20
Summary
• HMM Tagging
Markov Chains
Hidden Markov Models
46
10/11/20