0% found this document useful (0 votes)
118 views46 pages

CSCI 5832 Natural Language Processing: Jim Martin

Uploaded by

Eman Asem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views46 pages

CSCI 5832 Natural Language Processing: Jim Martin

Uploaded by

Eman Asem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 46

CSCI 5832

Natural Language Processing

Jim Martin
Lecture 9

10/11/20 1
Today 2/12

• Review
 GT example
• HMMs and Viterbi
 POS tagging

2
10/11/20
Good-Turing Intuition

• Notation: Nx is the frequency-of-frequency-x


 So N10=1, N1=3, etc
• To estimate counts/probs for unseen species
 Use number of species (words) we’ve seen once
 c0* =c1 p0 = N1/N
• All other estimates are adjusted (down) to allow for
increased probabilities for unseen

3
10/11/20
HW 0 Results

• Favorite color • 21 events


 Blue 8 • Count of counts
 Green 3  N1 = 4
 Red 2  N2 = 3
 Black 2  N3 = 1
 White 2  N4,5,6,7 = 0
 Periwinkle 1  N8 = 1
 Gamboge 1
 Eau-de-Nil 1
 Brown 1
4
10/11/20
GT for a New Color

• Count of counts
• Treat the 0s as 1s so...  N1 = 4
 N0 = 4; P(new color) = 4/21 = .19  N2 = 3
 If we new the number of colors out there we  N3 = 1
would divide .19 by the number of colors not  N4,5,6,7 = 0
seen.  N8 = 1
• Otherwise
 N*1 = (1+1) 3/4 = 6/4= 1.5
 P*(Periwinkle) = 1.5/21 = .07
 N*2 = (2+1) 1/3 = 1
 P*(Black) = 1/21 = .047

5
10/11/20
GT for New Color

• Count of counts
• But 2 twists  N1 = 4
 N2 = 3
 Treat the high flyers as
 N3 = 1
trusted.  N4,5,6,7 = 0
 So P(Blue) should stay 8/21  N8 = 1

 Use interpolation to smooth


the bin counts before re-
estimation
 To deal with
• N3=(3+1) 0/1

6
10/11/20
Why Logs?

Simple Good-Turing does linear


interpolation in log-space. Why?
QuickTime™
QuickTime™ and and aa
TIFF
TIFF (Uncompressed)
(Uncompressed) decompressor
decompressor
are
are needed
needed to
to see
see this
this picture.
picture.

7
10/11/20
Part of Speech tagging
• Part of speech tagging
 Parts of speech
 What’s POS tagging good for anyhow?
 Tag sets
 Rule-based tagging
 Statistical tagging
 Simple most-frequent-tag baseline
 Important Ideas
 Training sets and test sets
 Unknown words
 HMM tagging 8
10/11/20
Parts of Speech

• 8 (ish) traditional parts of speech


 Noun, verb, adjective, preposition, adverb,
article, interjection, pronoun, conjunction, etc
 Called: parts-of-speech, lexical category,
word classes, morphological classes, lexical
tags, POS
 Lots of debate in linguistics about the number,
nature, and universality of these
 We’ll completely ignore this debate.

9
10/11/20
POS examples

• N noun chair, bandwidth, pacing


• V verb study, debate, munch
• ADJ adjective purple, tall, ridiculous
• ADV adverb unfortunately, slowly
• P preposition of, by, to
• PRO pronoun I, me, mine
• DET determiner the, a, that, those
10
10/11/20
POS Tagging example

WORD tag

the DET
koala N
put V
the DET
keys N
on P
the DET
table N

11
10/11/20
POS Tagging

• Words often have more than one POS:


back
 The back door = JJ
 On my back = NN
 Win the voters back = RB
 Promised to back the bill = VB
• The POS tagging problem is to determine
the POS tag for a particular instance of a
word.
These examples from Dekang Lin
12
10/11/20
How hard is POS tagging?
Measuring ambiguity

13
10/11/20
2 methods for POS tagging

1. Rule-based tagging
 (ENGTWOL)
2. Stochastic (=Probabilistic) tagging
 HMM (Hidden Markov Model) tagging

14
10/11/20
Hidden Markov Model Tagging

• Using an HMM to do POS tagging


• Is a special case of Bayesian inference
 Foundational work in computational linguistics
 Bledsoe 1959: OCR
 Mosteller and Wallace 1964: authorship
identification
• It is also related to the “noisy channel”
model that’s the basis for ASR, OCR and
MT
15
10/11/20
POS Tagging as Sequence
Classification

• We are given a sentence (an “observation” or


“sequence of observations”)
 Secretariat is expected to race tomorrow
• What is the best sequence of tags which
corresponds to this sequence of observations?
• Probabilistic view:
 Consider all possible sequences of tags
 Out of this universe of sequences, choose the tag
sequence which is most probable given the
observation sequence of n words w1…wn.
16
10/11/20
Road to HMMs

• We want, out of all sequences of n tags t1…tn the single


tag sequence such that P(t1…tn|w1…wn) is highest.

• Hat ^ means “our estimate of the best one”


• Argmaxx f(x) means “the x such that f(x) is maximized”

17
10/11/20
Road to HMMs

• This equation is guaranteed to give us the


best tag sequence

• But how to make it operational? How to


compute this value?
• Intuition of Bayesian classification:
 Use Bayes rule to transform into a set of other
probabilities that are easier to compute
18
10/11/20
Using Bayes Rule

19
10/11/20
Likelihood and Prior

20
10/11/20
Two Sets of Probabilities (1)

• Tag transition probabilities p(ti|ti-1)


 Determiners likely to precede adjs and nouns
 That/DT flight/NN
 The/DT yellow/JJ hat/NN
 So we expect P(NN|DT) and P(JJ|DT) to be high
 Compute P(NN|DT) by counting in a labeled
corpus:

21
10/11/20
Two Sets of Probabilities (2)

• Word likelihood probabilities p(wi|ti)


 VBZ (3sg Pres verb) likely to be “is”
 Compute P(is|VBZ) by counting in a
labeled corpus:

22
10/11/20
An Example: the verb “race”

• Secretariat/NNP is/VBZ expected/VBN to/TO


race/VB tomorrow/NR
• People/NNS continue/VB to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN
for/IN outer/JJ space/NN
• How do we pick the right tag?

23
10/11/20
Disambiguating “race”

24
10/11/20
Example

• P(NN|TO) = .00047
• P(VB|TO) = .83
• P(race|NN) = .00057
• P(race|VB) = .00012
• P(NR|VB) = .0027
• P(NR|NN) = .0012
• P(VB|TO)P(NR|VB)P(race|VB) = .00000027
• P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
• So we (correctly) choose the verb reading,
25
10/11/20
Hidden Markov Models

• What we’ve described with these two


kinds of probabilities is a Hidden Markov
Model
• Let’s just spend a bit of time tying this into
the model
• First some definitions.

26
10/11/20
Definitions

• A weighted finite-state automaton adds


probabilities to the arcs
 The sum of the probabilities leaving any arc must sum
to one
• A Markov chain is a special case in which the
input sequence uniquely determines which
states the automaton will go through
• Markov chains can’t represent inherently
ambiguous problems
 Useful for assigning probabilities to unambiguous
sequences
27
10/11/20
Markov chain for weather

28
10/11/20
Markov chain for words

29
10/11/20
Markov chain = “First-order
Observable Markov Model”
• A set of states
 Q = q1, q2…qN; the state at time t is qt
• Transition probabilities:
 a set of probabilities A = a01a02…an1…ann.
 Each aij represents the probability of transitioning from
state i to state j
 The set of these is the transition probability matrix A

• Current state only depends on previous state


P(qi | q1 ...qi−1) = P(qi | qi−1 )
30
10/11/20
Markov chain for weather

• What is the probability of 4 consecutive


rainy days?
• Sequence is rainy-rainy-rainy-rainy
• I.e., state sequence is 3-3-3-3
• P(3,3,3,3) =
 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432

31
10/11/20
HMM for Ice Cream

• You are a climatologist in the year 2799


• Studying global warming
• You can’t find any records of the weather
in Baltimore, MA for summer of 2007
• But you find Jason Eisner’s diary
• Which lists how many ice-creams Jason
ate every date that summer
• Our job: figure out how hot it was
32
10/11/20
Hidden Markov Model

• For Markov chains, the output symbols are the same


as the states.
 See hot weather: we’re in state hot
• But in part-of-speech tagging (and other things)
 The output symbols are words
 But the hidden states are part-of-speech tags
• So we need an extension!
• A Hidden Markov Model is an extension of a Markov
chain in which the input symbols are not the same as
the states.
• This means we don’t know which state we are in.
33
10/11/20
Hidden Markov Models

• States Q = q1, q2…qN;


• Observations O= o1, o2…oN;
 Each observation is a symbol from a vocabulary V = {v1,v2,…
vV}
• Transition probabilities
 Transition probability matrix A = {aij}
aij = P(qt = j | qt−1 = i) 1 ≤ i, j ≤ N
• Observation likelihoods
 Output probability matrix B={bi(k)}
b (k) = P(X t = ok | qt = i)
probability vector i
• Special initial €
34
10/11/20 π i = P(q1 = i) 1 ≤ i ≤ N
Eisner task

• Given
 Ice Cream Observation Sequence:
1,2,3,2,2,2,3…
• Produce:
 Weather Sequence: H,C,H,H,H,C…

35
10/11/20
HMM for ice cream

36
10/11/20
Transitions between the hidden
states of HMM, showing A probs

37
10/11/20
B observation likelihoods for
POS HMM

38
10/11/20
The A matrix for the POS HMM

39
10/11/20
The B matrix for the POS HMM

40
10/11/20
Viterbi intuition: we are looking
for the best ‘path’
S1 S2 S3 S4 S5
RB

NN

VBN
JJ DT VB
TO
VBD
VB NNP NN

promised to back the bill


41
10/11/20
The Viterbi Algorithm

QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.

42
10/11/20
Viterbi example

43
10/11/20
Error Analysis

• Look at a confusion matrix

• See what errors are causing problems


 Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)
 Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)
44
10/11/20
Evaluation

• The result is compared with a manually


coded “Gold Standard”
 Typically accuracy reaches 96-97%
 This may be compared with result for a
baseline tagger (one that uses no context).
• Important: 100% is impossible even for
human annotators.

45
10/11/20
Summary

• HMM Tagging
 Markov Chains
 Hidden Markov Models

46
10/11/20

You might also like