0% found this document useful (0 votes)
11 views108 pages

Apznzaaczprqee1da4bjade7ul0meb Ap8tjou Feozcgqct6cpnh0z32ibu3faj 0wgfmnhp5p Eneunhaucakhow Bie9yhlaoqtsknu7yq0gfnxrzjd2mjuyrbnhadveb2wj7gjgcxpffbjgyxl4nzdqf5qeux-Lla2ggr5kg9w4bp8ev5hqrj7bwr3npwnp9gfmazwtau

The document outlines the course on Natural Language Processing (TSC-7261) focusing on Parts of Speech (POS) tagging and sequence labeling. It covers the significance of POS tagging, its categories, challenges in tagging, and various algorithms used for tagging, including rule-based methods and Hidden Markov Models. The content emphasizes the historical context of POS tagging and its relevance in modern NLP applications.

Uploaded by

Fedasa Bote
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views108 pages

Apznzaaczprqee1da4bjade7ul0meb Ap8tjou Feozcgqct6cpnh0z32ibu3faj 0wgfmnhp5p Eneunhaucakhow Bie9yhlaoqtsknu7yq0gfnxrzjd2mjuyrbnhadveb2wj7gjgcxpffbjgyxl4nzdqf5qeux-Lla2ggr5kg9w4bp8ev5hqrj7bwr3npwnp9gfmazwtau

The document outlines the course on Natural Language Processing (TSC-7261) focusing on Parts of Speech (POS) tagging and sequence labeling. It covers the significance of POS tagging, its categories, challenges in tagging, and various algorithms used for tagging, including rule-based methods and Hidden Markov Models. The content emphasizes the historical context of POS tagging and its relevance in modern NLP applications.

Uploaded by

Fedasa Bote
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 108

AAU

AAiT
SiTE
Course Title: Natural Language Processing (TSC-7261)
Credit Hour: 3
Instructor: Fantahun B. (PhD)  [email protected]
Office: NB #

3-Parts of Speech Tagging and Sequence Labeling


2023/2024, AA
POS Tagging and Sequence Labeling
Contents
 POS Tagging
 Lexical syntax
 Hidden Markov Models
 Maximum Entropy Models

11/13/2023

NLP

Fantahun B.(PhD)

2
POS Tagging and Sequence Labeling
Objectives:
After completing this chapter, students will be able to:

11/13/2023

NLP

Fantahun B.(PhD)

3
POS Tagging
 From the earliest linguistic traditions (Yaska and Panini 5th C. BCE,
Aristotle 4th C. BCE), the idea that words can be classified into
grammatical categories
 part of speech, word classes, POS, POS tags, morphological classes,
or lexical tags

 8 parts of speech attributed to Dionysius Thrax of Alexandria (c.


1st C. BCE):
 Noun,
 Verb,
 Pronon,
 Preposition,

 Adverb,
 Conjunction,
 Participle,
 Article

 These categories are relevant for NLP today.


11/13/2023

NLP

Fantahun B.(PhD)

From the earliest linguistic traditions (the Sanskrit grammarians Yaska and Panini
in India, the Aristotle and the Stoics in Greece), came idea that
POS Tagging
 Part-of-speech tagging is the process of assigning a part-

of-speech to each word in a text.

 Tagging is a disambiguation task why?


 words are ambiguous, have more than one possible part-of-

speech—and the goal is to find the correct tag for the situation.

 Example book:
 VERB: (Book that flight)
 NOUN: (Hand me that book).
 Maps from sequence x1,…,xn of words to

y1,…,yn of POS tags

11/13/2023

NLP

Fantahun B.(PhD)

5
POS Tagging

A sketch of part of speech tagging.


 input: a sequence x1,x2,...,xn of (tokenized) words and a tagset,
 Output: a sequence y1,y2,...,yn of tags, each output yi

corresponding exactly to one input xi,

11/13/2023

NLP

Fantahun B.(PhD)

6
POS Tagging
 Map from sequence x1,…,xn of words to y1,…,yn of POS tags

11/13/2023

NLP

Fantahun B.(PhD)

7
POS Tagging: Significance
What do you think is the significance of POS?
 The significance of parts-of-speech is the large amount of

information they give about a word and its neighbors.

 Useful in a language model for speech recognition.


 Eg. tagsets distinguish between possessive pronouns (my, your, his,

her, its) and personal pronouns (I, you, he, me).


• possessive pronouns  likely to be followed by a noun,
• personal pronouns  likely to be followed by a verb.

11/13/2023

NLP

Fantahun B.(PhD)

8
POS Tagging: Significance
 Speech synthesis system: a word’s part-of-speech can tell us

something about how the word is pronounced.

 Example, the word content can be a noun or an adjective.

They are pronounced differently


• As noun  pronounced CONtent,
• As adjective  pronounced conTENT.

 Thus knowing the POS can produce more natural pronunciations in

a speech synthesis system and more accuracy in a speech


recognition system.

 More examples
• Object, OBject (noun) and obJECT (verb),
• Discount, DIScount (noun) and disCOUNT (verb).
• INsult, inSULT ?, OVERflow, overFLOW ?, DIScoun,t discount ?
11/13/2023

NLP

Fantahun B.(PhD)

9
POS Tagging: Significance
 Information Retrieval: POS can also be used in stemming for

informational retrieval (IR), since knowing a word’s POS can help


tell us which morphological affixes it can take. They can also
enhance an IR application by selecting out nouns or other
important words from a document.
 Parsing, WSD: Automatic assignment of POS plays a role in parsing,

in word-sense disambiguation algorithms, and in shallow parsing of


texts to quickly find names, times, dates, or other named entities
for the information extraction applications.
 Linguistic research: corpora that have been marked for POS are

very useful for linguistic research. For example, they can be used
to help find instances or frequencies of particular constructions.

11/13/2023

NLP

Fantahun B.(PhD)

10
POS Tagging: POS Categories: 1-closed classes
 Closed class words:
 Closed classes are those that have relatively fixed membership.
 For example, prepositions are a closed class because there is a
fixed set of them in English; new prepositions are rarely coined.
 Usually function words: short, frequent words with grammatical
function
• determiners: a, an, the
• pronouns: she, he, I
• prepositions: on, under, over, near, by, …

11/13/2023

NLP

Fantahun B.(PhD)

11
POS Tagging: POS Categories: 2-open classes
 Open class words: nouns and verbs are open classes
because new nouns and verbs are continually coined or
borrowed from other languages.
 content words: Nouns, Verbs, Adjectives, Adverbs
 Interjections: oh, ouch, uh-huh, yes, hello
o New nouns and verbs like iPhone or to fax

 There are four major open classes that occur in the

languages of the world;


 nouns, verbs, adjectives, and adverbs.
 English has all four of these, although not every language does.
11/13/2023

NLP

Fantahun B.(PhD)

12
POS
Tagging
Open
class
("content")and
wordsSequence Labeling
Nouns
d

Verbs

Adjectives

old green tasty


slowly yesterday

Proper

Common

Main

Adverbs

Janet
Italy

cat, cats
mango

eat
went

Numbers

Closed class ("function")


Determiners the, a, some
Conjunctions and, or
Pronouns
11/13/2023

Interjections Ow hello
… more

122,312
one

Auxiliary
be, can, do
had,

Prepositions to with
Particles

off up

… more

they its
NLP

Fantahun B.(PhD)

See Section-5.1 page-124-130


13
POS Tagging: Tagsets for English
 There are a small number of popular tagsets for English, many

of which evolved from


 the 87-tag tagset used for the Brown corpus (Francis, 1979; Francis

and Kuˇcera, 1982).


 This corpus was tagged with POS by first applying the TAGGIT

program and then hand-correcting the tags.

 Besides this original Brown tagset, other most common tagsets:


 the small 45-tag Penn Treebank tagset (Marcus et al., 1993), and
 the medium-sized 61-tag C5 tagset used by the Lancaster UCREL

project’s CLAWS (the Constituent Likelihood AutomaticWordtagging System) tagger to


tag the British National Corpus (BNC)
(Garside et al., 1997).

11/13/2023

NLP

Fantahun B.(PhD)

14
POS Tagging:
Tagsets for English

45 tagsets

11/13/2023

NLP

Fantahun B.(PhD)

15
POS Tagging: Universal dependencies tagset
 s

11/13/2023

NLP

Fantahun B.(PhD)

16
POS Tagging: Some tagged English sentences
Example:
There/PRO were/VERB 70/NUM children/NOUN there/ADV ./PUNC
Preliminary/ADJ findings/NOUN were/AUX reported/VERB in/ADP
today/NOUN ’s/PART New/PROPN England/PROPN Journal/PROPN
of/ADP Medicine/PROPN
The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN
of/IN other/JJ topics/NNS ./.
There/EX are/VBP 70/CD children/NNS there/RB

11/13/2023

NLP

Fantahun B.(PhD)

17
POS Tagging: Difficulties in tagging
 Some tagging distinctions are quite hard for both humans

and machines to make.

1) Eg. prepositions (IN), particles (RP), and adverbs (RB) can

have a large overlap.

 Words like around can be all three:


 Mrs./NNP Shaefer/NNP never/RB got/VBD around/RP to/TO
joining/VBG
 All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT
corner/NN
 Chateau/NNP Petrus/NNP costs/VBZ around/RB 250/CD

11/13/2023

NLP

Fantahun B.(PhD)

18
POS Tagging: Difficulties in tagging
 Making these decisions requires sophisticated knowledge of syntax;

tagging manuals (Santorini, 1990) give various heuristics that can help
human coders make these decisions, and that can also provide useful
features for automatic taggers. Eg. two heuristics from Santorini (1990):
 Prepositions generally are associated with a following noun phrase
(also by prepositional phrases), and that the word around is tagged
as an adverb when it means “approximately”.
 Particles often can either precede or follow a noun phrase object:
• She told off/RP her friends  particle
• She told her friends off/RP.

 Prepositions, cannot follow their noun phrase (* is used here to mark

an ungrammatical sentence.

• She stepped off/IN the train  preposition


• *She stepped the train off/IN.
11/13/2023

NLP

Fantahun B.(PhD)

19
POS Tagging: Difficulties in tagging
2) Another difficulty is labeling the words that can modify nouns.
 Sometimes the modifiers preceding nouns are common nouns

like cotton below,

 other times the Treebank tagging manual specifies that

modifiers be tagged as adjectives (for example if the modifier is


a hyphenated common noun like income-tax) and

 other times as proper nouns (for modifiers which are

hyphenated proper nouns like Gramm-Rudman):


o cotton/NN sweater/NN
o income-tax/JJ return/NN
o the/DT Gramm-Rudman/NP Act/NP

11/13/2023

NLP

Fantahun B.(PhD)

20
POS Tagging: Difficulties in tagging
 Some words that can be adjectives, common nouns, or

proper nouns, are tagged in the Treebank as common nouns


when acting as modifiers:

 Chinese/NN cooking/NN
 Pacific/NN waters/NNS

11/13/2023

NLP

Fantahun B.(PhD)

21
POS Tagging: Difficulties in tagging
3) A third known difficulty in tagging is distinguishing past

participles (VBN) from adjectives (JJ).

 A word like married is a past participle when it is being used

in an eventive, verbal way, as below, and is an adjective


when it is being used to express a property, as below:
 They were married/VBN by the Justice of the Peace

yesterday at 5:00.
 At the time, she was already married/JJ.

11/13/2023

NLP

Fantahun B.(PhD)

22
POS Tagging: Algorithms
 Many algorithms have been applied to this problem,

including
 Rule-based tagging
 Probabilistic / Stochastic methods
o HMM tagging
o Maximum entropy tagging

 Transformation based tagging and


 memory-based tagging.

11/13/2023

NLP

Fantahun B.(PhD)

23
POS Tagging: Algorithms - Rule-Based POS Tagging
 The earliest algorithms for automatically assigning POS were

based on a two-stage architecture (Harris, 1962; Klein and


Simmons, 1963; Greene and Rubin, 1971).
1. Use a dictionary to assign each word a list of potential POS.
2. Use large lists of hand-written disambiguation rules to winnow

down this list to a single POS for each word.

 One of the most comprehensive rule-based approaches is

the Constraint Grammar approach (Karlsson et al., 1995a).


In this section we describe a tagger based on this
approach, the EngCG tagger (Voutilainen, EngCG 1995,
1999).

11/13/2023

NLP

Fantahun B.(PhD)

24
POS Tagging: Algorithms - Rule-Based POS Tagging
 The EngCG ENGTWOL lexicon is based on the two-level

morphology and has about 56,000 entries for English word


stems (Heikkil¨a, 1995),
 counting a word with multiple POS (e.g., nominal and verbal senses of

hit) as separate entries, and


 not counting inflected and many derived forms.

 Each entry is annotated with a set of morphological and

syntactic features.
 Fig. 5.11 shows some selected words, together with a slightly

simplified listing of their features; these features are used in


rule writing.
11/13/2023

NLP

Fantahun B.(PhD)

25
POS Tagging: Algorithms - Rule-Based POS Tagging

11/13/2023

NLP

Fantahun B.(PhD)

26
POS Tagging: Algorithms - Rule-Based POS Tagging
 Most of the features in Fig. 5.11 are relatively self-explanatory;
 SG for singular, -SG3 for other than third-person-singular.
 ABSOLUTE  non-comparative and non-superlative for an adjective,

NOMINATIVE  non-genitive, and PCP2 means past participle.

 PRE, CENTRAL, and POST are ordering slots for determiners

(predeterminers (all) come before determiners (the): all the president’s


men).

 NOINDEFDETERMINER  words like furniture do not appear with the

indefinite determiner a.

 SV, SVO, and SVOO specify the subcategorization or complementation

pattern for the verb. SV means the verb appears solely with a subject
(nothing occurred); SVO with a subject and an object (I showed the film);
SVOO with a subject and two complements: She showed her the ball.

11/13/2023

NLP

Fantahun B.(PhD)

27
POS Tagging: Algorithms - Rule-Based POS Tagging
 In the first stage of the tagger, each word is run through the two-level

lexicon transducer and the entries for all possible parts-of-speech are
returned.

 For example the phrase “Pavlov had shown that salivation . . .“ would

return the following list (one line per possible tag, with the correct tag
shown in boldface):
Pavlov PAVLOV N NOM SG PROPER
had
HAVE V PAST VFIN SVO
HAVE PCP2 SVO
shown SHOW PCP2 SVOO SVO SV
that
ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
salivation
N NOM SG
...

11/13/2023

NLP

Fantahun B.(PhD)

28
POS Tagging: Algorithms - Rule-Based POS Tagging
 EngCG then applies a large set of constraints (as many as 3,744

constraints in the EngCG-2 system) to the input sentence to rule out


incorrect parts-of-speech.

 The boldfaced entries in the table above show the desired result, in

which the simple past tense tag (rather than the past participle tag) is
applied to had, and the complementizer (CS) tag is applied to that.

 The constraints are used in a negative way, to eliminate tags that are

inconsistent with the context.

 For example one constraint eliminates all readings of that except the

ADV (adverbial intensifier) sense (this is the sense in the sentence it


isn’t that odd).

 Here’s a simplified version of the constraint. . .

11/13/2023

NLP

Fantahun B.(PhD)

29
POS Tagging: Algorithms - Rule-Based POS Tagging
 Here’s a simplified version of the constraint. . .

11/13/2023

NLP

Fantahun B.(PhD)

30
POS Tagging: Algorithms - Rule-Based POS Tagging
 The first two clauses of this rule check to see that the that

directly precedes a sentence-final adjective, adverb, or


quantifier. In all other cases the adverb reading is eliminated.

 The last clause eliminates cases preceded by verbs like consider

or believe which can take a noun and an adjective; this is to


avoid tagging the following instance of that as an adverb:

I consider that odd.

11/13/2023

NLP

Fantahun B.(PhD)

31
POS Tagging: Algorithms - Rule-Based POS Tagging
 Another rule is used to express the constraint that the

complementizer sense of that is most likely to be used if the


previous word is a verb which expects a complement (like
believe, think, or show), and if that is followed by the beginning
of a noun phrase and finite verb.

 This description oversimplifies the EngCG architecture; the

system also includes probabilistic constraints, and also makes


use of other syntactic information we haven’t discussed. The
interested reader should consult Karlsson et al. (1995b) and
Voutilainen (1999).

11/13/2023

NLP

Fantahun B.(PhD)

32
POS Tagging: Markov Chains
 An HMM is nothing more than a probabilistic function of a

Markov process.

 Markov processes/chains/models were first developed by

Andrei A. Markov (a student of Chebyshev).

 We will refer to vanilla Markov models as Visible Markov Models

(VMMs) when we want to be careful to distinguish them from


HMMs.

 Markov models can be used whenever one wants to model the

probability of a linear sequence of events.

11/13/2023

NLP

Fantahun B.(PhD)

33
POS Tagging: Markov Chains
 Markov chains and Hidden Markov Models are both extensions

of the finite automata.

 Finite automaton is defined by a set of states, and a set of

transitions between states that are taken based on the input


observations.

 A weighted finite-state automaton is a simple augmentation of

the finite automaton in which each arc is associated with a


probability, indicating how likely that path is to be taken.

 The probability on all the arcs leaving a node must sum to 1.


 A Markov chain is a special case of a weighted automaton in

which the input sequence uniquely determines which states the


automaton will go through.

11/13/2023

NLP

Fantahun B.(PhD)

34
POS Tagging: Markov Chains
 Fig. 6.1a shows a Markov chain for assigning a probability to a

sequence of weather events, where the vocabulary consists of


HOT, COLD, and RAINY.

11/13/2023

NLP

Fantahun B.(PhD)

35
POS Tagging: Markov Chains
 Fig. 6.1b shows another simple example of a Markov chain for

assigning a probability to a sequence of words w1, ..., wn.

 A Markov chain is specified by the following components:

11/13/2023

NLP

Fantahun B.(PhD)

36
POS Tagging: Markov Chains
 First order Markov chain assumptions

 Each aij expresses the probability p(qj|qi), hence

11/13/2023

NLP

Fantahun B.(PhD)

37
POS Tagging: Hidden Markov Models
 In an HMM, you don't know the state sequence that the model

passes through, but only some probabilistic function of it.

 A Markov chain is useful when we need to compute a probability

for a sequence of events that we can observe in the world.

 In many cases, however, the events we are interested in may not

be directly observable in the world.


 Example: in POS tagging, we didn’t observe POS tags in the world;
we saw words, and had to infer the correct tags from the word
sequence. We call the POS tags hidden because they are not
observed.
 In speech recognition; in that case we’ll see acoustic events in the
world, and have to infer the presence of ‘hidden’ words that are
the underlying causal source of the acoustics.

11/13/2023

NLP

Fantahun B.(PhD)

38
POS Tagging: Hidden Markov Models
 A HMM allows us to talk about both observed Model events (like words

that we see in the input) and hidden events (like POS tags) that we think
of as causal factors in our probabilistic model.

 To exemplify these models, we’ll use a task conceived of by Jason Eisner

(2002a).

 Imagine that you are a climatologist in the year 2799 studying the history of

global warming. You cannot find any records of the weather in Baltimore,
Maryland, for the summer of 2007, but you do find Jason Eisner’s diary, which
lists how many ice creams Jason ate every day that summer. Our goal is to
use these observations to estimate the temperature every day. We’ll simplify
this weather task by assuming there are only two kinds of days: cold (C) and
hot (H).

 So the Eisner task is as follows: Given a sequence of observations O,

each observation an integer corresponding to the number of ice creams


eaten on a given day, figure out the correct ‘hidden’ sequence Q of
weather states (H or C) which caused Jason to eat the ice cream.

11/13/2023

NLP

Fantahun B.(PhD)

39
POS Tagging: Hidden Markov Models
Formal definition of an HMM
 An HMM is specified by the following components:

11/13/2023

NLP

Fantahun B.(PhD)

40
POS Tagging: Hidden Markov Models
A first-order Hidden Markov Model two simplifying assumptions.
1) As with a first-order Markov chain, the probability of a particular

state is dependent only on the previous state:

2) The probability of an output observation oi is dependent only on

the state observation qi, and not on any other states or any other
observations:

Fig. 6.3 shows a sample HMM for the ice cream task. The two hidden states (H and C)
correspond to hot and cold weather, while the observations (drawn from the
alphabet O = {1,2,3}) correspond to the number of ice creams eaten by Jason on a
given day.
11/13/2023

NLP

Fantahun B.(PhD)

41
POS Tagging: Hidden Markov Models

11/13/2023

NLP

Fantahun B.(PhD)

42
POS Tagging: Hidden Markov Models
 Notice that in the HMM in Fig. 6.3, there is a (non-zero) probability

of transitioning between any two states. Such an HMM is called a


fully-connected or ergodic HMM.

 Sometimes, however, we have HMMs in which many of the

transitions between states have zero probability.

 For example, in left-to-right ( Bakis HMMs ), the state transitions

proceed from left to right, as shown in Fig. 6.4.

 There are no transitions going from a higher-numbered state to a

lower-numbered state

 (or, more accurately, any transitions from a higher-numbered state

to a lower-numbered state have zero probability).

 Bakis HMMs are generally used to model temporal processes like

speech.

11/13/2023

NLP

Fantahun B.(PhD)

43
POS Tagging: Hidden Markov Models

11/13/2023

NLP

Fantahun B.(PhD)

44
POS Tagging: Hidden Markov Models
 Hidden Markov Models should be characterized by three

fundamental problems:

11/13/2023

NLP

Fantahun B.(PhD)

45
HMMs: Computing Likelihood: The Forward Algorithm

 For example, given the HMM in Fig. 6.2b, what is the probability of

the sequence 3 1 3?

 Markov chain, where the surface observations are the same as the

hidden events, we could compute the probability of 3 1 3 just by


following the states labeled 3 1 3 and multiplying the probabilities
along the arcs.

 For a HMM, things are not so simple. We want to determine the

probability of an ice-cream observation sequence like 3 1 3, but we


don’t know what the hidden state sequence is!

11/13/2023

NLP

Fantahun B.(PhD)

46
HMMs: Computing Likelihood: The Forward Algorithm
 Simpler case: we already knew the weather, and wanted to

predict how much ice cream Jason would eat.

 First, recall that for Hidden Markov Models, each hidden state

produces only a single observation. Thus the sequence of hidden


states and the sequence of observations have the same length.

 Given this one-to-one mapping, and the Markov assumptions

expressed in Eq. 6.6, for a particular hidden state sequence, Q and


an observation sequence O
Q=q0,q1,q2, ...,qT
O = o1,o2, ...,oT ,
the likelihood of the observation sequence is:

,
11/13/2023

NLP

Fantahun B.(PhD)

47
HMMs: Computing Likelihood: The Forward Algorithm

 But of course, we don’t actually know what the hidden state

(weather) sequence was.

 We’ll need to compute the probability of ice-cream events 3 1 3

instead by summing over all possible weather sequences, weighted


by their probability.

11/13/2023

NLP

Fantahun B.(PhD)

48
HMMs: Computing Likelihood: The Forward Algorithm
 First, let’s compute the joint probability of being in a

particular weather sequence Q and generating a particular


sequence O of ice-cream events.

 In general, this is:

 The computation of the joint probability of our ice-cream

observation 3 1 3 and one possible hidden state sequence


hot hot cold is as follows (Fig. 6.6 shows a graphic
representation of this):

11/13/2023

NLP

Fantahun B.(PhD)

49
HMMs: Computing Likelihood: The Forward Algorithm

11/13/2023

NLP

Fantahun B.(PhD)

50
HMMs: Computing Likelihood: The Forward Algorithm
 Now, we can compute the total probability of the observations

just by summing over all possible hidden state sequences:

 For our particular case, we would sum over the 8 three-event

sequences. What are these?

(6.13)

P(3 1 3)=P(3 1 3,cold cold cold) + P(3 1 3,cold cold hot)


+ P(3 1 3,hot hot cold) + ...

 What is the problem with this approach?


 For an HMM with N hidden states and an observation sequence

of T observations, there are NT possible hidden sequences. For


real tasks, where N and T and both large, NT is a large number ?

11/13/2023

NLP

Fantahun B.(PhD)

51
HMMs: Computing Likelihood: The Forward Algorithm
 Solution: we use the Forward Algorithm which is an efficient

algorithm (O(N2T)).

 The forward algorithm is a kind of dynamic programming

algorithm, i.e., an algorithm that uses a table to store


intermediate values as it builds up the probability of the
observation sequence.

 The forward algorithm computes the observation probability

by summing over the probabilities of all possible hidden


state paths that could generate the observation sequence,
but it does so efficiently by implicitly folding each of these
paths into a single forward trellis.

11/13/2023

NLP

Fantahun B.(PhD)

52
HMMs: Computing Likelihood: The Forward Algorithm
 Each cell of the forward algorithm trellis

t(j) represents the

probability of being in state j after seeing the first t observations,


given the automaton λ.

 The value of each cell

t( j) is computed by summing over the

probabilities of every path that could lead us to this cell.

 Formally, each cell expresses the following probability:

 Fig. 6.7 shows an example of the forward trellis for computing the

likelihood of 3 1 3 given the hidden state sequence hot hot cold.

11/13/2023

NLP

Fantahun B.(PhD)

53
HMMs: Computing Likelihood: The Forward Algorithm
 z

11/13/2023

NLP

Fantahun B.(PhD)

54
HMMs: Computing Likelihood: The Forward Algorithm
 For a given state qj at time t, the value

t( j) is computed as:

 The three factors that are multiplied in Eq. 6.15 in extending the

previous paths to compute the forward probability at time t are:

11/13/2023

NLP

Fantahun B.(PhD)

55
HMMs: Computing Likelihood: The Forward Algorithm
 v

11/13/2023

NLP

Fantahun B.(PhD)

56
HMMs: Computing Likelihood: The Forward Algorithm
 We give two formal definitions of the forward algorithm; the

pseudocode (refer Fig. 6.9) and a statement of the definitional


recursion here:

11/13/2023

NLP

Fantahun B.(PhD)

57
HMMs: Decoding: The Viterbi Algorithm

 For any model, such as an HMM, that contains hidden variables,

the task of determining which sequence of variables is the


underlying source of some sequence of observations is called the
decoding task.

 In the ice cream domain, given a sequence of ice cream

observations 3 1 3 and an HMM, the task of the decoder is to find


the best hidden weather sequence (H H H).

11/13/2023

NLP

Fantahun B.(PhD)

58
HMMs: Decoding: The Viterbi Algorithm
 We might propose to find the best sequence as follows:
1. for each possible hidden state sequence (HHH, HHC, HCH, etc.),

we could run the forward algorithm and compute the likelihood of


the observation sequence given that hidden state sequence.

2. then we could choose the hidden state sequence with the max

observation likelihood.

 Problem: exponentially large number of state sequences!


 Solution: the most common decoding algorithms for HMMs, the

Viterbi Algorithm.

 Like the forward algorithm, Viterbi is a kind of dynamic

programming, and makes uses of a dynamic programming trellis.


Viterbi also strongly resembles another dynamic programming
variant, the minimum edit distance algorithm.

11/13/2023

NLP

Fantahun B.(PhD)

59
HMMs: Decoding: The Viterbi Algorithm
 Fig. 6.10 shows an example of the Viterbi trellis for computing the

best hidden state sequence for the observation sequence 3 1 3.

 The idea is to process the observation sequence left to right, filling

out the trellis. Each cell of the Viterbi trellis, vt(j) represents the
probability that the HMM is in state j after seeing the first t
observations and passing through the most probable state
sequence q0,q1, ...,qt−1, given the automaton λ .

 The value of each cell vt(j) is computed by recursively taking the

most probable path that could lead us to this cell. Formally, each
cell expresses the following probability:

 Note that we represent the most probable path by taking the

maximum over all possible previous state sequences.

11/13/2023

NLP

Fantahun B.(PhD)

60
HMMs: Decoding: The Viterbi Algorithm
 s

11/13/2023

NLP

Fantahun B.(PhD)

61
HMMs: Decoding: The Viterbi Algorithm
 For a given state qj at time t, the value vt(j) is computed as:

 The three factors that are multiplied in Eq. 6.20 for extending the

previous paths to compute the Viterbi probability at time t are:

11/13/2023

NLP

Fantahun B.(PhD)

62
HMMs: Decoding: The Viterbi Algorithm
 Note that the Viterbi algorithm is identical to the forward algorithm

except that it takes the max over the previous path probabilities
where the forward algorithm takes the sum.

 Note also that the Viterbi algorithm has one component that the

forward algorithm doesn’t have: backpointers.

 This is because while the forward algorithm needs to produce an

observation likelihood, the Viterbi algorithm must produce a


probability and also the most likely state sequence.

 We compute this best state sequence by keeping track of the path

of hidden states that led to each state, as suggested in Fig. 6.12,


and then at the end tracing back the best path to the beginning
(the Viterbi backtrace ).

11/13/2023

NLP

Fantahun B.(PhD)

63
HMMs: Decoding: The Viterbi Algorithm
 Finally, we can give a formal definition of the Viterbi recursion as

follows:

11/13/2023

NLP

Fantahun B.(PhD)

64
HMMs: Decoding: The Viterbi Algorithm
 Finally, we can give a formal definition of the Viterbi recursion as

follows:

11/13/2023

NLP

Fantahun B.(PhD)

65
Training HMMs: The Forward-Backward Algorithm

 Problem: the third problem for HMMs:


o learning the parameters of an HMM, i.e., the A and B matrices.
 Input:
o an unlabeled sequence of observations O and a vocabulary of

potential hidden states Q.


o Thus for the ice cream task, we would start with a sequence of
observations O = {1,3,2, ...,}, and the set of hidden states H and C.
o For the POS tagging task we would start with a sequence of
observations O = {w1,w2,w3 . . .} and a set of hidden states NN, NNS,
VBD, IN,... And so on.
 Algorithm: the forward-backward or Baum-Welch algorithm (Baum, 1972),
a special case of the Expectation-Maximization or EM algorithm (Dempster
et al., 1977).
11/13/2023

NLP

Fantahun B.(PhD)

66
Training HMMs: The Forward-Backward Algorithm
 Simpler case of training a Markov chain rather than HMM:
 Since the states in a Markov chain are observed, we can run the

model on the observation sequence and directly see which path


we took through the model, and which state generated each
observation symbol.

 A Markov chain of course has no emission probabilities B

(alternatively we could view a Markov chain as a degenerate


Hidden Markov Model where all the b probabilities are 1.0 for the
observed symbol and 0 for all other symbols.).

 Thus the only probabilities we need to train are the transition

probability matrix A.

11/13/2023

NLP

Fantahun B.(PhD)

67
Training HMMs: The Forward-Backward Algorithm
 We get the maximum likelihood estimate of the probability aij of a

particular transition between states i and j by counting the number of


times the transition was taken, which we could call C(i→ j), and then
normalizing by the total count of all times we took any transition from
state i:
(6.27)

 We can directly compute this probability in a Markov chain because

we know which states we were in.

 For an HMM we cannot compute these counts directly from an

observation sequence since we don’t know which path of states was


taken through the machine for a given input.

 The Baum-Welch algorithm uses two neat intuitions to solve this

problem.

11/13/2023

NLP

Fantahun B.(PhD)

68
Training HMMs: The Forward-Backward Algorithm
 The Baum-Welch algorithm uses two neat intuitions to solve this

problem.

1) iteratively estimate the counts.


o We will start with an estimate for the transition and observation

probabilities, and then use these estimated probabilities to derive better


and better probabilities.

2) Get estimated probabilities by computing the forward probability for

an observation and then dividing that probability mass among all the
different paths that contributed to this forward probability.

 In order to understand the algorithm, we need to define a useful

probability related to the forward probability, called the backward


probability.

11/13/2023

NLP

Fantahun B.(PhD)

69
Training HMMs: The Forward-Backward Algorithm
 The backward probability β is the probability of seeing the

observations from time t+1 to the end, given that we are in state i at
time t (and of course given the automaton λ):

11/13/2023

NLP

Fantahun B.(PhD)

70
Training HMMs: The Forward-Backward Algorithm
 s

11/13/2023

NLP

Fantahun B.(PhD)

71
Training HMMs: The Forward-Backward Algorithm
 We are now ready to understand how the forward and backward

probabilities can help us compute the transition probability aij and


observation probability bi(ot) from an observation sequence, even
though the actual path taken through the machine is hidden.

 How do we compute the numerator?


 Consult your textbook

11/13/2023

NLP

Fantahun B.(PhD)

72
POS Tagging: Hidden Markov Models
Sources of information in Tagging
 tags of other words in the context of the word we are interested

in.

 Syntagmatic structural information


 Not very successful,
• eg. Greene and Rubin (1971), an early deterministic rule-based tagger that

used such information about syntagmatic patterns correctly tagged only 77%
of words.

 Just knowing the word involved gives a lot of information about

the correct tag

 Charniak et al. (1993), showed that a `dumb' tagger that simply assigns

the most common tag to each word performs at the surprisingly high
level of 90% correct.
 As a result of this, the performance of such a `dumb' tagger has been
used to give a baseline performance level in subsequent studies.
11/13/2023

NLP

Fantahun B.(PhD)

73
POS Tagging: Hidden Markov Models
Sources of information in Tagging
 And all modern taggers in some way make use of a

combination of

 syntagmatic information (looking at information about tag

sequences) and
 lexical information (predicting a tag based on the word
concerned).

11/13/2023

NLP

Fantahun B.(PhD)

74
POS Tagging: Hidden Markov Models
Computing the most-likely tag sequence
 HMM tagging algorithm chooses as the most likely tag

sequence the one that maximizes the product of two terms;

 the probability of the sequence of tags, and


 the probability of each tag generating a word.

 For this example, we will use the 87-tag Brown corpus tagset,

because it has a specific tag for to, TO, used only when to is an
infinitive; prepositional uses of to are tagged as IN.

 Example:
(5.36) Secretariat/NNP is/BEZ expected/VBN to/TO race/VB tomorrow/NR
(5.37) People/NNS continue/VB to/TO inquire/VB the/AT reason/NN for/IN
the/AT race/NN for/IN outer/JJ space/NN
11/13/2023

NLP

Fantahun B.(PhD)

75
POS Tagging: Hidden Markov Models
Computing the most-likely tag sequence
 Let’s look at how race can be correctly tagged as a VB instead

of an NN in (5.36).

11/13/2023

NLP

Fantahun B.(PhD)

76
POS Tagging: Hidden Markov Models
Computing the most-likely tag sequence
 Almost all the probabilities in these two sequences are identical; in

Fig. 5.12 we have highlighted in boldface the three probabilities that


differ. Let’s consider two of these, corresponding to P(ti|ti−1) and
P(wi|ti) .
Fig. 5.12a P(ti|ti−1) = P(VB|TO),
Fig. 5.12b P(ti|ti−1) = P(NN|TO).

 The tag transition probabilities P(NN|TO) and P(VB|TO) give us the

answer to the question “How likely are we to expect a verb (noun)


given the previous tag?”

 A look at the (87-tag) Brown corpus


P(NN|TO) = .00047
P(VB|TO) = .83

gives us the following probabilities, showing that verbs are about 500
times as likely as nouns to occur after TO:

11/13/2023

NLP

Fantahun B.(PhD)

77
POS Tagging: Hidden Markov Models
Computing the most-likely tag sequence
 Let’s now turn to P(wi|ti), the lexical likelihood of the word race given

a part-of-speech tag.

 For the two possible tags VB and NN, these correspond to the

probabilities: P(race|VB) and P(race|NN).

 Here are the lexical likelihoods from Brown:

P(race|NN) = .00057
P(race|VB) = .00012
 Finally, we need to represent the tag sequence probability for the

following tag (in this case the tag NR for tomorrow):


P(NR|VB) = .0027
P(NR|NN) = .0012

11/13/2023

NLP

Fantahun B.(PhD)

78
POS Tagging: Hidden Markov Models
Computing the most-likely tag sequence
 If we multiply the lexical likelihoods with the tag sequence

probabilities, we see that the probability of the sequence with


the VB tag is higher.
 Hence, the HMM tagger correctly tags race as a VB in Fig. 5.12
despite the fact that it is the less likely sense of race:
P(VB|TO)P(NR|VB)P(race|VB) = .00000027
P(NN|TO)P(NR|NN)P(race|NN) = .00000000032

11/13/2023

NLP

Fantahun B.(PhD)

79
POS Tagging: Algorithms - Transformation-Based Tagging
 Transformation-Based Tagging, sometimes called Brill tagging,

is an instance of the Transformation-Based Learning (TBL)


approach to machine learning (Brill, 1995), and draws
inspiration from both the rule-based and stochastic taggers.
 Like the rulebased taggers, TBL is based on rules that specify

what tags should be assigned to what words.


 Like the stochastic taggers, TBL is a machine learning

technique, in which rules are automatically induced from


the data.
 Like some but not all of the HMM taggers, TBL is a supervised

learning technique; it assumes a pre-tagged training


corpus.
11/13/2023

NLP

Fantahun B.(PhD)

80
POS Tagging: Algorithms - Transformation-Based Tagging
 Imagine an artist painting a picture of a white house with green trim

against a blue sky. Suppose most of the picture was sky, and hence
most of the picture was blue.
 The artist might begin by using a very broad brush and painting the

entire canvas blue.


 Next she might switch to a somewhat smaller white brush, and paint the

entire house white. She would just color in the whole house, not worrying
about the brown roof, or the blue windows or the green gables.
 Next she takes a smaller brown brush and colors over the roof.
 Now she takes up the blue paint on a small brush and paints in the blue

windows on the house.


 Finally she takes a very fine green brush and does the trim on the gables.

11/13/2023

NLP

Fantahun B.(PhD)

81
POS Tagging: Algorithms - Transformation-Based Tagging
 The painter starts with a broad brush that covers a lot of the canvas

but colors a lot of areas that will have to be repainted. The next layer
colors less of the canvas, but also makes less “mistakes”. Each new
layer uses a finer brush that corrects less of the picture, but makes
fewer mistakes.
 TBL uses somewhat the same method as this painter.
 The TBL algorithm has a set of tagging rules. A corpus is first tagged

using the broadest rule, that is, the one that applies to the most
cases. Then a slightly more specific rule is chosen, which changes
some of the original tags. Next an even narrower rule, which
changes a smaller number of tags (some of which might be
previously changed tags).
11/13/2023

NLP

Fantahun B.(PhD)

82
POS Tagging: Transformation-Based Tagging
How TBL rules are applied
 Let’s look at one of the rules used by Brill’s (1995) tagger. Before

the rules apply, the tagger labels every word with its most-likely
tag. We get these most-likely tags from a tagged corpus. For
example, in the Brown corpus, race is most likely to be a noun:
P(NN|race) = .98
P(VB|race) = .02
 This means that the two examples of race that we saw above will

both be coded as NN.

11/13/2023

NLP

Fantahun B.(PhD)

83
POS Tagging: Transformation-Based Tagging
How TBL rules are applied
 In the first case, this is a mistake, as NN is the incorrect tag:

 In the second case, this race is correctly tagged as a NN:

11/13/2023

NLP

Fantahun B.(PhD)

84
POS Tagging: Transformation-Based Tagging
How TBL rules are applied
 After selecting the most-likely tag, Brill’s tagger applies its

transformation rules.
 As it happens, Brill’s tagger learned a rule that applies exactly

to this mistagging of race:


Change NN to VB when the previous tag is TO
 This rule would change race/NN to race/VB in exactly the

following situation, since it is preceded by to/TO.

11/13/2023

NLP

Fantahun B.(PhD)

85
POS Tagging: Transformation-Based Tagging
How TBL rules are learned
 Brill’s TBL algorithm has three major stages.
1. It labels every word with its mostlikely tag.
2. It examines every possible transformation, and selects the one

that results in the most improved tagging.


3. It re-tags the data according to this rule.
 The last two stages are repeated until some stopping criterion is

reached, such as insufficient improvement over the previous pass.


 Note that stage two requires that TBL knows the correct tag of

each word; that is, TBL is a supervised learning algorithm.


11/13/2023

NLP

Fantahun B.(PhD)

86
POS Tagging: Transformation-Based Tagging
How TBL rules are learned
 The output of the TBL process is an ordered list of transformations;

these then constitute a “tagging procedure” that can be applied


to a new corpus.

 In principle the set of possible transformations is infinite, since we

could imagine transformations such as “transform NN to VB if the


previous word was “IBM” and the word “the” occurs between 17
and 158 words before that”.

 But TBL needs to consider every possible transformation, in order to

pick the best one on each pass through the algorithm. Thus the
algorithm needs a way to limit the set of transformations. This is
done by designing a small set of templates (abstracted
transformations). Every allowable transformation is an instantiation
of one of the templates.

11/13/2023

NLP

Fantahun B.(PhD)

87
POS Tagging: Transformation-Based Tagging
How TBL rules are learned
 Brill’s set of templates is listed in Fig. 5.20. Fig. 5.21 gives the details

of this algorithm for learning transformations.

11/13/2023

NLP

Fantahun B.(PhD)

88
POS Tagging: Transformation-Based Tagging
How TBL rules are learned
 Brill’s set of templates is listed in Fig. 5.20. Fig. 5.21 gives the details

of this algorithm for learning transformations. (refer on page 169)

11/13/2023

NLP

Fantahun B.(PhD)

89
POS Tagging: Maximum Entropy Models
 A second probabilistic machine learning framework called

Maximum Entropy modeling, MaxEnt for short.

 MaxEnt is more widely known as multinomial logistic regression.


 Our goal in this chapter is to introduce the use of MaxEnt for

sequence classification.

 Recall that the task of sequence classification or sequence

labelling is to assign a label to each element in some sequence,


such as assigning a part-of-speech tag to a word.

 The most common MaxEnt sequence classifier is the Maximum

Entropy Markov Model or MEMM, to be introduced in Sec. 6.8.


But before we see this use of MaxEnt as a sequence classifier,
we need to introduce non-sequential classification.

11/13/2023

NLP

Fantahun B.(PhD)

90
POS Tagging: Maximum Entropy Models
 The task of classification is to take a single observation, extract some

useful features describing the observation, and then based on these


features, to classify the observation into one of a set of discrete
classes.
 A probabilistic classifier does slightly more than this; in addition to
assigning a label or class, it gives the probability of the observation
being in that class; indeed, for a given observation a probabilistic
classifier gives a probability distribution over all classes.
 Such non-sequential classification tasks occur throughout speech and
language processing.
 text classification (spam/ham)
 sentiment analysis (positive/negative opinion).
 sentence boundaries (a period (‘.’) as either a sentence boundary or

not).

11/13/2023

NLP

Fantahun B.(PhD)

91
POS Tagging: Maximum Entropy Models
 MaxEnt belongs to the family of classifiers known as the exponential or

log-linear classifiers.

 MaxEnt works by extracting some set of features from the input,

combining them linearly (meaning that we multiply each by a weight


and then add them up), and then, for reasons we will see below,
using this sum as an exponent.

 Let’s flesh out this intuition just a bit more. Assume that we have some

input x (perhaps it is a word that needs to be tagged, or a document


that needs to be classified) from which we extract some features. A
feature for tagging might be this word ends in -ing or the previous
word was ‘the’. For each such feature fi, we have some weight wi.

11/13/2023

NLP

Fantahun B.(PhD)

92
POS Tagging: Maximum Entropy Models
 Given the features and weights, our goal is to choose a class (for

example a POS-tag) for the word. MaxEnt does this by choosing the
most probable tag; the probability of a particular class c given the
observation x is:

 Here Z is a normalizing factor, used to make the probabilities correctly

sum to 1; and as usual exp(x) = e x

11/13/2023

NLP

Fantahun B.(PhD)

93
POS Tagging: Maximum Entropy Models
 Multinomial logistic regression is called MaxEnt in speech and

language processing (see Sec. 6.7.1 on the intuition behind the


name ‘maximum entropy’)
 Adding some details to this equation, first we’ll flesh out the
normalization factor Z, specify the number of features as N, and
make the value of the weight dependent on the class c. The final
equation is:

 Note that the normalization factor Z is just used to make the

exponential into a true probability;

11/13/2023

NLP

Fantahun B.(PhD)

94
POS Tagging: Maximum Entropy Models
 We need to make one more change to see the final MaxEnt equation.

So far we’ve been assuming that the features fi are real-valued. It is


more common in speech and language processing, however, to use
binary-valued features. A feature that only takes on the values 0 and 1 is
also called an indicator function.
 In general, the features we use are indicator functions of some property
of the observation and the class we are considering assigning. Thus in
MaxEnt, instead of the notation fi, we will often use the notation fi(c,x),
meaning a feature i for a particular class c for a given observation x.
 The final equation for computing the probability of y being of class c
given x in MaxEnt is:

11/13/2023

NLP

Fantahun B.(PhD)

95
POS Tagging: Maximum Entropy Models
 Example features: POS

(6.81) Secretariat/NNP is/BEZ expected/VBN to/TO race/?? tomorrow/


 We are doing classification, not sequence classification
 We would like to know whether to assign the class VB to race (or instead

assign some other class like NN).


 One useful feature, we’ll call it f1, would be the fact that the current
word is race. We can thus add a binary feature which is true if this is the
case:

11/13/2023

NLP

Fantahun B.(PhD)

96
POS Tagging: Maximum Entropy Models
 Two more part-of-speech tagging features might focus on aspects of a

word’s spelling and case:

11/13/2023

NLP

Fantahun B.(PhD)

97
POS Tagging: Maximum Entropy Models
 Since each feature is dependent on both a property of the observation

and the class being labeled, we would need to have separate feature
for, e.g, the link between race and VB, or the link between a previous
TO and NN:

11/13/2023

NLP

Fantahun B.(PhD)

98
POS Tagging: Maximum Entropy Models
 Each of these features has a corresponding weight. Thus the weight

w1(c,x) would indicate how strong a cue the word race is for the tag VB,
the weight w2(c,x) would indicate how strong a cue the previous tag TO
is for the current word being a VB, and so on.

11/13/2023

NLP

Fantahun B.(PhD)

99
POS Tagging: Maximum Entropy Models
 Let’s assume that the feature weights for the two classes VB and VN

are as shown in Fig. 6.19.


 Let’s call the current input observation (where the current word is
race) x.
 We can now compute P(NN|x) and P(VB|x), using Eq. 6.80

11/13/2023

NLP

Fantahun B.(PhD)

100
POS Tagging: Maximum Entropy Models
 Notice that when we use MaxEnt to perform classification, MaxEnt

naturally gives us a probability distribution over the classes.


 If we want to do a hard-classification and choose the single-best
class, we can choose the class that has the highest probability, i.e.:

 Classification in MaxEnt is thus a generalization of classification in

(boolean) logistic regression.


 In boolean logistic regression, classification involves building one
linear expression which separates the observations in the class from
the observations not in the class.
 Classification in MaxEnt, by contrast, involves building a separate
linear expression for each of C classes.
11/13/2023

NLP

Fantahun B.(PhD)

101
POS Tagging: Maximum Entropy Models
 But as we’ll see later in Sec. 6.8, we generally don’t use MaxEnt for

hard classification.

 Usually we want to use MaxEnt as part of sequence classification,

where we want not the best single class for one unit, but the best
total sequence.

 For this task, it’s useful to exploit the entire probability distribution for

each individual unit, to help find the best sequence.

 Indeed even in many non-sequence applications a probability

distribution over the classes is more useful than a hard choice.

11/13/2023

NLP

Fantahun B.(PhD)

102
POS Tagging: Maximum Entropy Models
 The features we have described so far express a single binary

property of an observation.

 But it is often useful to create more complex features that express

combinations of properties of a word. Some kinds of machine


learning models, like Support Vector Machines (SVMs), can
automatically model the interactions between primitive properties,
but in MaxEnt any kind of complex feature has to be defined by
hand.

For example a word starting with a capital letter (eg. Day) is more likely to be a
proper noun (NNP) than a common noun (eg. in United Nations Day). But a word
which is capitalized but which occurs at the beginning of the sentence (the
previous word is <s>), as in Day after day...., is not more likely to be a proper
noun.

 Even if each of these properties were already a primitive feature,

MaxEnt would not model their combination, so this boolean


combination of properties would need to be encoded as a feature by
hand:

11/13/2023

NLP

Fantahun B.(PhD)

103
POS Tagging: Maximum Entropy Models
 A key to successful use of MaxEnt is thus the design of appropriate

features and feature combinations.

11/13/2023

NLP

Fantahun B.(PhD)

104
POS Tagging: Maximum Entropy Models (Learning)
 Learning a MaxEnt model can be done via a generalization of the

logistic regression learning algorithms described in Sec. 6.6.4; as we


saw in (6.73), we want to find the parameters w which maximize the
log likelihood of the M training samples:

 As with binary logistic regression, we use some convex optimization

algorithm to find the weights which maximize this function.

 Regularized version:

11/13/2023

NLP

Fantahun B.(PhD)

105
POS Tagging
Bibliography
Speech and Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics, and Speech
Recognition (2nd edition) D. Jurafsky and J. Martin
Foundations of Statistical Natural Language Processing C. Manning and
H. Schutze

11/13/2023

NLP

Fantahun B.(PhD)

106

You might also like