Morphology parsing
Informatics 2A: Lecture 15
Shay Cohen
School of Informatics
University of Edinburgh
[email protected]
22 October 2018
1 / 31
Formal languages
2 / 31
Natural languages
3 / 31
Essential epistemology
Exact sciences Empirical sciences Engineering
Deals in... Axioms and Facts and theories Artifacts
theorems
Truth is ... Forever Temporary It works!
Examples... Maths, CS Physics, Biology, Applied CS
theory Linguistics and NLP
4 / 31
Morphology parsing: the problem
Finite-state transducers
Finite-state transducers
FSTs for morphology parsing and generation
(This lecture is based on Jurafsky & Martin chapter 3, sections
1–7.)
5 / 31
Morphology in other languages
Morphology is the study of the structure of words
English has relatively impoverished morphology.
Languages with rich morphology: Turkish, Arabic, Hungarian,
Korean, and many more (actually, English is rather unique in
having relatively simple morphology)
For example, Turkish is an agglutinative language, and words are
constructed by concatenating morphemes together without
changing them much:
evlerinizden: “from your houses” (morphemes: ev-ler-iniz-den)
This lecture will mostly discuss how to build an English
morphological analyser.
6 / 31
Morphology
Stems and affixes (prefixes, suffixes, infixes and circumfixes)
combine together
Four methods to combine them:
I Inflection (stem + grammar affix) - word does not change its
grammatical class (walk → walking)
I Derivation (stem + grammar affix) - word changes its
grammatical form (computerize → computerization)
I Compounding (stems together) - doghouse
I Cliticization - I’ve, we’re, he’s
Morphology can be concatenative or non-concatenative (e.g.
templatic morphology as in Arabic)
7 / 31
WALS - Suffixing versus Prefixing
8 / 31
Examples: Inflection in different languages
In English, nouns are inflected for number, while verbs are inflected
for person and tense. Number: book / books Person: you read /
she reads Tense: we talk / we talked
In German, nouns are inflected for number and case, e.g house
becomes:
Singular Plural
Nominative das Haus die Häuser
Genitive des Hauses der Häuser
Dative dem Haus / dem Hause den Häusern
Accusative das Haus die Häuser
In Spanish, inflection depends on gender, e.g. el sol / la luna.
In Luganda, nouns have ten genders, each with different inflections
for singular/ plural!
9 / 31
Examples: Agglutination and compounding
ostoskeskuksessa
ostos#keskus+N+Sg+Loc:in
shopping#center+N+Sg+Loc:in
in the shopping center (Finnish)
qangatasuukkuvimmuuriaqalaaqtunga
“I’ll have to go to the airport” (Inuktitut)
Avrupallatramadklarmzdanmsnzcasna
“as if you are reportedly of those of ours that we were unable to
Europeanize” (Turkish)
In the most extreme examples, the meaning of the word is the
meaning of a sentence!
10 / 31
Morphological parsing: the problem
English has concatenative morphology. Words can be made up of a
main stem (carrying the basic dictionary meaning) plus one or
more affixes carrying grammatical information. E.g.:
Surface form: cats walking smoothest
Lexical form: cat+N+PL walk+V+PresPart smooth+Adj+Sup
Morphological parsing is the problem of extracting the lexical form
from the surface form. (For speech processing, this includes
identifying the word boundaries.)
We should take account of:
I Irregular forms (e.g. goose → geese)
I Systematic rules (e.g. ‘e’ inserted before suffix ‘s’ after
s,x,z,ch,sh: fox → foxes, watch → watches)
11 / 31
Why bother?
I Any NLP tasks involving grammatical parsing will typically
involve morphology parsing as a prerequisite.
I Search engines: e.g. a search for ’fox’ should return
documents containing ’foxes’, and vice versa.
I Even a humble task like spell checking can benefit: e.g. is
‘walking’ a possible word form?
But why not just list all derived forms separately in our wordlist
(e.g. walk, walks, walked, walking)?
I Might be OK for English, but not for a morphologically rich
language — e.g. in Turkish, can pile up to 10 suffixes on a
verb stem, leading to 40,000 possible forms for some verbs!
I Even for English, morphological parsing makes
adding/learning new words easier.
I In speech processing, word breaks aren’t known in advance.
12 / 31
How expressive is morphology?
Morphemes are tacked together in a rather “regular” way.
This means that finite-state machines are a good way to model
morphology. There is no need for “unbounded memory” to model
it (there are no long range dependencies).
This is as opposed to syntax, the study of the order of words in a
sentence, which we will learn about in another lecture
13 / 31
Parsing and generation
Parsing here means going from the surface to the lexical form.
E.g. foxes → fox +N +PL.
Generation is the opposite process: fox +N +PL → foxes. It’s
helpful to consider these two processes together.
Either way, it’s often useful to proceed via an intermediate form,
corresponding to an analysis in terms of morphemes (= minimal
meaningful units) before orthographic rules are applied.
Surface form: foxes
Intermediate form: fox ˆ s #
Lexical form: fox +N +PL
(ˆ means morpheme boundary, # means word boundary.)
N.B. The translation between surface and intermediate form is
exactly the same if ‘foxes’ is a 3rd person singular verb!
14 / 31
Finite-state transducers
We can consider -NFAs (over an alphabet Σ) in which transitions
may also (optionally) produce output symbols (over a possibly
different alphabet Π).
E.g. consider the following machine with input alphabet {a, b} and
output alphabet {0, 1}:
Such a thing is called a finite state transducer.
In effect, it specifies a (possibly multi-valued) translation from one
regular language to another.
15 / 31
Finite-state transducers
We can consider -NFAs (over an alphabet Σ) in which transitions
may also (optionally) produce output symbols (over a possibly
different alphabet Π).
E.g. consider the following machine with input alphabet {a, b} and
output alphabet {0, 1}:
a:0 a:1
b: ε
b: ε
Such a thing is called a finite state transducer.
In effect, it specifies a (possibly multi-valued) translation from one
regular language to another.
16 / 31
Quick exercise
a:0 a:1
b: ε
b: ε
What output will this produce, given the input aabaaabbab?
1. 001110
2. 001111
3. 0011101
4. More than one output is possible.
17 / 31
Formal definition
Formally, a finite state transducer T with inputs from Σ and
outputs from Π consists of:
I sets Q, S, F as in ordinary NFAs,
I a transition relation ∆ ⊆ Q × (Σ∪{}) × (Π∪{}) × Q
From this, one can define a many-step transition relation
∆ˆ ⊆ Q × Σ∗ × Π∗ × Q, where (q, x, y , q 0 ) ∈ ∆ ˆ means “starting
from state q, the input string x can be translated into the output
string y , ending up in state q 0 .” (Details omitted.)
Note that a finite state transducer can be run in either direction!
From T as above, we can obtain another transducer T just by
swapping the roles of inputs and outputs.
18 / 31
FSTs and FSAs
Formally, a finite state transducer T with inputs from Σ and
outputs from Π consists of:
I sets Q, S, F as in ordinary NFAs,
I a transition relation ∆ ⊆ Q × (Σ∪{}) × (Π∪{}) × Q
Reminder: Formally, an NFA with alphabet Σ consists of:
I A finite set Q of states.
I A transition relation ∆ ⊆ Q × (Σ∪{}) × Q,
I A set S ⊆ Q of possible starting states,
I A set F ⊆ Q of accepting states.
19 / 31
Stage 1: From lexical to intermediate form
Consider the problem of translating a lexical form like ‘fox+N+PL’
into an intermediate form like ‘fox ˆ s # ’, taking account of
irregular forms like goose/geese.
We can do this with a transducer of the following schematic form:
regular noun
(copied to output)
+N: ε +PL : ^s#
+SG : #
irregular noun
+N: ε +SG : #
(copied to output)
+N: ε +PL : #
irregular noun
(replaced by plural)
We treat each of +N, +SG, +PL as a single symbol.
The ‘transition’ labelled +PL : ˆs# abbreviates three transitions:
+PL : ˆ, : s, : #.
20 / 31
The Stage 1 transducer fleshed out
The left hand part of the preceding diagram is an abbreviation for
something like this (only a small sample shown):
o x
f t
a
c
g o o s e
o:e
o:e s e
Here, for simplicity, a single label u abbreviates u : u.
21 / 31
Stage 1 in full
22 / 31
Stage 2: From intermediate to surface form
To convert a sequence of morphemes to surface form, we apply a
number of orthographic rules such as the following.
I E-insertion: Insert e after s,z,x,ch,sh before a word-final
morpheme -s. (fox → foxes)
I E-deletion: Delete e before a suffix beginning with e,i.
(love → loving)
I Consonant doubling: Single consonants b,s,g,k,l,m,n,p,r,s,t,v
are doubled before suffix -ed or -ing. (beg → begged)
We shall consider a simplified form of E-insertion, ignoring ch,sh.
(Note that this rule is oblivious to whether -s is a plural noun suffix
or a 3rd person verb suffix.)
23 / 31
A transducer for E-insertion (adapted from J+M)
? 5
z,s,x
? z,s,x s ^: ε
z,s,x ^: ε
0 ?,z,s,x 0’ 1 2 ε :e 3 s 4
? z,x
^: ε
# # ? #
Here ? may stand for any symbol except z,s,x,ˆ,#.
(Treat # as a ‘visible space character’.)
At a morpheme boundary following z,s,x, we arrive in State 2.
If the ensuing input sequence is s#, our only option is to go via
states 3 and 4. Note that there’s no #-transition out of State 5.
State 5 allows e.g. ‘exˆserviceˆmen#’ to be translated to
‘exservicemen’.
24 / 31
Putting it all together
FSTs can be cascaded: output from one can be input to another.
To go from lexical to surface form, use ‘Stage 1’ transducer
followed by a bunch of orthographic rule transducers like the
above. (Made more efficient by back-end compilation into one
single transducer.)
The results of this generation process are typically deterministic
(each lexical form gives a unique surface form), even though our
transducers make use of non-determinism along the way.
Running the same cascade backwards lets us do parsing (surface to
lexical form). Because of ambiguity, this process is frequently
non-deterministic: e.g. ‘foxes’ might be analysed as fox+N+PL or
fox+V+Pres+3SG.
Such ambiguities are not resolved by morphological parsing itself:
left to a later processing stage.
25 / 31
Quick exercise 2
? 5
z,s,x
? z,s,x s ^: ε
z,s,x ^: ε
0 ?,z,s,x 0’ 1 2 ε :e 3 s 4
? z,x
^: ε
# # ? #
Apply this backwards to translate from surface to int. form.
Starting from state 0, how many sequences of transitions are
compatible with the input string ‘asses’ ?
1. 1
2. 2
3. 3
4. 4
5. More than 4
26 / 31
Solution
? 5
z,s,x
? z,s,x s ^: ε
z,s,x ^: ε
0 ?,z,s,x 0’ 1 2 ε :e 3 s 4
? z,x
^: ε
# # ? #
On the input string ‘asses’, 10 transition sequences are possible!
a s s e s
I 0 → 00 → 1 → 1 → 2 → 3 → 4, output assˆs
a 0 s s e 0 s
I 0 → 0 → 1 → 1 → 2 → 0 → 1, output assˆes
a 0 s s e 0 s
I 0 → 0 → 1 → 1 → 0 → 1, output asses
a 0 s s e s
I 0 → 0 → 1 → 2 → 5 → 2 → 3 → 4, output asˆsˆs
a 0 s s e 0 s
I 0 → 0 → 1 → 2 → 5 → 2 → 0 → 1, output asˆsˆes
a 0 s s e 0 s
I 0 → 0 → 1 → 2 → 5 → 0 → 1, output asˆses
I Four of these can also be followed by 1 → 2 (output ˆ).
27 / 31
The Porter Stemmer
Lexicon can be quite large with finite state transducers
Sometimes need to extract the stem in a very efficient fashion
(such as in IR)
The Porter stemmer: a lexicon-free method for getting the stem of
a given word
ATIONAL → ATE (e.g., relation → relate)
ING → if stem comtains a vowel (e.g. motoring → motor)
SSES → SS (e.g., grasses → grass)
Makes errors:
organization → organ
doing → doe
numerical → numerous
policy → police
28 / 31
Current methods for morphological analysis
A vibrant area of study
Mostly done by learning from data, just like many other NLP
problems. Two main paradigms: unsupervised morphological
parsing and supervised one.
NLP solvers are not perfect! They can make mistakes. Sometimes
ambiguity can’t even be resolved. BUT, for English, morphological
analysis is highly accurate. With other languages, there is still a
long way to go.
29 / 31
Finite state transducers in NLP
One of the basic tools that is used for many applications
I Speech recognition
I Machine translation
I Part-of-speech tagging
I ... and many more
30 / 31
Next class
Part-of-speech tagging:
I What are parts of speech?
I What are they useful for?
I Zipf’s law and the ambiguity of POS tagging
I One problem NLP solves really well (... for English)
31 / 31