0% found this document useful (0 votes)

17 views31 pages

Inf2a L15 Slides

The lecture discusses morphology parsing, focusing on the structure of words in various languages, particularly English. It explains the processes of morphological parsing and generation using finite-state transducers (FSTs), which facilitate the conversion between surface forms and lexical forms. The importance of morphology parsing in natural language processing tasks, such as search engines and spell checking, is also highlighted.

Uploaded by

ankita.mishra.phd23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views31 pages

Inf2a L15 Slides

Uploaded by

ankita.mishra.phd23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Morphology parsing

Informatics 2A: Lecture 15

Shay Cohen

School of Informatics
University of Edinburgh
[email protected]

22 October 2018

1 / 31
Formal languages

2 / 31
Natural languages

3 / 31
Essential epistemology

Exact sciences Empirical sciences Engineering

Deals in... Axioms and Facts and theories Artifacts
theorems
Truth is ... Forever Temporary It works!
Examples... Maths, CS Physics, Biology, Applied CS
theory Linguistics and NLP

4 / 31
Morphology parsing: the problem

Finite-state transducers

FSTs for morphology parsing and generation

(This lecture is based on Jurafsky & Martin chapter 3, sections

1–7.)

5 / 31
Morphology in other languages

Morphology is the study of the structure of words

English has relatively impoverished morphology.
Languages with rich morphology: Turkish, Arabic, Hungarian,
Korean, and many more (actually, English is rather unique in
having relatively simple morphology)
For example, Turkish is an agglutinative language, and words are
constructed by concatenating morphemes together without
changing them much:
evlerinizden: “from your houses” (morphemes: ev-ler-iniz-den)
This lecture will mostly discuss how to build an English
morphological analyser.

6 / 31
Morphology

Stems and affixes (prefixes, suffixes, infixes and circumfixes)

combine together
Four methods to combine them:
I Inflection (stem + grammar affix) - word does not change its
grammatical class (walk → walking)
I Derivation (stem + grammar affix) - word changes its
grammatical form (computerize → computerization)
I Compounding (stems together) - doghouse
I Cliticization - I’ve, we’re, he’s

Morphology can be concatenative or non-concatenative (e.g.

templatic morphology as in Arabic)

7 / 31
WALS - Suffixing versus Prefixing

8 / 31
Examples: Inflection in different languages
In English, nouns are inflected for number, while verbs are inflected
for person and tense. Number: book / books Person: you read /
she reads Tense: we talk / we talked

In German, nouns are inflected for number and case, e.g house
becomes:
Singular Plural
Nominative das Haus die Häuser
Genitive des Hauses der Häuser
Dative dem Haus / dem Hause den Häusern
Accusative das Haus die Häuser
In Spanish, inflection depends on gender, e.g. el sol / la luna.

In Luganda, nouns have ten genders, each with different inflections

for singular/ plural!

9 / 31
Examples: Agglutination and compounding

ostoskeskuksessa
ostos#keskus+N+Sg+Loc:in
shopping#center+N+Sg+Loc:in
in the shopping center (Finnish)

qangatasuukkuvimmuuriaqalaaqtunga
“I’ll have to go to the airport” (Inuktitut)

Avrupallatramadklarmzdanmsnzcasna
“as if you are reportedly of those of ours that we were unable to
Europeanize” (Turkish)

In the most extreme examples, the meaning of the word is the

meaning of a sentence!

10 / 31
Morphological parsing: the problem

English has concatenative morphology. Words can be made up of a

main stem (carrying the basic dictionary meaning) plus one or
more affixes carrying grammatical information. E.g.:
Surface form: cats walking smoothest
Lexical form: cat+N+PL walk+V+PresPart smooth+Adj+Sup
Morphological parsing is the problem of extracting the lexical form
from the surface form. (For speech processing, this includes
identifying the word boundaries.)
We should take account of:
I Irregular forms (e.g. goose → geese)
I Systematic rules (e.g. ‘e’ inserted before suffix ‘s’ after
s,x,z,ch,sh: fox → foxes, watch → watches)

11 / 31
Why bother?
I Any NLP tasks involving grammatical parsing will typically
involve morphology parsing as a prerequisite.
I Search engines: e.g. a search for ’fox’ should return
documents containing ’foxes’, and vice versa.
I Even a humble task like spell checking can benefit: e.g. is
‘walking’ a possible word form?
But why not just list all derived forms separately in our wordlist
(e.g. walk, walks, walked, walking)?
I Might be OK for English, but not for a morphologically rich
language — e.g. in Turkish, can pile up to 10 suffixes on a
verb stem, leading to 40,000 possible forms for some verbs!
I Even for English, morphological parsing makes
adding/learning new words easier.
I In speech processing, word breaks aren’t known in advance.

12 / 31
How expressive is morphology?

Morphemes are tacked together in a rather “regular” way.

This means that finite-state machines are a good way to model
morphology. There is no need for “unbounded memory” to model
it (there are no long range dependencies).
This is as opposed to syntax, the study of the order of words in a
sentence, which we will learn about in another lecture

13 / 31
Parsing and generation
Parsing here means going from the surface to the lexical form.
E.g. foxes → fox +N +PL.
Generation is the opposite process: fox +N +PL → foxes. It’s
helpful to consider these two processes together.
Either way, it’s often useful to proceed via an intermediate form,
corresponding to an analysis in terms of morphemes (= minimal
meaningful units) before orthographic rules are applied.
Surface form: foxes
Intermediate form: fox ˆ s #
Lexical form: fox +N +PL
(ˆ means morpheme boundary, # means word boundary.)
N.B. The translation between surface and intermediate form is
exactly the same if ‘foxes’ is a 3rd person singular verb!

14 / 31
Finite-state transducers
We can consider -NFAs (over an alphabet Σ) in which transitions
may also (optionally) produce output symbols (over a possibly
different alphabet Π).
E.g. consider the following machine with input alphabet {a, b} and
output alphabet {0, 1}:

Such a thing is called a finite state transducer.

In effect, it specifies a (possibly multi-valued) translation from one
regular language to another.

15 / 31
Finite-state transducers

We can consider -NFAs (over an alphabet Σ) in which transitions

may also (optionally) produce output symbols (over a possibly
different alphabet Π).
E.g. consider the following machine with input alphabet {a, b} and
output alphabet {0, 1}:
a:0 a:1
b: ε

b: ε

Such a thing is called a finite state transducer.

In effect, it specifies a (possibly multi-valued) translation from one
regular language to another.

16 / 31
Quick exercise

a:0 a:1
b: ε

b: ε

What output will this produce, given the input aabaaabbab?

1. 001110
2. 001111
3. 0011101
4. More than one output is possible.

17 / 31
Formal definition

Formally, a finite state transducer T with inputs from Σ and

outputs from Π consists of:
I sets Q, S, F as in ordinary NFAs,
I a transition relation ∆ ⊆ Q × (Σ∪{}) × (Π∪{}) × Q
From this, one can define a many-step transition relation
∆ˆ ⊆ Q × Σ∗ × Π∗ × Q, where (q, x, y , q 0 ) ∈ ∆ ˆ means “starting
from state q, the input string x can be translated into the output
string y , ending up in state q 0 .” (Details omitted.)
Note that a finite state transducer can be run in either direction!
From T as above, we can obtain another transducer T just by
swapping the roles of inputs and outputs.

18 / 31
FSTs and FSAs

Formally, a finite state transducer T with inputs from Σ and

outputs from Π consists of:
I sets Q, S, F as in ordinary NFAs,
I a transition relation ∆ ⊆ Q × (Σ∪{}) × (Π∪{}) × Q

Reminder: Formally, an NFA with alphabet Σ consists of:

I A finite set Q of states.
I A transition relation ∆ ⊆ Q × (Σ∪{}) × Q,
I A set S ⊆ Q of possible starting states,
I A set F ⊆ Q of accepting states.

19 / 31
Stage 1: From lexical to intermediate form
Consider the problem of translating a lexical form like ‘fox+N+PL’
into an intermediate form like ‘fox ˆ s # ’, taking account of
irregular forms like goose/geese.
We can do this with a transducer of the following schematic form:
regular noun
(copied to output)
+N: ε +PL : ^s#

+SG : #

irregular noun
+N: ε +SG : #
(copied to output)

+N: ε +PL : #

irregular noun
(replaced by plural)

We treat each of +N, +SG, +PL as a single symbol.

The ‘transition’ labelled +PL : ˆs# abbreviates three transitions:
+PL : ˆ, : s, : #.
20 / 31
The Stage 1 transducer fleshed out

The left hand part of the preceding diagram is an abbreviation for

something like this (only a small sample shown):
o x

f t
a
c

g o o s e

o:e

o:e s e

Here, for simplicity, a single label u abbreviates u : u.

21 / 31
Stage 1 in full

22 / 31
Stage 2: From intermediate to surface form

To convert a sequence of morphemes to surface form, we apply a

number of orthographic rules such as the following.
I E-insertion: Insert e after s,z,x,ch,sh before a word-final
morpheme -s. (fox → foxes)
I E-deletion: Delete e before a suffix beginning with e,i.
(love → loving)
I Consonant doubling: Single consonants b,s,g,k,l,m,n,p,r,s,t,v
are doubled before suffix -ed or -ing. (beg → begged)
We shall consider a simplified form of E-insertion, ignoring ch,sh.
(Note that this rule is oblivious to whether -s is a plural noun suffix
or a 3rd person verb suffix.)

23 / 31
A transducer for E-insertion (adapted from J+M)

? 5
z,s,x
? z,s,x s ^: ε
z,s,x ^: ε
0 ?,z,s,x 0’ 1 2 ε :e 3 s 4
? z,x
^: ε
# # ? #

Here ? may stand for any symbol except z,s,x,ˆ,#.

(Treat # as a ‘visible space character’.)
At a morpheme boundary following z,s,x, we arrive in State 2.
If the ensuing input sequence is s#, our only option is to go via
states 3 and 4. Note that there’s no #-transition out of State 5.
State 5 allows e.g. ‘exˆserviceˆmen#’ to be translated to
‘exservicemen’.
24 / 31
Putting it all together
FSTs can be cascaded: output from one can be input to another.
To go from lexical to surface form, use ‘Stage 1’ transducer
followed by a bunch of orthographic rule transducers like the
above. (Made more efficient by back-end compilation into one
single transducer.)
The results of this generation process are typically deterministic
(each lexical form gives a unique surface form), even though our
transducers make use of non-determinism along the way.
Running the same cascade backwards lets us do parsing (surface to
lexical form). Because of ambiguity, this process is frequently
non-deterministic: e.g. ‘foxes’ might be analysed as fox+N+PL or
fox+V+Pres+3SG.
Such ambiguities are not resolved by morphological parsing itself:
left to a later processing stage.

25 / 31
Quick exercise 2
? 5
z,s,x
? z,s,x s ^: ε
z,s,x ^: ε
0 ?,z,s,x 0’ 1 2 ε :e 3 s 4
? z,x
^: ε
# # ? #

Apply this backwards to translate from surface to int. form.

Starting from state 0, how many sequences of transitions are
compatible with the input string ‘asses’ ?
1. 1
2. 2
3. 3
4. 4
5. More than 4
26 / 31
Solution
? 5
z,s,x
? z,s,x s ^: ε
z,s,x ^: ε
0 ?,z,s,x 0’ 1 2 ε :e 3 s 4
? z,x
^: ε
# # ? #

On the input string ‘asses’, 10 transition sequences are possible!

a s s e s
I 0 → 00 → 1 → 1 → 2 → 3 → 4, output assˆs
a 0 s s e 0 s
I 0 → 0 → 1 → 1 → 2 → 0 → 1, output assˆes
a 0 s s e 0 s
I 0 → 0 → 1 → 1 → 0 → 1, output asses
a 0 s s e s
I 0 → 0 → 1 → 2 → 5 → 2 → 3 → 4, output asˆsˆs
a 0 s s e 0 s
I 0 → 0 → 1 → 2 → 5 → 2 → 0 → 1, output asˆsˆes
a 0 s s e 0 s
I 0 → 0 → 1 → 2 → 5 → 0 → 1, output asˆses

I Four of these can also be followed by 1 → 2 (output ˆ).
27 / 31
The Porter Stemmer
Lexicon can be quite large with finite state transducers
Sometimes need to extract the stem in a very efficient fashion
(such as in IR)
The Porter stemmer: a lexicon-free method for getting the stem of
a given word
ATIONAL → ATE (e.g., relation → relate)
ING → if stem comtains a vowel (e.g. motoring → motor)
SSES → SS (e.g., grasses → grass)
Makes errors:
organization → organ
doing → doe
numerical → numerous
policy → police

28 / 31
Current methods for morphological analysis

A vibrant area of study

Mostly done by learning from data, just like many other NLP
problems. Two main paradigms: unsupervised morphological
parsing and supervised one.
NLP solvers are not perfect! They can make mistakes. Sometimes
ambiguity can’t even be resolved. BUT, for English, morphological
analysis is highly accurate. With other languages, there is still a
long way to go.

29 / 31
Finite state transducers in NLP

One of the basic tools that is used for many applications

I Speech recognition
I Machine translation
I Part-of-speech tagging
I ... and many more

30 / 31
Next class

Part-of-speech tagging:

I What are parts of speech?

I What are they useful for?
I Zipf’s law and the ambiguity of POS tagging
I One problem NLP solves really well (... for English)

31 / 31

English Morphology
No ratings yet
English Morphology
32 pages
Words
No ratings yet
Words
44 pages
7-Morphology Part2
No ratings yet
7-Morphology Part2
28 pages
Lecture 3
No ratings yet
Lecture 3
55 pages
Lexical Analysis - Morphological Analysis
No ratings yet
Lexical Analysis - Morphological Analysis
9 pages
Morphological Analysis
No ratings yet
Morphological Analysis
35 pages
Linguistics: Morphology Basics
No ratings yet
Linguistics: Morphology Basics
41 pages
NLP Morphology & Transducers
No ratings yet
NLP Morphology & Transducers
26 pages
675469663
No ratings yet
675469663
33 pages
Lec08 09 FSA For Morphological Parsig and Generation
No ratings yet
Lec08 09 FSA For Morphological Parsig and Generation
40 pages
CH 3 Morphology and FST
No ratings yet
CH 3 Morphology and FST
30 pages
NLP MODULE-2 Final
No ratings yet
NLP MODULE-2 Final
114 pages
NLP3 - Lecture 3
No ratings yet
NLP3 - Lecture 3
52 pages
Electronic
No ratings yet
Electronic
17 pages
NLP Lect 2 Words and Morphology
No ratings yet
NLP Lect 2 Words and Morphology
52 pages
Wordlevel Analysis - Chap2
No ratings yet
Wordlevel Analysis - Chap2
97 pages
Word Level Analysis
No ratings yet
Word Level Analysis
49 pages
NLP 39-48
No ratings yet
NLP 39-48
11 pages
NLP Unit-1
No ratings yet
NLP Unit-1
12 pages
Module 3: Morphology Morphological Parsing With Finite State
No ratings yet
Module 3: Morphology Morphological Parsing With Finite State
29 pages
Morp
No ratings yet
Morp
30 pages
NLP Unit 2
No ratings yet
NLP Unit 2
48 pages
Linguistics Students' Guide
No ratings yet
Linguistics Students' Guide
4 pages
8-Morphology Part3
No ratings yet
8-Morphology Part3
27 pages
Finnish 2008
No ratings yet
Finnish 2008
64 pages
NLP Unit-1
No ratings yet
NLP Unit-1
5 pages
Scan 27 Nov 23 09 21 15
No ratings yet
Scan 27 Nov 23 09 21 15
11 pages
2 NLP
No ratings yet
2 NLP
36 pages
UNIT-1 Notes
No ratings yet
UNIT-1 Notes
19 pages
IS 7118 Unit-3 Morphology
No ratings yet
IS 7118 Unit-3 Morphology
98 pages
ACFrOgBKMtkrKQXYgwzYfGAQxQ0GJjQ4MloahBs6vi5pwqo xRZUN6IRgh8lAAyR2U7sguAn6becvxh174Y RYo84nZ3K9mm OlN3Q JrDvd18FxMzMkCBuxruzd1tH0C6XqndKXsCSXuwHIWVT7olg5FKOstIhFYq-Kh6hMBg
No ratings yet
ACFrOgBKMtkrKQXYgwzYfGAQxQ0GJjQ4MloahBs6vi5pwqo xRZUN6IRgh8lAAyR2U7sguAn6becvxh174Y RYo84nZ3K9mm OlN3Q JrDvd18FxMzMkCBuxruzd1tH0C6XqndKXsCSXuwHIWVT7olg5FKOstIhFYq-Kh6hMBg
32 pages
Finnish, Turkish and Hungarian
100% (2)
Finnish, Turkish and Hungarian
12 pages
Module 3 - Part 1
No ratings yet
Module 3 - Part 1
54 pages
Morphology FST
No ratings yet
Morphology FST
47 pages
Morphology Notes
No ratings yet
Morphology Notes
5 pages
Finite Automata and Morphological Parsing
No ratings yet
Finite Automata and Morphological Parsing
18 pages
Lecture-3 (Words - Transducers)
No ratings yet
Lecture-3 (Words - Transducers)
61 pages
Unit3 - Morphology and Finite State Transducers
100% (1)
Unit3 - Morphology and Finite State Transducers
55 pages
WINSEM2018-19 SWE1017 ETH VL2018195004705 2018-12-19 Reference-Material-I
No ratings yet
WINSEM2018-19 SWE1017 ETH VL2018195004705 2018-12-19 Reference-Material-I
42 pages
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
No ratings yet
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
30 pages
Linguistics & NLP: Morphology Basics
No ratings yet
Linguistics & NLP: Morphology Basics
14 pages
NLP - Sem
No ratings yet
NLP - Sem
31 pages
Feature Systems and Augmented Grammars
No ratings yet
Feature Systems and Augmented Grammars
7 pages
Natural Langauge Processsing Unit 2
No ratings yet
Natural Langauge Processsing Unit 2
16 pages
Words & Transducers
No ratings yet
Words & Transducers
7 pages
Lecture2 436n
No ratings yet
Lecture2 436n
140 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
74 pages
NLP Morphology for Linguists
No ratings yet
NLP Morphology for Linguists
8 pages
Unit 12 (3 Half)
No ratings yet
Unit 12 (3 Half)
37 pages
Unit 1 NLP-1
No ratings yet
Unit 1 NLP-1
40 pages
Yacht Club and Linea Calc
No ratings yet
Yacht Club and Linea Calc
16 pages
مورفولوجي
No ratings yet
مورفولوجي
10 pages
Natural Language Processing PDF
100% (1)
Natural Language Processing PDF
47 pages
NLP Shorts 3
No ratings yet
NLP Shorts 3
25 pages
Natural Language Processing
No ratings yet
Natural Language Processing
13 pages
NLP Chapter 3
No ratings yet
NLP Chapter 3
50 pages
3nlp Computer
No ratings yet
3nlp Computer
83 pages
UNIT ONE, TWO & THREE OF NDA THE LEARNER MANUAl
No ratings yet
UNIT ONE, TWO & THREE OF NDA THE LEARNER MANUAl
45 pages
The Ocp in Functional Phonology
No ratings yet
The Ocp in Functional Phonology
27 pages
SP UP - SP Out Unit
No ratings yet
SP UP - SP Out Unit
66 pages
Quest Workbook Kindergarten PDF
No ratings yet
Quest Workbook Kindergarten PDF
2 pages
1089893806-Efal Step Ahead Learner Document Grade 12
No ratings yet
1089893806-Efal Step Ahead Learner Document Grade 12
31 pages
Macro Skills Module
No ratings yet
Macro Skills Module
70 pages
Assignment 2
100% (4)
Assignment 2
6 pages
Chapter 6: Second Language Development
No ratings yet
Chapter 6: Second Language Development
2 pages
Edu 329 Lesson Plan 2 Cause and Effect
No ratings yet
Edu 329 Lesson Plan 2 Cause and Effect
5 pages
Individual Differences: - Factors That Bring About Student Diversity
No ratings yet
Individual Differences: - Factors That Bring About Student Diversity
16 pages
0500 s16 Ms 22 PDF
No ratings yet
0500 s16 Ms 22 PDF
10 pages
Pertemuan 3
No ratings yet
Pertemuan 3
64 pages
Book Review Tomlinson 2013
No ratings yet
Book Review Tomlinson 2013
3 pages
Exercises On Verbs 2022
No ratings yet
Exercises On Verbs 2022
2 pages
Diphthong and Triphthong
57% (7)
Diphthong and Triphthong
10 pages
Past Tense Sentences: Worksheet: Azar: Understanding and Using English Grammar, 3rd Ed. Chart 2-9
No ratings yet
Past Tense Sentences: Worksheet: Azar: Understanding and Using English Grammar, 3rd Ed. Chart 2-9
2 pages
Subject Verb Agreement
No ratings yet
Subject Verb Agreement
25 pages
Young Learners' Free Time Activities
No ratings yet
Young Learners' Free Time Activities
2 pages
Use of English: Abstract Nouns and Compound Nouns
No ratings yet
Use of English: Abstract Nouns and Compound Nouns
8 pages
Lesson 1: Teaching Listening: Time Frame: 9 Hours
100% (1)
Lesson 1: Teaching Listening: Time Frame: 9 Hours
11 pages
ChineseSLA Written Specs W
No ratings yet
ChineseSLA Written Specs W
3 pages
Although Vs Though Vs Even Though
No ratings yet
Although Vs Though Vs Even Though
10 pages
Sanskrit Level 3 Week 0 2546 PDF
No ratings yet
Sanskrit Level 3 Week 0 2546 PDF
4 pages
Trisangam International Refereed Journal (TIRJ)
No ratings yet
Trisangam International Refereed Journal (TIRJ)
10 pages
Easy Learn Phonic Dictation
100% (1)
Easy Learn Phonic Dictation
14 pages
English Emotion Lexicon Analysis
No ratings yet
English Emotion Lexicon Analysis
27 pages
XI English Half-Yearly Answer Key
No ratings yet
XI English Half-Yearly Answer Key
1 page
Pismeni Ispit Iz Engleskog Jezika
No ratings yet
Pismeni Ispit Iz Engleskog Jezika
6 pages
Focus - 4. Godina - Orijentacioni Plan Za Drustveni Smer
No ratings yet
Focus - 4. Godina - Orijentacioni Plan Za Drustveni Smer
3 pages
RealLife Power Learning Method
No ratings yet
RealLife Power Learning Method
5 pages

Inf2a L15 Slides

Uploaded by

Inf2a L15 Slides

Uploaded by

Morphology parsing

Informatics 2A: Lecture 15

Exact sciences Empirical sciences Engineering

FSTs for morphology parsing and generation

(This lecture is based on Jurafsky & Martin chapter 3, sections

Morphology is the study of the structure of words

Stems and affixes (prefixes, suffixes, infixes and circumfixes)

Morphology can be concatenative or non-concatenative (e.g.

In Luganda, nouns have ten genders, each with different inflections

In the most extreme examples, the meaning of the word is the

English has concatenative morphology. Words can be made up of a

Morphemes are tacked together in a rather “regular” way.

Such a thing is called a finite state transducer.

We can consider -NFAs (over an alphabet Σ) in which transitions

Such a thing is called a finite state transducer.

What output will this produce, given the input aabaaabbab?

Formally, a finite state transducer T with inputs from Σ and

Formally, a finite state transducer T with inputs from Σ and

Reminder: Formally, an NFA with alphabet Σ consists of:

We treat each of +N, +SG, +PL as a single symbol.

The left hand part of the preceding diagram is an abbreviation for

Here, for simplicity, a single label u abbreviates u : u.

To convert a sequence of morphemes to surface form, we apply a

Here ? may stand for any symbol except z,s,x,ˆ,#.

Apply this backwards to translate from surface to int. form.

On the input string ‘asses’, 10 transition sequences are possible!

A vibrant area of study

One of the basic tools that is used for many applications

I What are parts of speech?

You might also like

We can consider -NFAs (over an alphabet Σ) in which transitions