Chapter 2
Chapter 2
Mestry
Morphology
The study of word formation – how words are built up from smaller pieces.
Identification, analysis, and description of the structure of a given language's MORPHEMES and
other linguistic units, such as root words, affixes, parts of speech, intonations and stresses, or
implied context.
Morphological analysis:
Token= lemma/Stem + part of speech + grammatical features
Examples:
cats = cat+N+plur
played = play+V+past
katternas = katt+N+plur+def+gen
Types of Morphology
Inflectional morphology:-modification of a word to express different grammatical categories.
Examples- cats, men etc.
Derivational Morphology:- creation of a new word from existing word by changing grammatical
category.
Examples- happiness, brotherhood etc.
There are some differences between inflectional and derivational morphemes. First, inflectional
morphemes never change the grammatical category (part of speech) of a word. For example, tall
Natural Language Processing Notes By Prof. Suresh R. Mestry
and taller are both adjectives. The inflectional morpheme -er (comparative marker) simply
produces a different version of the adjective tall.
However, derivational morphemes often change the part of speech of a word. Thus, the verb read
becomes the noun reader when we add the derivational morpheme -er.
It is simply that read is a verb, but reader is a noun.
For example, such derivational prefixes as re- and un- in English generally do not change the
category of the word to which they are attached. Thus, both happy and unhappy are adjectives,
and both fill and refill are verbs, for example. The derivational suffixes -hood and -dom, as in
neighborhood and kingdom, are also the typical examples of derivational morphemes that do not
change the grammatical category of a word to which they are attached.
Second, when a derivational suffix and an inflectional suffix are added to the same word, they
always appear in a certain relative order within the word. That is, inflectional suffixes follow
derivational suffixes. Thus, the derivational (-er) is added to read, then the inflectional (-s) is
attached to produce readers.
Similarly, in organize– organizes the inflectional -s comes after the derivational -ize. When an
inflectional suffix is added to a verb, as with organizes, then we cannot add any further
derivational suffixes. It is impossible to have a form like organizesable, with inflectional -s
after derivational -able because inflectional morphemes occur outside derivational morphemes and
attach to the base or stem.
A third point worth emphasizing is that certain derivational morphemes serve to create new base
forms or new stems to which we can attach other derivational or inflectional affixes. For example,
we use the derivational -atic to create adjectives from nouns, as in words like systematic and
problematic.
Inflectional affixes always have a regular meaning. Derivational affixes may have irregular meaning. If
we consider an inflectional affix like the plural 's in word-forms like bicycles, dogs, shoes, tins, trees, and
so on, the difference in meaning between the base and the affixed form is always the same: 'more than
one'. If, however, we consider the change in meaning caused by a derivational affix like 'age in words
like bandage, peerage, shortage, spillage, and so on, it is difficult to sort of any fixed change in meaning,
or even a small set of meaning changes.
Approaches to Morphology
There are three principal approaches to morphology
Morpheme based morphology
Lexeme based morphology
Word based morphology
This can be achieved through two possible methods: stemming and lemmatization. The aim of both
processes is the same: reducing the inflectional forms of each word into a common base or root. However,
these two methods are not exactly the same
Natural Language Processing Notes By Prof. Suresh R. Mestry
Stemming algorithms work by cutting off the end or the beginning of the word, taking into
account a list of common prefixes and suffixes that can be found in an inflected word. This
indiscriminate cutting can be successful in some occasions, but not always, and that is why this
approach presents some limitations.
Stemming
Lemmatization, on the other hand, takes into consideration the morphological analysis of the
words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through
to link the form back to its lemma.
Lemmatization
Once we defined regular expression, they can be implemented via finite-state automaton.
The finite-state automaton is not only the mathematical device used to implement regular expressions, but
also one of the most significant tools of computational linguistics. Variations of automata such as finite-
state transducers, Hidden Markov Models, and N-gram grammars are important components of the speech
recognition and synthesis, spell-checking, and information-extraction applications.
Disjunction: Regular expressions are case sensitive; lower-case /s/ is distinct from upper-case /S/;
This can be solved by square braces [ and ]. The string of characters inside the braces specify a
disjunction of characters to match.
Caret ˆ : The square braces can also be used to specify what a single character cannot be, by use
of the caret ˆ.If the caret ˆ is the first symbol after the open square brace [, the resulting pattern is
negated.
For woodchuck and woodchucks? cases we use the question-mark /?/, which means ‘the preceding
character or nothing’.
Natural Language Processing Notes By Prof. Suresh R. Mestry
Ranges:
More disjunction :Another word for raccoon is coon, the pipe | use for disjunction
Anchors:
o Beginning of string ˆ
o End of string $
The FST is a multi-function device, and can be viewed in the following ways:
Translator: It reads one string on one tape and outputs another string,
Recognizer: It takes a pair of strings as two tapes and accepts/rejects based on their matching.
Generator: It outputs a pair of strings on two tapes along with yes/no result based on whether they
are matching or not.
Relater: It compares the relation between two sets of strings available on two tapes.
The objective of the morphological parsing is to produce output lexicons for a single input lexicon,
e.g., like it is given in table 4.1.
The second column in the table contains the stem of the corresponding word (lexicon) in first column,
along with its morphological features, like, +N means word is noun, +SG means it is singular, +PL
means it is plural, +V for verb, and pres-part for present participle.
We achieve it through two level morphology, which represents a word as a correspondence between
lexical level - a simple concatenation of lexicons, as shown in column 2 of table 4.1, and a surface
level as shown in column 1. These are shown using two tapes of finite state transducer.
N-Gram Models
• Estimate probability of each word given prior context.
– P(phone | Please turn off your cell)
• Number of parameters required grows exponentially with the number of words of prior context.
• An N-gram model uses only N1 words of prior context.
– Unigram: P(phone)
– Bigram: P(phone | cell)
– Trigram: P(phone | your cell)
• The Markov assumption is the presumption that the future behavior of a dynamical system only
depends on its recent history. In particular, in a kth-order Markov model, the next state only
depends on the k most recent states, therefore an N-gram model is a (N1)-order Markov model.
• N-gram approximation
n
P( w1n ) P( wk | wkk1N 1 )
k 1
Estimating Probabilities
• N-gram conditional probabilities can be estimated from raw text based on the relative frequency of
word sequences.
C ( wn 1wn )
P( wn | wn 1 )
C ( wn 1 )
n 1 C ( wnn1N 1wn )
P( wn | w n N 1 )
C ( wnn1N 1 )
• To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to
every sentence and treat these as additional words.
Example:
Let’s work through an example using a mini-corpus of three sentences
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Here are the calculations for some of the bigram probabilities from this corpus