0% found this document useful (0 votes)

8 views

Chapter 2

Uploaded by

bhelravisoni

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Chapter 2

Uploaded by

bhelravisoni

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Natural Language Processing Notes By Prof. Suresh R.

Mestry

Ch.2. Word Level Analysis

Outlines: Morphology analysis –survey of English Morphology, Inflectional morphology & Derivational
morphology, Lemmatization, Regular expression, finite automata, finite state transducers (FST),
Morphological parsing with FST , Lexicon free FST Porter stemmer. N –Grams- N-gram language model,
N-gram for spelling correction.

Morphology
 The study of word formation – how words are built up from smaller pieces.
 Identification, analysis, and description of the structure of a given language's MORPHEMES and
other linguistic units, such as root words, affixes, parts of speech, intonations and stresses, or
implied context.

Morphological analysis:
 Token= lemma/Stem + part of speech + grammatical features

Examples:
 cats = cat+N+plur
 played = play+V+past
 katternas = katt+N+plur+def+gen

Words are built up of minimal meaningful elements called morphemes:

 Washing= wash + ing
 Browser= browse + er
 Rats= rat + s
 played = play-ed
 cats = cat-s
 unfriendly = un-friend-ly

 Two types of morphemes:

 Stems: play, cat, friend
 Affixes: -ed, -s, un-, -ly

 Two main types of affixes:

 Prefixes precede the stem: un-
 Suffixes follow the stem: -ed, -s, un-, -ly

Types of Morphology
 Inflectional morphology:-modification of a word to express different grammatical categories.
Examples- cats, men etc.
 Derivational Morphology:- creation of a new word from existing word by changing grammatical
category.
Examples- happiness, brotherhood etc.

Differences between Derivational and Inflectional Morphemes

 There are some differences between inflectional and derivational morphemes. First, inflectional
morphemes never change the grammatical category (part of speech) of a word. For example, tall
Natural Language Processing Notes By Prof. Suresh R. Mestry

and taller are both adjectives. The inflectional morpheme -er (comparative marker) simply
produces a different version of the adjective tall.
 However, derivational morphemes often change the part of speech of a word. Thus, the verb read
becomes the noun reader when we add the derivational morpheme -er.
It is simply that read is a verb, but reader is a noun.
 For example, such derivational prefixes as re- and un- in English generally do not change the
category of the word to which they are attached. Thus, both happy and unhappy are adjectives,
and both fill and refill are verbs, for example. The derivational suffixes -hood and -dom, as in
neighborhood and kingdom, are also the typical examples of derivational morphemes that do not
change the grammatical category of a word to which they are attached.
 Second, when a derivational suffix and an inflectional suffix are added to the same word, they
always appear in a certain relative order within the word. That is, inflectional suffixes follow
derivational suffixes. Thus, the derivational (-er) is added to read, then the inflectional (-s) is
attached to produce readers.
 Similarly, in organize– organizes the inflectional -s comes after the derivational -ize. When an
inflectional suffix is added to a verb, as with organizes, then we cannot add any further
derivational suffixes. It is impossible to have a form like organizesable, with inflectional -s
after derivational -able because inflectional morphemes occur outside derivational morphemes and
attach to the base or stem.
 A third point worth emphasizing is that certain derivational morphemes serve to create new base
forms or new stems to which we can attach other derivational or inflectional affixes. For example,
we use the derivational -atic to create adjectives from nouns, as in words like systematic and
problematic.

Inflectional affixes always have a regular meaning. Derivational affixes may have irregular meaning. If
we consider an inflectional affix like the plural 's in word-forms like bicycles, dogs, shoes, tins, trees, and
so on, the difference in meaning between the base and the affixed form is always the same: 'more than
one'. If, however, we consider the change in meaning caused by a derivational affix like 'age in words
like bandage, peerage, shortage, spillage, and so on, it is difficult to sort of any fixed change in meaning,
or even a small set of meaning changes.

Approaches to Morphology
There are three principal approaches to morphology
 Morpheme based morphology
 Lexeme based morphology
 Word based morphology

Stemming and Lemmatization

“When we are running a search, we want to find relevant results not only for the exact expression we
typed on the search bar, but also for the other possible forms of the words we used. For example, it’s
very likely we will want to see results containing the form “shoot” if we have typed “shoots” in the
search bar.”

This can be achieved through two possible methods: stemming and lemmatization. The aim of both
processes is the same: reducing the inflectional forms of each word into a common base or root. However,
these two methods are not exactly the same
Natural Language Processing Notes By Prof. Suresh R. Mestry

 Stemming algorithms work by cutting off the end or the beginning of the word, taking into
account a list of common prefixes and suffixes that can be found in an inflected word. This
indiscriminate cutting can be successful in some occasions, but not always, and that is why this
approach presents some limitations.

Stemming

 Lemmatization, on the other hand, takes into consideration the morphological analysis of the
words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through
to link the form back to its lemma.

Lemmatization

How do they work?

 Stemming: there are different algorithms that can be used in the stemming process, but the most
common in English is Porter stemmer. The rules contained in this algorithm are divided in five
different phases numbered from 1 to 5. The purpose of these rules is to reduce the words to the
root.
 Lemmatization: the key to this methodology is linguistics. To extract the proper lemma, it is
necessary to look at the morphological analysis of each word. This requires having dictionaries for
every language to provide that kind of analysis.

Regular Expression and Finite-state Automata

Regular expression, the standard notation for characterizing text sequences. The regular expression is
used for specifying text strings in situations like this web-search example, and in other information
retrieval applications, but also plays an important role in word-processing, computation of frequencies
from corpora, and other such tasks.
Natural Language Processing Notes By Prof. Suresh R. Mestry

Once we defined regular expression, they can be implemented via finite-state automaton.
The finite-state automaton is not only the mathematical device used to implement regular expressions, but
also one of the most significant tools of computational linguistics. Variations of automata such as finite-
state transducers, Hidden Markov Models, and N-gram grammars are important components of the speech
recognition and synthesis, spell-checking, and information-extraction applications.

Basic Regular Expression Patterns

 The simplest kind of regular expression is a sequence of simple characters. For example, to search
for woodchuck, we type /woodchuck/. We use the slash since this is the notation used by Perl, but
the slashes are not part of the regular expressions. The search string can consist of a single letter
(like /!/) or a sequence of letters (like /urgl/);

 Disjunction: Regular expressions are case sensitive; lower-case /s/ is distinct from upper-case /S/;
This can be solved by square braces [ and ]. The string of characters inside the braces specify a
disjunction of characters to match.

 Caret ˆ : The square braces can also be used to specify what a single character cannot be, by use
of the caret ˆ.If the caret ˆ is the first symbol after the open square brace [, the resulting pattern is
negated.
 For woodchuck and woodchucks? cases we use the question-mark /?/, which means ‘the preceding
character or nothing’.
Natural Language Processing Notes By Prof. Suresh R. Mestry

 Ranges:

 More disjunction :Another word for raccoon is coon, the pipe | use for disjunction

 Operators for optional or repeated patterns, often called Kleene operators

 Anchors:
o Beginning of string ˆ
o End of string $

Morphological parsing with Finite-state transducers (FST)

 FST is a type of FSA which maps between two sets of symbols.
 It is a two-tape automaton that recognizes or generates pairs of strings, one from each type.
 FST defines relations between sets of strings.
 FSAutomata have input labels. i.e. One input tape
 FSTransducers have input:output pairs on labels. i.e. Two tapes: input and output.
Natural Language Processing Notes By Prof. Suresh R. Mestry

The FST is a multi-function device, and can be viewed in the following ways:
 Translator: It reads one string on one tape and outputs another string,
 Recognizer: It takes a pair of strings as two tapes and accepts/rejects based on their matching.
 Generator: It outputs a pair of strings on two tapes along with yes/no result based on whether they
are matching or not.
 Relater: It compares the relation between two sets of strings available on two tapes.

 The objective of the morphological parsing is to produce output lexicons for a single input lexicon,
e.g., like it is given in table 4.1.
 The second column in the table contains the stem of the corresponding word (lexicon) in first column,
along with its morphological features, like, +N means word is noun, +SG means it is singular, +PL
means it is plural, +V for verb, and pres-part for present participle.
 We achieve it through two level morphology, which represents a word as a correspondence between
lexical level - a simple concatenation of lexicons, as shown in column 2 of table 4.1, and a surface
level as shown in column 1. These are shown using two tapes of finite state transducer.

N-gram language model

Language Models
• Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences in
a language.
• For NLP, a probabilistic model of a language that gives a probability that a string is a member of a
language is more useful.
• To specify a correct probability distribution, the probability of all sentences in a language must
sum to 1.
• A language model also supports predicting the completion of a sentence.
Natural Language Processing Notes By Prof. Suresh R. Mestry

• Please turn off your cell _____

• Your program does not ______

N-Gram Models
• Estimate probability of each word given prior context.
– P(phone | Please turn off your cell)
• Number of parameters required grows exponentially with the number of words of prior context.
• An N-gram model uses only N1 words of prior context.
– Unigram: P(phone)
– Bigram: P(phone | cell)
– Trigram: P(phone | your cell)
• The Markov assumption is the presumption that the future behavior of a dynamical system only
depends on its recent history. In particular, in a kth-order Markov model, the next state only
depends on the k most recent states, therefore an N-gram model is a (N1)-order Markov model.

N-Gram Model Formulas

• N-gram approximation
n
P( w1n )   P( wk | wkk1N 1 )
k 1

Estimating Probabilities
• N-gram conditional probabilities can be estimated from raw text based on the relative frequency of
word sequences.
C ( wn 1wn )
P( wn | wn 1 ) 
C ( wn 1 )

n 1 C ( wnn1N 1wn )
P( wn | w n  N 1 )
C ( wnn1N 1 )
• To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to
every sentence and treat these as additional words.

• An N-gram model can be seen as a probabilistic automata for generating sentences.

Initialize sentence with N1 <s> symbols

Until </s> is generated do:
Stochastically pick the next word based on the conditional
probability of each word given the previous N 1 words.
Natural Language Processing Notes By Prof. Suresh R. Mestry

Example:
Let’s work through an example using a mini-corpus of three sentences
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

Here are the calculations for some of the bigram probabilities from this corpus

NLP Practice Problems (2)
No ratings yet
NLP Practice Problems (2)
48 pages
Analysis of Demand, Supply & Elasticity of Coca Cola
No ratings yet
Analysis of Demand, Supply & Elasticity of Coca Cola
8 pages
Part T Fire Protection
No ratings yet
Part T Fire Protection
108 pages
Rodney Harrison: Heritage. Critical Approaches - London: Routledge, 2013. 268 Pp. ISBN: 978-0-415-59197-3
No ratings yet
Rodney Harrison: Heritage. Critical Approaches - London: Routledge, 2013. 268 Pp. ISBN: 978-0-415-59197-3
5 pages
Baker Hughes 43018 Axcelerate in Gom CH
No ratings yet
Baker Hughes 43018 Axcelerate in Gom CH
1 page
Sample Consortium Agreement
100% (1)
Sample Consortium Agreement
77 pages
Wordlevel Analysis - Chap2
No ratings yet
Wordlevel Analysis - Chap2
97 pages
Word Level Analysis
No ratings yet
Word Level Analysis
49 pages
Word Level Analysis NLP Mod 2
No ratings yet
Word Level Analysis NLP Mod 2
18 pages
Module 3 - Part 1
No ratings yet
Module 3 - Part 1
54 pages
5. Lexical analysis- morphological analysis
No ratings yet
5. Lexical analysis- morphological analysis
9 pages
Words & Transducers
No ratings yet
Words & Transducers
7 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
UNIT-1 notes
No ratings yet
UNIT-1 notes
19 pages
Morphology
No ratings yet
Morphology
41 pages
Lec08 09 FSA for Morphological Parsig and Generation
No ratings yet
Lec08 09 FSA for Morphological Parsig and Generation
40 pages
NLP MODULE-2 Final Copy
No ratings yet
NLP MODULE-2 Final Copy
114 pages
NLP Lect-6 03.02.21
No ratings yet
NLP Lect-6 03.02.21
17 pages
10 FST
No ratings yet
10 FST
26 pages
Morphological Analysis (1)
No ratings yet
Morphological Analysis (1)
118 pages
Morphological Analysis
No ratings yet
Morphological Analysis
35 pages
Module 2
No ratings yet
Module 2
78 pages
CS674 Natural Language Processing Morphology: Topics For Today
No ratings yet
CS674 Natural Language Processing Morphology: Topics For Today
4 pages
Unit3_Morphology and Finite State Transducers
100% (1)
Unit3_Morphology and Finite State Transducers
55 pages
675469663
No ratings yet
675469663
33 pages
Lecture 3
No ratings yet
Lecture 3
55 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
103 pages
unit2
No ratings yet
unit2
20 pages
Morphology FST
No ratings yet
Morphology FST
47 pages
Lecture-3 (Words - Transducers)
No ratings yet
Lecture-3 (Words - Transducers)
61 pages
ch3MorphologyAndFST
No ratings yet
ch3MorphologyAndFST
30 pages
NLP Notes 1&2
No ratings yet
NLP Notes 1&2
3 pages
NLP chap2
No ratings yet
NLP chap2
126 pages
NLP Lect 2 Words and Morphology
No ratings yet
NLP Lect 2 Words and Morphology
52 pages
Finnish, Turkish and Hungarian
100% (1)
Finnish, Turkish and Hungarian
12 pages
Scan 27 Nov 23 09 21 15
No ratings yet
Scan 27 Nov 23 09 21 15
11 pages
Electronic
No ratings yet
Electronic
17 pages
Module 3: Morphology Inflectional and Derivation Morphology
No ratings yet
Module 3: Morphology Inflectional and Derivation Morphology
17 pages
Words
No ratings yet
Words
44 pages
nlp unit-1
No ratings yet
nlp unit-1
12 pages
nlp unit-1 (1)
No ratings yet
nlp unit-1 (1)
5 pages
3.Chapter4_Lexical Representations
No ratings yet
3.Chapter4_Lexical Representations
36 pages
Natural Langauge Processsing Unit 2
No ratings yet
Natural Langauge Processsing Unit 2
16 pages
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
No ratings yet
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
30 pages
NLP TT-1 Question Bank
No ratings yet
NLP TT-1 Question Bank
21 pages
NLP Merged
No ratings yet
NLP Merged
52 pages
Chapter 1
No ratings yet
Chapter 1
41 pages
Chapter Three
No ratings yet
Chapter Three
75 pages
NLP Qa
No ratings yet
NLP Qa
10 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
NLP 2
No ratings yet
NLP 2
29 pages
The Theory of Automata in Natural Language Processing
No ratings yet
The Theory of Automata in Natural Language Processing
6 pages
Units 1 & 2
No ratings yet
Units 1 & 2
22 pages
Understanding Words and Morphology
From Everand
Understanding Words and Morphology
Gauraang Asan
No ratings yet
CME4408_P4_RE_FSA_Morphology_FST
No ratings yet
CME4408_P4_RE_FSA_Morphology_FST
85 pages
Regular Expression
No ratings yet
Regular Expression
29 pages
lecture2 lexicology
No ratings yet
lecture2 lexicology
20 pages
Lecture 1 Text Preprocessing PDF
No ratings yet
Lecture 1 Text Preprocessing PDF
29 pages
Natual Languagr Processing
No ratings yet
Natual Languagr Processing
12 pages
NLp shorts 3
No ratings yet
NLp shorts 3
25 pages
02 - Morphological Analysis
100% (1)
02 - Morphological Analysis
17 pages
Derivational Morphology
0% (2)
Derivational Morphology
5 pages
Webster's American English Dictionary (with pronunciation guides): With over 50,000 references (US English)
From Everand
Webster's American English Dictionary (with pronunciation guides): With over 50,000 references (US English)
Alice Grandison
5/5 (1)
Coreference: Fundamentals and Applications
From Everand
Coreference: Fundamentals and Applications
Fouad Sabry
No ratings yet
Syntax and Sentence Structure in Linguistics
From Everand
Syntax and Sentence Structure in Linguistics
Aadinath Guha
No ratings yet
Bank
No ratings yet
Bank
5 pages
Nielsen Global New Products Report
No ratings yet
Nielsen Global New Products Report
22 pages
Determination of Specific Energy
No ratings yet
Determination of Specific Energy
3 pages
TOSCON_2023-Final_E_Posters_List
No ratings yet
TOSCON_2023-Final_E_Posters_List
21 pages
South East Asian
No ratings yet
South East Asian
9 pages
Cisco Router Password Recovery
No ratings yet
Cisco Router Password Recovery
3 pages
Skoda Slavia Accessories Brochure 31 1 24 PDF
No ratings yet
Skoda Slavia Accessories Brochure 31 1 24 PDF
35 pages
Datasheet RQ 30 2.41 PDF
No ratings yet
Datasheet RQ 30 2.41 PDF
3 pages
Basic Statistics and Probability
No ratings yet
Basic Statistics and Probability
49 pages
Kinetics Lab Report
0% (1)
Kinetics Lab Report
4 pages
Exam No 1
No ratings yet
Exam No 1
10 pages
Qb e Commerce(Csit)
No ratings yet
Qb e Commerce(Csit)
8 pages
Renolit
No ratings yet
Renolit
8 pages
Fainl
No ratings yet
Fainl
9 pages
Spacemacs Cheatsheet
No ratings yet
Spacemacs Cheatsheet
2 pages
Lose Weight Reading
No ratings yet
Lose Weight Reading
4 pages
Sony DVP-F21 3070344112
No ratings yet
Sony DVP-F21 3070344112
84 pages
Professional - Portfolio
No ratings yet
Professional - Portfolio
5 pages
High School Basketball Preview
No ratings yet
High School Basketball Preview
14 pages
2017 Working Freebitco - in Script
No ratings yet
2017 Working Freebitco - in Script
3 pages
Andrés Felipe Quiroga Escamilla: Reading Quizz English Ii Name: Group: 3 Date: 29 May 2020 SCORE
No ratings yet
Andrés Felipe Quiroga Escamilla: Reading Quizz English Ii Name: Group: 3 Date: 29 May 2020 SCORE
2 pages
Worksheet 2 - Master Budget
No ratings yet
Worksheet 2 - Master Budget
2 pages
eNodeB LTE FDD Feature List
No ratings yet
eNodeB LTE FDD Feature List
9 pages
Laporan Cut Off Inventory Unit Palu Ok
No ratings yet
Laporan Cut Off Inventory Unit Palu Ok
19 pages
CH 5 Exchange Rates of Ifm
No ratings yet
CH 5 Exchange Rates of Ifm
12 pages