0% found this document useful (0 votes)
250 views

NLP Unit 5

This document discusses ambiguity resolution in natural language processing through statistical methods. It covers topics like part-of-speech tagging, probabilistic language processing, and estimating probabilities from large text corpora to assign the most likely tags. The performance of NLP systems depends on resolving ambiguity, like lexical ambiguity where a word has multiple meanings or syntactic ambiguity from unclear sentence structure. Statistical methods aim to address this by using tagged examples from corpora to estimate word usage probabilities.

Uploaded by

Shobhit Raha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
250 views

NLP Unit 5

This document discusses ambiguity resolution in natural language processing through statistical methods. It covers topics like part-of-speech tagging, probabilistic language processing, and estimating probabilities from large text corpora to assign the most likely tags. The performance of NLP systems depends on resolving ambiguity, like lexical ambiguity where a word has multiple meanings or syntactic ambiguity from unclear sentence structure. Statistical methods aim to address this by using tagged examples from corpora to estimate word usage probabilities.

Uploaded by

Shobhit Raha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 10

Unit -5

KOE088: NATURAL LANGUAGE PROCESSING


DETAILED SYLLABUS
Unit Topic

V Ambiguity Resolution: Statistical Methods, Probabilistic Language Processing, Estimating


Probabilities, Part-of Speech tagging, Obtaining Lexical Probabilities, Probabilistic Context-
Free Grammars, Best First Parsing. Semantics and Logical Form, Word senses and Ambiguity,
Encoding Ambiguity in Logical Form.

Lecture -39

Ambiguity Resolution: Statistical Methods,


Probabilistic Language Processing,

Natural Language Processing (NLP) is a field of research and application that analyses how with the help of machine we
can comprehend and manipulate natural language for further exploration. NLP contains many computational techniques for
the automated analysis and representation of human language. The fundamental terms of language play important role in
NLP. Examples of some fundamental terms are bad, somewhat, old, fantastic, extremely and so on. A collection of these
fundamental terms is called a composite. Examples of some composite terms are very good movie, young man, not
extremely surprised, and so on. In simple terms, atomic terms are words and composite terms are phrases. Words are the
constitutional building block of language. Human languages either in spoken or written form, is composed of words. The
NLP approaches related to word level are one of the initial steps towards comprehending the language.

The performance of NLP systems including machine translation, automatic question answering, information retrieval
depends on the correct meaning of the text. Biggest challenge is ambiguity i.e., unclear or open meaning depending on the
context of usage. 
Types of Ambiguity are:

Lexical Ambiguity
Syntactic Ambiguity

The lexical ambiguity of a word or phrase means having more than one meaning of the word in the language. "Meaning"
here refers to the definition captured by a good dictionary. For example, in Hindi language “Aam” means common as well
as mango. Another example, in English language, treating the word silver as a noun, an adjective, or a verb. She bagged
two silver medals (Noun); She made a silver speech (Adjective); His worries had silvered his hair (Verb). Syntactic
ambiguity is a situation where a sentence may be explained in more than one way due to ambiguous sentence structure. For
example: John saw the man on the hill with a telescope. Many questions arise: Who is on the mountain? John, the man, or
both? Who has the telescope? John, the man, or the mountain?

Ambiguity in Natural Language Processing can be removed using:

Word Sense Disambiguation

Part of Speech Tagger

HMM (Hidden Markov Model) Tagger

Hybrid combination of taggers with machine learning techniques.

Word sense disambiguation (WSD) aims to identify the intended meanings of words (word senses) in a given context. For
a given word and its possible meanings, WSD categorizes an occurrence of the word in context into one or more of its
sense classes. The features of the context (such as neighboring words) provide the evidence for classification. Statistical
methods for ambiguity resolution include concepts of Part of Speech (POS) tagging and the usage of probabilistic
grammars in parsing. Statistical methods for diminishing ambiguity in sentences have been formulated by deploying large
corpora (e.g., Brown Corpus, WordNet, Sent WordNet) about word usage. The words and sentences in these different
corpora are pre-tagged with the POS, grammatical structure, and frequencies of usage for words and sentences taken from
a huge sample of written language. POS tagging is the mechanism of selecting the most likely part of speech from among
the alternatives for each word in a sentence. Probabilistic grammar work on the concept of grammar rules, rules associated
to part of speech tags. A probabilistic grammar has a probability associated with each rule, based on its frequency of use in
the corpus. This approach assists in choosing the best alternative when the text is syntactically ambiguous (that is, there are
more than one different parse trees for the text).
Page No.1
Statistical methods for reducing ambiguity in sentences have been developed by using  large corpora (e.g., the Brown
Corpus) about word usage.  The words and sentences in these corpora are pre-tagged with the parts of speech, grammatical
structure, and frequencies of usage for words and sentences taken from a broad sample of written language.  Part of speech
tagging is the process of selecting the most likely part of speech from among the alternatives for each word in a sentence. 
In practice, only about 10% of the distinct words in a million-word cprpus have two or more parts of speech, as illustrated
by the following summary for the complete Brown Corpus (DeRose 1988):

The one word in the Brown corpus that has 7 different part of speech tags is "still."  Presumably, those tags include at
least n, v, adj, adv, and conj.  In practice, there are a small number of different sets of tags (or "tagsets") in use today.  For
example, the tagset used by the Brown corpus has 87 different tags.  In the following discussion, we shall use a very small
4-tag set, {n, v, det, prep} in order to maintain clarity of the ideas behind the statistical approach to part of speech tagging.

Given n or v as the two part of speech tags (T), for the word W = flies, the outcome is partially governed by looking at the
probability of each choice, based on the frequency of occurrence of the given word in a large corpus of representative text.

Compare prob(T = v | W = flies), which reads "the probability of a verb (v) choice, given that the word is "flies,"
with prob(T = n | W = flies).

In statistics, the following is true for two independent events X and Y:

prob(X | Y) = prob(X & Y) / prob(Y).

That is, the probability of event X occurring, after knowing that Y has occurred, is the quotient of the probability of events
X and Y both occurring and the probability of Y occurring alone.  Thus,

prob(T = n | W = flies) = prob(T = n & W = flies) / prob(W = flies)

How are these probabilities estimated?  The collected experience of using these words in a large corpus of known (and pre-
tagged) text, is used as an estimator for these probabilities.

Suppose, for example, that we have a corpus of 1,273,000 words (approximately the size of the Brown corpus) which has
1000 occurrences of the word flies, 400 pre-tagged as nouns (n) and 600 pre-tagged as verbs (v).  Then prob(W = flies), the
probability that a randomly-selected word in a new text is flies, is 1000/1,273,000 = 0.0008.  Similarly, the probability that
the word is flies and it is a noun (n) or a verb (v) can be estimated by:
 

prob(T = n & W = flies) = 400/1,273,000 = 0.0003


prob(T = v & W = flies) = 600/1,273,000 = 0.0005

So prob(T = v | W = flies) = prob(T = v & W = flies) / prob(W = flies) = 0.0005/0.0008 = 0.625.  In other words, the
prediction that an arbitrary occurrence of the word flies is a verb will be correct 62.5% of the time.

Page No.2
Lecture -40

Estimating Probabilities,

A language model is something that specifies the following two quantities, for all words in the vocabulary (of a language)

1. Probability of a sentence or sequence Pr(w1 , w2, …, wn )


2. Probability of the next word in a sequence Pr(wk+1 | w1 , …, wk )
3. Note on notation: Pr(w1 , w2, …, wn ) is short for Pr(W1 = w1 , W1 = w2 , …, Wn = wn ) Random variable W1
taking on value w1 and so on. e.g., Pr(I, love, fish) = Pr(W1 = I, W2 = love, W3 = fish)

We are free to model the probabilities however we want to. – Usually means that you have to make assumptions. • If you
make no independence assumptions about the sequence, then one way to estimate is the fraction of times you see it.

Pr(w1 , w2, …, wn ) = #(w1 , w2 , …, wn ) / N where N is the total number of sequences.

How many times would you have seen the particular sentence? Pr(w1 , w2 , …, wn ) = #(w1 , w2 , …, wn ) / N –
Estimating from sparse observations is unreliable. – Also, we don’t have a solution for a new sequence.

Lecture -41

Part-of Speech tagging,


Tagging is a kind of classification that may be defined as the automatic assignment of description to the tokens. Here the
descriptor is called tag, which may represent one of the part-of-speech, semantic information and so on.
Now, if we talk about Part-of-Speech (PoS) tagging, then it may be defined as the process of assigning one of the parts of
speech to the given word. It is generally called POS tagging. In simple words, we can say that POS tagging is a task of
labelling each word in a sentence with its appropriate part of speech. We already know that parts of speech include nouns,
verb, adverbs, adjectives, pronouns, conjunction and their sub-categories.
Most of the POS tagging falls under Rule Base POS tagging, Stochastic POS tagging and Transformation based tagging.

Rule-based POS Tagging

One of the oldest techniques of tagging is rule-based POS tagging. Rule-based taggers use dictionary or lexicon for getting
possible tags for tagging each word. If the word has more than one possible tag, then rule-based taggers use hand-written
rules to identify the correct tag. Disambiguation can also be performed in rule-based tagging by analyzing the linguistic
features of a word along with its preceding as well as following words. For example, suppose if the preceding word of a
word is article then word must be a noun.

Working of Transformation Based Learning(TBL)

In order to understand the working and concept of transformation-based taggers, we need to understand the working of
transformation-based learning. Consider the following steps to understand the working of TBL −
 Start with the solution − The TBL usually starts with some solution to the problem and works in cycles.
 Most beneficial transformation chosen − In each cycle, TBL will choose the most beneficial
transformation.
 Apply to the problem − The transformation chosen in the last step will be applied to the problem.
The algorithm will stop when the selected transformation in step 2 will not add either more value or there are no more
transformations to be selected. Such kind of learning is best suited in classification tasks.

Advantages of Transformation-based Learning (TBL)

The advantages of TBL are as follows −


 We learn small set of simple rules and these rules are enough for tagging.
 Development as well as debugging is very easy in TBL because the learned rules are easy to understand.
 Complexity in tagging is reduced because in TBL there is interlacing of machinelearned and human-
generated rules.
 Transformation-based tagger is much faster than Markov-model tagger.

Disadvantages of Transformation-based Learning (TBL)


Page No.3
The disadvantages of TBL are as follows −
 Transformation-based learning (TBL) does not provide tag probabilities.
 In TBL, the training time is very long especially on large corpora.

Hidden Markov Model (HMM) POS Tagging

Before digging deep into HMM POS tagging, we must understand the concept of Hidden Markov Model (HMM).

Hidden Markov Model

An HMM model may be defined as the doubly-embedded stochastic model, where the underlying stochastic process is
hidden. This hidden stochastic process can only be observed through another set of stochastic processes that produces the
sequence of observations.

Lecture -42

Obtaining Lexical Probabilities, Probabilistic Context-Free Grammars,

A PCFG is a probabilistic version of a CFG – Each production has a probability.

Simple PCFG for English

Sentence Probability

Assume productions for each node are chosen independently.


• Probability of derivation is the product of the probabilities of its productions.

Page No.4
Syntactic Disambiguation

Resolve ambiguity by picking most probable parse tree

Probability of a sentence is the sum of the probabilities of all of its derivations.

P(“book the flight through Houston”) = P(D1 ) + P(D2 ) = 0.0000216 + 0.00001296 = 0.00003456

Three Useful PCFG Tasks

Observation likelihood: To classify and order sentences.

Most likely derivation: To determine the most likely parse tree for a sentence.

Maximum likelihood training: To train a PCFG to fit empirical training data .

Lecture -43

Best First Parsing.

Probabilistic context-free grammars have done nothing to improve the efficiency of the parser. Algorithms can be
developed that attempt to explore the high-probability constituents first. These are called best-first parsing algo rithms.
The hope is that the best parse can be found quickly and much of the search space, containing lowerrated possibilities, is
never explored.

It turns out that all the chart parsing algorithms can be modified fairly easily to consider the most likely constituents
first. The central idea is to make the agenda a priority queue - a structure where the highestrated elements are always
first in the queue. The parser then operates by always re moving the highest-ranked constituent from the agenda and
adding it to the chart.

It might seem that this one change in search strategy is all that is needed to modify the algorithms, but there is a
complication. The previous chart parsing algorithms all depended on the fact that the parser systematically worked from
left to right, completely processing constituents occurring earlier in the sentence before considering later ones. With the
modified algorithm, this is not the case.

If the last word in the sentence has the highest score, it will be added to the chart first. The problem this causes is that you
cannot simply add active arcs to the chart (and depend on later steps in the algorithm to extend them).

In fact, the constituent needed to extend a particular active arc may already be on the chart. Thus, whenever an active arc
is added to the chart, you must check to see if it can be extended immediately, given the current chart. Thus we need to
modify the arc extension algorithm.

Adopting a best-first strategy makes a significant improvement in the efficiency of the parser. For instance, using Grammar
and lexicon trained from the corpus, the sentence "The man put a bird in the house" is parsed correctly after generating 65
constituents with the best-first parser. The standard bottom-up algorithm generates 158 constituents on the same sentence,
only to obtain the same result. If the standard algorithm were modified to terminate when the first complete S interpretation
is found, it would still generate 106 constituents for the same sentence. So best-first strategies can lead to significant
improvements in efficiency.

Page No.5
Lecture -44

Semantics and Logical Form,

The study of the meaning of linguistic sentences – Meaning of morphemes – Meaning of words – Meaning of phrases

Steps for determining the meaning of a sentence


– Compute a context-independent notion of meaning in logical form (semantic interpretation)
– Interpret the logical form in context to produce the final meaning representation (contextual interpretation)

The study of language in context is called pragmatics.

Verifiability – Use meaning representation to determine the relationship between the meaning of a sentence and the
world we know it

Query: “Does Maharani serve vegetarian food? ” Serves (Maharani, VegetarianFood)

The straightforward way


• Make it possible for a system to compare, or match, the representation of meaning of an input against the
representations (facts)

Unambiguous Representations –

Single linguistic inputs may have different meaning representations assigned to them based on the circumstances in
which they occur –

ambiguity cf. vagueness


• It’s not always easy to distinguish ambiguity from vagueness

• E.g., I have two kids and George has three


I have one horse and George has two

The denotation of a natural language sentence is the set of conditions that must hold in the (model) world for the
sentence to be true. This is called the logical form of the sentence.
Less ambiguous
• Can check truth value by querying a database
• If you know it’s true, you can update database
• Questions become queries on the database
• Comprehending a document is same as chaining

Ambiguity
• Lexical (word sense) ambiguity
• Syntactic (structural) ambiguity
• Disambiguation – Structural information of the sentences – Word co-occurrence constraints

Vagueness
• Make it difficult to determine what to do with a particular input based on it’s meaning representations
• Some word senses are more specific than others

First Order Predicate Calculus (FOPC)

• Also called First Order Logic (FOL)

Make use of FOPC as the representational framework, because it is

--Fexible, well-understood, and computational tractable


– Produced directly from the syntactic structure of a sentence
– Specify the sentence meaning without having to refer back natural language itself
– Context-independency: does not contain the results of any analysis that requires interpretation of the sentences in context.

FOPC allows
– The analysis of Truth conditions

Page No.6
• Allows us to answer yes/no questions – Supports the use of variables
• Allows us to answer questions through the use of variable binding
– Supports inference
• Allows us to answer questions that go beyond what we know explicitly
– Determine the truth of propositions that do not literally (exactly) present in the KB

Elements of FOPC

Terms: the device for representing objects


– Variables
• Make assertions and draw references about objects without having to make reference to any particular named object
(anonymous objects)
• Depicted as single lower-case letters
Constants
• Refer to specific objects in the world being described
• Depicted as single capitalized letters or single capitalized words

– Functions
• Refer to unique objects without having to associate a name constant with them
• Syntactically the same as single predicates,

Predicates: – Symbols refer to the relations holding among some fixed number of objects in a given domain
– Or symbols refer to the properties of a single object
• Encode the category membership
– The arguments to a predicates must be terms, not other predicates

A CFG specification of the syntax of FOPC

Used to form larger composite representations – Example I only have five dollars and I don’t have a
lot of time Have(Speaker, FiveDollars) Have(Speaker, LotOfTime)

Page No.7
Lecture -45
Word senses and Ambiguity,

Word sense disambiguation (WSD)  in Natural Language Processing (NLP) is the problem of identifying which
“sense” (meaning) of a word is activated by the use of the word in a particular context or scenario. In people, this
appears to be a largely unconscious process. The challenge of correctly identifying words in NLP systems is common,
and determining the specific usage of a word in a sentence has many applications. The application of Word Sense
Disambiguation involves the area of Information Retrieval, Question Answering systems, Chat-bots, etc.

Word Sense Disambiguation (WSD) is a subtask of Natural Language Processing that deals with the problem of
identifying the correct sense of a word in context. Many words in natural language have multiple meanings, and WSD
aims to disambiguate the correct sense of a word in a particular context. For example, the word “bank” can have
different meanings in the sentences “I deposited money in the bank” and “The boat went down the river bank”.

WSD is a challenging task because it requires understanding the context in which the word is used and the different
senses in which the word can be used. Some common approaches to WSD include:

1. Supervised learning: This involves training a machine learning model on a dataset of annotated examples,
where each example contains a target word and its sense in a particular context. The model then learns to
predict the correct sense of the target word in new contexts.
2. Unsupervised learning: This involves clustering words that appear in similar contexts together, and then
assigning senses to the resulting clusters. This approach does not require annotated data, but it is less
accurate than supervised learning.
3. Knowledge-based: This involves using a knowledge base, such as a dictionary or ontology, to map words
to their different senses. This approach relies on the availability and accuracy of the knowledge base.
4. Hybrid: This involves combining multiple approaches, such as supervised and knowledge-based methods,
to improve accuracy.

WSD has many practical applications, including machine translation, information retrieval, and text-to-speech
systems. Improvements in WSD can lead to more accurate and efficient natural language processing systems.

Word Sense Disambiguation (WSD) is a subfield of Natural Language Processing (NLP) that deals with determining
the intended meaning of a word in a given context. It is the process of identifying the correct sense of a word from a set
of possible senses, based on the context in which the word appears. WSD is important for natural language
understanding and machine translation, as it can improve the accuracy of these tasks by providing more accurate word
meanings. Some common approaches to WSD include using WordNet, supervised machine learning, and unsupervised
methods such as clustering.
The noun ‘star’ has eight different meanings or senses. An idea can be mapped to each sense of the word. For example,

Page No.8
 “He always wanted to be a Bollywood star.” The word ‘star’ can be described as “A famous and good
singer, performer, sports player, actor, personality, etc.”
 “The Milky Way galaxy contains between 200 and 400 billion stars”. In this, the word star means “a big
ball of burning gas in space that we view as a point of light in the night sky.”

Approaches for Word Sense Disambiguation


There are many approaches to Word Sense Disambiguation. The three main approaches are given below:  
1. Supervised: The assumption behind supervised approaches is that the context can supply enough evidence to
disambiguate words on its own (hence, world knowledge and reasoning are deemed unnecessary).  
Supervised methods for Word Sense Disambiguation (WSD) involve training a model using a labeled dataset of word
senses. The model is then used to disambiguate the sense of a target word in new text. Some common techniques used
in supervised WSD include:
1. Decision list: A decision list is a set of rules that are used to assign a sense to a target word based on the
context in which it appears.
2. Neural Network: Neural networks such as feedforward networks, recurrent neural networks, and
transformer networks are used to model the context-sense relationship.
3. Support Vector Machines: SVM is a supervised machine learning algorithm used for classification and
regression analysis.
4. Naive Bayes: Naive Bayes is a probabilistic algorithm that uses Bayes’ theorem to classify text into
predefined categories.
5. Decision Trees: Decision Trees are a flowchart-like structure in which an internal node represents
feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome.

2. Unsupervised: The underlying assumption is that similar senses occur in similar contexts, and thus senses
can be induced from the text by clustering word occurrences using some measure of similarity of context.
Using fixed-size dense vectors (word embeddings) to represent words in context has become one of the most
fundamental blocks in several NLP systems. Traditional word embedding approaches can still be utilized to
improve WSD, despite the fact that they conflate words with many meanings into a single vector
representation. Lexical databases (e.g., WordNet, ConceptNet, BabelNet) can also help unsupervised systems
map words and their senses as dictionaries, in addition to word embedding techniques.

3. Knowledge-Based: It is built on the idea that words used in a text are related to one another, and that this
relationship can be seen in the definitions of the words and their meanings. The pair of dictionary senses
having the highest word overlap in their dictionary meanings are used to disambiguate two (or more) words.
Lesk Algorithm is the classical algorithm based on Knowledge-Based WSD. Lesk algorithm assumes that
words in a given “neighborhood” (a portion of text) will have a similar theme. The dictionary definition of an
uncertain word is compared to the terms in its neighborhood in a simplified version of the Lesk algorithm.  

Lecture -46

Encoding Ambiguity in Logical Form.

Ambiguity in Natural Language Processing


Ambiguity is an intrinsic characteristic of human conversations and one that is particularly challenging in natural language
understanding(NLU) scenarios by ambiguity, w are essentially referring to sentences that have multiple alternative
interpretations.
Ambiguity is one of those areas of cognitive sciences that doesn’t have a well-defined solution. The spectrum of what can be
considered ambiguous on any language varies greatly depending on the speaker. From a technical standpoint, any sentence
in a language with a large-enough grammar can have alternative interpretations. However, most native speakers only
recognize the primary interpretation when hearing a phrase while alternative representations may be more obvious to non-
native speakers whom, cognitively speaking, need to rewire their brains in order to lean a new language. If humans find it
difficult to deal with ambiguity in conversations, just imagine the challenge for NLU systems.

Types of Ambiguity

Technically defining ambiguity can, well, ambiguous. However, there are different forms of ambiguity that are relevant in
natural language and, consequently, in artificial intelligence(AI) systems.

Lexical Ambiguity: This type of ambiguity represents words that can have multiple assertions. For instance, in English, the
word “back” can be a noun ( back stage), an adjective (back door) or an adverb (back away).

Page No.9
Syntactic Ambiguity: This type of ambiguity represents sentences that can be parsed in multiple syntactical forms. Take the
following sentence: “ I heard his cell phone rin in my office”. The propositional phrase “in my office” can e parsed in a way
that modifies the noun or on another way that modifies the verb.

Semantic Ambiguity: This type of ambiguity is typically related to the interpretation of sentence. For instance, the previous
sentence used in the previous point can be interpreted as if I was physically present in the office or as if the cell phone was in
the office.

— Metonymy: Arguably, the most difficult type of ambiguity, metonymy deals with phrases in which the literal meaning is
different from the figurative assertion. For instance, when we say “Samsung us screaming for new management”, we don’t
really mean that the company is literally screaming (although you never know with Samsung these days ;) ).

Metaphors
Metaphors are a specific type of metonymy on which a phrase with one literal meaning is used as an analogy to suggest a
different meaning. For example, if we say: “Roger Clemens was painting the corners”, we are not referring to the former NY
Yankee star working as a painter.
Metaphors are particularly difficult to handle as they typically include references to historical or fictitious elements which
are hard to place in the context of the conversation. From a conceptual standpoint, metaphors can be seen as a type of
metonymy on which the relationship between sentences is based on similarity.

Page No.10

You might also like