NLP Unit 5
NLP Unit 5
Lecture -39
Natural Language Processing (NLP) is a field of research and application that analyses how with the help of machine we
can comprehend and manipulate natural language for further exploration. NLP contains many computational techniques for
the automated analysis and representation of human language. The fundamental terms of language play important role in
NLP. Examples of some fundamental terms are bad, somewhat, old, fantastic, extremely and so on. A collection of these
fundamental terms is called a composite. Examples of some composite terms are very good movie, young man, not
extremely surprised, and so on. In simple terms, atomic terms are words and composite terms are phrases. Words are the
constitutional building block of language. Human languages either in spoken or written form, is composed of words. The
NLP approaches related to word level are one of the initial steps towards comprehending the language.
The performance of NLP systems including machine translation, automatic question answering, information retrieval
depends on the correct meaning of the text. Biggest challenge is ambiguity i.e., unclear or open meaning depending on the
context of usage.
Types of Ambiguity are:
Lexical Ambiguity
Syntactic Ambiguity
The lexical ambiguity of a word or phrase means having more than one meaning of the word in the language. "Meaning"
here refers to the definition captured by a good dictionary. For example, in Hindi language “Aam” means common as well
as mango. Another example, in English language, treating the word silver as a noun, an adjective, or a verb. She bagged
two silver medals (Noun); She made a silver speech (Adjective); His worries had silvered his hair (Verb). Syntactic
ambiguity is a situation where a sentence may be explained in more than one way due to ambiguous sentence structure. For
example: John saw the man on the hill with a telescope. Many questions arise: Who is on the mountain? John, the man, or
both? Who has the telescope? John, the man, or the mountain?
Word sense disambiguation (WSD) aims to identify the intended meanings of words (word senses) in a given context. For
a given word and its possible meanings, WSD categorizes an occurrence of the word in context into one or more of its
sense classes. The features of the context (such as neighboring words) provide the evidence for classification. Statistical
methods for ambiguity resolution include concepts of Part of Speech (POS) tagging and the usage of probabilistic
grammars in parsing. Statistical methods for diminishing ambiguity in sentences have been formulated by deploying large
corpora (e.g., Brown Corpus, WordNet, Sent WordNet) about word usage. The words and sentences in these different
corpora are pre-tagged with the POS, grammatical structure, and frequencies of usage for words and sentences taken from
a huge sample of written language. POS tagging is the mechanism of selecting the most likely part of speech from among
the alternatives for each word in a sentence. Probabilistic grammar work on the concept of grammar rules, rules associated
to part of speech tags. A probabilistic grammar has a probability associated with each rule, based on its frequency of use in
the corpus. This approach assists in choosing the best alternative when the text is syntactically ambiguous (that is, there are
more than one different parse trees for the text).
Page No.1
Statistical methods for reducing ambiguity in sentences have been developed by using large corpora (e.g., the Brown
Corpus) about word usage. The words and sentences in these corpora are pre-tagged with the parts of speech, grammatical
structure, and frequencies of usage for words and sentences taken from a broad sample of written language. Part of speech
tagging is the process of selecting the most likely part of speech from among the alternatives for each word in a sentence.
In practice, only about 10% of the distinct words in a million-word cprpus have two or more parts of speech, as illustrated
by the following summary for the complete Brown Corpus (DeRose 1988):
The one word in the Brown corpus that has 7 different part of speech tags is "still." Presumably, those tags include at
least n, v, adj, adv, and conj. In practice, there are a small number of different sets of tags (or "tagsets") in use today. For
example, the tagset used by the Brown corpus has 87 different tags. In the following discussion, we shall use a very small
4-tag set, {n, v, det, prep} in order to maintain clarity of the ideas behind the statistical approach to part of speech tagging.
Given n or v as the two part of speech tags (T), for the word W = flies, the outcome is partially governed by looking at the
probability of each choice, based on the frequency of occurrence of the given word in a large corpus of representative text.
Compare prob(T = v | W = flies), which reads "the probability of a verb (v) choice, given that the word is "flies,"
with prob(T = n | W = flies).
That is, the probability of event X occurring, after knowing that Y has occurred, is the quotient of the probability of events
X and Y both occurring and the probability of Y occurring alone. Thus,
How are these probabilities estimated? The collected experience of using these words in a large corpus of known (and pre-
tagged) text, is used as an estimator for these probabilities.
Suppose, for example, that we have a corpus of 1,273,000 words (approximately the size of the Brown corpus) which has
1000 occurrences of the word flies, 400 pre-tagged as nouns (n) and 600 pre-tagged as verbs (v). Then prob(W = flies), the
probability that a randomly-selected word in a new text is flies, is 1000/1,273,000 = 0.0008. Similarly, the probability that
the word is flies and it is a noun (n) or a verb (v) can be estimated by:
So prob(T = v | W = flies) = prob(T = v & W = flies) / prob(W = flies) = 0.0005/0.0008 = 0.625. In other words, the
prediction that an arbitrary occurrence of the word flies is a verb will be correct 62.5% of the time.
Page No.2
Lecture -40
Estimating Probabilities,
A language model is something that specifies the following two quantities, for all words in the vocabulary (of a language)
We are free to model the probabilities however we want to. – Usually means that you have to make assumptions. • If you
make no independence assumptions about the sequence, then one way to estimate is the fraction of times you see it.
How many times would you have seen the particular sentence? Pr(w1 , w2 , …, wn ) = #(w1 , w2 , …, wn ) / N –
Estimating from sparse observations is unreliable. – Also, we don’t have a solution for a new sequence.
Lecture -41
One of the oldest techniques of tagging is rule-based POS tagging. Rule-based taggers use dictionary or lexicon for getting
possible tags for tagging each word. If the word has more than one possible tag, then rule-based taggers use hand-written
rules to identify the correct tag. Disambiguation can also be performed in rule-based tagging by analyzing the linguistic
features of a word along with its preceding as well as following words. For example, suppose if the preceding word of a
word is article then word must be a noun.
In order to understand the working and concept of transformation-based taggers, we need to understand the working of
transformation-based learning. Consider the following steps to understand the working of TBL −
Start with the solution − The TBL usually starts with some solution to the problem and works in cycles.
Most beneficial transformation chosen − In each cycle, TBL will choose the most beneficial
transformation.
Apply to the problem − The transformation chosen in the last step will be applied to the problem.
The algorithm will stop when the selected transformation in step 2 will not add either more value or there are no more
transformations to be selected. Such kind of learning is best suited in classification tasks.
Before digging deep into HMM POS tagging, we must understand the concept of Hidden Markov Model (HMM).
An HMM model may be defined as the doubly-embedded stochastic model, where the underlying stochastic process is
hidden. This hidden stochastic process can only be observed through another set of stochastic processes that produces the
sequence of observations.
Lecture -42
Sentence Probability
Page No.4
Syntactic Disambiguation
P(“book the flight through Houston”) = P(D1 ) + P(D2 ) = 0.0000216 + 0.00001296 = 0.00003456
Most likely derivation: To determine the most likely parse tree for a sentence.
Lecture -43
Probabilistic context-free grammars have done nothing to improve the efficiency of the parser. Algorithms can be
developed that attempt to explore the high-probability constituents first. These are called best-first parsing algo rithms.
The hope is that the best parse can be found quickly and much of the search space, containing lowerrated possibilities, is
never explored.
It turns out that all the chart parsing algorithms can be modified fairly easily to consider the most likely constituents
first. The central idea is to make the agenda a priority queue - a structure where the highestrated elements are always
first in the queue. The parser then operates by always re moving the highest-ranked constituent from the agenda and
adding it to the chart.
It might seem that this one change in search strategy is all that is needed to modify the algorithms, but there is a
complication. The previous chart parsing algorithms all depended on the fact that the parser systematically worked from
left to right, completely processing constituents occurring earlier in the sentence before considering later ones. With the
modified algorithm, this is not the case.
If the last word in the sentence has the highest score, it will be added to the chart first. The problem this causes is that you
cannot simply add active arcs to the chart (and depend on later steps in the algorithm to extend them).
In fact, the constituent needed to extend a particular active arc may already be on the chart. Thus, whenever an active arc
is added to the chart, you must check to see if it can be extended immediately, given the current chart. Thus we need to
modify the arc extension algorithm.
Adopting a best-first strategy makes a significant improvement in the efficiency of the parser. For instance, using Grammar
and lexicon trained from the corpus, the sentence "The man put a bird in the house" is parsed correctly after generating 65
constituents with the best-first parser. The standard bottom-up algorithm generates 158 constituents on the same sentence,
only to obtain the same result. If the standard algorithm were modified to terminate when the first complete S interpretation
is found, it would still generate 106 constituents for the same sentence. So best-first strategies can lead to significant
improvements in efficiency.
Page No.5
Lecture -44
The study of the meaning of linguistic sentences – Meaning of morphemes – Meaning of words – Meaning of phrases
Verifiability – Use meaning representation to determine the relationship between the meaning of a sentence and the
world we know it
Unambiguous Representations –
Single linguistic inputs may have different meaning representations assigned to them based on the circumstances in
which they occur –
The denotation of a natural language sentence is the set of conditions that must hold in the (model) world for the
sentence to be true. This is called the logical form of the sentence.
Less ambiguous
• Can check truth value by querying a database
• If you know it’s true, you can update database
• Questions become queries on the database
• Comprehending a document is same as chaining
Ambiguity
• Lexical (word sense) ambiguity
• Syntactic (structural) ambiguity
• Disambiguation – Structural information of the sentences – Word co-occurrence constraints
Vagueness
• Make it difficult to determine what to do with a particular input based on it’s meaning representations
• Some word senses are more specific than others
FOPC allows
– The analysis of Truth conditions
Page No.6
• Allows us to answer yes/no questions – Supports the use of variables
• Allows us to answer questions through the use of variable binding
– Supports inference
• Allows us to answer questions that go beyond what we know explicitly
– Determine the truth of propositions that do not literally (exactly) present in the KB
Elements of FOPC
– Functions
• Refer to unique objects without having to associate a name constant with them
• Syntactically the same as single predicates,
Predicates: – Symbols refer to the relations holding among some fixed number of objects in a given domain
– Or symbols refer to the properties of a single object
• Encode the category membership
– The arguments to a predicates must be terms, not other predicates
Used to form larger composite representations – Example I only have five dollars and I don’t have a
lot of time Have(Speaker, FiveDollars) Have(Speaker, LotOfTime)
Page No.7
Lecture -45
Word senses and Ambiguity,
Word sense disambiguation (WSD) in Natural Language Processing (NLP) is the problem of identifying which
“sense” (meaning) of a word is activated by the use of the word in a particular context or scenario. In people, this
appears to be a largely unconscious process. The challenge of correctly identifying words in NLP systems is common,
and determining the specific usage of a word in a sentence has many applications. The application of Word Sense
Disambiguation involves the area of Information Retrieval, Question Answering systems, Chat-bots, etc.
Word Sense Disambiguation (WSD) is a subtask of Natural Language Processing that deals with the problem of
identifying the correct sense of a word in context. Many words in natural language have multiple meanings, and WSD
aims to disambiguate the correct sense of a word in a particular context. For example, the word “bank” can have
different meanings in the sentences “I deposited money in the bank” and “The boat went down the river bank”.
WSD is a challenging task because it requires understanding the context in which the word is used and the different
senses in which the word can be used. Some common approaches to WSD include:
1. Supervised learning: This involves training a machine learning model on a dataset of annotated examples,
where each example contains a target word and its sense in a particular context. The model then learns to
predict the correct sense of the target word in new contexts.
2. Unsupervised learning: This involves clustering words that appear in similar contexts together, and then
assigning senses to the resulting clusters. This approach does not require annotated data, but it is less
accurate than supervised learning.
3. Knowledge-based: This involves using a knowledge base, such as a dictionary or ontology, to map words
to their different senses. This approach relies on the availability and accuracy of the knowledge base.
4. Hybrid: This involves combining multiple approaches, such as supervised and knowledge-based methods,
to improve accuracy.
WSD has many practical applications, including machine translation, information retrieval, and text-to-speech
systems. Improvements in WSD can lead to more accurate and efficient natural language processing systems.
Word Sense Disambiguation (WSD) is a subfield of Natural Language Processing (NLP) that deals with determining
the intended meaning of a word in a given context. It is the process of identifying the correct sense of a word from a set
of possible senses, based on the context in which the word appears. WSD is important for natural language
understanding and machine translation, as it can improve the accuracy of these tasks by providing more accurate word
meanings. Some common approaches to WSD include using WordNet, supervised machine learning, and unsupervised
methods such as clustering.
The noun ‘star’ has eight different meanings or senses. An idea can be mapped to each sense of the word. For example,
Page No.8
“He always wanted to be a Bollywood star.” The word ‘star’ can be described as “A famous and good
singer, performer, sports player, actor, personality, etc.”
“The Milky Way galaxy contains between 200 and 400 billion stars”. In this, the word star means “a big
ball of burning gas in space that we view as a point of light in the night sky.”
2. Unsupervised: The underlying assumption is that similar senses occur in similar contexts, and thus senses
can be induced from the text by clustering word occurrences using some measure of similarity of context.
Using fixed-size dense vectors (word embeddings) to represent words in context has become one of the most
fundamental blocks in several NLP systems. Traditional word embedding approaches can still be utilized to
improve WSD, despite the fact that they conflate words with many meanings into a single vector
representation. Lexical databases (e.g., WordNet, ConceptNet, BabelNet) can also help unsupervised systems
map words and their senses as dictionaries, in addition to word embedding techniques.
3. Knowledge-Based: It is built on the idea that words used in a text are related to one another, and that this
relationship can be seen in the definitions of the words and their meanings. The pair of dictionary senses
having the highest word overlap in their dictionary meanings are used to disambiguate two (or more) words.
Lesk Algorithm is the classical algorithm based on Knowledge-Based WSD. Lesk algorithm assumes that
words in a given “neighborhood” (a portion of text) will have a similar theme. The dictionary definition of an
uncertain word is compared to the terms in its neighborhood in a simplified version of the Lesk algorithm.
Lecture -46
Types of Ambiguity
Technically defining ambiguity can, well, ambiguous. However, there are different forms of ambiguity that are relevant in
natural language and, consequently, in artificial intelligence(AI) systems.
Lexical Ambiguity: This type of ambiguity represents words that can have multiple assertions. For instance, in English, the
word “back” can be a noun ( back stage), an adjective (back door) or an adverb (back away).
Page No.9
Syntactic Ambiguity: This type of ambiguity represents sentences that can be parsed in multiple syntactical forms. Take the
following sentence: “ I heard his cell phone rin in my office”. The propositional phrase “in my office” can e parsed in a way
that modifies the noun or on another way that modifies the verb.
Semantic Ambiguity: This type of ambiguity is typically related to the interpretation of sentence. For instance, the previous
sentence used in the previous point can be interpreted as if I was physically present in the office or as if the cell phone was in
the office.
— Metonymy: Arguably, the most difficult type of ambiguity, metonymy deals with phrases in which the literal meaning is
different from the figurative assertion. For instance, when we say “Samsung us screaming for new management”, we don’t
really mean that the company is literally screaming (although you never know with Samsung these days ;) ).
Metaphors
Metaphors are a specific type of metonymy on which a phrase with one literal meaning is used as an analogy to suggest a
different meaning. For example, if we say: “Roger Clemens was painting the corners”, we are not referring to the former NY
Yankee star working as a painter.
Metaphors are particularly difficult to handle as they typically include references to historical or fictitious elements which
are hard to place in the context of the conversation. From a conceptual standpoint, metaphors can be seen as a type of
metonymy on which the relationship between sentences is based on similarity.
Page No.10