NLP- AI2214601 unit 1to unit 5 notes
NLP- AI2214601 unit 1to unit 5 notes
UNIT I INTRODUCTION 9
and rules, Tokenization, Detecting and Correcting Spelling Errors, Minimum Edit Distance
The origins of Natural Language Processing (NLP) can be traced back to the 1950s and
1960s with the development of early computational linguistics and machine translation
systems. Some
key milestones include the development of the Georgetown-IBM experiment in 1954, which
translated Russian sentences into English, and the creation of the first chatbot, ELIZA, in the
mid-1960s.
Challenges in NLP stem from the complexity and ambiguity inherent in natural language.
Here are some of the key challenges:
1. Ambiguity: Natural language is highly ambiguous, with words and phrases often
having multiple meanings depending on context. Resolving this ambiguity is a major
challenge in tasks such as parsing, word sense disambiguation, and machine translation.
2. Syntax and Semantics: Understanding the syntactic and semantic structure of
sentences is crucial for NLP tasks. However, natural language exhibits complex
syntactic and semantic patterns that can be difficult for machines to parse and
understand accurately.
3. Context Dependency: The meaning of a word or phrase can vary depending on the
surrounding context. Capturing and modeling context dependencies is essential for
tasks like sentiment analysis, named entity recognition, and question answering.
4. Lack of Annotated Data: Many NLP tasks require large amounts of annotated data for
training machine learning models. However, creating high-quality annotated datasets
can be time-consuming and expensive, especially for languages with limited resources.
5. Domain Specificity: Natural language varies greatly across different domains and
genres (e.g., medical texts, legal documents, social media posts). Building NLP
systems that perform well across diverse domains is challenging due to the need for
domain adaptation and specialized knowledge.
6. Commonsense Reasoning: Understanding and reasoning about commonsense
knowledge is essential for many NLP tasks, such as language understanding and
generation. However, capturing and representing commonsense knowledge in a
machine-readable format is still an ongoing research challenge.
7. Ethical and Bias Concerns: NLP systems can inadvertently perpetuate biases present
in the data they are trained on, leading to issues such as algorithmic bias and fairness
concerns. Addressing these ethical considerations is crucial for the responsible
development and deployment of NLP technologies.
Despite these challenges, significant progress has been made in NLP in recent years, driven
by advances in machine learning, deep learning, and computational linguistics. Ongoing
research continues to push the boundaries of what is possible in natural language
understanding and generation.
2. Language modeling is a fundamental task in natural language processing (NLP) that
involves predicting the next word in a sequence of words. The goal is to capture the statistical
structure of language and generate coherent and contextually relevant text.
1. Input Sequence: A language model takes as input a sequence of words or tokens. This
sequence can be a sentence, paragraph, or longer text.
2. Context Encoding: The input sequence is encoded into a numerical representation
that can be processed by the language model. This encoding captures the contextual
information of the input, such as the meaning of words and their relationships within
the sequence.
3. Prediction: Based on the encoded context, the language model predicts the probability
distribution over the vocabulary of possible next words. This distribution indicates the
likelihood of each word occurring given the context provided by the input sequence.
4. Sampling: To generate text, the language model can either select the word with the
highest probability (greedy decoding) or sample from the probability distribution to
introduce randomness and generate diverse text.
Overall, language modeling plays a crucial role in various NLP tasks and continues to be an
active area of research and development.
Grammar-based language models (LMs) are a class of language models that rely on explicit
grammar rules to generate or understand natural language text. These models are based on
linguistic theories and formal grammars, which define the syntax and structure of a language.
Here's how grammar-based language models typically work:
1. Grammar Rules: Grammar-based LMs start with a set of grammar rules that describe
the syntactic structure of the language. These rules define how words and phrases can
be combined to form grammatically correct sentences.
2. Parsing: When generating or understanding text, the input is parsed according to the
grammar rules to identify the syntactic structure of the input sentence. This involves
breaking down the input into its constituent parts, such as words, phrases, and clauses,
and determining how they relate to each other.
3. Rule Application: The grammar rules are then applied to the parsed input to generate
or interpret text. These rules govern how words and phrases can be combined to form
valid sentences according to the grammar of the language.
4. Constraints: Grammar-based LMs may incorporate additional constraints to ensure
that the generated text adheres to specific criteria, such as style, domain-specific
vocabulary, or semantic coherence.
5. Evaluation: The generated text is evaluated based on its grammaticality and
coherence according to the rules of the grammar. This evaluation may involve
checking for violations of grammar rules, semantic inconsistencies, or other linguistic
criteria.
1. S
2. NP VP (using the rule S -> NP VP)
3. Det N VP (using the rule NP -> Det N)
4. "the" N VP (using the rule Det -> "the")
5. "the" N V NP (using the rule VP -> V NP)
6. "the" N V Det N (using the rule NP -> Det N)
7. "the" N V Det N PP (using the rule VP -> V NP PP)
8. "the" N V Det N P NP (using the rule PP -> P NP)
9. "the" N V Det N P Det N (using the rule NP -> Det N)
Now we have a complete sentence: "the cat chased the dog on a ball."
1. Training Data: Statistical language models are trained on large amounts of text data,
known as a corpus. This corpus contains sequences of words along with their
frequencies of occurrence.
2. n-gram Models: One of the simplest approaches to statistical language modeling is
the n-gram model, where the probability of a word sequence is estimated based on the
frequencies of occurrence of n-length sequences of words (n-grams) in the training
data. For example, a bigram model (n=2) estimates the probability of a word given its
preceding word, while a trigram model (n=3) estimates the probability of a word given
its two preceding words.
3. Estimating Probabilities: Given a sequence of words w1, w2, ..., wn, the probability
of the entire sequence P(w1, w2, ..., wn) is estimated as the product of the conditional
probabilities of each word given its preceding context:
P(w1, w2, ..., wn) ≈ P(w1) * P(w2|w1) * P(w3|w1, w2) * ... * P(wn|wn−1, ..., w1)
These probabilities are estimated from the frequencies of n-grams in the training data
using techniques such as maximum likelihood estimation (MLE) or smoothed
estimation methods like add-one smoothing or Kneser-Ney smoothing.
4. Backoff and Interpolation: To address data sparsity issues and improve the
robustness of n-gram models, techniques like backoff and interpolation are often
employed. Backoff involves using lower-order n-grams when higher-order n-grams
have zero counts, while interpolation combines probabilities from different n-gram
orders to smooth the probability estimates.
5. Application: Once trained, a statistical language model can be used for various NLP
tasks. For example, in speech recognition, the language model helps to recognize the
most likely sequence of words given the input speech signal. In machine translation, it
guides the generation of fluent and grammatically correct translations.
Statistical language modeling provides a simple yet effective framework for capturing the
statistical properties of natural language. However, it has limitations such as the inability to
capture long-range dependencies and the need for large amounts of training data to achieve
good performance. More sophisticated approaches, such as neural language models, have
been developed to address these limitations and achieve state-of-the-art results in many NLP
tasks.
a simple example of statistical language modeling using a bigram model. Suppose we have a
small corpus consisting of the following sentences:
We can use this corpus to build a bigram language model, which estimates the probability of
each word given its preceding word. Here's how we can do it:
("i", "like"): 2
("like", "to"): 2
("to", "eat"): 2
("eat", "apples"): 1
("apples", "are"): 1
("are", "delicious"): 1
("eat", "bananas"): 1
Now, we have a bigram language model that can estimate the probability of word
sequences. For example, if we want to compute the probability of the sentence "I like
to eat bananas," we can multiply the probabilities of the bigrams:
P("i") * P("like" | "i") * P("to" | "like") * P("eat" | "to") * P("bananas" | "eat")
= 1.0 * 1.0 * 1.0 * 1.0 * 0.5
= 0.5
This shows that according to our bigram model, the probability of the sentence "I like
to eat bananas" is 0.5.
Regular expressions (regex) are powerful tools used in natural language processing (NLP)
for pattern matching and text processing tasks. They allow for efficient searching, extraction,
and manipulation of text based on specified patterns. Here are some common applications of
regular expressions in NLP:
1. Tokenization: Regular expressions can be used to split a text into tokens, such as
words or sentences. For example, \w+ matches one or more word characters,
effectively tokenizing words in a sentence.
2. Text Cleaning: Regular expressions are useful for cleaning and preprocessing text
data by removing unwanted characters, punctuation, or formatting. For instance, \W
matches any non-word character, which can be used to remove punctuation marks
from text.
3. Pattern Matching: Regular expressions enable the extraction of specific patterns or
entities from text data. For example, \b\d{3}-\d{3}-\d{4}\b matches phone numbers in
the format XXX-XXX-XXXX.
4. Named Entity Recognition (NER): Regular expressions can be used as simple rules
for identifying named entities such as dates, emails, or URLs in text. For example, a
regex pattern can match strings that resemble email addresses (\b[A-Za-z0-9._%+-]
+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b).
5. Information Extraction: Regular expressions can aid in extracting structured
information from unstructured text, such as dates, addresses, or numerical data. For
instance, \b\d{2}/\d{2}/\d{4}\b matches dates in the format MM/DD/YYYY.
6. Text Normalization: Regular expressions can be used to normalize text by converting
it to a standard format. For example, \b[A-Z]+\b matches all uppercase words, which
can be converted to lowercase for normalization.
7. Text Segmentation: Regular expressions can help in segmenting text into meaningful
units, such as paragraphs or sections. For example, \n\n matches two consecutive
newline characters, which can be used to split text into paragraphs.
While regular expressions are powerful, they also have limitations. They may not handle
complex patterns or variations in text well, and writing and maintaining complex regex
patterns can be challenging. Additionally, regular expressions are often not robust to noisy or
ambiguous text data. In such cases, more advanced techniques, such as rule-based systems or
machine learning models, may be more suitable.
Example
Output:
['[email protected]', '[email protected]']
In this example:
Finite State Automata (FSA) or Finite State Machines (FSM), which are models used in
computer science and mathematics to represent systems that can be in only a finite number of
states at any given time. These automata are widely used in various fields, including natural
language processing, compiler design, and digital circuit design.
1. Definition: A Finite State Automaton is defined by a finite set of states, a finite set of
input symbols, a transition function that describes how the automaton transitions
between states based on input symbols, a start state, and a set of accept states.
2. Types of FSAs:
o Deterministic Finite Automaton (DFA): In a DFA, for each state and input
symbol, there is exactly one transition leading to a next state. DFAs are
commonly used in lexical analysis and pattern matching.
o Nondeterministic Finite Automaton (NFA): In an NFA, there can be
multiple transitions for a given state and input symbol, or there can be ε-
transitions (transitions without consuming an input symbol). NFAs are often
used in regular expression matching.
3. Operations on FSAs:
o Union, intersection, and complementation of automata.
o Concatenation, Kleene star (closure), and concatenation of automata.
o Minimization of DFAs to reduce the number of states while preserving the
language recognized by the automaton.
4. Applications:
o Regular expression matching: FSAs are used to implement regular expression
engines.
Lexical analysis: DFAs are used to recognize tokens in programming
o
languages.
o Pattern recognition: FSAs can be used to model and recognize patterns in data.
5. Limitations:
o FSAs are limited in their expressive power compared to more complex
automata models like pushdown automata and Turing machines.
o They can only recognize regular languages, which are a subset of the
languages recognized by context-free grammars.
Let's consider a simple example of a deterministic finite automaton (DFA) that recognizes
strings over the alphabet {0, 1} that end with "01".
English morphology is an essential aspect of natural language processing (NLP) that deals
with the structure and formation of words in the English language. It encompasses various
morphological processes, such as inflection, derivation, compounding, and others.
Understanding English morphology is crucial for tasks like tokenization, stemming,
lemmatization, and part-of-speech tagging. Here's a brief overview of some key concepts in
English morphology and their relevance in NLP:
In NLP, algorithms and models are developed to handle these morphological processes
efficiently, enabling tasks such as text normalization, syntactic analysis, semantic analysis,
and more. Proper handling of English morphology enhances the accuracy and effectiveness of
NLP systems across a wide range of applications.
English morphology in natural language processing (NLP) involves analyzing the structure
and formation of words in the English language. Morphology deals with the internal structure
of words and how they are formed from smaller meaningful units called morphemes. Here's
an example illustrating English morphology:
Understanding English morphology helps NLP systems better comprehend and process text,
enabling tasks such as sentiment analysis, machine translation, information retrieval, and
more.
1. Lexical Transduction:
o Lexical transduction refers to the process of mapping words from one form to
another based on specific rules or patterns. This could involve transformations
such as stemming or lemmatization, where words are reduced to their base or
dictionary forms.
o For example, in English morphology, converting the word "running" to its base
form "run" involves a lexical transduction rule that removes the suffix "-ing."
2. Rules for Lexical Transduction:
o Lexical transduction rules are typically based on linguistic knowledge and
patterns observed in the language. These rules define how words are
transformed from one form to another.
o Rules can involve the application of affix stripping, suffix removal, or applying
irregular transformation patterns.
o Example lexical transduction rule: "If a word ends with '-ing', remove the
suffix to obtain the base form."
3. Grammatical Transduction:
o Grammatical transduction refers to the process of transforming sentences or
phrases from one grammatical form to another. This could involve tasks such
as converting active voice to passive voice, changing tense, or altering
sentence structure.
o Example: Converting the sentence "The cat chased the mouse" from active
voice to passive voice results in "The mouse was chased by the cat."
4. Rules for Grammatical Transduction:
o Grammatical transduction rules are based on syntactic and grammatical
structures. These rules define how sentences or phrases are transformed while
preserving their meaning.
o Rules can involve rearranging word order, changing verb conjugation, or
altering grammatical features.
o Example grammatical transduction rule: "To convert active voice to passive
voice, move the object of the active sentence to the subject position and change
the verb form to the passive voice."
1. Word Tokenization:
o Word tokenization, also known as word segmentation or word splitting,
involves dividing a text into individual words based on whitespace or
punctuation boundaries.
o Example: The sentence "Tokenization is an important NLP task" can be
tokenized into ["Tokenization", "is", "an", "important", "NLP", "task"].
2. Sentence Tokenization:
o Sentence tokenization involves splitting a text into individual sentences based
on punctuation marks like periods, exclamation marks, and question marks.
o Example: The paragraph "This is the first sentence. This is the second
sentence! And this is the third sentence?" can be tokenized into ["This is the
first sentence.", "This is the second sentence!", "And this is the third
sentence?"].
3. Subword Tokenization:
o Subword tokenization involves dividing words into smaller units, such as
morphemes or character n-grams. This approach is commonly used in
languages with complex morphology or for handling out-of-vocabulary words.
o Example: In subword tokenization, the word "tokenization" can be split into
["to", "ken", "iza", "tion"] or ["token", "iza", "tion"].
4. Tokenization Challenges:
o Tokenization can be challenging for languages with complex word boundaries
or agglutinative morphology.
o Ambiguity in tokenization can arise due to punctuation marks, abbreviations,
contractions, and compound words.
5. Tokenization Libraries:
o Various NLP libraries provide built-in functions for tokenization, including
NLTK (Natural Language Toolkit), spaCy, and the tokenization module in the
TensorFlow and PyTorch frameworks.
6. Preprocessing:
o Tokenization is typically the first step in text preprocessing, followed by tasks
such as lowercasing, stemming, lemmatization, and stop word removal.
Detecting and correcting spelling errors is an important task in natural language processing
(NLP) and can significantly improve the accuracy and readability of text. Here's an overview
of how spelling errors are detected and corrected:
1. Spell Checking:
o Spell checking involves identifying words in a text that are not found in a
dictionary or known vocabulary.
o Spell checkers compare each word in the text against a dictionary or a list of
known words to determine if it is spelled correctly.
o Words that are not found in the dictionary are flagged as potential spelling
errors.
2. Candidate Generation:
o Once spelling errors are detected, candidate words are generated as potential
replacements for the misspelled words.
o Candidate generation techniques may involve:
Generating possible corrections by applying operations such as
insertion, deletion, substitution, or transposition of characters.
Using statistical language models to suggest the most likely
replacements based on context.
3. Candidate Ranking:
o After generating candidate replacements, a ranking algorithm is applied to
score and rank the candidate corrections.
o Ranking algorithms consider factors such as:
Edit distance: How many edits are required to transform the misspelled
word into each candidate.
Language model probabilities: How likely each candidate is based on
the surrounding context.
Frequency of occurrence: How frequently each candidate appears in a
large corpus of text.
4. Correction Selection:
o The correction selection process involves choosing the highest-ranked
candidate as the replacement for the misspelled word.
o In some cases, multiple candidate corrections may be suggested to the user for
manual selection.
5. Contextual Spelling Correction:
o Contextual spelling correction takes surrounding context into account when
detecting and correcting spelling errors.
o Contextual information, such as adjacent words, grammar, syntax, and
semantics, can help improve the accuracy of spelling correction.
6. Evaluation and Feedback:
o Spell checkers are often evaluated using manually annotated datasets or user
feedback to assess their accuracy and effectiveness.
o Continuous improvement based on user feedback helps refine and enhance
spelling correction algorithms over time.
In natural language processing (NLP), the minimum edit distance (also known as
Levenshtein distance) is a metric used to quantify the similarity between two strings by
measuring the minimum number of single-character edits (insertions, deletions, or
substitutions) required to transform one string into the other. It's a fundamental concept used
in various NLP tasks, such as spell checking, text correction, and approximate string
matching. Here's how it works:
1. Definition:
o Given two strings, A of length m and B of length n, the minimum edit distance
between them, denoted as D(A, B), is the minimum number of edits required to
transform string A into string B.
2. Operations:
o Insertion: Add a character to string A.
o Deletion: Remove a character from string A.
o Substitution: Replace a character in string A with another character.
3. Dynamic Programming Algorithm:
o The minimum edit distance can be efficiently computed using dynamic
programming.
o The algorithm fills in a matrix where each cell (i, j) represents the minimum
edit distance between the substrings A[0:i] and B[0:j].
o The algorithm iterates through each position in the matrix, updating the values
based on the minimum cost of the possible edit operations.
o The final value in the bottom-right corner of the matrix represents the
minimum edit distance between the two strings.
4. Applications:
o Spell Checking: Determine the closest words to a misspelled word by
computing the minimum edit distance between the misspelled word and all
words in a dictionary.
o Approximate String Matching: Find strings in a database that are similar to a
given query string by computing the minimum edit distance between the query
string and database strings.
o OCR (Optical Character Recognition): Correct errors in OCR output by
comparing the recognized text with the original text using minimum edit
distance.
5. Example:
o For example, consider the strings "kitten" and "sitting":
The minimum edit distance between them is 3.
One possible sequence of edit operations is: substitute 'k' with 's',
substitute 'e' with 'i', and insert 'g' at the end.
Unsmoothed N-grams:
Definition:
o An n-gram is a contiguous sequence of n items (words, characters, or tokens)
within a larger sequence of text.
o Unsmoothed n-grams involve calculating the probability of observing each n-
gram in the training data directly from the counts of those n-grams, without
any adjustments for unseen or rare events.
2. Probability Estimation:
o Given a corpus of text, the probability of a word sequence is estimated by
counting the occurrences of each n-gram in the training data and dividing by
the total count of all n-grams.
o For example, the probability of observing the word sequence "the cat sat"
using trigrams would be estimated by counting the number of occurrences of
the trigram "the cat sat" and dividing by the total count of all trigrams in the
corpus.
3. Challenges:
o Unsmoothed n-grams can suffer from data sparsity issues, especially for
higher-order n-grams or in corpora with limited training data.
o If an n-gram is not observed in the training data, its probability will be zero,
which can lead to severe underestimation of the likelihood of unseen word
sequences.
4. Usage:
o Despite their limitations, unsmoothed n-grams can still be useful in certain
contexts, particularly for small or specialized corpora where data sparsity is
less of an issue.
o Unsmoothed n-grams can serve as a baseline model for comparison with more
sophisticated language models that incorporate smoothing techniques.
5. Evaluation:
o The performance of unsmoothed n-gram models can be evaluated using
standard metrics such as perplexity or accuracy on a held-out test set.
o Perplexity measures how well the model predicts the test data and can indicate
the effectiveness of the language model in capturing the distribution of word
sequences in the training corpus.
Unsmoothed N-grams
Language modeling is the way of determining the probability of any sequence of words.
Language modeling is used in a wide variety of applications such as Speech Recognition,
Spam filtering, etc. In fact, language modeling is the key aim behind the implementation of
many state- of-the-art Natural Language Processing models.
Methods of Language Modelings:
Two types of Language Modelings:
• Statistical Language Modelings: Statistical Language Modeling, or Language
Modeling, is the development of probabilistic models that are able to predict the next
word in the sequence given the words that precede. Examples such as N-gram
language modeling.
• Neural Language Modelings: Neural network methods are achieving better results
than classical methods both on standalone language models and when models are
incorporated into larger models on challenging tasks like speech recognition and
machine translation. A way of performing a neural language model is through word
embeddings.
N-gram
N-gram can be defined as the contiguous sequence of n items from a given sample of text
or speech. The items can be letters, words, or base pairs according to the application. The
N-grams typically are collected from a text or speech corpus (A long text dataset).
N-gram Language Model:
An N-gram language model predicts the probability of a given N-gram within any
sequence of words in the language. A good N-gram model can predict the next word in the
sentence i.e the value of p(w|h)
Example of N-gram such as unigram (“This”, “article”, “is”, “on”, “NLP”) or bi- gram
(‘This article’, ‘article is’, ‘is on’,’on NLP’).
Now, we will establish a relation on how to find the next word in the sentence using
. We need to calculate p(w|h), where is the candidate for the next word. For example in the
above example, lets’ consider, we want to calculate what is the probability of the last word
being “NLP” given the previous words:
• For unigram:
• For Bigram:
# imports
Evaluating N-grams
The value of these lambdas can be calculated if we have a dev set/training data to set these
lambdas to maximize the likelihood of these models.
Open vs Closed Vocabulary tasks: If we know all the words in advance then its a close
Vocabulary. But many times we don’t know all words. In our test set there might occur OOV
(Out of Vocabulary ) words. One way to deal with them is to create a special token called
<UNK>. Any word which has low probability while training is changed to <UNK> and then
the model is treated as if the word <UNK> is there. At decoding time, we change the unseen
word to <UNK> and then compute its probability
w.r.t the Language Model.
Smoothing for Large N-Grams (Web-scale-N-Grams): We use Stupid Backoff technique.
P(Wi|Wi-k … Wi-1) = { simple probability formula } but if count(Wi-k …. Wi) is less than
some threshold then we use k* P(Wi|Wi-k+1 …. Wi-1). Where k is some fraction and this
can be best calculated by our dev set data. Stupid backoff produces scores rather than
probabilities. And this works quite well for large scale N-Grams.
Add-K does not work good for Language modelling. Stupid backoff works well for Large
N-Grams. Lets start with Good Turing smoothing.
N1 = 1 ; N2 = 3
Read the intuition of Good Turing smoothing here. Done by leave one out validation and by
taking a held set. Very well explained here (from 11th Minute).
P (new things occurring ) = N1
/ N ; Generalizing: c* =
(c+1)*Nc+1 / Nc And P = c* /
N.
Therefore in case of new things, c* = 0 as we have not observed it earlier. After first few
N’s namely N0,N1 … N10, we replace the N’s by a smooth function as there will be gaps.
Say N127 can be zero. Therefore we replace with the best fit paralog function.
Kneser Ney Smoothing
First do absolute discounting (as a result of good turing smoothing). But we see the
discounting factor is nearly equal to 0.75. So we do absolute discounting and then do
interpolation — Discounting + ( L(Wi-1)*P(w) ).
Instead of P(w) (Unigram probability){How likely is w}, Pcontinuation(w){How likely is w
to appear as a novel continuation} is a better estimate.
P continuation (w) = Words which precede w / Total word bigrams.
A frequent word like Fransisco will have low continuation probability as it only appears
after San. P kn (Wi | Wi-1) = (count (Wi-1Wi) — d)/count (Wi-1) + L(Wi-1) Pcont(Wi)
WORD CLASSES:
Words can be grouped into classes referred to as Part of Speech (PoS) or morphological
classes
Traditional grammar is based on few types of PoS (noun, verb, adjective, preposition, adverb,
conjunction, etc..) ▫More recent models are based on a larger number of classes
45 Penn Treebank
87 Brown corpus
146 C7 tagset
The word PoS provides crucial information to determine the roles of the word itself and of
the words close to it in the sentence
knowing if a word is a personal pronoun (I, you, he/she,.. ) or a possessive pronoun (my,
your, his/her,…) allows a more accurate selection of the most probable words that appear in
its neighborhood (the syntactic rules are often based on the PoS of words)
e.g. possessive pronoun - noun vs. personal pronoun – verb
The 4 largest open classes of words, present in most of the languages, are
Stochastic POS Tagging
Another technique of tagging is Stochastic POS Tagging. Now, the question that arises here
is which model can be stochastic. The model that includes frequency or probability
(statistics) can be called stochastic. Any number of different approaches to the problem of
part-of-speech tagging can be referred to as stochastic tagger.
The simplest stochastic tagger applies the following approaches for POS tagging −
Word Frequency Approach
In this approach, the stochastic taggers disambiguate the words based on the probability that a
word occurs with a particular tag. We can also say that the tag encountered most frequently
with the word in the training set is the one assigned to an ambiguous instance of that word.
The main issue with this approach is that it may yield inadmissible sequence of tags.
Tag Sequence Probabilities
It is another approach of stochastic tagging, where the tagger calculates the probability of a
given sequence of tags occurring. It is also called n-gram approach. It is called so because the
best tag for a given word is determined by the probability at which it occurs with the n
previous tags.
Properties of Stochastic POST Tagging
Stochastic POS taggers possess the following properties −
• This POS tagging is based on the probability of tag occurring.
• It requires training corpus
• There would be no probability for the words that do not exist in the corpus.
• It uses different testing corpus (other than training corpus).
• It is the simplest POS tagging because it chooses most frequent tags
associated with a word in training corpus.
Transformation-based Tagging
Transformation based tagging is also called Brill tagging. It is an instance of the
transformation-based learning (TBL), which is a rule-based algorithm for automatic tagging
of POS to the given text. TBL, allows us to have linguistic knowledge in a readable form,
transforms one state to another state by using transformation rules.
It draws the inspiration from both the previous explained taggers − rule-based and stochastic.
If we see similarity between rule-based and transformation tagger, then like rule-based, it is
also based on the rules that specify what tags need to be assigned to what words. On the other
hand, if we see similarity between stochastic and transformation tagger then like stochastic, it
is machine learning technique in which rules are automatically induced from data.
Working of Transformation Based Learning(TBL)
In order to understand the working and concept of transformation-based taggers, we need to
understand the working of transformation-based learning. Consider the following steps to
understand the working of TBL −
• Start with the solution − The TBL usually starts with some solution to the
problem and works in cycles.
• Most beneficial transformation chosen − In each cycle, TBL will choose the most
beneficial transformation.
• Apply to the problem − The transformation chosen in the last step will be applied
to the problem.
The algorithm will stop when the selected transformation in step 2 will not add either more
value or there are no more transformations to be selected. Such kind of learning is best
suited in classification tasks.
Advantages of Transformation-based Learning (TBL)
The advantages of TBL are as follows −
• We learn small set of simple rules and these rules are enough for tagging.
• Development as well as debugging is very easy in TBL because the learned rules are
easy to understand.
• Complexity in tagging is reduced because in TBL there is interlacing of
machinelearned and human-generated rules.
• Transformation-based tagger is much faster than Markov-model tagger.
Disadvantages of Transformation-based Learning (TBL)
The disadvantages of TBL are as follows −
• Transformation-based learning (TBL) does not provide tag probabilities.
• In TBL, the training time is very long especially on large corpora.
.
UNIT-3 Syntactic Analysis
Examples of auxiliary verbs include “be,” “do,” “have,” “will,” “shall,” “may,” “can,” “must,” “ought,”
“should,” “could,” and “would.”
The sentence "That cold, empty sky was full of fire and light" is broken down
into its grammatical components. The subject of the sentence is the noun
phrase "That cold, empty sky," and the predicate is "was full of fire and
light," with "full" as the predicate adjective and "of fire and light" as the
prepositional phrase that modifies "full."
Sentence:
"That cold, empty sky was full of fire and light."
Structure:
1. S (Sentence): The root node, representing the entire sentence.
2. NP-SBJ (Noun Phrase - Subject): This is the subject of the sentence. It contains:
o DT (Determiner): "That"
o JJ (Adjective): "cold"
o , (Comma): Separates adjectives.
o JJ (Adjective): "empty"
o NN (Noun): "sky"
3. VP (Verb Phrase): This represents the action or predicate in the sentence. It contains:
o VBD (Verb, Past Tense): "was" (the linking verb in past tense).
o ADJP-PRD (Adjective Phrase - Predicate): The predicate describing the
subject.
JJ (Adjective): "full" (describing the state of the subject).
PP (Prepositional Phrase): Begins with a preposition and includes:
IN (Preposition): "of"
NP (Noun Phrase): The object of the preposition "of," consisting
of:
NN (Noun): "fire"
CC (Coordinating Conjunction): "and"
NN (Noun): "light"
Example:
Dynamic Programming Parsing
Shallow Parsing
3.7 Probabilistic Cocke-Younger-Kasami (PCYK/PCKY)
CYK another Example:
PCYK /PCKY Example:
Example of PCYK:
https://youtu.be/fairHU−DEVY?si=mIxsNEpCXb9wtIRV
3.8 Lexicalized PCFGs or Probabilistic Lexicalized CFGs:
https://www.youtube.com/watch?v=LuXv9T6KdV4
3.9
3.10
Second Assumption
The second probability in equation (1) above can be approximated by assuming that a word
appears in a category independent of the words in the preceding or succeeding categories
which can be explained mathematically as follows −
PROB (W1,..., WT | C1,..., CT) = Πi=1..T PROB (Wi|Ci)
Now, on the basis of the above two assumptions, our goal reduces to finding a sequence C
which maximizes
Πi=1...T PROB(Ci|Ci-1) * PROB(Wi|Ci)
Now the question that arises here is has converting the problem to the above form really
helped us. The answer is - yes, it has. If we have a large tagged corpus, then the two
probabilities in the above formula can be calculated as −
PROB (Ci=VERB|Ci-1=NOUN) = (# of instances where Verb follows Noun) / (# of
instances where Noun appears) (2)
PROB (Wi|Ci) = (# of instances where Wi appears in Ci) /(# of instances where Ci appears)
(3)
We can see any problems in natural language processing as linguistic classification problems
in which linguistic contexts are used to predict linguistic classes. Maximum entropy models
are a clean way to combine various pieces of contextual evidence to estimate the probability
of a particular linguistic class occurring with a specific linguistic context.
Maximum entropy classification is a method that generalizes logistic regression to multiclass
problems. The Maximum Entropy model is a type of log-linear model.
If we are given some data and told to decide, we could think of attributes about the data,i.e.,
features. Some of these features might be more important than others.
We apply a weight to each feature found in the data, and we add up all of the features.
Finally, the weighted sum is normalized to give a fraction between 0 and 1. We can use this
fraction to tell us the score of how confident we might be in making a decision.
Maximum Likelihood
The principle of Maximum Likelihood is defined as we have to find the parameter values w
such that it models the input data x with the maximum probability. The aim is to find the
weight parameters that will maximize the likelihood of the training data.
We Assume we have a random sample with a training set of n examples. We assume input
values to be independent, so the probability function f(x,w) is the product of the
probabilities of each input.
Like maximum likelihood, the entire conditional probability says we choose a parameter
estimate w_hat that maximizes the product f(yi|xi, w).
The function f(x,y) is a function that can account for relations between data and labels. It
expresses some characteristics of the data point. It results in a value of 0 or 1 depending on
the absence or presence. The wj is a weight of the feature function that captures how closely
a given feature is related to a provided label. In the training process, wj is randomly
initialized initially. The training process will learn the weight through gradient descent with
some optimization methods.
Approach
In the training phase, we have to find weight w. Let us start with the log-likelihood function:
This function L(w) measures how well w explains the labeled data. The higher
value of P(y|x; w) greater is the value of L(w). The maximum-likelihood function uses the
argmax function to find the best values for the parameter w:
The maximum entropy model is log-linear. MaxEnt handles multinomial distribution. The
maximum entropy principle states that we have to model the given set of data by finding the
highest entropy to satisfy the constraints of our previous knowledge.
To find the probability for each class, Maximum Entropy is defined as:
Applications
MaxEnt classification is a more classical machine learning task and solves problems beyond
natural language processing. Here are a few:
• Sentiment analysis (e.g., given a product review, the reviewer likes and dislikes about
the product).
• Preferences (e.g., Given a person's demographics, who will a person vote for? Would
they prefer Superman, Batman, or the Teenage Mutant Ninja Turtles? etc.).
• Diagnosis (e.g., Given characteristics of several medical images and patient history,
what medical condition is a person at risk of having?).
Maximum Entropy Markov Model
There are many systems where there is a time or state dependency. These systems evolve
through a sequence of states, and past states influence the current state. For example, stock
prices, DNA sequencing, human speech, or words in a sentence.
Maximum Entropy Markov Model makes use of state-time dependencies,i.e., it uses
predictions of the past and the current observation to make the current prediction.
In image analysis, we're required to classify the object into one of many classes. We estimate
the probability for each class. Rather than take a hard decision on one of the outcomes, it's
better to output probabilities, which will benefit downstream tasks. Multinomial logistic
regression is also called softmax regression or Maximum Entropy (MaxEnt) classifier.
Entropy's related to the disorder. Higher the disorder, less predictable the outcomes, and
hence more information. For example, an unbiased coin has more information (and entropy)
than one that mostly lands up heads.
MaxEnt is about picking a probability distribution that maximizes the entropy.
Then, there's Markov Chain. It models a system as a set of states with probabilities assigned
to state transitions. While MaxEnt computes probabilities for each input independently, the
Markov chain recognizes a dependency from one state to the next. Thus, MEMM
maximizes entropy plus using state dependencies (Markov Model).
The MEMM has dependencies between each state and the full observation sequence
explicitly. MEMM has only one transition probability matrix. This matrix encapsulates
previous states y(i−1) and current observation x(i) pairs in the training data to the current
state y(i). Our goal is to find the P(y1,y2,…,yn|x1,x2,…xn). This is given by:
Since HMM only depends on the previous state, we can limit the condition of y(n)
given y(n-1). This is the Markov independence assumption.
Shortcomings Of MEMM
MEMM suffers from what's called the label bias problem. Once we're in a state or
label, the following observation will select one of many transitions leaving that state.
However, the model as a whole would have many more transitions. If a state has only
one outgoing change, the observation has no influence. Simply put, transition scores
are normalized on a per-state basis.
UNIT 4 SEMANTICS AND PRAGMATICS
Requirement for representation, First-Order Logic, Description Logics-Syntax-Driven Semantic analysis,
Semantic attachments-word Senses, Relations between senses, Thematic Roles,selctional restrictions-Word
Sense ,Disambiguation,WSD using Supervised-Word similarity using Thesaurus and Distributional methods.
What are Semantics?
Study of words, phrases and sentences in a language.
Explores how words and grammatical structures contribute to the meaning of sentences, and how
meaning is composed and interpreted.
focuses on the literal meaning of language, and aims to understand how meaning is derived from the
linguistic form.
What is Pragmatics?
Study of how language is used in context.
Investigates how meaning is affected by factors such as the speaker's intentions, the listener's knowledge,
and the communicative situation.
focuses on the non-literal or implied meaning of language, and aims to understand how meaning
is derived from the use of language in social interaction.
● Compositionality: The meaning of a sentence is composed of the meanings of each word and the way
they are combined. This means that the meaning of a sentence can be derived from the meanings of its
parts.
● Truth conditions: A representation must specify the truth conditions for a sentence, i.e., the
conditions under which the sentence would be true or false.
● Context sensitivity: The meaning of a sentence may depend on the context in which it is used.
Therefore, a representation must be able to account for the effects of context on meaning.
● Pragmatic relevance: A representation must be relevant to the communicative situation. This means
that it should take into account the speaker's intended meaning and the listener's interpretation.
● Consistency: A representation should be consistent with other linguistic and cognitive theories, and
should not lead to contradictions or inconsistencies.
First-order Logic:
● First-order logic (FOL) is a formal language that has been used in semantics and pragmatics to represent
the meaning of sentences in a structured and logical way.
● FOL allows us to represent the relationships between objects, properties, and events in a precise and
formal manner, which can be useful for analyzing and understanding the meaning of natural language
expressions.
● In semantics:
○ FOL allows us to represent the meanings of words and sentences regarding their truth
conditions.
■ Example: The sentence "John is a doctor" can be represented in FOL as
"Doctor(John)", which means that the object "John" has the property of being a
doctor.
○ FOL allows us to represent the logical structure of sentences, including their subject-
predicate structure and the relationships between different parts of the sentence.
■ Example: The sentence "All dogs bark" can be represented in FOL as "For all x,
if x is a dog, then x barks", where "For all x" is a quantifier that means "for
every x", "if x is a dog" is a predicate that describes the property of being a dog,
and "x barks" is a predicate that describes the action of barking.
○ FOL representations can help to avoid ambiguity and inconsistency in
interpretation.
■ Example: The sentence "Every student passed the exam" can be represented in
FOL as "For all x, if x is a student, then x passed the exam", which avoids the
ambiguity of the sentence "Every student passed the exam with flying colours",
since the latter may suggest that all students scored exceptionally well, which is
not necessarily implied by the former.
● In pragmatics:
○ FOL allows us to represent the speaker's intended meaning and the listener's
interpretation of a sentence.
■ Example: The sentence "I need help with my homework" can be represented in
FOL as a request for assistance, such as "Request(Assistance, Speaker,
Homework)", where "Request" is a predicate that describes the communicative
act of making a request, "Assistance" is a variable that represents the object
being requested, "Speaker" is a variable that represents the speaker, and
"Homework" is a variable that represents the object for which assistance is
needed.
○ FOL allows us to represent the context in which a sentence is used, including the
speaker's beliefs, intentions, and assumptions, and the listener's knowledge and
expectations.
■ Example: The sentence "Do you have the time?" can be represented in FOL as
a request for information, such as "Request(Time, Listener)", where "Request"
is a predicate that describes the communicative act of making a request,
"Time" is a variable that represents the object being requested, and "Listener"
is a variable that represents the person being addressed.
○ FOL representations can help to capture the communicative function of a
sentence and its relationship to other expressions in the discourse.
■ Example: The sentence "I'm sorry, I can't come to your party tonight" can be
represented in FOL as a polite refusal, such as "Refusal(Party, Speaker,
Listener)", where "Refusal" is a predicate that describes the
communicative act of refusing an invitation, "Party" is a variable that represents the event being refused,
"Speaker" is a variable that represents the person making the refusal, and "Listener" is a variable that represents
the person being addressed.
● Description Logics (DLs) are a family of formal knowledge representation languages used to
represent and reason complex concepts and relationships in a structured and logical manner.
● DLs are a subset of first-order logic (FOL) specifically designed for representing knowledge in a way
that is both expressive and computationally tractable.
● Provides formal semantics for natural language expressions, allowing us to represent their meaning
in a structured and logical way.
● Used to construct ontologies, which are structured representations of knowledge in a particular domain.
● Allows us to reason about the relationships between concepts and instances, and to infer new
knowledge based on existing knowledge.
● Often uses inference engines to perform reasoning tasks, such as consistency checking, classification,
and query answering.DL operates under an open-world assumption, which allows for more flexible and
incremental development of ontologies.
● A type of DL approach
● The syntax of natural language expression is used to drive the process of semantic analysis.
● Involves mapping the syntax of a sentence onto a formal logical structure, such as a DL
ontology, to derive its meaning.
● This approach allows for a more efficient and accurate analysis of natural language expressions, as the
syntactic structure can provide important cues for determining the meaning of ambiguous or complex
expressions.
● The grammar of the language works as a guide.
● The assumption is that the grammatical structure of a sentence reflects its underlying meaning and that
by analyzing this structure, we can infer the meaning of the sentence.
● For example, the sentence "John is a doctor who specializes in cardiology" can be analyzed
syntactically to identify the subclauses "John is a doctor" and "who specializes in cardiology", which
can then be mapped onto corresponding concepts in a DL ontology to derive the overall meaning of the
sentence.
Semantic Attachments:
● Semantic attachments, also known as semantic roles or theta roles, are a linguistic concept that
describes the relationship between the semantic content of a sentence and its syntactic structure.
In other words, they represent the different roles that words or phrases play in a sentence based on their
meaning.
● For example, in the sentence "John ate the pizza with a fork," the word "John" is the agent who
acts as eating, "pizza" is the patient that undergoes the action of being eaten, and "fork" is the instrument
that John uses to eat the pizza. These different roles are represented as semantic attachments associated
with each word or phrase in the sentence.
Word Senses:
● A word sense is a specific meaning of a word that is determined by its context. Semantic
attachments are a way of representing the meaning of a word in context by linking it to the concepts
or entities that it refers to.
● For example, consider the word "bank." Depending on the context in which it appears, it could refer to
a financial institution or the side of a river. In semantic attachments, we might represent these two senses
of the word as follows:
○ For the financial institution sense:
■ Word: "bank"
■ Sense: "financial institution"
■ Attachment: links to the concept of a financial institution, such as a bank
account, loans, or mortgages.
○ For the side of a river sense:
■ Word: "bank"
■ Sense: "river bank"
○ Attachment: links to the concept of a river, such as water, shore, or sediment.
● By representing word senses in this way, we can better understand the meaning of words in context
and use this information for various NLP tasks, such as information retrieval, machine translation, and
sentiment analysis.
Relations between senses:
● The relationship between senses is typically represented by semantic relations or roles.
● These relations capture the semantic relationships between the different senses of a word, as well as
the relationships between different words in a sentence or discourse.
● Some common types of semantic relations include:
○ Hyponymy/Hypernymy:
■ Captures the relationship between a specific instance of a concept
(hyponym) and its more general category (hypernym).
■ For example, "dog" is a hyponym of "animal" and "animal" is a hypernym of
"dog".
○ Synonymy:
■ Captures the relationship between different words or senses that have the same or
similar meaning.
■ For example, "car" and "automobile" are synonyms.
○ Antonymy:
■ Captures the relationship between words or senses that are opposite in
meaning.
■ For example, "hot" and "cold" are antonyms.
○ Meronymy/Holonymy:
■ Captures the relationship between a part and a whole.
■ For example, "wheel" is a meronym of "car" and "car" is a holonym of
"wheel".
○ Troponymy:
■ Captures the relationship between a verb and a more specific way in which
the action is carried out.
■ For example, "walk" is a troponym of "move" and "stroll" is a troponym of
"walk".
● These relations can be used to build a network of interconnected senses and concepts, which can be
used for various NLP tasks such as word sense disambiguation, information retrieval, and machine
translation.
Thematic roles:
Selectional Restrictions:
Supervised Approach:
● This approach uses labelled examples to train a machine learning model to predict the correct
sense of a word in context.
● A popular algorithm for supervised WSD is the Naive Bayes classifier.
● Example: In the sentence "I went to the bank to deposit my paycheck," the word "bank" could
refer to a financial institution or a river bank. A supervised WSD model would use labelled
examples to learn how to predict the correct sense based on the context of the sentence.
Dictionary-Based Approach:
● This approach uses a dictionary or lexical database that provides information on the
different senses of a word.
● When presented with a word in context, the approach looks up the word in the dictionary and
chooses the sense that best fits the context.
● Example: In the sentence "I love to play the bass guitar," the word "bass" could refer to a fish or a
low-pitched musical instrument. A dictionary-based WSD approach would look up
"bass" and choose the sense that matches the context of the sentence.
Thesaurus-Based Approach:
● This approach uses a thesaurus or semantic network that groups words based on their semantic
similarity.
● When presented with a word in context, the approach identifies related words in the
thesaurus and chooses the sense that best fits the context.
● Example: In the sentence "The company's revenue has been steadily increasing," the word
"increase" could refer to an uptick in profits or a general upward trend. A thesaurus-based WSD
approach would identify related words like "growth" and "improvement" and choose the sense that
matches the context of the sentence.
Bootstrapping methods:
Bootstrapping in NLP:
Bootstrapping methods:
1. Self-training:
a. Definition: A method where a model is iteratively trained on a small labelled dataset and
then uses its predictions on a larger unlabeled dataset to expand the training set.
b. Example: In part-of-speech tagging, a model might be trained on a small labelled set of
sentences with their corresponding POS tags, and then use its predictions on a larger set
of unlabeled sentences to identify and add new POS-tagged examples to the training set.
2. Co-training:
a. Definition: A method where two or more models are trained independently on
different views of the same data, and then use their predictions on a larger unlabeled
dataset to improve each other's performance.
b. Example: In sentiment analysis, one model might be trained on a small labelled dataset
of tweets, while another is trained on a small labelled dataset of news articles. The
models can then use their predictions on a larger set of unlabeled social media data to
improve each other's accuracy.
3. Active learning:
a. Definition: A method where a model is trained on a small labelled dataset and
then used to select the most informative examples from an unlabeled dataset for
annotation. These examples are then added to the labelled dataset, and the model is
retrained on the expanded dataset.
b. Example: In named entity recognition, a model might be trained on a small labelled
dataset of sentences with named entities, and then use active learning to select the most
uncertain examples from a larger set of unlabeled sentences for manual annotation. These
labelled examples can then be used to improve the model's performance.
4. Semi-supervised learning:
a. Definition: A method where a model is trained on a small labelled dataset and a larger
unlabeled dataset, to leverage the structure of the unlabeled data to improve
performance on the labelled data.
b. Example: In machine translation, a model might be trained on a small labelled
dataset of sentence pairs in two languages, then use a larger set of unlabeled sentences
in both languages to improve the model's accuracy.
Word similarity using a thesaurus and distributional methods:
Word similarity is an important task in natural language processing (NLP) that involves measuring the degree
of relatedness between pairs of words. There are various approaches to measuring word similarity, including
the use of thesaurus and distributional methods.
Thesaurus-based method:
1. Thesaurus-based methods use a pre-existing vocabulary of synonyms, antonyms, and other lexical
relations to measure word similarity.
2. These methods rely on the idea that words with similar meanings tend to have similar or related
entries in a thesaurus.
3. Often involve mapping words to a set of synsets, which are groups of words that share a
common meaning.
4. Synsets are typically organized in a hierarchical or network structure that reflects the
relationships between words at different levels of abstraction or specificity.
5. To measure word similarity using a thesaurus-based method, one common approach is to compute
the distance or similarity between pairs of synsets.
6. This can be done using various metrics, such as the shortest path distance between synsets in a graph or
the amount of shared information between synsets based on their properties.
7. Example: One commonly used thesaurus for NLP is WordNet, which provides a hierarchical network
of synonym sets (synsets) for thousands of words. Word similarity can be computed using the shortest
path between two synsets in the WordNet graph.
Distributional-based method:
1. Distributional methods, on the other hand, use statistical information about the co-
occurrence patterns of words in large text corpora to estimate their similarity.
2. The intuition behind distributional methods is that words that occur in similar
contexts are likely to have similar meanings.
3. To apply a distributional method for measuring word similarity, one first needs to
represent words as high-dimensional vectors that capture their distributional
patterns in a corpus.
4. There are various ways to do this, such as counting the frequency of words in a
fixed- size window of text, using neural network models that learn dense
embeddings of words, or applying matrix factorization techniques that
decompose the co-occurrence matrix of words into low-rank components.
5. Once words are represented as vectors, their similarity can be computed using
various distance or similarity metrics, such as cosine similarity, Euclidean
distance, or Mahalanobis distance.
6. These metrics capture the degree of overlap or distance between the vectors,
which reflects the degree of similarity or dissimilarity between the
corresponding words.
7. Example: One popular distributional method for word similarity is the cosine
similarity between word vectors. Word vectors are high-dimensional representations
of words that capture their semantic properties based on their distributional patterns
in a corpus. The cosine similarity between two-word vectors measures the degree
of similarity between the corresponding words.
Both thesaurus-based and distributional methods have their strengths and weaknesses, and their
choice depends on the specific application and resources available. Thesaurus-based methods can
be more precise in measuring word similarity, but they rely on a fixed vocabulary that may not
be exhaustive or flexible enough for some tasks. Distributional methods, on the other hand, can
capture more nuanced and context-dependent meanings, but they may require more data and
computational resources to train and apply.
UNIT V
DISCOURSE ANALYSIS AND LEXICAL RESOURCES Discourse segmentation,
Coherence– Reference Phenomena, Anaphora Resolution using Hobbs and Centering
Algorithm – Coreference Resolution – Resources: Porter Stemmer, Lemmatizer, Penn
Treebank, Brill’sTagger, WordNet, PropBank, FrameNet,
Result
It infers that the state asserted by term S0 could cause the state asserted by S1. For example,
two statements show the relationship result: Ram was caught in the fire. His skin burned.
Explanation
It infers that the state asserted by S1 could cause the state asserted by S0. For example, two
statements show the relationship − Ram fought with Shyam’s friend. He was drunk.
Parallel
It infers p(a1,a2,…) from assertion of S0 and p(b1,b2,…) from assertion S1. Here ai and bi
are similar for all i. For example, two statements are parallel − Ram wanted car. Shyam
wanted money.
Elaboration
It infers the same proposition P from both the assertions − S0 and S1 For example, two
statements show the relation elaboration: Ram was from Chandigarh. Shyam was from
Kerala.
Occasion
It happens when a change of state can be inferred from the assertion of S0, final state of
which can be inferred from S1 and vice-versa. For example, the two statements show the
relation occasion: Ram picked up the book. He gave it to Shyam.
Building Hierarchical Discourse Structure
The coherence of entire discourse can also be considered by hierarchical structure between
coherence relations. For example, the following passage can be represented as hierarchical
structure
−
S1 − Ram went to the bank to deposit
Reference Resolution
Interpretation of the sentences from any discourse is another important task and to achieve
this we need to know who or what entity is being talked about. Here, interpretation reference
is the key element. Reference may be defined as the linguistic expression to denote an entity
or individual. For example, in the passage, Ram, the manager of ABC bank, saw his friend
Shyam at a shop. He went to meet him, the linguistic expressions like Ram, His, He are
reference.
On the same note, reference resolution may be defined as the task of determining what
entities are referred to by which linguistic expression.
Terminology Used in Reference Resolution
We use the following terminologies in reference resolution −
Referring expression − The natural language expression that is used to perform reference is
called a referring expression. For example, the passage used above is a referring expression.
Referent − It is the entity that is referred. For example, in the last given example Ram is a
referent.
Corefer − When two expressions are used to refer to the same entity, they are called corefers.
For example, Ram and he are corefers.
Antecedent − The term has the license to use another term. For example, Ram is the
antecedent of the reference he.
Anaphora & Anaphoric − It may be defined as the reference to an entity that has been
previously introduced into the sentence. And, the referring expression is called anaphoric.
Discourse model − The model that contains the representations of the entities that have been
referred to in the discourse and the relationship they are engaged in.
Types of Referring Expressions
Let us now see the different types of referring expressions. The five types of referring
expressions are described below −
Indefinite Noun Phrases
Such kind of reference represents the entities that are new to the hearer into the discourse
context. For example − in the sentence Ram had gone around one day to bring him some food
− some is an indefinite reference.
Definite Noun Phrases
Opposite to above, such kind of reference represents the entities that are not new or
identifiable to the hearer into the discourse context. For example, in the sentence - I used to
read The Times of India – The Times of India is a definite reference.
Pronouns
It is a form of definite reference. For example, Ram laughed as loud as he
could. The word he represents pronoun referring expression.
Demonstratives
These demonstrate and behave differently than simple definite pronouns. For example, this
and that are demonstrative pronouns.
Names
It is the simplest type of referring expression. It can be the name of a person, organization
and location also. For example, in the above examples, Ram is the name-refereeing
expression.
Reference Resolution Tasks
The two reference resolution tasks are described below.
Coreference Resolution
It is the task of finding referring expressions in a text that refer to the same entity. In simple
words, it is the task of finding corefer expressions. A set of coreferring expressions are called
coreference chain. For example - He, Chief Manager and His - these are referring
expressions in the first passage given as example.
Constraint on Coreference Resolution
In English, the main problem for coreference resolution is the pronoun it. The reason behind
this is that the pronoun it has many uses. For example, it can refer much like he and she. The
pronoun it also refers to the things that do not refer to specific things. For example, It’s
raining. It is really good.
Pronominal Anaphora Resolution
Unlike the coreference resolution, pronominal anaphora resolution may be defined as the task
of finding the antecedent for a single pronoun. For example, the pronoun is his and the task of
pronominal anaphora resolution is to find the word Ram because Ram is the antecedent.
Stemming is the process of producing morphological variants of a root/base word. Stemming
programs are commonly referred to as stemming algorithms or stemmers. A stemming
algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word,
“chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”. Stemming
is an important part of the pipelining process in Natural language processing. The input to
the stemmer is tokenized words. How do we get these tokenized words? Well, tokenization
involves breaking down the document into different words.
Stemming is a natural language processing technique that is used to reduce words to their
base form, also known as the root form. The process of stemming is used to normalize text
and make it easier to process. It is an important step in text pre- processing, and it is
commonly used in information retrieval and text mining applications.
There are several different algorithms for stemming, including the Porter stemmer, Snowball
stemmer, and the Lancaster stemmer. The Porter stemmer is the most widely used algorithm,
and it is based on a set of heuristics that are used to remove common suffixes from words.
The Snowball stemmer is a more advanced algorithm that is based on the Porter stemmer, but
it also supports several other languages in addition to English. The Lancaster stemmer is a
more aggressive stemmer and it is less accurate than the Porter stemmer and Snowball
stemmer.
Stemming can be useful for several natural language processing tasks such as text
classification, information retrieval, and text summarization. However, stemming can also
have some negative effects such as reducing the readability of the text, and it may not always
produce the correct root form of a word.
It is important to note that stemming is different from Lemmatization. Lemmatization is
the process of reducing a word to its base form, but unlike stemming, it takes into account
the context of the word, and it produces a valid word, unlike stemming which may produce
a non-word as the root form.
Under-stemming occurs when two words are stemmed from the same root that are not of
different stems. Under-stemming can be interpreted as false-
negatives. Under-stemming is a problem that can occur when using stemming algorithms in
natural language processing. It refers to the situation where a stemmer does not produce the
correct root form of a word or does not reduce a word to its base form. This can happen
when the stemmer is not aggressive enough in removing suffixes or when it is not designed
for the specific task or language.
Automatically interpreting and analysing the meaning of words and pre-
processing textual input can be complex in Natural Language Processing (NLP).
We frequently employ lexicons to help with this. A lexicon, word-hoard,
wordbook, or word-stock is a person's, language's, or branch of knowledge's
vocabulary. We frequently link the text in our data to the lexicon, which aids us in
comprehending the relationships between those terms.
Wordnet
WordNet is a massive lexicon of English words. Nouns, verbs, adjectives, and
adverbs are arranged into synsets,' which are collections of cognitive synonyms
that communicate a separate concept. Conceptual-semantic and linguistic links like
hyponymy and antonymy are used to connect synsets.
WordNet is similar to a thesaurus in that it groups words according
# Word Definition
print ("Meaning of Synset : ", syn.definition())
syn = wordnet.synsets('doing')[0]
print ("Syntag : ", syn.pos())
syn = wordnet.synsets('quickly')[0]
print ("Syntag : ", syn.pos()) Output:
Syntax : n
Syntax : v
Syntax : a
Syntax : r
The FrameNet corpus is a lexical database of English that is both human- and machine-
readable, based on annotating examples of how words are used in actual texts. FrameNet is
based on a theory of meaning called Frame Semantics, deriving from the work of Charles J.
Fillmore and colleagues. The basic idea is straightforward: that the meanings of most words
can best be understood on the basis of a semantic frame: a description of a type of event,
relation, or entity and the participants in it. For example, the concept of cooking typically
involves a person doing the cooking (Cook), the food that is to be cooked (Food), something
to hold the food while cooking (Container) and a source of heat (Heating_instrument). In the
FrameNet project, this is represented as a frame called Apply_heat, and the Cook, Food,
Heating_instrument and Container are called frame elements (FEs). Words that evoke this
frame, such as fry, bake, boil, and broil, are called lexical units (LUs) of the Apply_heat
frame. The job of FrameNet is to define the frames and to annotate sentences to show how
the FEs fit syntactically around the word that evokes the frame.
A Frame is a script-like conceptual structure that describes a particular type of situation,
object, or event along with the participants and props that are needed for that Frame. For
example, the “Apply_heat” frame describes a common situation involving a Cook, some
Food, and a Heating_Instrument, and is evoked by words such as bake, blanch, boil, broil,
brown, simmer, steam, etc.
We call the roles of a Frame “frame elements” (FEs) and the frame-evoking words are called
“lexical units” (LUs).
FrameNet includes relations between Frames. Several types of relations are defined, of which
the most important are:
• Inheritance: An IS-A relation. The child frame is a subtype of the parent frame,
and each FE in the parent is bound to a corresponding FE in the child. An example is the
“Revenge” frame which inherits from the “Rewards_and_punishments” frame.
• Using: The child frame presupposes the parent frame as background, e.g the “Speed”
frame “uses” (or presupposes) the “Motion” frame; however, not all parent FEs need to be
bound to child FEs.
• Subframe: The child frame is a subevent of a complex event represented by the
parent, e.g. the “Criminal_process” frame has subframes of “Arrest”, “Arraignment”, “Trial”,
and “Sentencing”.
• Perspective_on: The child frame provides a particular perspective on an un-
perspectivized parent frame. A pair of examples consists of the “Hiring” and
“Get_a_job” frames, which perspectivize the “Employment_start” frame from the
Employer’s and the Employee’s point of view, respectively.
To get a list of all of the Frames in FrameNet, you can use the frames() function. If you
supply a regular expression pattern to the frames() function, you will get a list of all Frames
whose names match that pattern:
>>> x = fn.frames(r'(?i)crim')
>>> x.sort(key=itemgetter('ID'))
>>> x
To get the details of a particular Frame, you can use the frame() function passing in the frame
number:
>>> from pprint import pprint
>>> f = fn.frame(202)
>>> f.ID
202
>>>
f.name
'Arrest'
>>> f.definition
"Authorities charge a Suspect, who is under suspicion of having committed a
crime..."
'Co-participant',
'Manner',
'Means',
'Offense',
'Place',
'Purpose',
'Source_of_legal_authority',
'Suspect',
'Time',
The'Type']
frame() function shown above returns a dict object containing detailed information about
the Frame. See the documentation on the frame() function for the specifics.
You can also search for Frames by their Lexical Units (LUs).
The frames_by_lemma() function returns a list of all frames that contain LUs in which the
‘name’ attribute of the LU matches the given regular expression. Note that LU names are
composed of “lemma.POS”, where the “lemma” part can be made up of either a single
lexeme (e.g. ‘run’) or multiple lexemes (e.g. ‘a little’) (see below).
>>> PrettyList(sorted(fn.frames_by_lemma(r'(?i)a little'),
key=itemgetter('ID')))
A lexical unit (LU) is a pairing of a word with a meaning. For example, the “Apply_heat”
Frame describes a common situation involving a Cook, some Food, and a Heating
Instrument, and is _evoked_ by words such as bake, blanch, boil, broil, brown, simmer,
steam, etc. These frame-evoking words are the LUs in the Apply_heat frame. Each sense of
a polysemous word is a different LU.
We have used the word “word” in talking about LUs. The reality is actually rather complex.
When we say that the word “bake” is polysemous, we mean that the lemma “bake.v” (which
has the word-forms “bake”, “bakes”, “baked”, and “baking”) is linked to three different
frames:
• Apply_heat: “Michelle baked the potatoes for 45 minutes.”
• Cooking_creation: “Michelle baked her mother a cake for her birthday.”
• Absorb_heat: “The potatoes have to bake for more than 30 minutes.”
These constitute three different LUs, with different definitions.
Multiword expressions such as “given name” and hyphenated words like “shut-eye” can also
be LUs. Idiomatic phrases such as “middle of nowhere” and “give the slip (to)” are also
defined as LUs in the appropriate frames (“Isolated_places” and “Evading”, respectively),
and their internal structure is not analyzed.
Framenet provides multiple annotated examples of each sense of a word (i.e. each LU).
Moreover, the set of examples (approximately 20 per LU) illustrates all of the combinatorial
possibilities of the lexical unit.
Each LU is linked to a Frame, and hence to the other words which evoke that Frame. This
makes the FrameNet database similar to a thesaurus, grouping together semantically similar
words.
In the simplest case, frame-evoking words are verbs such as “fried” in:
“Matilde fried the catfish in a heavy iron skillet.”
Sometimes event nouns may evoke a Frame. For example, “reduction” evokes
“Cause_change_of_scalar_position” in:
“…the reduction of debt levels to $665 million from $2.6 billion.”
Adjectives may also evoke a Frame. For example, “asleep” may evoke the “Sleep” frame as
in:
“They were asleep for hours.”
Many common nouns, such as artifacts like “hat” or “tower”, typically serve as dependents
rather than clearly evoking their own frames.
Details for a specific lexical unit can be obtained using this class’s lus() function, which takes
an optional regular expression pattern that will be matched against the name of the lexical
unit:
>>> from pprint import pprint
>>> PrettyList(sorted(fn.lus(r'(?i)a little'), key=itemgetter('ID')))
[<lu
You canID=14733 name=a
obtain detailed little.n>,
information <lu ID=14743
on a particular LU byname=a little.adv>,
calling the ...]
lu() function and
passing in an LU’s ‘ID’ number:
>>> from pprint import pprint
>>> fn.lu(256).definition
'COD: be aware of beforehand; predict.'
Note that LU names take the form of a dotted string (e.g. “run.v” or “a little.adv”) in which a
lemma precedes the “.” and a part of speech (POS) follows the dot. The lemma may be
composed of a single lexeme (e.g. “run”) or of multiple lexemes (e.g. “a little”). The list of
POSs used in the LUs is:
v - verb n - noun a - adjective adv - adverb prep - preposition num - numbers intj -
interjection art - article c - conjunction scon - subordinating conjunction
For more detailed information about the info that is contained in the dict that is returned by
the lu() function, see the documentation on the lu() function.
Annotated Documents
The FrameNet corpus contains a small set of annotated documents. A list of these documents
can be obtained by calling the docs() function:
>>> from pprint import pprint
>>> d = fn.docs('BellRinging')[0]
>>> d.corpname
'PropBank'
>>> d.sentence[49]
full-text sentence (...) in BellRinging:
[POS_tagset] PENN
[text] + [annotationSet]
[PT] 2 phrases
The corpus consists of one million words of American English texts printed in 1961. The
(E=Experiencer,
texts su=supp)
for the corpus were sampled from 15 different text categories to make the corpus a good
standard reference. Today, this corpus is considered small, and slightly dated. The corpus is,
however, still used. Much of its usefulness lies in the fact that the Brown corpus lay-out has
been copied by other corpus compilers. The LOB corpus (British English) and the Kolhapur
Corpus (Indian English) are two examples of corpora made to match the Brown corpus. The
availability of corpora which are so similar in structure is a valuable resourse for researchers
interested in comparing different language varieties, for example.
For a long time, the Brown and LOB corpora were almost the only easily available computer
readable corpora. Much research within the field of corpus linguistics has therefore been
made using these data. By studying the same data from different angles, in different kinds of
studies, researchers can compare their findings without
having to take into consideration possible variation caused by the use of different data.
At the University of Freiburg, Germany, researchers are compiling new versions of the LOB
and Brown corpora with texts from 1991. This will undoubtedly be a valuable resource for
studies of language change in a near diachronic perspective.
The Brown corpus consists of 500 texts, each consisting of just over 2,000 words. The texts
were sampled from 15 different text categories. The number of texts in each category varies
(see below).
More comprehensive information about the Brown corpus can be found in the
Brown Corpus Manual (external link).
BNC is distributed in a format which makes possible almost any kind of computer-based
research on the nature of the language. Obvious application areas include lexicography,
natural language understanding (NLP) systems, and all branches of applied and The
theoretical linguistics.
Uses of the BNC
Large language corpora can help provide answers for these kinds of questions -- if only
because they encourage linguists, lexicographers, and all who work with language to ask
them. The purpose of a language corpus is to provide language workers with evidence of
how language is really used, evidence that can then be used to inform and substantiate
individual theories about what words might or should mean. Traditional grammars and
dictionaries tell us what a word ought to mean, but only experience can tell us what a word
is used to mean. This is why dictionary publishers, grammar writers, language teachers, and
developers of natural language processing software alike have been turning to corpus
evidence as a means of extending and organizing that experience.