NLP Notes
NLP Notes
Introduction To NLP
Challenges/Open Problems of NLP
Characteristics of NLP
Application of NLP
Word Segmentation
Parsing – Parsing Tree, Top down parsing and Bottom up parsing
Chunking,
NER
Sentiment Analysis
Web 2.0 application
Chapter 3
HMM
CRF
Naïve Bayes
Chapter 4
Pos Tagging – Difficulty
Morphology Fundamentals - Types
Automatic Morphology Learning,
Finite State Machine Based Morphology
Shallow Parsing
Chapter 5
Dependency Parsing
Malt Parser
Chapter 6
Chapter 1 and 2
Introduction To NLP:
1. Natural language processing (NLP) can be defined as the automatic (or semi-automatic) processing of
human language.
2. Natural Language processing (NLP) is a field of computer science and linguistics concerned with the
interactions between computers and human (natural) languages.
3. In theory, natural-language processing is a very attractive method of human-computer interaction.
4. Natural language processing is the task of analyzing and generating by computers, languages that
humans speak, read and write.
5. NLP is concerned with questions involving three dimensions: language, algorithm and problem.
6. Figure 1 expresses this point. On the language axis are different natural languages and linguistics.
7. The problem axis mentions different NLP tasks like
morphology, part of speech tagging etc.
8. The algorithm axis depicts mechanisms like HMM,
MEMM, CRF etc. for solving problems.
9. The goal of natural language analysis is to produce
knowledge representation structures like predicate calculus
expressions, semantic graphs or frames. This processing makes use
of foundational tasks like morphology analysis, Part of Speech
Tagging, Named Entity Recognition, both shallow and deep
Parsing, Semantics Extraction, Pragmatics and Discourse
Processing.
Characteristics of NLP
Application of NLP
The applications can be divided into two major classes: Text-based applications and
Dialogue-based applications.
Text-based applications:
Text-based applications involve the processing of written text, such as books, newspapers,
reports, manuals, e-mail messages, and so on. These are all reading-based tasks. Text-based
natural language research is ongoing in applications such as
finding appropriate documents on certain topics from a database of texts (for example,
finding relevant books in a library)
extracting information from messages or articles on certain topics (for example, building a
database of all stock transactions described in the news on a given day)
translating documents from one language to another (for example, producing automobile
repair manuals in many different languages)
summarizing texts for certain purposes (for example, producing a 3-page summary of a
1000-page government report)
One very attractive domain for text-based research is story understanding. In this task the
system processes a story and then must answer questions about it. This is similar to the
type of reading comprehension tests used in schools and provides a very rich method for
evaluating the depth of understanding the system is able to achieve.
By: Prof. Harshal V. Patil Page 3
Natural Languages Processing
Dialogue-based applications:
It involves human-machine communication. Most naturally this involves spoken language, but it
also includes interaction using keyboards.
Typical potential applications include
question-answering systems, where natural language is used to query a database (for
example, a query system to a personnel database)
automated customer service over the telephone (for example, to perform banking
transactions or order items from a catalogue)
tutoring systems, where the machine interacts with a student (for example, an
automated mathematics tutoring system)
spoken language control of a machine (for example, voice control of a VCR or
computer)
general cooperative problem-solving systems (for example, a system that helps a person
plan and schedule freight shipments)
The following list is not complete, but useful systems have been built for:
spelling and grammar checking
optical character recognition (OCR)
screen readers for blind and partially sighted users
augmentative and alternative communication (i.e., systems to aid people who have
difficulty communicating because of disability)
machine aided translation (i.e., systems which help a human translator, e.g., by storing
translations of phrases and providing online dictionaries integrated with word
processors, etc)
lexicographers' tools
information retrieval
document classification (filtering, routing)
document clustering
information extraction
question answering
summarization
text segmentation
exam marking
report generation (possibly multilingual)
machine translation
natural language interfaces to databases
email understanding
dialogue systems
Word Segmentation
Word segmentation is the problem of dividing a string of written language into its
component words.
In English and many other languages using some form of the Latin alphabet, the space
is a good approximation of a word divider (word delimiter).
NAME V ART N
NP V ART N
NP V NP
NP VP
S
Example 2 Construct the Parse Tree for following sentence
“All the morning flights from Denver to Tampa leaving before 10.”
Top Down Parsing – Construct the Parse Tree – Book that flights
Top-down parsing is a strategy of analyzing unknown data relationships by
hypothesizing general parse tree structures and then considering whether the
known fundamental structures are compatible with the hypothesis. It occurs in
the analysis of both natural languages and computer languages.
A top-down parser searches for a parse tree by trying to build from the root
node S down to the leaves.
The top-down strategy never wastes time exploring trees that cannot result in
an S, since it begins by generating just those trees.
Example -
Different between Top down parsing and Bottom up parsing
Top down never explores options that will not lead to a full parse, but can explore many
options that never connect to the actual sentence.
Bottom up never explores options that do not connect to the actual sentence but can
explore options that can never lead to a full parse.
Relative amounts of wasted search depend on how much the grammar branches in each
direction
Chunking,
NER (Named-entity recognition)
It is also known as entity identification, entity chunking and entity extraction.
Named-entity recognition is the problem of segmenting and classifying proper names,
such as names of people and organization, in text.
An entity is an individual person, place, or thing in the world, while a mention is a
phrase of text that refers to an entity using a proper name.
The problem of named-entity recognition is in part one of segmentation because
mentions in English are often multi-word.
It is a subtask of information extraction that seeks to locate and classify elements in text
into pre-defined categories such as the names of persons, organizations, locations,
expressions of times, quantities, monetary values, percentages, etc.
Most research on NER systems has been structured as taking an unannotated block of
text, such as this one:
Example –
Jim bought 300 shares of Acme Corp. in 2006.
And producing an annotated block of text that highlights the names of entities:
[Jim]Person bought 300 shares of [Acme Corp.]Organization in Time.
In this example, a person name consisting of one token, a two-token company name and
a temporal expression have been detected and classified.
Sentiment Analysis
Sentiment analysis (also known as opinion mining) refers to the use of natural language
processing, text analysis and computational linguistics to identify and extract subjective
information in source materials.
Sentiment analysis is widely applied to reviews and social media for a variety of
applications, ranging from marketing to customer service.
Sentiment analysis aims to determine the attitude of a speaker or a writer with respect to
some topic or the overall contextual polarity of a document.
Types of Sentiment Analysis –
1. Subjectivity/objectivity identification –
This task is commonly defined as classifying a given text (usually a sentence)
into one of two classes: objective or subjective.
The subjectivity of words and phrases may depend on their context and an
objective document may contain subjective sentences (e.g., a news article
quoting people's opinions).
2. Feature/aspect-based sentiment analysis –
It refers to determining the opinions or sentiments expressed on different
features or aspects of entities, e.g., of a cell phone, a digital camera, or a bank.
Syndication - Users can "subscribe" to RSS feed-enabled websites so that they are
automatically notified of any changes or updates in content via an aggregator
Chapter 3
Alice knows the general weather trends in the area, and what Bob likes to do on
average. In other words, the parameters of the HMM are known.
They can be represented as follows in Python:
states = ('Rainy', 'Sunny')
observations = ('walk', 'shop', 'clean')
start_probability = {'Rainy': 0.6, 'Sunny': 0.4}
transition_probability = {
'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},
'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},
}
emission_probability = {
'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5},
'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1},
}
Naïve Bayes
Naive Bayes has been studied extensively since the 1950s. It was introduced under a
different name into the text retrieval community in the early 1960s.
Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the
number of variables (features/predictors) in a learning problem.
Naive Bayes is a simple technique for constructing classifiers: models that assign class
labels to problem instances, represented as vectors of feature values, where the class labels
are drawn from some finite set.
It is not a single algorithm for training such classifiers, but a family of algorithms based on
a common principle: all naive Bayes classifiers assume that the value of a particular feature
is independent of the value of any other feature, given the class variable.
For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in
diameter.
By: Prof. Harshal V. Patil Page 11
Natural Languages Processing
A naive Bayes classifier considers each of these features to contribute independently to the
probability that this fruit is an apple, regardless of any possible correlations between the
color, roundness and diameter features.
For some types of probability models, naive Bayes classifiers can be trained very
efficiently in a supervised learning setting. In many practical applications, parameter
estimation for naive Bayes models uses the method of maximum likelihood; in other
words, one can work with the naive Bayes model without accepting Bayesian
probability or using any Bayesian methods.
Despite their naive design and apparently oversimplified assumptions, naive Bayes
classifiers have worked quite well in many complex real-world situations.
An advantage of naive Bayes is that it only requires a small amount of training data to
estimate the parameters necessary for classification.
Chapter 4
Pos Tagging – Difficulty
The process of assigning one of the parts of speech to the given word is called Parts Of
Speech tagging. It is commonly referred to as POS tagging. Parts of speech include nouns,
verbs, adverbs, adjectives, pronouns, conjunction and their sub-categories.
Example:
Word : Paper, Tag: Noun
Word : Go, Tag: Verb
Word: Famous, Tag:Adjective
POS tagging exemplas some general issues in NLP evaluation:
Training data and test data The assumption in NLP is always that a system should work on
novel data, therefore test data must be kept unseen. For machine learning approaches, such
as stochastic POS tagging, the usual technique is to spilt a data set into 90% training and
10% test data. Care needs to be taken that the test data is representative. For an approach that
relies on significant hand-coding, the test data should be literally unseen by the researchers.
Development cycles involve looking at some initial data, developing the algorithm, testing
on unseen data, revising the algorithm and testing on a new batch of data. The seen data is
kept for regression testing.
Baselines Evaluation should be reported with respect to a baseline, which is normally what
could be achieved with a very basic approach, given the same training data. For instance, the
baseline for POS tagging with training data is to choose the most common tag for a
particular word on the basis of the training data (and to simply choose the most frequent tag
of all for unseen words).
Ceiling It is often useful to try and compute some sort of ceiling for the performance of an
application. This is usually taken to be human performance on that task, where the ceiling is
the percentage agreement found between two annotators (interannotator agreement). Fot
By: Prof. Harshal V. Patil Page 12
Natural Languages Processing
POS tagging, this has been reported as 96% (which makes existing POS taggers look
impressive). However this raises lots of questions: relatively untrained human annotators
working independently often have quite low agreement, but trained annotators discussing
results can achieve much higher performance (approaching 100% for POS tagging). Human
performance varies considerably between individuals. In any case, human performance may
not be a realistic ceiling on relatively unnatural tasks, such as POS tagging.
Error analysis The error rate on a particular problem will be distributed very unevenly. For
instance, a POS tagger will never confuse the tag PUN with the tag VVN (past participle),
but might confuse VVN with AJ0 (adjective) because there's a systematic ambiguity for
many forms (e.g., given). For a particular application, some errors 25 may be more important
than others. For instance, if one is looking for relatively low frequency cases of demonical
verbs (that is verbs derived from nouns . e.g., canoe, tango, fork used as verbs), then POS
tagging is not directly useful in general, because a verbal use without a characteristic affix is
likely to be massaged. This makes POS-tagging less useful for lexicographers, who are often
specifically interested in finding examples of unusual word uses. Similarly, in text
categorization, some errors are more important than others: e.g. treating an incoming order
for an expensive product as junk email is a much worse error than the converse.
Reproducibility If at all possible, evaluation should be done on a generally available corpus
so that other researchers can replicate the experiments.
Chapter 5
Dependency Parsing
The dependency approach has a number of advantages over full phrase-structure parsing.
Deals well with free word order languages where the constituent structure is quite fluid
Parsing is much faster than CFG-bases parsers
Dependency structure often captures the syntactic relations needed by later
applications - CFG-based approaches often extract this same information from trees
anyway.
Ex. –
Malt Parser
Chapter 6
WordNET Theory
There are several electronic dictionaries, thesauri, lexical databases, and so forth today.
WordNet is one of the largest and most widely used of these.
It has been used for many natural language processing tasks, including word sense
disambiguation and question answering.
This is an attempt to explore and understand the structure of WordNet, and how it is used
and for what applications it is used, and also to see where it's strength and weakness lies
WordNet is the main resource for lexical semantics for English that is used in NLP.
Primarily because of its very large coverage and the fact that it's freely available.
WordNets are under development for many other languages, though so far none are as
extensive as the original.
Semantic Roles
Once the computer has arrived at an analysis of the input sentence's syntactic
structure, a semantic analysis is needed to ascertain the meaning of the sentence.
The basic or primitive unit of meaning for semantic will be not the word but the
sense, because words may have different senses, like those listed in the dictionary for
the same word.
It is concern with what words mean and how these meaning combine in sentence to
form sentence meaning.
Metaphors;
Word Sense – Application
Needed for many applications, problematic for large domains. Assumes that we have a
standard set of word senses (e.g., WordNet)
frequency: e.g., diet: the food sense (or senses) is much more frequent than the
parliament sense (Diet of Wurms)
collocations: e.g. striped bass (the _sh) vs bass guitar: syntactically related or in a
window of words (latter sometimes called `cooccurrence'). Generally `one sense per
collocation'.
selection restrictions/preferences (e.g., Kim eats bass, must refer to fish
A combination of unsupervised Knowledge-based and supervised Machine Learning techniques that
will provide a high-precision system that is able to tag running text with word senses
A system that acquires a huge number of examples per word from the web
The use of sophisticated linguistic information, such as, syntactic relations, semantic classes,
selectional restrictions, subcategorization information, domain, etc.
Efficient margin-based Machine Learning algorithms.
Novel algorithms that combine tagged examples with huge amounts of untagged examples in order to
increase the precision of the system.