Unit 4 NLP
Unit 4 NLP
1
What is Natural Language Processing
(NLP)
• The process of computer analysis of input
provided in a human language (natural language),
and conversion of this input into a useful form of
representation.
• The field of NLP is primarily concerned with
getting computers to perform useful and
interesting tasks with human languages.
• The field of NLP is secondarily concerned with
helping us come to a better understanding of
human language.
2
Forms of Natural Language
• The input/output of a NLP system can be:
– written text
– speech
• We will mostly concerned with written text (not
speech).
• To process written text, we need:
– lexical, syntactic, semantic knowledge about the language
– discourse information, real world knowledge
• To process spoken language, we need everything
required to process written text, plus the
challenges of speech recognition and speech
synthesis.
3
Components of NLP
• Natural Language Understanding
– Mapping the given input in the natural language into a useful representation.
– Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, …
• Natural Language Generation
– Producing output in the natural language from some internal representation.
– Different level of synthesis required:
deep planning (what to say),
syntactic generation
• NL Understanding is much harder than NL Generation. But, still
both of them are hard.
4
Why NL Understanding is hard?
• Natural language is extremely rich in form and structure,
and very ambiguous.
– How to represent meaning,
– Which structures map to which meaning structures.
• One input can mean many different things. Ambiguity can
be at different levels.
– Lexical (word level) ambiguity -- different meanings of words
– Syntactic ambiguity -- different ways to parse the sentence
– Interpreting partial information -- how to interpret pronouns
– Contextual information -- context of the sentence may affect
the meaning of that sentence.
• Many input can mean the same thing.
• Interaction among components of the input is not clear.
5
Knowledge of Language
• Phonology – concerns how words are related to the sounds
that realize them.
• Morphology – concerns how words are constructed from
more basic meaning units called morphemes. A
morpheme is the primitive unit of meaning in a language.
• Syntax – concerns how can be put together to form correct
sentences and determines what structural role each word
plays in the sentence and what phrases are subparts of
other phrases.
• Semantics – concerns what words mean and how these
meaning combine in sentences to form sentence meaning.
The study of context-independent meaning.
6
Knowledge of Language (cont.)
• Pragmatics – concerns how sentences are used in
different situations and how use affects the
interpretation of the sentence.
7
What is Natural Language Processing
(NLP)
• The process of computer analysis of input
provided in a human language (natural language),
and conversion of this input into a useful
form of representation.
• The field of NLP is primarily concerned with
getting computers to perform useful and
interesting tasks with human languages.
• The field of NLP is secondarily concerned with
helping us come to a better understanding
of human language.
BİL711 Natural Language Processing 8
Forms of Natural Language
• The input/output of a NLP system can be:
– written text
– speech
• We will mostly concerned with written text (not
speech).
• To process written text, we need:
– lexical, syntactic, semantic knowledge about the language
– discourse information, real world knowledge
• To process spoken language, we need everything
required to process written text, plus the
challenges of speech recognition and speech
synthesis.
Computer Human
Human Judge
Words
Morphological Analysis
Morphologically analyzed words (another step: POS tagging)
Syntactic Analysis
Syntactic Structure
Semantic Analysis
Context-independent meaning representation
Discourse Processing
Final meaning representation
Utterance Planning
Meaning representations for sentences
Sentence Generation
Morphologically analyzed words
Morphological Generation
Words
Lexicons
How is a word composed?
Ambiguity
Parsing Requirements
Requires a defined Grammar
Requires a big dictionary (10K words)
Requires that sentences follow the grammar
defined
Requires ability to deal with words not in
dictionary
Parsing (from Section 22.4)
Goal:
Understand a single sentence by syntax analysis
Methods
– Bottom-up
– Top-down
More efficient (and complicated) algorithm
given in 23.2
A Parsing Example
S NP VP
NP Article N | Proper
Rules: VP Verb NP
N home | boy | store
Proper Betty | John
Verb go|give|see
Article the | an | a
– Uni-gram: P ( s ) P( wi )
i 1
n
– Bi-gram: P( s) P( wi | wi 1 )
i 1
n
– Tri-gram: P( s) P( wi | wi 2 wi 1 )
i 1
A simple example
(corpus = 10 000 words, 10 000 bi-grams)
wi P(wi) wi-1 wi-1wi P(wi|wi-1)
I (10) 10/10 000 # (1000) (# I) (8) 8/1000
= 0.001 = 0.008
that (10) (that I) (2) 0.2
talk (8) 0.0008 I (10) (I talk) (2) 0.2
we (10) (we talk) (1) 0.1
…
talks (8) 0.0008 he (5) (he talks) (2) 0.4
she (5) (she talks) (2) 0.4
…
she (5) 0.0005 says (4) (she says) (2) 0.5
laughs (2) (she laughs) (1) 0.5
listens (2) (she listens) (2) 1.0
Uni-gram: P(I, talk) = P(I) * P(talk) = 0.001*0.0008
P(I, talks) = P(I) * P(talks) = 0.001*0.0008
Bi-gram: P(I, talk) = P(I | #) * P(talk | I) = 0.008*0.2
P(I, talks) = P(I | #) * P(talks | I) = 0.008*0
Smoothing
P
MLE
smoothed
word
Smoothing methods
n-gram:
• Change the freq. of occurrences
– Laplace smoothing (add-one):
| | 1
Padd _ one ( | C )
(| i | 1)
i V
– Good-Turing
nr 1
change the freq. r to r* (r 1)
nr
nr = no. of n-grams of freq. r
Smoothing (cont’d)
44
45
• The PageRank algorithm is designed to weight links
from high-quality sites more heavily. What is a high-
quality site? One that is linked to by other high-quality
sites. The definition is recursive, but we will see that
the recursion bottoms out properly.
• The PageRank for a page p is defined as:
47
HITS differs from PageRank in several ways:
• First, it is a query-dependent measure: it rates
pages with respect to a query. That means that it
must be computed anew for each query—a
computational burden that most search engines
have elected not to take on.
• Given a query, HITS first finds a set of pages that
are relevant to the query. It does that by
intersecting hit lists of query words, and then
adding pages in the link neighborhood of these
pages—pages that link to or are linked from one
of the pages in the original relevant set.
48
Question Answering
• Information retrieval is the task of finding documents
that are relevant to a query, where the query may be a
question, or just a topic area or concept.
• Question answering is a somewhat different task, in
which the query really is a question, and the answer is
not a ranked list of documents but rather a short
response—a sentence, or even just a phrase.
• There have been question-answering NLP (natural
language processing) systems since the 1960s, but only
since 2001 have such systems used Web information
retrieval to radically increase their breadth of coverage.
49
Information Extraction
• In formation extraction is the process of acquiring
knowledge by skimming a text and looking for
occurrences of a particular class of object and for
relationships among objects.
• A typical task is to extract instances of addresses
from Web pages, with database fields for street,
city, state, and zip code; or instances of storms
from weather reports, with fields for
temperature, wind speed, and precipitation.
50
• 1. Tokenization
• 2. Complex-word handling
• 3. Basic-group handling
• 4. Complex-phrase handling
• 5. Structure merging
51