0% found this document useful (0 votes)
822 views

Unit 4 NLP

Natural Language Processing (NLP) involves computer analysis and representation of human language input. The field aims to perform useful tasks with human languages and improve understanding of language. NLP involves understanding language through morphological, syntactic, semantic and discourse analysis, and generating language. It is an interdisciplinary field that draws from linguistics, computer science, engineering, psychology and philosophy.

Uploaded by

Kumar Sumit
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
822 views

Unit 4 NLP

Natural Language Processing (NLP) involves computer analysis and representation of human language input. The field aims to perform useful tasks with human languages and improve understanding of language. NLP involves understanding language through morphological, syntactic, semantic and discourse analysis, and generating language. It is an interdisciplinary field that draws from linguistics, computer science, engineering, psychology and philosophy.

Uploaded by

Kumar Sumit
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

AI: NLP

1
What is Natural Language Processing
(NLP)
• The process of computer analysis of input
provided in a human language (natural language),
and conversion of this input into a useful form of
representation.
• The field of NLP is primarily concerned with
getting computers to perform useful and
interesting tasks with human languages.
• The field of NLP is secondarily concerned with
helping us come to a better understanding of
human language.
2
Forms of Natural Language
• The input/output of a NLP system can be:
– written text
– speech
• We will mostly concerned with written text (not
speech).
• To process written text, we need:
– lexical, syntactic, semantic knowledge about the language
– discourse information, real world knowledge
• To process spoken language, we need everything
required to process written text, plus the
challenges of speech recognition and speech
synthesis.

3
Components of NLP
• Natural Language Understanding
– Mapping the given input in the natural language into a useful representation.
– Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, …
• Natural Language Generation
– Producing output in the natural language from some internal representation.
– Different level of synthesis required:
deep planning (what to say),
syntactic generation
• NL Understanding is much harder than NL Generation. But, still
both of them are hard.

4
Why NL Understanding is hard?
• Natural language is extremely rich in form and structure,
and very ambiguous.
– How to represent meaning,
– Which structures map to which meaning structures.
• One input can mean many different things. Ambiguity can
be at different levels.
– Lexical (word level) ambiguity -- different meanings of words
– Syntactic ambiguity -- different ways to parse the sentence
– Interpreting partial information -- how to interpret pronouns
– Contextual information -- context of the sentence may affect
the meaning of that sentence.
• Many input can mean the same thing.
• Interaction among components of the input is not clear.

5
Knowledge of Language
• Phonology – concerns how words are related to the sounds
that realize them.
• Morphology – concerns how words are constructed from
more basic meaning units called morphemes. A
morpheme is the primitive unit of meaning in a language.
• Syntax – concerns how can be put together to form correct
sentences and determines what structural role each word
plays in the sentence and what phrases are subparts of
other phrases.
• Semantics – concerns what words mean and how these
meaning combine in sentences to form sentence meaning.
The study of context-independent meaning.

6
Knowledge of Language (cont.)
• Pragmatics – concerns how sentences are used in
different situations and how use affects the
interpretation of the sentence.

• Discourse – concerns how the immediately preceding


sentences affect the interpretation of the next
sentence. For example, interpreting pronouns and
interpreting the temporal aspects of the information.

• World Knowledge – includes general knowledge about


the world. What each language user must know about
the other’s beliefs and goals.

7
What is Natural Language Processing
(NLP)
• The process of computer analysis of input
provided in a human language (natural language),
and conversion of this input into a useful
form of representation.
• The field of NLP is primarily concerned with
getting computers to perform useful and
interesting tasks with human languages.
• The field of NLP is secondarily concerned with
helping us come to a better understanding
of human language.
BİL711 Natural Language Processing 8
Forms of Natural Language
• The input/output of a NLP system can be:
– written text
– speech
• We will mostly concerned with written text (not
speech).
• To process written text, we need:
– lexical, syntactic, semantic knowledge about the language
– discourse information, real world knowledge
• To process spoken language, we need everything
required to process written text, plus the
challenges of speech recognition and speech
synthesis.

BİL711 Natural Language Processing 9


Components of NLP
• Natural Language Understanding
– Mapping the given input in the natural language into a useful representation.
– Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, …
• Natural Language Generation
– Producing output in the natural language from some internal representation.
– Different level of synthesis required:
deep planning (what to say),
syntactic generation
• NL Understanding is much harder than NL Generation. But, still
both of them are hard.

BİL711 Natural Language Processing 10


Why NL Understanding is hard?
• Natural language is extremely rich in form and structure,
and very ambiguous.
– How to represent meaning,
– Which structures map to which meaning structures.
• One input can mean many different things. Ambiguity can
be at different levels.
– Lexical (word level) ambiguity -- different meanings of words
– Syntactic ambiguity -- different ways to parse the sentence
– Interpreting partial information -- how to interpret pronouns
– Contextual information -- context of the sentence may affect
the meaning of that sentence.
• Many input can mean the same thing.
• Interaction among components of the input is not clear.

BİL711 Natural Language Processing 11


Knowledge of Language
• Phonology – concerns how words are related to the sounds
that realize them.
• Morphology – concerns how words are constructed from
more basic meaning units called morphemes. A
morpheme is the primitive unit of meaning in a language.
• Syntax – concerns how can be put together to form correct
sentences and determines what structural role each word
plays in the sentence and what phrases are subparts of
other phrases.
• Semantics – concerns what words mean and how these
meaning combine in sentences to form sentence meaning.
The study of context-independent meaning.

BİL711 Natural Language Processing 12


Knowledge of Language (cont.)
• Pragmatics – concerns how sentences are used in
different situations and how use affects the
interpretation of the sentence.

• Discourse – concerns how the immediately preceding


sentences affect the interpretation of the next
sentence. For example, interpreting pronouns and
interpreting the temporal aspects of the information.

• World Knowledge – includes general knowledge about


the world. What each language user must know about
the other’s beliefs and goals.

BİL711 Natural Language Processing 13


Language and Intelligence
Turing Test

Computer Human

Human Judge

• Human Judge asks tele-typed questions to Computer and


Human.
• Computer’s job is to act like a human.
• Human’s job is to convince Judge that he is not machine.
• Computer is judged “intelligent” if it can fool the judge
• Judgment of intelligence is linked to appropriate answers to
questions from the system.

BİL711 Natural Language Processing 14


NLP - an inter-disciplinary Field
• NLP borrows techniques and insights from several disciplines.
• Linguistics: How do words form phrases and sentences? What
constraints the possible meaning for a sentence?
• Computational Linguistics: How is the structure of sentences are
identified? How can knowledge and reasoning be modeled?
• Computer Science: Algorithms for automatons, parsers.
• Engineering: Stochastic techniques for ambiguity resolution.
• Psychology: What linguistic constructions are easy or difficult for
people to learn to use?
• Philosophy: What is the meaning, and how do words and sentences
acquire it?

BİL711 Natural Language Processing 15


Some NLP Applications
• Machine Translation – Translation between two natural
languages.
– See the Babel Fish translations system on Alta Vista.
• Information Retrieval – Web search (uni-lingual or
multi-lingual).
• Query Answering/Dialogue – Natural language
interface with a database system, or a dialogue system.
• Report Generation – Generation of reports such as
weather reports.
• Some Small Applications –
– Grammar Checking, Spell Checking, Spell Corrector

BİL711 Natural Language Processing 16


Brief History of NLP
• 1940s –1950s: Foundations
– Development of formal language theory (Chomsky, Backus, Naur,
Kleene)
– Probabilities and information theory (Shannon)
• 1957 – 1970s:
– Use of formal grammars as basis for natural language processing
(Chomsky, Kaplan)
– Use of logic and logic based programming (Minsky, Winograd,
Colmerauer, Kay)
• 1970s – 1983:
– Probabilistic methods for early speech recognition (Jelinek, Mercer)
– Discourse modeling (Grosz, Sidner, Hobbs)
• 1983 – 1993:
– Finite state models (morphology) (Kaplan, Kay)
• 1993 – present:
– Strong integration of different techniques, different areas.

BİL711 Natural Language Processing 17


Natural Language Understanding

Words

Morphological Analysis
Morphologically analyzed words (another step: POS tagging)

Syntactic Analysis
Syntactic Structure

Semantic Analysis
Context-independent meaning representation

Discourse Processing
Final meaning representation

BİL711 Natural Language Processing 18


Natural Language Generation
Meaning representation

Utterance Planning
Meaning representations for sentences

Sentence Planning and Lexical Choice


Syntactic structures of sentences with lexical choices

Sentence Generation
Morphologically analyzed words

Morphological Generation
Words

BİL711 Natural Language Processing 19


Morphological Analysis
• Analyzing words into their linguistic components (morphemes).
• Morphemes are the smallest meaningful units of language.
cars car+PLU
giving give+PROG
geliyordum gel+PROG+PAST+1SG - I was coming
• Ambiguity: More than one alternatives
flies flyVERB+PROG
flyNOUN+PLU

adamı adam+ACC - the man


(accusative)
adam+P1SG - my man
ada+P1SG+ACC - my island
(accusative)

BİL711 Natural Language Processing 20


Morphological Analysis (cont.)
• Relatively simple for English. But for some languages
such as Turkish, it is more difficult.
uygarlaştıramadıklarımızdanmışsınızcasına
uygar-laş-tır-ama-dık-lar-ımız-dan-mış-sınız-casına
uygar +BEC +CAUS +NEGABLE +PPART +PL +P1PL +ABL +PAST +2PL +AsIf
“(behaving) as if you are among those whom we could not civilize/cause to become civilized”
+BEC is “become” in English
+CAUS is the causative voice marker on a verb
+PPART marks a past participle form
+P1PL is 1st person plural possessive marker
+2PL is 2nd person plural
+ABL is the ablative (from/among) case marker
+AsIf is a derivational marker that forms an adverb from a finite verb form
+NEGABLE is “not able” in English

• Inflectional and Derivational Morphology.


• Common tools: Finite-state transducers
BİL711 Natural Language Processing 21
Part-of-Speech (POS) Tagging
• Each word has a part-of-speech tag to describe its category.
• Part-of-speech tag of a word is one of major word groups
(or its subgroups).
– open classes -- noun, verb, adjective, adverb
– closed classes -- prepositions, determiners, conjuctions,
pronouns, particples
• POS Taggers try to find POS tags for the words.
• duck is a verb or noun? (morphological analyzer cannot
make decision).
• A POS tagger may make that decision by looking the
surrounding words.
– Duck! (verb)
– Duck is delicious for dinner. (noun)

BİL711 Natural Language Processing 22


Lexical Processing
• The purpose of lexical processing is to determine meanings of
individual words.
• Basic methods is to lookup in a database of meanings -- lexicon
• We should also identify non-words such as punctuation marks.
• Word-level ambiguity -- words may have several meanings, and the
correct one cannot be chosen based solely on the word itself.
– bank in English
– yüz in Turkish
• Solution -- resolve the ambiguity on the spot by POS tagging
(if possible) or pass-on the ambiguity to the other levels.

BİL711 Natural Language Processing 23


Syntactic Processing
• Parsing -- converting a flat input sentence into a hierarchical
structure that corresponds to the units of meaning in the sentence.
• There are different parsing formalisms and algorithms.
• Most formalisms have two main components:
– grammar -- a declarative representation describing the syntactic
structure of sentences in the language.
– parser -- an algorithm that analyzes the input and outputs its
structural representation (its parse) consistent with the grammar
specification.
• CFGs are in the center of many of the parsing mechanisms. But they
are complemented by some additional features that make the
formalism more suitable to handle natural languages.

BİL711 Natural Language Processing 24


Semantic Analysis
• Assigning meanings to the structures created by
syntactic analysis.
• Mapping words and structures to particular domain
objects in way consistent with our knowledge of the
world.
• Semantic can play an import role in selecting among
competing syntactic analyses and discarding illogical
analyses.
– I robbed the bank -- bank is a river bank or a
financial institution
• We have to decide the formalisms which will be used in
the meaning representation.

BİL711 Natural Language Processing 25


Knowledge Representation for NLP
• Which knowledge representation will be used depends
on the application -- Machine Translation, Database
Query System.
• Requires the choice of representational framework, as
well as the specific meaning vocabulary (what are
concepts and relationship between these concepts --
ontology)
• Must be computationally effective.
• Common representational formalisms:
– first order predicate logic
– conceptual dependency graphs
– semantic networks
– Frame-based representations

BİL711 Natural Language Processing 26


How to get there
NLP applications are all similar in that they
require some level of understanding.

Understand the query, understand the


document, understand the data being
communicated…
Understanding Sentences: Overview
Parsing and Grammar
How is a sentence composed?

Lexicons
How is a word composed?

Ambiguity
Parsing Requirements
Requires a defined Grammar
Requires a big dictionary (10K words)
Requires that sentences follow the grammar
defined
Requires ability to deal with words not in
dictionary
Parsing (from Section 22.4)
Goal:
Understand a single sentence by syntax analysis
Methods
– Bottom-up
– Top-down
More efficient (and complicated) algorithm
given in 23.2
A Parsing Example
S  NP VP
NP  Article N | Proper
Rules: VP  Verb NP
N  home | boy | store
Proper  Betty | John
Verb  go|give|see
Article  the | an | a

The Sentence: The boy went home.


n-grams

• Limit hi to n-1 preceding words


Most used cases
n

– Uni-gram: P ( s )   P( wi )
i 1
n
– Bi-gram: P( s)   P( wi | wi 1 )
i 1
n
– Tri-gram: P( s)   P( wi | wi 2 wi 1 )
i 1
A simple example
(corpus = 10 000 words, 10 000 bi-grams)
wi P(wi) wi-1 wi-1wi P(wi|wi-1)
I (10) 10/10 000 # (1000) (# I) (8) 8/1000
= 0.001 = 0.008
that (10) (that I) (2) 0.2
talk (8) 0.0008 I (10) (I talk) (2) 0.2
we (10) (we talk) (1) 0.1

talks (8) 0.0008 he (5) (he talks) (2) 0.4
she (5) (she talks) (2) 0.4

she (5) 0.0005 says (4) (she says) (2) 0.5
laughs (2) (she laughs) (1) 0.5
listens (2) (she listens) (2) 1.0
Uni-gram: P(I, talk) = P(I) * P(talk) = 0.001*0.0008
P(I, talks) = P(I) * P(talks) = 0.001*0.0008
Bi-gram: P(I, talk) = P(I | #) * P(talk | I) = 0.008*0.2
P(I, talks) = P(I | #) * P(talks | I) = 0.008*0
Smoothing

• Goal: assign a low probability to words or


n-grams not observed in the training corpus

P
MLE

smoothed

word
Smoothing methods
n-gram: 
• Change the freq. of occurrences
– Laplace smoothing (add-one):
|  | 1
Padd _ one ( | C ) 
 (|  i | 1)
 i V
– Good-Turing
nr 1
change the freq. r to r*  (r  1)
nr
nr = no. of n-grams of freq. r
Smoothing (cont’d)

• Combine a model with a lower-order model


– Backoff (Katz)

 PGT (wi | wi 1 ) if | wi 1wi | 0


PKatz ( wi | wi 1 )  
 (wi 1 ) PKatz ( wi ) otherwise
– Interpolation (Jelinek-Mercer)

PJM ( wi | wi 1 )  wi1 PML ( wi | wi 1 )  (1  wi1 ) PJM ( wi )


Information Retrieval
Now the main focus of Natural Language
Processing

There are four types:


1. Query answering
2. Text categorization
3. Text summary
4. Data extraction
Information Retrieval: The task
Choose from some set of documents ones that
are related to my query

Ex. Internet search


Information Retrieval
Methods
Boolean: “(Natural AND Language) OR
(Computational AND Linguistics)”
• too confusing for most users

Vector: Assign different weights to each term in


query. Rank documents by distance from
query and report ones that are close.
Information Retrieval
Mostly implemented using simple statistical
models on the words only
More advanced NLP techniques have not
yielded significantly better results

Information in a text is mostly in its words


41
Text Categorization
Once upon a time… this was done by humans
Computers are much better at it (and more consistent)
Best success for NLP so far (90+ % accuracy)
Much faster and more consistent than humans.
Automated systems now perform most of the work.
NLP works better for TC than IR because categories are
fixed.
Text Summarization
Main task: understand main meaning and
describe in a shorter way
Common Systems: Microsoft
How:
– Sentence/paragraph extraction (find the most
important sentences/paragraphs and string them
together for a summary)
– Statistical methods are more common
The PageRank Algo
• PageRank3 was one of the two original ideas
that set Google’s search apart from other Web
search engines when it was introduced in
1997.
• “The other innovation was the use of anchor
text—the underlined text in a hyperlink—to
index a page, even though the anchor text was
ona different page than the one being
indexed.)

44
45
• The PageRank algorithm is designed to weight links
from high-quality sites more heavily. What is a high-
quality site? One that is linked to by other high-quality
sites. The definition is recursive, but we will see that
the recursion bottoms out properly.
• The PageRank for a page p is defined as:

• where PR(p) is the PageRank of page p, N is the total


number of pages in the corpus, ini are the pages that
link in to p, and C(ini) is the count of the total number
of out-links on page ini.
• The constant d is a damping factor. It can be
understood through the random surfer model: imagine
a Web surfer who starts at some random page and
begins exploring.
46
The HITS Algo: Question Answering
System

• The Hyperlink-Induced Topic Search algorithm,


also known as “Hubs and Authorities” or HITS,
is another influential link-analysis algorithm.
• Both PageRank and HITS played important
roles in developing our understanding of Web
information retrieval.

47
HITS differs from PageRank in several ways:
• First, it is a query-dependent measure: it rates
pages with respect to a query. That means that it
must be computed anew for each query—a
computational burden that most search engines
have elected not to take on.
• Given a query, HITS first finds a set of pages that
are relevant to the query. It does that by
intersecting hit lists of query words, and then
adding pages in the link neighborhood of these
pages—pages that link to or are linked from one
of the pages in the original relevant set.

48
Question Answering
• Information retrieval is the task of finding documents
that are relevant to a query, where the query may be a
question, or just a topic area or concept.
• Question answering is a somewhat different task, in
which the query really is a question, and the answer is
not a ranked list of documents but rather a short
response—a sentence, or even just a phrase.
• There have been question-answering NLP (natural
language processing) systems since the 1960s, but only
since 2001 have such systems used Web information
retrieval to radically increase their breadth of coverage.

49
Information Extraction
• In formation extraction is the process of acquiring
knowledge by skimming a text and looking for
occurrences of a particular class of object and for
relationships among objects.
• A typical task is to extract instances of addresses
from Web pages, with database fields for street,
city, state, and zip code; or instances of storms
from weather reports, with fields for
temperature, wind speed, and precipitation.

50
• 1. Tokenization
• 2. Complex-word handling
• 3. Basic-group handling
• 4. Complex-phrase handling
• 5. Structure merging

51

You might also like