0% found this document useful (0 votes)
39 views33 pages

Syntactic AI in Natural Language Processing

The document provides an overview of Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL), and Natural Language Processing (NLP), detailing their definitions and interrelations. It discusses the importance of NLP in processing large volumes of textual data and highlights various applications such as sentiment analysis, machine translation, and text summarization. Additionally, it covers regular expressions, finite automata, and their roles in computational linguistics, along with references for further reading.

Uploaded by

fpar570
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views33 pages

Syntactic AI in Natural Language Processing

The document provides an overview of Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL), and Natural Language Processing (NLP), detailing their definitions and interrelations. It discusses the importance of NLP in processing large volumes of textual data and highlights various applications such as sentiment analysis, machine translation, and text summarization. Additionally, it covers regular expressions, finite automata, and their roles in computational linguistics, along with references for further reading.

Uploaded by

fpar570
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Speech & Natural Language Processing

Curated by
Dr. Tohida Rehman
Assistant Professor
Department of Information Technology
Jadavpur University
Contents
1. Introduction of AI, ML,DL
2. Overview of NLP
3. Discussion about the application of NLP
4. Details of regular expression
5. References
Introduction[1/2]
What is AI?
 The term artificial intelligence is used to describe that machines can mimic/simulate human
thinking capabilities and behavior.
What is ML?
 Machine learning uses artificial intelligence to enable machines to learn and predict
outcomes more accurately without being explicitly programmed to do so.
What is DL?
 Deep learning is a subset of ML that uses complex multi-layered artificial neural network
algorithms modeled to represent human-like behavior using a large amount of data.
What is NLP?
 Natural language processing (NLP) is the ability of a computer program to understand
human language as it is spoken and written -- referred to as natural language.
 NLP ⊆ AI; NLP ∩ ML; NLP ∩ DL and DL ⊆ ML.
Introduction[2/2]
What is Linguistics?
Linguistics is the scientific study of language, and its focus is the systematic investigation of the
properties of particular languages as well as the characteristics of the language in general.
Important subfields of linguistics include:
• Phonetics - the study of how speech sounds are produced and perceived
• Phonology - the study of sound patterns and changes
• Morphology - studies how words are formed from smaller meaningful units called morphemes.
• Syntax - focuses on the rules that govern sentence structure and word order.
• Semantics - deals with the meaning of words, phrases, and sentences.
• Pragmatics - the study of how language is used in context (interpretation of meaning)
• Historical Linguistics - the study of language change
• Sociolinguistics - the study of the relation between language and society
• Computational Linguistics - the study of how computers can process human language
• Psycholinguistics - Investigates how language is learned, understood, and produced by the human
mind.
Why NLP is important?

 Large volume of textual data on the web and social media.


 NLP enables computers to communicate with humans in their native language.
 Making sense of a highly unstructured data source.
 Human language is incredible in its complexity and variety.
 We can express ourselves verbally and in writing in many ways.
 Need to utilize machine learning methods for understanding syntactic and semantics context
for modeling our languages.
Information Extraction
Subject: NLP Meeting
Date: January 15, 20 Event: NLP Meeting
Date: Nov-13-2022
To: T Rehman Start: 09:00am
End: 10:00am
Hi Rehman, we’ve now scheduled the NLP meeting. Where: SMCC building

It will be in the SMCC building tomorrow from 9:00-10:00 a.m.


-Poly
Create new Calendar entry
Information Extraction & Sentiment Analysis
Attributes:
zoom
affordability
size and weight
flash
ease of use
Size and weight

✓  since the camera is small and light, I won't need to carry around those heavy, bulky
 nice and compact to carry!

✓  the camera feels flimsy, is plastic and very light in weight you have to be very
professional cameras either!

✗ delicate in the handling of this camera


Machine Translation
 Automatically translate text from one language to another without human involvement.

Source Text

Which free courses will help you to learn English?

Translated Text

ক োন ফ্রি ক োর্ সআপনোক ইংকেজি ফ্রিখকে র্োহোয্য েকে?


Text Summarization
 People nowadays use search engines like Google, Yahoo, and Bing to find information on
the Internet.
 Due to explosion in data, it is helpful for users if they are provided relevant summaries of
the search.
 Text summarization has become a vital approach to help consumers swiftly grasp vast
amounts of information.
 Given a long text, humans have a natural tendency to remember its most important points in
a summary form. The volume of data around us is growing to the point that we need to find
a solution that will deliver accurate and timely summary information.
 It requires a tool or approach for extracting an accurate summary from a large amount of
data.
Extractive vs Abstractive Summarization
 Extractive and abstractive summarization are two types of text summarization methods.
 A technique for extracting essential sentences or paragraphs from the source text and
condensing them into a shorter text is known as extractive summarization.

Abstractive summarization:-acquire the text’s primary idea in natural language without the
verbatim use of terms from the text
Language Technology
making good progress
still really hard?
Sentiment analysis
Good progress in 2025...
mostly solved Best roast chicken in San Francisco!
Question answering (QA)
The waiter ignored us for 20 minutes.
Q. How effective is ibuprofen in reducing
Spam detection Coreference resolution fever in patients with acute febrile illness?

Let’s go to Shimla! ✓
✗ Paraphrase
Carter told Mubarak he shouldn’t run again.
Buy V1AGRA …
Word sense disambiguation (WSD) XYZ acquired ABC yesterday
I need new batteries for my mouse. ABC has been taken over by XYZ
Part-of-speech (POS) tagging
ADJ ADJ NOUN VERB ADV Summarization
Colorless green ideas sleep furiously. Parsing The Dow Jones is up Economy is
I can see Alcatraz from the window! The S&P500 jumped good
Housing prices rose
Named entity recognition (NER) Machine translation (MT)
PERSON ORG LOC We drink coffee every morning. Dialog
Einstein met with UN officials in Princeton Where is Citizen Kane playing in SF?
আমেো প্রফ্রেফ্রিন র্ োকে ফ্রি পোন ফ্রে |
Castro Theatre at 7:30. Do you
Information extraction (IE) want a ticket?
Party
You’re invited to our dinner May 27
party, Friday May 27 at 8:30 add
Ambiguity makes NLP hard:“Crash blossoms”
 Ambiguity is an intrinsic characteristic of human conversations.
 Natural language understanding(NLU) scenarios are very challenging for ambiguity.
 Because One words, phrases or sentences can have multiple meaning with different context.
 Crash blossom (plural crash blossoms) A sentence, often a news headline, that is subject to
incorrect interpretation due to syntactic and/or lexical ambiguity.
Teacher strikes idle kids Hospitals Are Sued by 7 Foot Doctors
Has two meanings, “Teacher hits lazy kids” and “Teacher walkouts leave kids idle.”
Identification and explanation: "strikes" can occur as either a verb meaning to hit or a noun
meaning a refusal to work. Meantime, "idle" can occur as either a verb or an adjective
 “Has two meaning Seven doctors are suing the hospital” or “doctors who are 7 foot in
height are suing the hospitals.
 More information can be found in New York Times article
Assignment

 Find out some(at least 10) Crash blossoms and identify the type of ambiguity.
Skills prerequisite

 Simple linear algebra (vectors, matrices)


 Basic probability theory
 Java or Python programming
 Different packages/tools like NLTK, spacy many more
Regular Expression
 One of the unsung successes in standardization in computer science has been the regular
expression (RE), a language for specifying text search strings
 Regular expression search requires a pattern that we want to search for, and a corpus of
texts to search through that.
 Regular expressions are case sensitive.
 American Mathematician Stephen Cole Kleene formalized the Regular Expression
language
 Regular expressions play a surprisingly large role
 Sophisticated sequences of regular expressions are often the first model for any text
processing text
 For many hard tasks, we use machine learning classifiers
 But regular expressions are used as features in the classifiers
 Can be very useful in capturing generalizations
DESCRIPTION OF A FINITE AUTOMATON[1/2]
Analytically a deterministic finite automaton(DFA) can be represented by a 5-tuple (Q, ∑ , δ,
qo, F). where
(i) Q is a finite nonempty set of states.
(ii) ∑ is a finite nonempty set of inputs called the input alphabet.
(iii) δ(delta) is a function that maps Q x ∑ into Q and is usually called the direct
transition function. This is the function which describes the change of states during the
transition. This mapping is usually represented by a transition table or a transition diagram.
(iv) qo ∈ Q is the initial state.
(v) F ⊆ Q is the set of final states. It is assumed here that there may be more than one final
state.
N.B:The behavior of a deterministic automaton (DFSA) is fully determined by the state it is in.
DESCRIPTION OF A FINITE AUTOMATON[2/2]
Analytically a nondeterministic finite automaton(NDFA) can be represented by a 5-tuple
(Q, ∑ , δ, qo, F). where
(i) Q is a finite nonempty set of states.
(ii) ∑ is a finite nonempty set of inputs called the input alphabet.
(iii) δ(delta) is a function that maps Q x ∑ into 𝟐𝑸 and is usually called the direct
transition function. This is the function which describes the change of states during the
transition. This mapping is usually represented by a transition table or a transition diagram.
(iv) qo ∈ Q is the initial state.
(v) F ⊆ Q is the set of final states. It is assumed here that there may be more than one final
state.
N.B: difference between the deterministic and nondeterministic automata is only in δ. For deterministic
automaton (DFA), the outcome is a state, i.e. an element of Q; for nondeterministic automaton the outcome is a
subset of Q.
Give the entire sequence of states for the input string 110001

Example Taken from K L P Mishra


Solution
H.W
Check this input string 0100 accepted or not?

Example Taken from K L P Mishra


Finite Automata, Regular Grammars, Regular Expressions
 The theoretical basis of computational work: finite state automata
 For description: regular expressions
 Similarly a regular expression can be implemented as a finite state automation and FSA can
be described with RE.
 The regular expression is more than just a convenient metalanguage for text searching.
 A regular expression is one way of characterizing a particular kind of formal language
called a regular language.
 Both regular expressions and finite-state automata can be used to describe regular
languages.
Regular Expression
We give a formal recursive definition of regular expressions over ∑ : as follows:
1. Any terminal symbol (i.e. an element of ∑ :), an empty string is denoted with ε or
sometimes Λ and null sign (∅), the empty set are regular expressions. When we view a in ∑:
as a regular expression, we denote it by a.
2. The union of two regular expressions R1 and R2 written as R 1 + R2, is also a regular
expression.
3. The concatenation of two regular expressions R1 and R2, written as R1. R2, is also a
regular expression.
4. The iteration (or closure) of a regular expression R written as R*, is also a regular
expression.
S. If R is a regular expression, then (R) is also a regular expression.
6. The regular expressions over ∑ : are precisely those obtained recursively by the application
of the rules 1-5 once or several times.
Regular expressions

 A formal language for specifying text strings


 How can we search for any of these?
 woodchuck
 woodchucks
 Woodchuck
 Woodchucks
Regular Expressions: Disjunctions
 Letters inside square brackets []
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit
 Ranges [A-Z]

Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in Disjunction
 Negations [^Ss]
 Carat means negation only when first in []

Pattern Matches
[^A-Z] Not an upper case letter Oyfn pripetchik
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
a^b The pattern a carat b Look up a^b now
Regular Expressions: More Disjunction
 Woodchucks is another name for groundhog!
 The pipe | for disjunction

Pattern Matches
groundhog|woodchuck
yours|mine yours
mine
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck
Regular Expressions: ? * + . {}
Pattern Matches
colou?r Optional previous color colour
char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
baa+ baa baaa baaaa baaaaa
beg.n begin begun begun beg3n
{} Exactly the specified "ma.{2}als“
number of occurrences
#Search for a sequence that starts with “ma",
followed excactly 2 (any) characters, and an
“als"
Regular Expressions: Anchors ^ $
 The most common anchors are the caret ˆ and the dollar-sign $.
 the caret ^ has three uses: to match the start of a line, to indicate a negation inside of square
brackets, and just to mean a caret.
 The dollar sign $ matches the end of a line. So the pattern $ is a useful pattern for
matching a space at the end of a line, and /^The dog\.$/ matches a line that contains only the
phrase The dog.
 We have to use the backslash here since we want the . to mean “period” and not the
wildcard.
Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
More operators
More operators
Example
 Find me all instances of the word “the” in a text.
the
[tT]he Misses capitalized examples
[^a-zA-Z][tT]he[^a-zA-Z] Incorrectly returns other or theology

 Regular expressions can be used with multiple languages. Such as: Java, Python, Ruby,
Swift, Scala, Groovy, C#, PHP, Javascript
RegEx Functions(You may try using python)

findall Returns a list containing all matches


search Returns a Match object if there is a match anywhere in the string
split Returns a list where the string has been split at each match
sub Replaces one or many matches with a string
Reference Books
1. Daniel Jurafsky and James H. Martin. 2020. Speech and Language Processing.
2. 3rd Edition Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical
Natural Language Processing. MIT Press.
3. Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, Harshit Surana. 2020. Practical
Natural Language Processing. O'Reilly.
4. NPTEL NLP course.
5. [Link]
6. Coursera course - Natural Language Processing

You might also like