Speech & Natural Language Processing
Curated by
Dr. Tohida Rehman
Assistant Professor
Department of Information Technology
Jadavpur University
Contents
1. Introduction of AI, ML,DL
2. Overview of NLP
3. Discussion about the application of NLP
4. Details of regular expression
5. References
Introduction[1/2]
What is AI?
The term artificial intelligence is used to describe that machines can mimic/simulate human
thinking capabilities and behavior.
What is ML?
Machine learning uses artificial intelligence to enable machines to learn and predict
outcomes more accurately without being explicitly programmed to do so.
What is DL?
Deep learning is a subset of ML that uses complex multi-layered artificial neural network
algorithms modeled to represent human-like behavior using a large amount of data.
What is NLP?
Natural language processing (NLP) is the ability of a computer program to understand
human language as it is spoken and written -- referred to as natural language.
NLP ⊆ AI; NLP ∩ ML; NLP ∩ DL and DL ⊆ ML.
Introduction[2/2]
What is Linguistics?
Linguistics is the scientific study of language, and its focus is the systematic investigation of the
properties of particular languages as well as the characteristics of the language in general.
Important subfields of linguistics include:
• Phonetics - the study of how speech sounds are produced and perceived
• Phonology - the study of sound patterns and changes
• Morphology - studies how words are formed from smaller meaningful units called morphemes.
• Syntax - focuses on the rules that govern sentence structure and word order.
• Semantics - deals with the meaning of words, phrases, and sentences.
• Pragmatics - the study of how language is used in context (interpretation of meaning)
• Historical Linguistics - the study of language change
• Sociolinguistics - the study of the relation between language and society
• Computational Linguistics - the study of how computers can process human language
• Psycholinguistics - Investigates how language is learned, understood, and produced by the human
mind.
Why NLP is important?
Large volume of textual data on the web and social media.
NLP enables computers to communicate with humans in their native language.
Making sense of a highly unstructured data source.
Human language is incredible in its complexity and variety.
We can express ourselves verbally and in writing in many ways.
Need to utilize machine learning methods for understanding syntactic and semantics context
for modeling our languages.
Information Extraction
Subject: NLP Meeting
Date: January 15, 20 Event: NLP Meeting
Date: Nov-13-2022
To: T Rehman Start: 09:00am
End: 10:00am
Hi Rehman, we’ve now scheduled the NLP meeting. Where: SMCC building
It will be in the SMCC building tomorrow from 9:00-10:00 a.m.
-Poly
Create new Calendar entry
Information Extraction & Sentiment Analysis
Attributes:
zoom
affordability
size and weight
flash
ease of use
Size and weight
✓ since the camera is small and light, I won't need to carry around those heavy, bulky
nice and compact to carry!
✓ the camera feels flimsy, is plastic and very light in weight you have to be very
professional cameras either!
✗ delicate in the handling of this camera
Machine Translation
Automatically translate text from one language to another without human involvement.
Source Text
Which free courses will help you to learn English?
Translated Text
ক োন ফ্রি ক োর্ সআপনোক ইংকেজি ফ্রিখকে র্োহোয্য েকে?
Text Summarization
People nowadays use search engines like Google, Yahoo, and Bing to find information on
the Internet.
Due to explosion in data, it is helpful for users if they are provided relevant summaries of
the search.
Text summarization has become a vital approach to help consumers swiftly grasp vast
amounts of information.
Given a long text, humans have a natural tendency to remember its most important points in
a summary form. The volume of data around us is growing to the point that we need to find
a solution that will deliver accurate and timely summary information.
It requires a tool or approach for extracting an accurate summary from a large amount of
data.
Extractive vs Abstractive Summarization
Extractive and abstractive summarization are two types of text summarization methods.
A technique for extracting essential sentences or paragraphs from the source text and
condensing them into a shorter text is known as extractive summarization.
Abstractive summarization:-acquire the text’s primary idea in natural language without the
verbatim use of terms from the text
Language Technology
making good progress
still really hard?
Sentiment analysis
Good progress in 2025...
mostly solved Best roast chicken in San Francisco!
Question answering (QA)
The waiter ignored us for 20 minutes.
Q. How effective is ibuprofen in reducing
Spam detection Coreference resolution fever in patients with acute febrile illness?
Let’s go to Shimla! ✓
✗ Paraphrase
Carter told Mubarak he shouldn’t run again.
Buy V1AGRA …
Word sense disambiguation (WSD) XYZ acquired ABC yesterday
I need new batteries for my mouse. ABC has been taken over by XYZ
Part-of-speech (POS) tagging
ADJ ADJ NOUN VERB ADV Summarization
Colorless green ideas sleep furiously. Parsing The Dow Jones is up Economy is
I can see Alcatraz from the window! The S&P500 jumped good
Housing prices rose
Named entity recognition (NER) Machine translation (MT)
PERSON ORG LOC We drink coffee every morning. Dialog
Einstein met with UN officials in Princeton Where is Citizen Kane playing in SF?
আমেো প্রফ্রেফ্রিন র্ োকে ফ্রি পোন ফ্রে |
Castro Theatre at 7:30. Do you
Information extraction (IE) want a ticket?
Party
You’re invited to our dinner May 27
party, Friday May 27 at 8:30 add
Ambiguity makes NLP hard:“Crash blossoms”
Ambiguity is an intrinsic characteristic of human conversations.
Natural language understanding(NLU) scenarios are very challenging for ambiguity.
Because One words, phrases or sentences can have multiple meaning with different context.
Crash blossom (plural crash blossoms) A sentence, often a news headline, that is subject to
incorrect interpretation due to syntactic and/or lexical ambiguity.
Teacher strikes idle kids Hospitals Are Sued by 7 Foot Doctors
Has two meanings, “Teacher hits lazy kids” and “Teacher walkouts leave kids idle.”
Identification and explanation: "strikes" can occur as either a verb meaning to hit or a noun
meaning a refusal to work. Meantime, "idle" can occur as either a verb or an adjective
“Has two meaning Seven doctors are suing the hospital” or “doctors who are 7 foot in
height are suing the hospitals.
More information can be found in New York Times article
Assignment
Find out some(at least 10) Crash blossoms and identify the type of ambiguity.
Skills prerequisite
Simple linear algebra (vectors, matrices)
Basic probability theory
Java or Python programming
Different packages/tools like NLTK, spacy many more
Regular Expression
One of the unsung successes in standardization in computer science has been the regular
expression (RE), a language for specifying text search strings
Regular expression search requires a pattern that we want to search for, and a corpus of
texts to search through that.
Regular expressions are case sensitive.
American Mathematician Stephen Cole Kleene formalized the Regular Expression
language
Regular expressions play a surprisingly large role
Sophisticated sequences of regular expressions are often the first model for any text
processing text
For many hard tasks, we use machine learning classifiers
But regular expressions are used as features in the classifiers
Can be very useful in capturing generalizations
DESCRIPTION OF A FINITE AUTOMATON[1/2]
Analytically a deterministic finite automaton(DFA) can be represented by a 5-tuple (Q, ∑ , δ,
qo, F). where
(i) Q is a finite nonempty set of states.
(ii) ∑ is a finite nonempty set of inputs called the input alphabet.
(iii) δ(delta) is a function that maps Q x ∑ into Q and is usually called the direct
transition function. This is the function which describes the change of states during the
transition. This mapping is usually represented by a transition table or a transition diagram.
(iv) qo ∈ Q is the initial state.
(v) F ⊆ Q is the set of final states. It is assumed here that there may be more than one final
state.
N.B:The behavior of a deterministic automaton (DFSA) is fully determined by the state it is in.
DESCRIPTION OF A FINITE AUTOMATON[2/2]
Analytically a nondeterministic finite automaton(NDFA) can be represented by a 5-tuple
(Q, ∑ , δ, qo, F). where
(i) Q is a finite nonempty set of states.
(ii) ∑ is a finite nonempty set of inputs called the input alphabet.
(iii) δ(delta) is a function that maps Q x ∑ into 𝟐𝑸 and is usually called the direct
transition function. This is the function which describes the change of states during the
transition. This mapping is usually represented by a transition table or a transition diagram.
(iv) qo ∈ Q is the initial state.
(v) F ⊆ Q is the set of final states. It is assumed here that there may be more than one final
state.
N.B: difference between the deterministic and nondeterministic automata is only in δ. For deterministic
automaton (DFA), the outcome is a state, i.e. an element of Q; for nondeterministic automaton the outcome is a
subset of Q.
Give the entire sequence of states for the input string 110001
Example Taken from K L P Mishra
Solution
H.W
Check this input string 0100 accepted or not?
Example Taken from K L P Mishra
Finite Automata, Regular Grammars, Regular Expressions
The theoretical basis of computational work: finite state automata
For description: regular expressions
Similarly a regular expression can be implemented as a finite state automation and FSA can
be described with RE.
The regular expression is more than just a convenient metalanguage for text searching.
A regular expression is one way of characterizing a particular kind of formal language
called a regular language.
Both regular expressions and finite-state automata can be used to describe regular
languages.
Regular Expression
We give a formal recursive definition of regular expressions over ∑ : as follows:
1. Any terminal symbol (i.e. an element of ∑ :), an empty string is denoted with ε or
sometimes Λ and null sign (∅), the empty set are regular expressions. When we view a in ∑:
as a regular expression, we denote it by a.
2. The union of two regular expressions R1 and R2 written as R 1 + R2, is also a regular
expression.
3. The concatenation of two regular expressions R1 and R2, written as R1. R2, is also a
regular expression.
4. The iteration (or closure) of a regular expression R written as R*, is also a regular
expression.
S. If R is a regular expression, then (R) is also a regular expression.
6. The regular expressions over ∑ : are precisely those obtained recursively by the application
of the rules 1-5 once or several times.
Regular expressions
A formal language for specifying text strings
How can we search for any of these?
woodchuck
woodchucks
Woodchuck
Woodchucks
Regular Expressions: Disjunctions
Letters inside square brackets []
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit
Ranges [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in Disjunction
Negations [^Ss]
Carat means negation only when first in []
Pattern Matches
[^A-Z] Not an upper case letter Oyfn pripetchik
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
a^b The pattern a carat b Look up a^b now
Regular Expressions: More Disjunction
Woodchucks is another name for groundhog!
The pipe | for disjunction
Pattern Matches
groundhog|woodchuck
yours|mine yours
mine
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck
Regular Expressions: ? * + . {}
Pattern Matches
colou?r Optional previous color colour
char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
baa+ baa baaa baaaa baaaaa
beg.n begin begun begun beg3n
{} Exactly the specified "ma.{2}als“
number of occurrences
#Search for a sequence that starts with “ma",
followed excactly 2 (any) characters, and an
“als"
Regular Expressions: Anchors ^ $
The most common anchors are the caret ˆ and the dollar-sign $.
the caret ^ has three uses: to match the start of a line, to indicate a negation inside of square
brackets, and just to mean a caret.
The dollar sign $ matches the end of a line. So the pattern $ is a useful pattern for
matching a space at the end of a line, and /^The dog\.$/ matches a line that contains only the
phrase The dog.
We have to use the backslash here since we want the . to mean “period” and not the
wildcard.
Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
More operators
More operators
Example
Find me all instances of the word “the” in a text.
the
[tT]he Misses capitalized examples
[^a-zA-Z][tT]he[^a-zA-Z] Incorrectly returns other or theology
Regular expressions can be used with multiple languages. Such as: Java, Python, Ruby,
Swift, Scala, Groovy, C#, PHP, Javascript
RegEx Functions(You may try using python)
findall Returns a list containing all matches
search Returns a Match object if there is a match anywhere in the string
split Returns a list where the string has been split at each match
sub Replaces one or many matches with a string
Reference Books
1. Daniel Jurafsky and James H. Martin. 2020. Speech and Language Processing.
2. 3rd Edition Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical
Natural Language Processing. MIT Press.
3. Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, Harshit Surana. 2020. Practical
Natural Language Processing. O'Reilly.
4. NPTEL NLP course.
5. [Link]
6. Coursera course - Natural Language Processing