Noida Institute of Engineering and Technology, Greater Noida
Introduction
Unit I
NLP: ACSA10712
Ankit Kumar Sharma
(Assistant Professor)
Course Details Information
B Tech 7th Sem Technology
Jyoti Kataria ACSA10712 NLP Unit 1
1
09/22/2025
Evaluation Scheme
NLP Syllabus
[Link]. FOURTH YEAR
Course Code ACSAI0712 LT P Credit
Course Title Natural Language Processing 3 0 0 3
Course objective:
The course aims to provide an understanding of the foundational concepts and techniques in NLP. The focus is on providing application-based knowledge.
Pre-requisites: Programming Skills, Data Structures, Algorithms, Probability and Statistics, Machine Learning.
Course Contents/Syllabus
UNIT-I Overview of Natural Language Processing 8Hour
Definition, Applications and emerging trends in NLP, Challenges. Ambiguity. NLP tasks using NLTK: Tokenization, stemming lemmatization, stop-word removal, POS tagging, Parsing, Named Entity Recognition, coreference solution.
UNIT-II 8Hour
Regular Expressions
Data Preprocessing: Convert to lower case, handle email-id, HTML tags, URLs, emojis, repeat characters, normalization of data(contractions, standardize) etc. Vocabulary, corpora, and linguistic resources, Linguistic foundations: Morphology, syntax, semantic and pragmatics, Language
models: Unigram, Bigram ,N-grams.
UNIT-III Text Analysis and Similarity 8Hour
TextVectorization:Bag-of-Wordsmodelandvectorspacemodels,TermPresence,TermFrequency,TF-IDFTextualSimilaritCosinesimilarity, WordMover’s distance, Word embeddings:Word2Vec, GloVe.
UNIT-IV Text Classification & NLP Applications 8Hour
Text classification: Implement of applications of NLP using text classification Sentiment Analysis, Topic modeling, Spam detection High Level NLP applications: Machine translation: Rule-based and statistical approaches, Text summarization Dialog system conversational agent sand
chatbots
Noida Institute of Engineering and Technology, Greater Noida
Subject code ACSA10712
Subject name Natural language Processing
Unit – 1
Overview of Natural language Processing
Definition , Applications and emerging trends in NLP, Challenges, Ambiguity , NLP
tasks using NLTK : Tokenization, stemming , lemmatization, stop-word, removal, Pos
tagging, parsing, named entity Recognition, coreferene resolution.
Natural Language Processing
What is NLP
Natural Language Processing (NLP) is a subfield of artificial
intelligence (AI) and computational linguistics focused on enabling
computers to understand, interpret, and respond to human language
in a way that is both meaningful and useful
Key Areas in NLP
• Text Analysis:
– Tokenization: Breaking down text into smaller units, such as words or
sentences.
– Part-of-Speech Tagging: Identifying the grammatical parts of speech
(nouns, verbs, adjectives, etc.) in a sentence.
– Named Entity Recognition (NER): Identifying and classifying named
entities (people, organizations, locations, etc.) in text.
– Sentiment Analysis: Determining the sentiment or emotion expressed in a
piece of text, such as positive, negative, or neutral.
• Speech Recognition:
– Converting spoken language into text, enabling voice-activated systems like
virtual assistants (e.g., Siri, Alexa) to understand and process voice
commands.
Key Areas of NLP
• Machine Translation:
– Automatically translating text or speech from one language to another, as
seen in services like Google Translate.
• Text Generation:
– Automatically generating coherent and contextually relevant text, such as
in chatbots, content creation, or language models like GPT.
• Question Answering:
– Building systems that can answer questions posed in natural language by
retrieving and summarizing relevant information from a large dataset.
• Text Summarization:
– Condensing a large piece of text into a shorter version while preserving its
meaning and key information.
Applications of NLP
• Virtual Assistants: NLP is used in virtual assistants like Siri, Alexa, and
Google Assistant to understand voice commands and respond
appropriately.
• Chatbots: Many businesses use NLP-based chatbots to provide
customer support and answer common queries.
• Search Engines: NLP helps search engines like Google understand user
queries and deliver relevant search results.
• Translation Services: NLP powers translation tools that convert text
from one language to another.
• Sentiment Analysis: Companies use sentiment analysis to monitor
social media, reviews, and other forms of user feedback to gauge public
opinion.
Challenges in NLP
• Ambiguity: Natural language is often ambiguous, with words having
multiple meanings or sentences that can be interpreted in different ways.
• Context: Understanding the context of a conversation or text is essential
for accurate processing, which can be challenging.
• Cultural Differences: Language use varies across cultures, making it
difficult to create models that work universally.
• Evolving Language: Language constantly evolves, with new slang,
idioms, and usage patterns emerging regularly.
Challenges in NLP - Ambiguity
Natural language ambiguity refers to situations where a word, phrase, or
sentence has multiple meanings, making it challenging to interpret correctly.
Some common forms of ambiguity include
1. Lexical Ambiguity
2. Syntactic Ambiguity
3. Semantic Ambiguity
4. Referential Ambiguity
5. Contextual Ambiguity
Challenges in NLP - Ambiguity
1. Lexical ambiguity
• Lexical means relating to words of a language.
• During Lexical analysis given paragraphs are broken down into words or
tokens. Each token has got specific meaning.
There can be instances where a single word can be interpreted in multiple ways.
The ambiguity that is caused by the word alone rather than the context is known as
Lexical Ambiguity.
Example: “Give me the bat!”
In this example “bat” have more than one meaning animal or circket bat
Challenges in NLP – Ambiguity continue….
Lexical ambiguity divide in two category
1. Polysemy
○ One word has many meanings
○ Determining the sense of a word in a particular context
■ He sat on the bank of a river/Withdraw money from the bank
■ Maruti has built a plant to manufacture cars/A man was planted in the
audience to raise anti-political slogans
2. Homonymy
o It refers to a single word having multiple but unrelated meanings.
Examples Bear, left, Pole
Challenges in NLP - Ambiguity
A bear (the animal) can bear (tolerate) very cold temperatures.
The driver turned left (opposite of right) and left (departed from) the main
road.
Pole and Pole — The first Pole refers to a citizen of Poland who could
either be referred to as Polish or a Pole. The second Pole refers to a
bamboo pole or any other wooden pole.
Challenges in NLP - Ambiguity
2. Syntactic Ambiguity/ Structural ambiguity
Syntactic meaning refers to the grammatical structure and rules that define how
words should be combined to form sentences and phrases. A sentence can be
interpreted in more than one way due to its structure or syntax such ambiguity is
referred to as Syntactic Ambiguity.
Example 1: “Old men and women”
There are possible two meaning of the example
All old men and young women.
All old men and old women.
Example 2: “John saw the boy with telescope.”
In example , two possible meanings are
John saw the boy through his telescope.
John saw the boy who was holding the telescope.
Challenges in NLP - Ambiguity
3. Semantic Ambiguity Semantics is nothing but “Meaning”.
• The semantics of a word or phrase refers to the way it is typically
understood or interpreted by people.
• Syntax describes the rules by which words can be combined into sentences, while
semantics describes what they mean.
This type of ambiguity occurs when a sentence has more than one interpretation or
meaning.
Example 1 “The chicken is ready to eat.”
The chicken (as food) is cooked and ready to be eaten.
The chicken (the animal) is hungry and ready to eat something.
Example 2 “Seema loves her mother and Sriya does too.”
In example two may be two interpretations
Sriya loves Seema’s mother or Sriya likes her mother.
Challenges in NLP - Ambiguity
4. Anaphoric (when a noun replace pronoun)Ambiguity -
A word that gets its meaning from a preceding word or phrase is called an
Example 1 - The house is on the longest street. It is very dirty.
In example1 “It ” represent to which house or long street
Example 2 – “I went to the hospital, and they told me to go home and rest.”
In example2- ‘they’ does not explicitly refer to the hospital instead it refers to the Dr or staff who attended
the patient in the hospital.
5. Pragmatic ambiguity - Pragmatics focuses on the real-time usage of language like what the speaker
wants to convey and how the listener infers it.
Example 1 : Do you know what time it is ?
Meaning of the example 1 - that some is asking for the time and other meaning is that someone showing
anger for missed the due time
Natural Language ToolKit (NLTK)
The Natural Language Toolkit (NLTK) is a Python programming environment for
creating applications for statistical natural language processing (NLP).
1. Tokenization
2. Sentence Segmentation
3. Corpus and Vocabulary
4. Stop words
5. Stemming and Lemmatization
6. Named Entity Recognition
7. Co-referencing Resolution
8. POS tagging
9. Parsing
Natural Language ToolKit (NLTK) – [Link]
● Tokenization method is used to split a sentence, paragraph, or
full-text document into smaller units - tokens
I Love NLP.
[‘I’, ‘love’, ‘NLP’,‘.’]
○ The basic unit of a language
○ It helps to interpret the meaning of the text by analysing the words
present in the text
○ Count the number and frequency of words in the text
Natural Language ToolKit (NLTK) – 2. Sentence Segmentation
You first need to break the entire document down into its constituent
sentences. You can do this by segmenting the article along with its
punctuations like full stops and commas.
● Splitting the given input text into sentences
● Characters used for defining sentence end - ‘!’, ‘?’, ‘.’
● Ambiguities:
○ The yearly results of Yahoo! are promising.
○ We are using the .NET framework for our project.
○ Susan scored 78.5% marks in her exams
○ Mr. Mehta is doing a great job.
● Rules like – Numbers around the ‘.’
Natural Language ToolKit (NLTK) – 3. Corpus and Vocabulary
● text corpus
○ The set of text documents used for the model.
○ e.g. For model to analyze movie reviews, corpus is the set of documents each
containing a movie review.
○ Documents set divided into training/testing for the model
● Vocabulary
○ The unique set of words in the entire corpus
○ Usually, the feature vector is based on the vocabulary of the corpus
○ Vocabulary size – number of words in the vocabulary
● Freely available corpus
○ Links to some freely available corpora - [Link]
Natural Language ToolKit (NLTK) – 3. Corpus and Vocabulary
conti….
Some popular Corpus Available
● Movie review dataset (IMDB dataset)
○ Consisting of 1000 negative and 1000 positive labeled movie reviews
○ [Link]
● Amazon product review datasets
○ DVD dataset, Sports and Outdoor datasets
○ Each consisting of 1000 negative and 1000 positive labeled product reviews.
○ [Link]
Natural Language ToolKit (NLTK) – 4. Stop words
Very common words in a language, no useful information
● Articles, prepositions, pronouns, conjunctions, etc,.
○ e.g. - “the”, “in”, “of”, “his”, “and ,”etc”
● Removal of stopwords token
○ Focus to the important information
○ Reduces the dataset size and training time
● May be needed for
○ Relational queries – “flights to Tokyo”
○ Phrases like - “To be or not to be”
○ Prediction of sentiment - “I told you that she was not happy” → “told, happy”
Natural Language ToolKit (NLTK) – 5. Stemming and Lemmatization
● Reduces the form of a word to the common base form.
e.g. - (go, going, gone) -> go (running, ran, runs, run) -> run
● To prepare text, words, and documents for further processing.
● When we search for a word on the web it also retrieves
variations of the word. If we search for say ‘kill’, it may also
return words like killer, killing, killed, kills.
● Here kill is the stem for killer, killed, killing, kills. It conveys
that each of these has the idea of ‘kill’.
Natural Language ToolKit (NLTK) – 5. Stemming and Lemmatization
● Stemming use heuristics that
○ Chops off letters from the end of the word
○ Transforms the end letters
● Lemmatization groups together similar inflected forms of a words, called lemma.
Word Suffix Stem Word Lemma
Was was Is, was, were Be
Cats s cat
Cats Cat
Changing ing chang
Changing, changed, change change
Studies es studi
Studying ing study Studies, studying Study
Natural Language ToolKit (NLTK) – 5. Stemming and Lemmatization
Stemming Lemmatization
Fast and simple- Pattern based Needs POS tagging,
dictionaries
Snowball, Porters LemmaGen, Morpha
Returns the stem of a word – may not be in Returns a proper word -
vocabulary lemma
Crude, less useful More informative
● Stemming and lemmatization are methods used by search engines and
chatbots to analyze the meaning behind a word.
● Stemming uses the stem of the word, while lemmatization uses the context in
which the word is being used.
Natural Language ToolKit (NLTK) – 6. Name Entity Recognition
● Named Entities (NEs) are proper names in texts, i.e. the names of people,
organizations, locations, time and quantities
● NER is to process a text and identify named entities
● Applications:
○ Helps identify the key elements in a text
■ Helps sort unstructured data and detect important information.
○ Useful in answering- question systems Hi, My name is Shubhangi Deb
■ “Where was Mahatma Gandhi born?” PERSON
I am from Australia GPE
I want to work with Amazon ORG
Jeff Bezos PERSON is my inspiration
Named entity recognition with
Machine Learning
Natural Language ToolKit (NLTK) – 6. Name Entity Recognition
● Applications
○ Processing Resumes
■ Looking for information from resumes formatted differently.
■ Personal information, experience, skills, degree etc,.
○ Gain Insights from Customer Feedback
■ Organize all this customer feedback and pinpoint repeating problem areas
■ Areas of customer likes/dislikes/improvement areas
○ Content Recommendation:
■ If you watch a lot of comedies on Netflix, you’ll get more
recommendations that have been classified as the entity ‘Comedy’.
Natural Language ToolKit (NLTK) – 7. Coreference Resolution
● Identify all expressions that refer to the same object.
○ Mohan went to McDonald to buy a burger. He visits the store very often and loves its food.
“ I voted for Biden because he was most aligned to my
principes”, Jenna said
Original sentence
“ Jenna voted for Biden because Biden was most aligned
to Jenna’s principes”, Jenna said
Sentence with resolved conferences
● Uses - It is an important step for a lot of higher level NLP tasks that involves
natural language understanding
Natural Language ToolKit (NLTK) – 7. Coreference Resolution
● Uses
○ Document summarization
○ Question answering
○ Machine translation
● Anaphora (backward references)
○ Refers to any reference that “points backward” to information that was
presented earlier in the text
“The apple on the table was rotten. It had been there for three days.”
● Cataphora (forward references)
○ Refers to any reference that “points forward” to information that will be
presented later in the text
“It has four legs The cow is a domestic animal.”
Natural Language ToolKit (NLTK) – 8. POS tagging
● Assign a parts of speech to each word in text Noun Pronoun
○ Nouns: Which defines any object or entity Interjectio
n
○ Verbs: That defines some action.
○ Pronoun: That can replace a noun – she, him. Preposition
Verb
Parts of
○ Adjectives Describe a noun/pronoun. speech
Adver
● In a sentence, every word will be associated with a Conjunctio b
n
proper POS tag Adjective
Puja bought a new phone from Samsung Store
Proper Noun Verb Determiner Adjective Noun Preposition Proper noun Noun
NNP VBN DT JJ NN IN NNP NN
Natural Language ToolKit (NLTK) – 9. NLTK
NLTK (Natural Language Toolkit) is a library for NLP in Python.
Powerful tool to preprocess text data for further analysis like as input to
Machine Learning algorithms.
Tokenization, POS tagging, Stemming etc
It includes many corpora and lexical resources (like WordNet)
Natural Language ToolKit (NLTK) – 9. NLTK
Installation
pip install nltk
Downloading the datasets:
import nltk
[Link]()
Choose from the screen
whatever packages you want to
download
Source:
[Link]
Natural Language ToolKit (NLTK) – 9. NLTK
Operations using NLTK:
Tokenization
import nltk
text = "First sentence. Second sentence“
nltk.word_tokenize(text)
Output: ['First', 'sentence', '.', 'Second', 'sentence', '.’]
Sentence splitting
nltk.sent_tokenize(text)
Output: ['First sentenece.', 'Second sentence']
Natural Language ToolKit (NLTK) – 9. NLTK
Accessing Text Corpora in NLTK:
● Corpus: A set of text documents usually having some common characteristic.
Corpora is plural of corpus.
set of online books
set of movie reviews
collection of tweets
• NLTK corpus is set of natural language datasets in nltk_data directory
In NLTK some corpora are included
Gutenberg corpus (online books)
Brown corpus (categories – news, humor etc
Reuters corpus (news)
WordNet –most advanced , contains words, synonyms, antonyms etc
from [Link] import gutenberg
Natural Language ToolKit (NLTK) – NLTK
Accessing Gutenberg corpus:
● NLTK includes a small selection of texts from the Project Gutenberg electronic text archive,
which contains some 25,000 free electronic books
See website for full list - [Link]
To download the corpus - [Link]('gutenberg’)
To find the Gutenberg collection that are downloaded with the NLTK package.-
[Link]()
To find the words in the text file - [Link](fileid)
To find the sentences in a file [Link](fileid)
To get the text in the file into a string - [Link](fileid)
In NLTK some corpora are included
Gutenberg corpus
Brown corpus
If phone number “0120- 4543466” in string “ His contact number is 0120- 4543466”
Python command - “0120- 4543466” in “ His contact number is 0120- 4543466”
If we know the format - ####-####### (phone number ) or ##/##/#### (date)
We need regular expressions to search for these patterns.
Natural Language ToolKit (NLTK) – NLTK
Accessing Browns corpus:
● Contains text categorized by genre, such as news, humor etc.
● To find the categories
from [Link] import brown
[Link]()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor',
'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction’]
To find the files in the category ‘humor’
[Link](categories=‘humor’)
['cr01', 'cr02', 'cr03', 'cr04', 'cr05', 'cr06', 'cr07', 'cr08', 'cr09’ ]
To find the words in the file ‘cr01’
[Link](fileids=['cr01‘])
['It', 'was', 'among', 'these', 'that', 'Hinkle', ...]
For details refer : [Link]
Natural Language ToolKit (NLTK) – NLTK
Accessing Gutenberg corpus:
Contains text categorized by genre, such as news, humor etc.
● To find the categories
from [Link] import brown
[Link]()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor',
'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction’]
To find the files in the category ‘humor’
[Link](categories=‘humor’)
['cr01', 'cr02', 'cr03', 'cr04', 'cr05', 'cr06', 'cr07', 'cr08', 'cr09’ ]
To find the words in the file ‘cr01’
[Link](fileids=['cr01‘])
['It', 'was', 'among', 'these', 'that', 'Hinkle', ...]
For details refer : [Link]
Natural Language ToolKit (NLTK) – NLTK
Accessing Gutenberg corpus:
● NLTK includes a small selection of texts from the Project Gutenberg electronic text archive,
which contains some 25,000 free electronic books
See website for full list - [Link]
To download the corpus - [Link]('gutenberg’)
To find the Gutenberg collection that are downloaded with the NLTK package.-
[Link]()
To find the words in the text file - [Link](fileid)
To find the sentences in a file [Link](fileid)
To get the text in the file into a string - [Link](fileid)
In NLTK some corpora are included
Gutenberg corpus
Brown corpus
If phone number “0120- 4543466” in string “ His contact number is 0120- 4543466”
Python command - “0120- 4543466” in “ His contact number is 0120- 4543466”
If we know the format - ####-####### (phone number ) or ##/##/#### (date)
We need regular expressions to search for these patterns.
Natural Language ToolKit (NLTK) – NLTK
Accessing Gutenberg corpus:
● NLTK includes a small selection of texts from the Project Gutenberg electronic text archive,
which contains some 25,000 free electronic books
See website for full list - [Link]
To download the corpus - [Link]('gutenberg’)
To find the Gutenberg collection that are downloaded with the NLTK package.-
[Link]()
To find the words in the text file - [Link](fileid)
To find the sentences in a file [Link](fileid)
To get the text in the file into a string - [Link](fileid)
In NLTK some corpora are included
Gutenberg corpus
Brown corpus
If phone number “0120- 4543466” in string “ His contact number is 0120- 4543466”
Python command - “0120- 4543466” in “ His contact number is 0120- 4543466”
If we know the format - ####-####### (phone number ) or ##/##/#### (date)
We need regular expressions to search for these patterns.
Text Preprocessing in NLP
Text Preprocessing in NLP
● Convert raw text into a set of tokens that the computer can understand and use.
Ready for feature extraction
Data cleaning and pre-processing is critical for the quality of further analysis
Pre-processing steps depend on
a. The data – structured (movie reviews) or unstructured (Tweets)
[Link] application for which data needs to be used.
Text Preprocessing in NLP
Steps in Text Preprocessing:
● Convert all characters to lower case
o e.g. Hello, HELLO, hello, hellO -> hello
Remove HTML tags, URL, email id
[Link]
anuj123@[Link]
Text between HTML tags
Converting data to standard form
2mrw, tmrw->tomorrow , btwn, btw -> between, b4->before
Text Preprocessing in NLP
● Emojis
Remove them / replace with a word / sentiment analysis
Replace characters repeated more (twitter)
e.g. it was ssssoooo nice -> it was so nice
Replace contractions (short forms to full words)
I ‘m -> I am, did’nt -> did not
Removal of punctuations
Removal of rare/frequent words
Text Preprocessing in NLP
● Tokenization
'the new policy of the government is good’
Sentence segmentation(if required)
text = "First in class. Last in class.“
Output: ['First in class.', ‘Last in class.']
● Remove stop words
Tokens as input - ['the', 'new', 'policy', 'of', 'the', 'government', 'is',
'good’]
['new', 'policy', 'government', 'good’]
Text Preprocessing in NLP
● Parts of Speech (POS) tagging
Tokens as input -['he', 'loves', 'to', 'play', 'with', 'toys', 'in', 'morning’]
[('he', 'PRP’), ('loves', 'VBZ’), ('to', 'TO'), ('play', 'VB’), ('with', 'IN'), ('toys',
'NN'), ('in', 'IN'), ('morning', 'NN’)]
Stemming
Tokens as input -['Stemming', 'usually', 'tries', 'to', 'convert', 'the', 'word', 'into', 'its', 'root',
'format’]
Stemming - ['stem', 'us', 'tri', 'to', 'convert', 'the', 'word', 'into', 'it', 'root',
'form’]
Lemmatization
Tokens as input -['Stemming', 'usually', 'tries', 'to', 'convert', 'the', 'word', 'into', 'its', 'root',
'format’]
Lemmatization - ['Stemming', 'usually', 'try', 'to', 'convert', 'the', 'word', 'into',
'it', 'root', 'format']
Text Preprocessing in NLP
● Name Entity Recognition
String ‘text’ as input
text = "Tom is good at playing football and stays in London."
tokens = word_tokenize(text)
pos_text= nltk.pos_tag(tokens)
nes=nltk.ne_chunk(pos_text,
nes=nltk.ne_chunk(pos_text, binary = True) binary = False)
OUTPUT OUTPUT
(NE Tom/NNP) (PERSON Tom/NNP)
('is', 'VBZ') ('is', 'VBZ')
('good', 'JJ') ('good', 'JJ')
('at', 'IN') ('at', 'IN')
('playing', 'VBG') ('playing', 'VBG')
('football', 'NN') ('football', 'NN')
('and', 'CC') ('and', 'CC')
('stays', 'NNS') ('stays', 'NNS')
('in', 'IN') ('in', 'IN')
(NE London/NNP) (GPE London/NNP)
('.', '.') ('.', '.')
Text Preprocessing in NLP
How do we implement these NLP concepts:
Language: Python
● Why Python?
○ Has simple syntax
○ Has extensive collection of NLP tools and libraries
● Python Libraries used in this course
○ Numpy
○ Pandas
○ Matplotlib
○ SciKit-Learn
○ NLTK
Text Preprocessing in NLP
Summary
● Now you have an idea of what NLP is, its applications and the emerging
trends and challenges in using it.
● You also learnt about the basic concepts of NLP like
○ Corpus and Vocabulary
○ Text Normalization (Tokenization , Stemming and Lemmatization,
Stop words, Sentence segmentation )
○ POS tagging, Named Entity Recognition, Co-referencing Resolution
○ Parsing
●Implementation of the above concepts using NLTK