UNIT_1_NLP
UNIT_1_NLP
21CSE356T
NATURAL LANGUAGE PROCESSING
UNIT- 1
S.PRABU
Assistant Professor
C-TECH-SRM-IST-KTR
1
21CSE356T - NLP
UNIT-1
Overview and Word Level Analysis 9 Hour Introduction to Natural Language
Processing, Applications of NLP, Levels of NLP, Regular Expressions, Morphological
Analysis, Tokenization, Stemming, Lemmatization, Feature extraction: Term
Frequency (TF), Inverse Document Frequency (IDF), Modeling using TF-IDF, Parts of
Speech Tagging, Named Entity Recognition, N-grams, Smoothing.
What is NLP?
Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI)
that helps computers understand, interpret, and respond to human language (text or
speech). It bridges the gap between human communication and machine understanding.
Natural Language Processing (NLP) plays a crucial role in bridging the gap
between human communication and machine understanding. It enables computers to
process large volumes of text or speech data, extracting meaningful insights and
patterns. NLP is widely used in industries such as healthcare (for analyzing patient
records), finance (for detecting fraudulent transactions), and customer service (for
automated support systems). Modern NLP systems rely on deep learning models that
learn from vast datasets to improve their accuracy and understanding. As technology
advances, NLP continues to enhance human-computer interaction, enabling more
personalized and intelligent digital experiences.
S.PRABU Asst.Prof/CTech/SRMIST-KTR
2
21CSE356T - NLP
Challenges in NLP
• Ambiguity – Words can have multiple meanings (e.g., “bank” as a financial
institution or riverbank)
• Context Understanding – Computers struggle with sarcasm and emotions
• Grammar & Syntax Variations – Different languages and dialects
S.PRABU Asst.Prof/CTech/SRMIST-KTR
3
21CSE356T - NLP
3. Sentiment Analysis
• Analyzes emotions in social media, product reviews, and customer feedback.
Categorizes sentiments as positive, negative, or neutral.
• Helps businesses improve products by understanding customer opinions.
• Used in brand reputation monitoring and political opinion analysis.
4. Speech Recognition
• Converts spoken words into written text (e.g., Google Voice, Siri, Cortana).
• Used in voice-controlled systems, medical transcription, and virtual assistants.
• Helps in accessibility for people with disabilities (e.g., voice-to-text software).
• Powers real-time meeting transcription tools like Otter.ai and Zoom captions.
5. Text Summarization
• Generates concise summaries of long documents or articles.
• Used in news aggregation platforms (e.g., Inshorts, AI-generated news
summaries).
• Helps researchers and professionals quickly scan lengthy reports.
S.PRABU Asst.Prof/CTech/SRMIST-KTR
4
21CSE356T - NLP
7. Spam Detection
• Filters spam emails, fraudulent messages, and phishing attacks.
• Used by Gmail, Outlook, and email service providers to detect suspicious
messages.
• Employs machine learning models to classify emails as spam or legitimate.
• Helps in cybersecurity by identifying fake and harmful content.
S.PRABU Asst.Prof/CTech/SRMIST-KTR
5
21CSE356T - NLP
Each phase plays a crucial role in the overall understanding and processing of natural
language.
S.PRABU Asst.Prof/CTech/SRMIST-KTR
6
21CSE356T - NLP
lexicon into components, based on what the user sets as parameters – paragraphs,
phrases, words, or characters.
For example, irrationally can be broken into ir (prefix), rational (root) and -
ly (suffix). Lexical Analysis finds the relation between these morphemes and converts
the word into its root form. A lexical analyzer also assigns the possible Part-Of-Speech
(POS) to the word. It takes into consideration the dictionary of the language.
Syntax Analysis ensures that a given piece of text is correct structure. It tries to
parse the sentence to check correct grammar at the sentence level. Given the possible
S.PRABU Asst.Prof/CTech/SRMIST-KTR
7
21CSE356T - NLP
POS generated from the previous step, a syntax analyzer assigns POS tags based on the
sentence structure.
For example:
Correct Syntax: Sun rises in the east.
Incorrect Syntax: Rise in sun the east.
This task is performed by mapping the syntaxic structure, and checking for
logic in the presented relationships between entities, words, phrases and sentences in
the text. There are a couple of important functions of semantic analysis, which allow
for natural language understanding:
• To ensure that the data types are used in a way that’s consistent with their
definition.
• To ensure that the flow of the text is consistent.
• Identification of synonyms, antonyms, homonyms, and other lexical items.
• Overall word sense disambiguation.
• Relationship extraction from the different entities identified from the text.
Consider the sentence: “The apple ate a banana”. Although the sentence is syntactically
correct, it doesn’t make sense because apples can’t eat. Semantic analysis looks for
meaning in the given sentence. It also deals with combining words into phrases.
S.PRABU Asst.Prof/CTech/SRMIST-KTR
8
21CSE356T - NLP
context for any smaller part of natural language structure (e.g. a phrase, word or
sentence).
During this phase, it’s important to ensure that each phrase, word, and entity mentioned
are mentioned within the appropriate context. This analysis involves considering not
only sentence structure and semantics, but also sentence combination and meaning of
the text as a whole.
Otherwise, when analyzing the structure of text, sentences are broken up and analyzed
and also considered in the context of the sentences that precede and follow them, and
the impact that they have on the structure of text. Some common tasks in this phase
include: information extraction, conversation analysis, text summarisation,
discourse analysis.
Discourse deals with the effect of a previous sentence on the sentence in consideration.
In the text, “Jack is a bright student. He spends most of the time in the library.” Here,
discourse assigns “he” to refer to “Jack”.
S.PRABU Asst.Prof/CTech/SRMIST-KTR
9
21CSE356T - NLP
S.PRABU Asst.Prof/CTech/SRMIST-KTR
10
21CSE356T - NLP
Components of Morphology
1. Inflectional Morphology
o Changes the tense, number, or gender of a word without altering its
meaning.
o Example:
▪ "run" → "running" (present participle)
▪ "book" → "books" (plural)
S.PRABU Asst.Prof/CTech/SRMIST-KTR
11
21CSE356T - NLP
2. Derivational Morphology
o Creates new words by adding prefixes or suffixes, changing the word’s
meaning.
o Example:
▪ "happy" → "unhappy" (prefix changes meaning)
▪ "teach" → "teacher" (suffix changes word type)
S.PRABU Asst.Prof/CTech/SRMIST-KTR
12
21CSE356T - NLP
Topic 6: TOKENIZATION
It is the act of breaking down a text into individual units, usually words or phrases, these
fragments named tokens, enable machines to navigate and understand the complexities
of human language.
Types of Tokenization
1. Word Tokenization
o The text is divided into individual words.
o Example:
▪ Text: "I love programming."
▪ Tokens: ["I", "love", "programming"]
2. Sentence Tokenization
o The text is divided into sentences.
o Example:
▪ Text: "I love programming. It's fun!"
▪ Tokens: ["I love programming.", "It's fun!"]
S.PRABU Asst.Prof/CTech/SRMIST-KTR
13
21CSE356T - NLP
3. Subword Tokenization
o The text is broken down into smaller units than words, typically used in
modern NLP models like BERT or GPT.
o Example:
▪ Text: "unhappiness"
▪ Tokens: ["un", "happiness"]
▪ This is useful for dealing with out-of-vocabulary words or rare
words.
▪
Tokenization Techniques
1. Whitespace Tokenization
o The text is split based on spaces between words.
o Example: "I love programming" → ["I", "love", "programming"]
o Simple but not ideal for handling punctuation.
2. Punctuation-Based Tokenization
o Treats punctuation marks as individual tokens.
o Example: "Hello, world!" → ["Hello", ",", "world", "!"]
3. Regular Expression Tokenization
o Uses regular expressions (regex) to split text based on patterns, allowing
for more control.
o Example: A regex pattern can capture words and punctuation marks
separately.
4. Byte Pair Encoding (BPE)
o A subword tokenization technique that splits rare words into more
frequent subwords.
o Example: "unhappiness" → ["un", "happiness"]
o Commonly used in neural machine translation and large language
models.
S.PRABU Asst.Prof/CTech/SRMIST-KTR
14
21CSE356T - NLP
Topic 7 : STEMMING
Stemming is the process of reducing words to their root or base form by
removing suffixes and prefixes. It helps in text normalization, making different forms
of a word comparable in NLP tasks.
Examples of Stemming
Word Stemmed Form
Running Run
Happily Happi
Studies Studi
Flying Fli
Note: Stemming may not always return valid words (e.g., "happily" → "happi"), which
is a limitation.
S.PRABU Asst.Prof/CTech/SRMIST-KTR
15
21CSE356T - NLP
Limitations of Stemming
• Over-Stemming: Reduces words too much, making them hard to understand.
o Example: "university" → "univers".
• Under-Stemming: Does not stem enough, leaving similar words ungrouped.
o Example: "running" and "runner" remain different.
• Not Language-Specific: Most stemming algorithms work best for English and
may not handle other languages effectively.
Topic 8 : LEMMATIZATION
Lemmatization is the process of reducing a word to its base or dictionary form
(called a lemma) using linguistic rules. Unlike stemming, lemmatization ensures that
the root word is a valid word.
S.PRABU Asst.Prof/CTech/SRMIST-KTR
16
21CSE356T - NLP
STEMMING LEMMATIZATION
Examples of Lemmatization
Word Lemma
Running Run
Better Good
Studies Study
Mice Mouse
S.PRABU Asst.Prof/CTech/SRMIST-KTR
17
21CSE356T - NLP
Types of Lemmatization
1. Dictionary-Based Lemmatization: Uses predefined word lists to find the
lemma.
2. Rule-Based Lemmatization: Uses grammar rules to determine the lemma.
3. POS-Aware Lemmatization: Considers the part of speech before lemmatizing
(e.g., "saw" as a noun stays "saw", but as a verb becomes "see").
Challenges in Lemmatization
• Context Sensitivity: "Left" could mean past tense of "leave" or a direction.
• Computational Cost: Slower than stemming due to dictionary lookups.
• Language-Specific Rules: Lemmatization must be adapted for different
languages.
S.PRABU Asst.Prof/CTech/SRMIST-KTR
18
21CSE356T - NLP
Formula for TF
Example of TF Calculation
Consider a document:
"Natural Language Processing (NLP) is a field of AI. NLP helps machines understand
human language."
Why is TF Important?
• Highlights frequently used words in a document.
• Helps in text analysis and classification.
• Used in TF-IDF (Term Frequency-Inverse Document Frequency) to rank words
in search engines.
Limitations of TF
• Common words (e.g., "the", "is", "a") may have high TF but add little meaning.
• TF alone does not consider how important a word is in a larger collection of
documents (handled by IDF).
S.PRABU Asst.Prof/CTech/SRMIST-KTR
19
21CSE356T - NLP
S.PRABU Asst.Prof/CTech/SRMIST-KTR
20
21CSE356T - NLP
Where:
• TF (Term Frequency): Measures how often a word appears in a document.
• IDF (Inverse Document Frequency): Reduces the weight of common words
across documents.
• The term frequency (TF) is a measure of how frequently a term appears in a
document. It is calculated by dividing the number of times a term appears in a
document by the total number of words in the document. The resulting value is a
number between 0 and 1.
• The inverse document frequency (IDF) is a measure of how important a term is
across all documents in the corpus. It is calculated by taking the logarithm of the
total number of documents in the corpus divided by the number of documents in
which the term appears. The resulting value is a number greater than or equal to
0.
S.PRABU Asst.Prof/CTech/SRMIST-KTR
21
21CSE356T - NLP
The TF-IDF score is calculated by multiplying the term frequency and the inverse
document frequency. The higher the TF-IDF score, the more important the term is in
the document.
TF-IDF = TF * IDFTF-IDF = TF * log(N/DF)
Where:
• TF is the term frequency of a word in a document
• N is the total number of documents in the corpus
• DF is the document frequency of a word in the corpus (i.e., the number of
documents that contain the word)
Now, let’s say we want to calculate the TF-IDF scores for the word “fox” in each of
these documents.
TF = (Number of times word appears in the document) / (Total number of words in the
document)
Doc1: 1 / 9
Doc2: 0 / 8
Doc3: 1 / 7
Doc4: 2 / 8
Doc5: 1 / 6
S.PRABU Asst.Prof/CTech/SRMIST-KTR
22
21CSE356T - NLP
Therefore, the TF-IDF score for the word “fox” is highest in Doc4 indicating that
this word is relatively important in this document compared to the rest of the corpus.
On the other hand, the TF-IDF score is zero in Doc2, indicating that the word “fox” is
not relevant in this document.
S.PRABU Asst.Prof/CTech/SRMIST-KTR
23
21CSE356T - NLP
Advantages of TF-IDF
Some of the advantages of using TF-IDF include:
1. Measures relevance: TF-IDF measures the importance of a term in a document,
based on the frequency of the term in the document and the inverse document
frequency (IDF) of the term across the entire corpus. This helps to identify which
terms are most relevant to a particular document.
2. Handles large text corpora: TF-IDF is scalable and can be used with large text
corpora, making it suitable for processing and analyzing large amounts of text
data.
3. Handles stop words: TF-IDF automatically down-weights common words that
occur frequently in the text corpus (stop words) that do not carry much meaning
or importance, making it a more accurate measure of term importance.
4. Can be used for various applications: TF-IDF can be used for various natural
language processing tasks, such as text classification, information retrieval, and
document clustering.
5. Interpretable: The scores generated by TF-IDF are easy to interpret and
understand, as they represent the importance of a term in a document relative to
its importance across the entire corpus.
6. Works well with different languages: TF-IDF can be used with different
languages and character encodings, making it a versatile technique for
processing multilingual text data.
Limitations of TF-IDF
• Ignores the context
• Assumes independence
• Vocabulary size
• Sensitivity to stopwords
S.PRABU Asst.Prof/CTech/SRMIST-KTR
24
21CSE356T - NLP
S.PRABU Asst.Prof/CTech/SRMIST-KTR
25
21CSE356T - NLP
S.PRABU Asst.Prof/CTech/SRMIST-KTR
26
21CSE356T - NLP
Example of NER
Sentence:
"Elon Musk is the CEO of Tesla, which is headquartered in California."
Entity Category
Elon Musk PERSON
Tesla ORGANIZATION
California LOCATION
S.PRABU Asst.Prof/CTech/SRMIST-KTR
27
21CSE356T - NLP
Topic 15 : N-grams
What are N-grams?
N-grams are continuous sequences of N words from a given text. They help analyze
word patterns and relationships in Natural Language Processing (NLP).
Types of N-grams
• Unigram (N=1) → Single words
• Bigram (N=2) → Two-word sequences
• Trigram (N=3) → Three-word sequences
• 4-gram, 5-gram, etc. → Longer sequences
Examples of N-grams
Sentence:
"Natural Language Processing is amazing!"
S.PRABU Asst.Prof/CTech/SRMIST-KTR
28
21CSE356T - NLP
Type N-grams
Limitations of N-grams
Ignores long-range dependencies – Can't capture meaning across entire sentences.
Data Sparsity – Higher N-grams require large datasets for accuracy.
Memory Intensive – Large N-grams need more storage & processing power.
What is Smoothing?
Smoothing is a technique used in Natural Language Processing (NLP) and
Probability Estimation to handle zero probabilities in language models. It helps
prevent unseen words or N-grams from having a probability of zero, improving the
robustness of NLP models.
Why is Smoothing Needed?
• When using N-gram language models, some word sequences might not appear
in the training data.
• If an unseen N-gram has zero probability, it can completely disrupt
probability-based text generation or classification.
• Smoothing assigns small nonzero probabilities to unseen N-grams, ensuring
better model performance.
S.PRABU Asst.Prof/CTech/SRMIST-KTR
29
21CSE356T - NLP
Topic 16 : SMOOTHING
Smoothing is a technique used in Natural Language Processing (NLP) and
Probability Estimation to handle zero probabilities in language models. It helps prevent
unseen words or N-grams from having a probability of zero, improving the robustness
of NLP models.
Why is Smoothing Needed?
• When using N-gram language models, some word sequences might not appear
in the training data.
• If an unseen N-gram has zero probability, it can completely disrupt probability-
based text generation or classification.
• Smoothing assigns small nonzero probabilities to unseen N-grams, ensuring
better model performance.
S.PRABU Asst.Prof/CTech/SRMIST-KTR
30
21CSE356T - NLP
3. Good-Turing Smoothing
• Adjusts probabilities of unseen N-grams based on low-frequency words.
• If a word appears only once, its count is adjusted using words appearing twice,
three times, etc..
• Used in speech recognition and language modeling.
S.PRABU Asst.Prof/CTech/SRMIST-KTR
31
21CSE356T - NLP
Formula:
S.PRABU Asst.Prof/CTech/SRMIST-KTR