Unsmoothed N-grams in NLP Analysis
Unsmoothed N-grams in NLP Analysis
Page No.
20
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)
Course Material
Department of
Department Programme [Link](CS)
Computer Science and Applications
Semester &
Class & Section I [Link](CS) II Sem, 2024-25
Academic Year
Page 2 of
Effective Date Page No.
20
Unit-II
Word Level Analysis : Unsmoothed N-grams
In Natural Language Processing (NLP), unsmoothed N-grams refer to sequences of nnn words
analyzed without applying any smoothing techniques to handle zero probabilities for unseen word
sequences. This raw approach can provide insights into the exact frequency and co-occurrence of words in a
text corpus, making it foundational for many tasks in NLP.
Key Concepts in Unsmoothed N-grams
1. N-grams:
o An N-gram is a contiguous sequence of nnn words from a given text.
Unigram: n=1n = 1n=1 (single words).
Bigram: n=2n = 2n=2 (word pairs).
Trigram: n=3n = 3n=3 (three-word sequences).
And so on.
2. Unsmoothing:
o In unsmoothed N-grams, the probabilities of word sequences are calculated directly from
their observed frequencies in the corpus:
P(w1,w2,...,wn)=Count(w1,w2,...,wn)Total Count of N-grams in the CorpusP(w_1, w_2, ...,
w_n) = \frac{\text{Count}(w_1, w_2, ..., w_n)}{\text{Total Count of N-grams in the
Corpus}}P(w1,w2,...,wn)=Total Count of N-grams in the CorpusCount(w1,w2,...,wn)
o If a sequence does not occur in the training data, its probability is zero.
3. Challenges:
o Data Sparsity: Many word sequences may not appear in the training corpus, resulting in zero
probabilities.
o Unseen Events: The model cannot generalize to unseen sequences, limiting its applicability
in real-world scenarios.
Applications of Unsmoothed N-grams
1. Language Modeling:
o N-grams are used to model the probability distribution of word sequences, forming the basis
of many simple language models.
2. Text Analysis:
o Analyze word frequency, co-occurrence, and patterns in text without adjusting probabilities
for rare events.
3. Machine Translation and Summarization:
o Evaluate the structure of phrases and sentences in a corpus.
4. Information Retrieval:
o Identify key phrases or sequences in a text to match user queries.
Advantages of Unsmoothed N-grams
1. Simplicity:
o Easy to calculate and understand.
2. Exact Representation:
o Provides raw insights into word distribution and sequence occurrence.
3. Interpretability:
o No additional assumptions (e.g., smoothing) obscure the results.
o
Effective Date Page 3 of
Page No.
20
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)
Limitations
1. Zero Probabilities:
o Any sequence not in the training data has a probability of zero, making the model unusable in
those cases.
2. Data Dependency:
o Requires large corpora to cover a wide range of word sequences.
3. Scalability:
o As nnn increases, the number of possible N-grams grows exponentially, leading to the "curse
of dimensionality."
When to Use Unsmoothed N-grams?
1. Exploratory Analysis:
o When analyzing raw word-level patterns in text data.
2. Baseline Models:
o For establishing a benchmark before applying advanced techniques like smoothing or neural
models.
3. Specific NLP Tasks:
o When zero probabilities are not critical, or the focus is on observed patterns only.
Smoothing techniques address the problem of zero probabilities in N-gram models by redistributing some
probability mass from seen N-grams to unseen ones. Evaluating the performance of smoothing methods is
crucial to assess how well a language model generalizes to unseen data and avoids overfitting.
1. Why Smoothing Is Important in N-gram Models
Unseen N-grams: Without smoothing, any unseen N-gram will have a probability of zero, making
the model unusable for predicting sequences containing those N-grams.
Better Generalization: Smoothing ensures the model can handle rare or unseen word sequences
effectively.
Improved Perplexity: By redistributing probability mass, smoothing generally leads to lower
perplexity on test data, indicating better predictions.
2. Common Smoothing Techniques
1. Laplace (Add-One) Smoothing:
o Adds 1 to the count of every possible N-gram to avoid zero probabilities.
o Formula: P(wn∣wn−1)=Count(wn−1,wn)+1Total Count of Bigrams+VP(w_n | w_{n-1}) = \
frac{\text{Count}(w_{n-1}, w_n) + 1}{\text{Total Count of Bigrams} + V}P(wn∣wn−1
)=Total Count of Bigrams+VCount(wn−1,wn)+1 where VVV is the vocabulary size.
2. Add-kkk Smoothing:
o Generalizes Laplace smoothing by adding a smaller constant k>0k > 0k>0.
o Reduces the overestimation of probabilities for unseen N-grams compared to Laplace
smoothing.
3. Good-Turing Smoothing:
o Adjusts the probability of seen and unseen N-grams based on the counts of N-grams with
similar frequencies.
o Effective for redistributing probability mass to unseen events.
4. Kneser-Ney Smoothing:
o Combines absolute discounting with backing off to lower-order models.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)
Page 4 of
Effective Date Page No.
20
o Specifically designed for language modeling and is often the most effective for N-grams.
o Captures the diversity of contexts where a word appears.
5. Backoff and Interpolation:
o Backoff: Uses lower-order N-grams when higher-order N-grams are unavailable.
o Interpolation: Combines probabilities from higher- and lower-order N-grams.
3. Metrics for Evaluating Smoothing Techniques
3.1 Perplexity
Measures how well the model predicts a test dataset.
Lower perplexity indicates better predictions.
Use: Compare perplexity across different smoothing methods to determine the most effective one.
3.2 Coverage
Measures how many N-grams in the test set have non-zero probabilities after smoothing.
Higher coverage: Indicates that the smoothing method successfully handles unseen N-grams.
3.3 Precision and Recall
Evaluate how accurately the smoothed model predicts N-grams compared to reference sequences.
Use: Helpful in tasks like machine translation or text generation.
3.4 BLEU/ROUGE Scores
Evaluate the impact of smoothing on downstream tasks like machine translation (BLEU) or
summarization (ROUGE).
Higher scores: Indicate that the smoothing method improves the quality of generated text.
4. Practical Considerations
1. Dataset Size:
o Smaller datasets often require more aggressive smoothing techniques like Laplace or Add-
kkk.
o Larger datasets can benefit from advanced methods like Kneser-Ney smoothing.
2. Vocabulary Size:
o Larger vocabularies increase the number of unseen N-grams, making effective smoothing
essential.
3. Higher-Order N-grams:
o Higher nnn (e.g., trigrams, 4-grams) suffer more from data sparsity, making advanced
smoothing methods critical.
4. Task-Specific Requirements:
o Some tasks (e.g., ASR, machine translation) may benefit more from sophisticated smoothing
techniques like Kneser-Ney due to their contextual sensitivity.
5. Interpreting Results
Perplexity: Use test data to compare perplexity scores across smoothing methods.
Probabilities: Compare how each method redistributes probability mass to unseen or rare N-grams.
Task Performance: Evaluate BLEU/ROUGE scores or other task-specific metrics to determine how
smoothing impacts downstream tasks.
Interpolation combines probabilities from higher-order and lower-order N-grams, rather than relying
solely on the highest available N-gram.
The idea is to use information from all N-gram levels, weighting them appropriately.
Mathematical Representation
For a trigram model (n=3n=3n=3):
Characteristics
All levels (unigram, bigram, trigram, etc.) contribute to the final probability.
Weights (λ\lambdaλ) can be determined through techniques like grid search or expectation-
maximization (EM) based on a held-out dataset.
Advantages
Smoother probability distribution compared to relying solely on higher-order N-grams.
Reduces the impact of data sparsity by leveraging lower-order N-grams.
Applications
Language modeling (e.g., speech recognition, machine translation).
Predictive text generation.
2. Backoff
Definition
Backoff uses lower-order N-grams only when higher-order N-grams are unavailable or have zero
probability.
Unlike interpolation, backoff does not combine probabilities; it falls back to lower-order
probabilities as needed.
Mathematical Representation
For a trigram model:
P(wi∣wi−2,wi−1)={If trigram exists: P(wi∣wi−2,wi−1)Else, back off to: αP(wi∣wi−1)P(w_i | w_{i-2}, w_{i-
1}) = \begin{cases} \text{If trigram exists: } P(w_i | w_{i-2}, w_{i-1}) \\ \text{Else, back off to: } \alpha
P(w_i | w_{i-1}) \\ \end{cases}P(wi∣wi−2,wi−1)={If trigram exists: P(wi∣wi−2,wi−1
)Else, back off to: αP(wi∣wi−1)
α\alphaα: Backoff weight to ensure proper normalization of probabilities.
Characteristics
Probabilities from lower-order N-grams are used only when necessary.
A normalization factor (α\alphaα) ensures the model’s probabilities sum to 1.
Advantages
Simpler implementation compared to interpolation.
Efficient when the corpus contains sufficient higher-order N-grams.
Applications
Language modeling in applications like text-to-speech (TTS) and auto-completion.
Page 6 of
Effective Date Page No.
20
Word Classes
In Natural Language Processing (NLP), word classes (also referred to as parts of speech
(POS) or syntactic categories) are used to group words based on their grammatical roles, syntactic behavior,
and function within a sentence. Word classes help in analyzing, understanding, and generating human
language computationally.
Page 8 of
Effective Date Page No.
20
Part-of-Speech Tagging
Part-of-Speech Tagging is the process of assigning word classes or grammatical categories
(e.g., noun, verb, adjective) to each word in a given text based on its context. It is a fundamental step in
many NLP applications as it helps in understanding the syntactic structure and meaning of a sentence.
1. Why POS Tagging is Important
1. Syntactic Analysis:
o Identifies the grammatical structure of sentences for parsing and sentence analysis.
2. Semantic Understanding:
o Determines word meaning based on context (e.g., "book" as a noun or verb).
3. Downstream Applications:
o Named Entity Recognition (NER): Identifies proper nouns.
o Machine Translation: Ensures grammatically correct output.
o Text Summarization: Extracts key phrases based on POS.
o Sentiment Analysis: Leverages adjectives and adverbs to detect sentiment.
2. How POS Tagging Works
Steps in POS Tagging:
1. Tokenization:
o Split the input text into individual words or tokens.
o Example: "The cat sat on the mat." → ["The", "cat", "sat", "on", "the", "mat"]
2. Assigning POS Tags:
o Each token is assigned a tag based on:
Rule-Based Methods: Grammar rules.
Statistical Models: Probabilities derived from training data.
Deep Learning Models: Neural networks that learn contextual relationships.
POS Tags
Commonly used POS tagging schemes:
1. Penn Treebank POS Tag Set (for English):
NN: Noun (singular)
VB: Verb (base form)
JJ: Adjective
RB: Adverb
IN: Preposition
2. Universal POS Tag Set:
NOUN, VERB, ADJ, ADV, PRON, etc.
3. Techniques for POS Tagging
1. Rule-Based Tagging
Relies on manually defined grammar rules.
Example:
o "If a word ends with '-ing', tag it as a verb (VB)."
Limitation:
o Cannot handle complex or ambiguous contexts effectively.
2. Statistical Tagging
Uses probabilistic models trained on labeled data.
Examples:
Effective Date Page 9 of
Page No.
20
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)
sentence = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)
Page 10 of
Effective Date Page No.
20
print(pos_tags)
2. SpaCy:
o Provides fast and efficient POS tagging.
python
Copy code
import spacy
nlp = [Link]("en_core_web_sm")
sentence = "The quick brown fox jumps over the lazy dog."
doc = nlp(sentence)
for token in doc:
print(f"{[Link]}: {token.pos_}")
3. Stanford CoreNLP:
o A highly accurate library for POS tagging, using statistical models.
4. Flair:
o A deep learning library specialized in POS tagging and other NLP tasks.
5. BERT-based Models:
o Pre-trained transformer models like BERT can perform POS tagging with fine-tuning.
.
Rule based
Rule-Based NLP involves the use of hand-crafted linguistic rules and patterns to process,
analyze, and generate human language. It is one of the oldest approaches in NLP and relies on predefined
rules, lexicons, and grammar to achieve language understanding or generation.
1. What is Rule-Based NLP?
In a rule-based system, language processing is based on a set of manually defined rules. These rules are
created by linguists or domain experts and are used to identify patterns in text or to define how language
elements interact.
Example:
o Rule: If a word ends with "-ing," it is likely a verb.
o Rule: If "not" appears before an adjective, classify it as negative sentiment.
Key Components:
1. Lexicons: Word lists or dictionaries with associated features (e.g., part of speech, polarity).
2. Grammar Rules: Syntax and morphology rules (e.g., subject-verb agreement, noun phrase
structure).
3. Pattern Matching: Matching text to specific patterns (e.g., regular expressions).
4. Rule Engine: A system that applies rules to text.
2. Applications of Rule-Based NLP
1. Part-of-Speech (POS) Tagging
Rule-based taggers assign POS tags to words using linguistic rules.
Example Rule:
o If the preceding word is a determiner (e.g., "the"), tag the current word as a noun.
2. Named Entity Recognition (NER)
Detect entities like names, dates, and locations using patterns.
Example Rule:
Effective Date Page 11 of
Page No.
20
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)
o If a word starts with a capital letter and is followed by "Inc." or "Ltd.," classify it as an
organization.
3. Sentiment Analysis
Identify positive or negative sentiment using sentiment word lexicons and negation rules.
Example Rule:
o If "not" appears before a positive word (e.g., "not good"), classify it as negative.
4. Text Normalization
Handle text preprocessing tasks like stemming and lemmatization using rules.
Example Rule:
o If a word ends in "ing," remove "ing" (e.g., "running" → "run").
5. Spell Checking
Correct spelling errors by comparing against a dictionary and applying transformation rules.
6. Information Extraction
Extract structured information from unstructured text using templates and rules.
Example:
o Extract dates in the format "DD-MM-YYYY" using regex patterns.
7. Question Answering
Use rules to detect question types and retrieve relevant information.
Example Rule:
o If a question starts with "Who," retrieve entities tagged as "Person."
3. Advantages of Rule-Based NLP
1. Interpretability:
o Rules are explicit and easy to understand.
o Useful in domains where decisions need to be explainable.
2. Domain Adaptability:
o Rules can be customized for specific languages, industries, or tasks.
3. Low Data Dependency:
o Does not require large labeled datasets for training.
4. Deterministic Behavior:
o Outputs are predictable and consistent.
4. Limitations of Rule-Based NLP
1. Scalability:
o Creating and maintaining a large number of rules is time-consuming and labor-intensive.
2. Coverage:
o Rules may fail to handle edge cases, ambiguities, or new language patterns.
3. Context Sensitivity:
o Difficult to account for context or nuances of natural language effectively.
4. Maintenance:
o Rules need to be updated frequently to keep up with evolving language and domain-specific
terms.
5. Generalization:
o Rule-based systems often struggle with unseen data or out-of-vocabulary words.
5. Examples of Rule-Based NLP Techniques
1. Regular Expressions (Regex)
Used for pattern matching in text.
Example:
o Extract email addresses:
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)
Page 12 of
Effective Date Page No.
20
python
Copy code
import re
text = "Contact us at support@[Link]."
emails = [Link](r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print(emails)
Output: ['support@[Link]']
2. Rule-Based POS Tagging
Example using NLTK:
python
Copy code
import nltk
from nltk import RegexpTagger
Page 14 of
Effective Date Page No.
20
Page 16 of
Effective Date Page No.
20
6. Context Sensitivity
POS tags often depend on the broader sentence or paragraph context, which simple models may fail
to capture.
o Example:
"He likes to fish"
"The fish is fresh"
7. Inconsistent Annotation Standards
Effective Date Page 17 of
Page No.
20
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)
Different corpora use different POS tagsets or annotation guidelines, leading to inconsistency in
tagging models.
o Example:
Universal POS Tagset (simpler): "book" → VERB
Penn Treebank Tagset (granular): "book" → VB
8. Polysemy and Homonymy
Polysemy: Words with multiple related meanings.
o Example: "run" → a physical action (VB) or a race (NN).
Homonymy: Words with unrelated meanings but identical spelling.
o Example: "bank" → a financial institution (NN) or the side of a river (NN).
9. Noisy Text
Tagging becomes difficult in non-standard or informal text formats, such as:
o Social Media Text: Contains abbreviations, emojis, and slang.
Example: "u r gr8" → "you are great."
o Speech Transcriptions: May include disfluencies and fillers.
Example: "Um, I think I like, uh, coffee."
10. Compound Words
Words like "ice-cream" or "well-being" can be misinterpreted as separate tokens or misclassified.
11. Handling Code-Switching
In multilingual contexts, speakers often switch between languages mid-sentence.
o Example:
"I need to book a taxi जल्दी से" (English + Hindi).
12. Evaluation and Metrics
Evaluating POS taggers is challenging due to:
o Different annotation schemes.
o Disagreement between annotators in ambiguous cases.
o Metric limitations: Precision, recall, and F1 may not always reflect real-world performance.
13. Dependency on Training Data
Quality of Training Data:
o Poorly annotated corpora result in models learning incorrect patterns.
Bias in Data:
o Models trained on biased datasets may perform poorly in diverse contexts.
14. Memory and Computational Constraints
Resource-heavy models like CRFs or neural networks may not work well on devices with limited
computational power.
Page 18 of
Effective Date Page No.
20
Viterbi Algorithm:
o A dynamic programming algorithm to compute the most probable tag sequence efficiently.
Advantages of HMM
Simplicity: Easy to implement and interpret.
Probabilistic Framework: Provides a natural way to handle uncertainty in language.
Disadvantages of HMM
1. Strong Independence Assumptions:
o Assumes that the current state depends only on the previous state and the current word.
2. Data Sparsity:
o Struggles with unseen words or rare transitions.
3. Fixed Features:
o Cannot incorporate rich contextual features easily.
2. Maximum Entropy Models (MaxEnt)
Maximum Entropy Models, also known as log-linear models, are discriminative models used for
classification tasks. MaxEnt models predict the conditional probability of a class (e.g., a tag) given an input
feature vector.
Key Concepts in MaxEnt
1. Feature Representation:
o Captures contextual information, e.g., surrounding words, word suffixes, and capitalization.
o Example Features:
Current Word: "cat"
Previous Word: "The"
Is Capitalized: False
2. Conditional Probability:
o Computes the probability of a class (tag) given the features.
3. Training:
o Maximize the log-likelihood of the training data using optimization algorithms (e.g., gradient
descent).
Advantages of MaxEnt
1. Rich Features:
o Can incorporate arbitrary, overlapping, and non-independent features.
2. Flexibility:
o No need for independence assumptions (unlike HMM).
3. Discriminative:
o Directly models the conditional probability P(y∣x)P(y | x)P(y∣x).
Disadvantages of MaxEnt
1. Computational Cost:
o Training can be expensive, especially with many features.
2. Overfitting:
o Requires regularization to avoid overfitting on the training data.
3. Data Dependence:
o Performance depends heavily on feature engineering and quality of labeled data.
3. Comparison of HMM and MaxEnt
Aspect HMM MaxEnt
Generative: Models joint probability Discriminative: Models conditional
Model Type
P(x,y)P(x, y)P(x,y). probability ( P(y
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)
Page 20 of
Effective Date Page No.
20