0% found this document useful (0 votes)
420 views20 pages

Unsmoothed N-grams in NLP Analysis

The document is a course material for M.Sc in Computer Science focusing on Natural Language Processing (NLP), specifically on unsmoothed N-grams and their applications. It discusses key concepts, advantages, limitations, and smoothing techniques for N-grams, as well as word classes and their importance in NLP. Additionally, it covers practical considerations for evaluating N-grams and techniques like interpolation and backoff for handling data sparsity.

Uploaded by

Boomika G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
420 views20 pages

Unsmoothed N-grams in NLP Analysis

The document is a course material for M.Sc in Computer Science focusing on Natural Language Processing (NLP), specifically on unsmoothed N-grams and their applications. It discusses key concepts, advantages, limitations, and smoothing techniques for N-grams, as well as word classes and their importance in NLP. Additionally, it covers practical considerations for evaluating N-grams and techniques like interpolation and backoff for handling data sparsity.

Uploaded by

Boomika G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Effective Date Page 1 of

Page No.
20
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Course Material

Department of
Department Programme [Link](CS)
Computer Science and Applications

Course Title Natural Language Processing Course Code 24P2CSDE05

Semester &
Class & Section I [Link](CS) II Sem, 2024-25
Academic Year

Handling Staff [Link] Designation Associate Professor

Staff Incharge HoD Principal


VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 2 of
Effective Date Page No.
20

Unit-II
Word Level Analysis : Unsmoothed N-grams
In Natural Language Processing (NLP), unsmoothed N-grams refer to sequences of nnn words
analyzed without applying any smoothing techniques to handle zero probabilities for unseen word
sequences. This raw approach can provide insights into the exact frequency and co-occurrence of words in a
text corpus, making it foundational for many tasks in NLP.
Key Concepts in Unsmoothed N-grams
1. N-grams:
o An N-gram is a contiguous sequence of nnn words from a given text.
 Unigram: n=1n = 1n=1 (single words).
 Bigram: n=2n = 2n=2 (word pairs).
 Trigram: n=3n = 3n=3 (three-word sequences).
 And so on.
2. Unsmoothing:
o In unsmoothed N-grams, the probabilities of word sequences are calculated directly from
their observed frequencies in the corpus:
P(w1,w2,...,wn)=Count(w1,w2,...,wn)Total Count of N-grams in the CorpusP(w_1, w_2, ...,
w_n) = \frac{\text{Count}(w_1, w_2, ..., w_n)}{\text{Total Count of N-grams in the
Corpus}}P(w1,w2,...,wn)=Total Count of N-grams in the CorpusCount(w1,w2,...,wn)
o If a sequence does not occur in the training data, its probability is zero.
3. Challenges:
o Data Sparsity: Many word sequences may not appear in the training corpus, resulting in zero
probabilities.
o Unseen Events: The model cannot generalize to unseen sequences, limiting its applicability
in real-world scenarios.
Applications of Unsmoothed N-grams
1. Language Modeling:
o N-grams are used to model the probability distribution of word sequences, forming the basis
of many simple language models.
2. Text Analysis:
o Analyze word frequency, co-occurrence, and patterns in text without adjusting probabilities
for rare events.
3. Machine Translation and Summarization:
o Evaluate the structure of phrases and sentences in a corpus.
4. Information Retrieval:
o Identify key phrases or sequences in a text to match user queries.
Advantages of Unsmoothed N-grams
1. Simplicity:
o Easy to calculate and understand.
2. Exact Representation:
o Provides raw insights into word distribution and sequence occurrence.
3. Interpretability:
o No additional assumptions (e.g., smoothing) obscure the results.
o
Effective Date Page 3 of
Page No.
20
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Limitations
1. Zero Probabilities:
o Any sequence not in the training data has a probability of zero, making the model unusable in
those cases.
2. Data Dependency:
o Requires large corpora to cover a wide range of word sequences.
3. Scalability:
o As nnn increases, the number of possible N-grams grows exponentially, leading to the "curse
of dimensionality."
When to Use Unsmoothed N-grams?
1. Exploratory Analysis:
o When analyzing raw word-level patterns in text data.
2. Baseline Models:
o For establishing a benchmark before applying advanced techniques like smoothing or neural
models.
3. Specific NLP Tasks:
o When zero probabilities are not critical, or the focus is on observed patterns only.

Evaluating N-grams Smoothing

Smoothing techniques address the problem of zero probabilities in N-gram models by redistributing some
probability mass from seen N-grams to unseen ones. Evaluating the performance of smoothing methods is
crucial to assess how well a language model generalizes to unseen data and avoids overfitting.
1. Why Smoothing Is Important in N-gram Models
 Unseen N-grams: Without smoothing, any unseen N-gram will have a probability of zero, making
the model unusable for predicting sequences containing those N-grams.
 Better Generalization: Smoothing ensures the model can handle rare or unseen word sequences
effectively.
 Improved Perplexity: By redistributing probability mass, smoothing generally leads to lower
perplexity on test data, indicating better predictions.
2. Common Smoothing Techniques
1. Laplace (Add-One) Smoothing:
o Adds 1 to the count of every possible N-gram to avoid zero probabilities.
o Formula: P(wn∣wn−1)=Count(wn−1,wn)+1Total Count of Bigrams+VP(w_n | w_{n-1}) = \
frac{\text{Count}(w_{n-1}, w_n) + 1}{\text{Total Count of Bigrams} + V}P(wn∣wn−1
)=Total Count of Bigrams+VCount(wn−1,wn)+1 where VVV is the vocabulary size.
2. Add-kkk Smoothing:
o Generalizes Laplace smoothing by adding a smaller constant k>0k > 0k>0.
o Reduces the overestimation of probabilities for unseen N-grams compared to Laplace
smoothing.
3. Good-Turing Smoothing:
o Adjusts the probability of seen and unseen N-grams based on the counts of N-grams with
similar frequencies.
o Effective for redistributing probability mass to unseen events.
4. Kneser-Ney Smoothing:
o Combines absolute discounting with backing off to lower-order models.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 4 of
Effective Date Page No.
20

o Specifically designed for language modeling and is often the most effective for N-grams.
o Captures the diversity of contexts where a word appears.
5. Backoff and Interpolation:
o Backoff: Uses lower-order N-grams when higher-order N-grams are unavailable.
o Interpolation: Combines probabilities from higher- and lower-order N-grams.
3. Metrics for Evaluating Smoothing Techniques
3.1 Perplexity
 Measures how well the model predicts a test dataset.
 Lower perplexity indicates better predictions.
 Use: Compare perplexity across different smoothing methods to determine the most effective one.
3.2 Coverage
 Measures how many N-grams in the test set have non-zero probabilities after smoothing.
 Higher coverage: Indicates that the smoothing method successfully handles unseen N-grams.
3.3 Precision and Recall
 Evaluate how accurately the smoothed model predicts N-grams compared to reference sequences.
 Use: Helpful in tasks like machine translation or text generation.
3.4 BLEU/ROUGE Scores
 Evaluate the impact of smoothing on downstream tasks like machine translation (BLEU) or
summarization (ROUGE).
 Higher scores: Indicate that the smoothing method improves the quality of generated text.
4. Practical Considerations
1. Dataset Size:
o Smaller datasets often require more aggressive smoothing techniques like Laplace or Add-
kkk.
o Larger datasets can benefit from advanced methods like Kneser-Ney smoothing.
2. Vocabulary Size:
o Larger vocabularies increase the number of unseen N-grams, making effective smoothing
essential.
3. Higher-Order N-grams:
o Higher nnn (e.g., trigrams, 4-grams) suffer more from data sparsity, making advanced
smoothing methods critical.
4. Task-Specific Requirements:
o Some tasks (e.g., ASR, machine translation) may benefit more from sophisticated smoothing
techniques like Kneser-Ney due to their contextual sensitivity.
5. Interpreting Results
 Perplexity: Use test data to compare perplexity scores across smoothing methods.
 Probabilities: Compare how each method redistributes probability mass to unseen or rare N-grams.
 Task Performance: Evaluate BLEU/ROUGE scores or other task-specific metrics to determine how
smoothing impacts downstream tasks.

Interpolation and Backoff


Interpolation and Backoff are two commonly used techniques for handling data sparsity in N-gram
models. These methods aim to improve language models' ability to generalize and assign probabilities to
unseen word sequences.
1. Interpolation
Definition
Effective Date Page 5 of
Page No.
20
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

 Interpolation combines probabilities from higher-order and lower-order N-grams, rather than relying
solely on the highest available N-gram.
 The idea is to use information from all N-gram levels, weighting them appropriately.

Mathematical Representation
For a trigram model (n=3n=3n=3):

Characteristics
 All levels (unigram, bigram, trigram, etc.) contribute to the final probability.
 Weights (λ\lambdaλ) can be determined through techniques like grid search or expectation-
maximization (EM) based on a held-out dataset.
Advantages
 Smoother probability distribution compared to relying solely on higher-order N-grams.
 Reduces the impact of data sparsity by leveraging lower-order N-grams.
Applications
 Language modeling (e.g., speech recognition, machine translation).
 Predictive text generation.
2. Backoff
Definition
 Backoff uses lower-order N-grams only when higher-order N-grams are unavailable or have zero
probability.
 Unlike interpolation, backoff does not combine probabilities; it falls back to lower-order
probabilities as needed.
Mathematical Representation
For a trigram model:
P(wi∣wi−2,wi−1)={If trigram exists: P(wi∣wi−2,wi−1)Else, back off to: αP(wi∣wi−1)P(w_i | w_{i-2}, w_{i-
1}) = \begin{cases} \text{If trigram exists: } P(w_i | w_{i-2}, w_{i-1}) \\ \text{Else, back off to: } \alpha
P(w_i | w_{i-1}) \\ \end{cases}P(wi∣wi−2,wi−1)={If trigram exists: P(wi∣wi−2,wi−1
)Else, back off to: αP(wi∣wi−1)
 α\alphaα: Backoff weight to ensure proper normalization of probabilities.
Characteristics
 Probabilities from lower-order N-grams are used only when necessary.
 A normalization factor (α\alphaα) ensures the model’s probabilities sum to 1.
Advantages
 Simpler implementation compared to interpolation.
 Efficient when the corpus contains sufficient higher-order N-grams.
Applications
 Language modeling in applications like text-to-speech (TTS) and auto-completion.

3. Interpolation vs. Backoff


Feature Interpolation Backoff
Combines probabilities from all N-gram Uses lower-order N-grams only when
Combination
levels. necessary.
Requires weights (λ\lambdaλ) for each N- Normalizes probabilities with a backoff
Weighting
gram level. factor (α\alphaα).
Sparsity Distributes probability across levels Falls back to lower levels when higher ones
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 6 of
Effective Date Page No.
20

Feature Interpolation Backoff


Handling smoothly. fail.
Complexity Computationally more complex. Simpler to implement.
Depends on the size and quality of the
Accuracy Typically provides better generalization.
corpus.
4. Advanced Techniques
Katz Backoff
 A combination of backoff and smoothing.
 High-order N-grams are used when available, and lower-order N-grams are backed off to with
discounted probabilities.
 Probability adjustment ensures unused mass from higher-order N-grams is redistributed to lower-
order ones.
Linear Interpolation
 A specific form of interpolation where weights are pre-determined or trained on a development
dataset.
 Each N-gram level contributes to the final probability, weighted by fixed or learned factors.

Word Classes
In Natural Language Processing (NLP), word classes (also referred to as parts of speech
(POS) or syntactic categories) are used to group words based on their grammatical roles, syntactic behavior,
and function within a sentence. Word classes help in analyzing, understanding, and generating human
language computationally.

1. Common Word Classes in NLP


Here are the primary word classes typically used in NLP:
Word Class Definition Examples
Noun Represents people, places, things, or ideas. dog, city, happiness
Pronoun Substitutes for nouns. he, she, it, they
Verb Denotes actions, states, or events. run, is, think
Adjective Describes or modifies nouns. happy, red, tall
quickly, very,
Adverb Modifies verbs, adjectives, or other adverbs.
tomorrow
Shows relationships between a noun/pronoun and another word in
Preposition in, on, by, with
the sentence.
Conjunction Connects words, phrases, or clauses. and, but, or
Determiner Modifies nouns to clarify reference. the, a, some, this
Interjection Expresses emotion or exclamation. oh, wow, ouch
Numeral Represents numbers or quantities. one, two, third
Adds meaning or emphasis, often functioning as part of a phrasal
Particle not, up, off
verb.
Auxiliary Verb Helps the main verb express tense, mood, or voice. is, have, can
2. Importance of Word Classes in NLP
 Syntactic Analysis: Identifying word classes is essential for parsing sentences into meaningful
structures.
Effective Date Page 7 of
Page No.
20
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

 Semantic Understanding: Helps in understanding the meaning and relationships of words in a


sentence.
 Applications:
o Machine Translation: Determines correct grammar and structure in the target language.
o Text-to-Speech (TTS): Improves pronunciation and prosody.
o Question Answering: Helps identify entities and relationships in text.
3. Word Class Tagging
Part-of-Speech (POS) Tagging
 POS tagging involves assigning word classes to each word in a text.
 Example: Sentence: She is reading a book. POS Tags: She (PRON), is (AUX), reading (VERB), a
(DET), book (NOUN)
POS Tagging Tools
1. NLTK (Natural Language Toolkit):
o Uses pre-trained POS taggers based on the Penn Treebank tag set.
2. SpaCy:
o Provides efficient, pre-trained models for POS tagging.
3. Stanford NLP:
o A highly accurate POS tagging library.
Popular Tag Sets
1. Penn Treebank POS Tag Set: Common in English NLP tasks.
o Example tags: NN (Noun, singular), VB (Verb, base form), DT (Determiner).
2. Universal POS Tag Set: A cross-linguistic standard.
o Example tags: NOUN, VERB, ADJ.
4. Challenges in Word Classes for NLP
1. Ambiguity:
o Some words can belong to multiple classes depending on context.
o Example: book (Noun: a book) vs. book (Verb: to book a ticket).
2. Idiomatic Expressions:
o Words may lose their standard class roles in idioms.
o Example: kick the bucket (phrase meaning to die).
3. Morphologically Rich Languages:
o Languages like Finnish or Turkish have complex inflection systems, making word class
tagging challenging.
4. Domain-Specific Vocabulary:
o Technical or slang words may not fit neatly into standard word classes.
5. Applications of Word Classes in NLP
1. Machine Translation:
o Ensures grammatical correctness in translations by tagging words with their appropriate
classes.
2. Named Entity Recognition (NER):
o Distinguishes between nouns (e.g., dog) and proper nouns (e.g., Google).
3. Text Summarization:
o Identifies keywords and key phrases based on nouns, verbs, and adjectives.
4. Sentiment Analysis:
o Adjectives and adverbs often carry sentiment, aiding polarity detection.
5. Information Retrieval:
o Helps identify relevant words or phrases in search queries.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 8 of
Effective Date Page No.
20

Part-of-Speech Tagging
Part-of-Speech Tagging is the process of assigning word classes or grammatical categories
(e.g., noun, verb, adjective) to each word in a given text based on its context. It is a fundamental step in
many NLP applications as it helps in understanding the syntactic structure and meaning of a sentence.
1. Why POS Tagging is Important
1. Syntactic Analysis:
o Identifies the grammatical structure of sentences for parsing and sentence analysis.
2. Semantic Understanding:
o Determines word meaning based on context (e.g., "book" as a noun or verb).
3. Downstream Applications:
o Named Entity Recognition (NER): Identifies proper nouns.
o Machine Translation: Ensures grammatically correct output.
o Text Summarization: Extracts key phrases based on POS.
o Sentiment Analysis: Leverages adjectives and adverbs to detect sentiment.
2. How POS Tagging Works
Steps in POS Tagging:
1. Tokenization:
o Split the input text into individual words or tokens.
o Example: "The cat sat on the mat." → ["The", "cat", "sat", "on", "the", "mat"]
2. Assigning POS Tags:
o Each token is assigned a tag based on:
 Rule-Based Methods: Grammar rules.
 Statistical Models: Probabilities derived from training data.
 Deep Learning Models: Neural networks that learn contextual relationships.
POS Tags
 Commonly used POS tagging schemes:
1. Penn Treebank POS Tag Set (for English):
 NN: Noun (singular)
 VB: Verb (base form)
 JJ: Adjective
 RB: Adverb
 IN: Preposition
2. Universal POS Tag Set:
 NOUN, VERB, ADJ, ADV, PRON, etc.
3. Techniques for POS Tagging
1. Rule-Based Tagging
 Relies on manually defined grammar rules.
 Example:
o "If a word ends with '-ing', tag it as a verb (VB)."
 Limitation:
o Cannot handle complex or ambiguous contexts effectively.
2. Statistical Tagging
 Uses probabilistic models trained on labeled data.
 Examples:
Effective Date Page 9 of
Page No.
20
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

o Hidden Markov Models (HMMs):


 Computes the most likely sequence of tags based on transition and emission
probabilities.
o Conditional Random Fields (CRFs):
 Incorporates contextual features and learns the sequence of tags.
3. Neural Network-Based Tagging
 Uses deep learning to capture context and dependencies in sentences.
 Examples:
o Recurrent Neural Networks (RNNs):
 Learn sequential data, such as sentences, for POS tagging.
o Bidirectional LSTMs (Bi-LSTMs):
 Use context from both directions (previous and next words) for better tagging
accuracy.
o Transformers:
 Models like BERT pre-trained on large corpora excel at contextual tagging.
4. Challenges in POS Tagging
1. Ambiguity:
o Words can have multiple POS tags depending on context.
o Example: book (Noun: "a book", Verb: "to book a room").
2. Out-of-Vocabulary Words:
o Words not seen during training can be challenging to tag.
3. Complex Sentence Structures:
o Long or syntactically ambiguous sentences may reduce accuracy.
4. Domain-Specific Text:
o Jargon or technical terms require domain-specific models.
5. Applications of POS Tagging
1. Named Entity Recognition (NER):
o Identifies entities like names, locations, or dates based on POS tags.
2. Dependency Parsing:
o Establishes syntactic relationships between words.
3. Text Summarization:
o Extracts key phrases by focusing on nouns, verbs, and adjectives.
4. Machine Translation:
o Ensures grammatical correctness in translations.
5. Question Answering Systems:
o Identifies the role of words in questions (e.g., subject, object).
6. POS Tagging Libraries
1. NLTK (Natural Language Toolkit):
o A Python library with pre-trained POS taggers.
python
Copy code
import nltk
from nltk import pos_tag, word_tokenize

sentence = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 10 of
Effective Date Page No.
20

print(pos_tags)
2. SpaCy:
o Provides fast and efficient POS tagging.
python
Copy code
import spacy

nlp = [Link]("en_core_web_sm")
sentence = "The quick brown fox jumps over the lazy dog."
doc = nlp(sentence)
for token in doc:
print(f"{[Link]}: {token.pos_}")
3. Stanford CoreNLP:
o A highly accurate library for POS tagging, using statistical models.
4. Flair:
o A deep learning library specialized in POS tagging and other NLP tasks.
5. BERT-based Models:
o Pre-trained transformer models like BERT can perform POS tagging with fine-tuning.
.

Rule based

Rule-Based NLP involves the use of hand-crafted linguistic rules and patterns to process,
analyze, and generate human language. It is one of the oldest approaches in NLP and relies on predefined
rules, lexicons, and grammar to achieve language understanding or generation.
1. What is Rule-Based NLP?
In a rule-based system, language processing is based on a set of manually defined rules. These rules are
created by linguists or domain experts and are used to identify patterns in text or to define how language
elements interact.
 Example:
o Rule: If a word ends with "-ing," it is likely a verb.
o Rule: If "not" appears before an adjective, classify it as negative sentiment.
Key Components:
1. Lexicons: Word lists or dictionaries with associated features (e.g., part of speech, polarity).
2. Grammar Rules: Syntax and morphology rules (e.g., subject-verb agreement, noun phrase
structure).
3. Pattern Matching: Matching text to specific patterns (e.g., regular expressions).
4. Rule Engine: A system that applies rules to text.
2. Applications of Rule-Based NLP
1. Part-of-Speech (POS) Tagging
 Rule-based taggers assign POS tags to words using linguistic rules.
 Example Rule:
o If the preceding word is a determiner (e.g., "the"), tag the current word as a noun.
2. Named Entity Recognition (NER)
 Detect entities like names, dates, and locations using patterns.
 Example Rule:
Effective Date Page 11 of
Page No.
20
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

o If a word starts with a capital letter and is followed by "Inc." or "Ltd.," classify it as an
organization.
3. Sentiment Analysis
 Identify positive or negative sentiment using sentiment word lexicons and negation rules.
 Example Rule:
o If "not" appears before a positive word (e.g., "not good"), classify it as negative.
4. Text Normalization
 Handle text preprocessing tasks like stemming and lemmatization using rules.
 Example Rule:
o If a word ends in "ing," remove "ing" (e.g., "running" → "run").
5. Spell Checking
 Correct spelling errors by comparing against a dictionary and applying transformation rules.
6. Information Extraction
 Extract structured information from unstructured text using templates and rules.
 Example:
o Extract dates in the format "DD-MM-YYYY" using regex patterns.
7. Question Answering
 Use rules to detect question types and retrieve relevant information.
 Example Rule:
o If a question starts with "Who," retrieve entities tagged as "Person."
3. Advantages of Rule-Based NLP
1. Interpretability:
o Rules are explicit and easy to understand.
o Useful in domains where decisions need to be explainable.
2. Domain Adaptability:
o Rules can be customized for specific languages, industries, or tasks.
3. Low Data Dependency:
o Does not require large labeled datasets for training.
4. Deterministic Behavior:
o Outputs are predictable and consistent.
4. Limitations of Rule-Based NLP
1. Scalability:
o Creating and maintaining a large number of rules is time-consuming and labor-intensive.
2. Coverage:
o Rules may fail to handle edge cases, ambiguities, or new language patterns.
3. Context Sensitivity:
o Difficult to account for context or nuances of natural language effectively.
4. Maintenance:
o Rules need to be updated frequently to keep up with evolving language and domain-specific
terms.
5. Generalization:
o Rule-based systems often struggle with unseen data or out-of-vocabulary words.
5. Examples of Rule-Based NLP Techniques
1. Regular Expressions (Regex)
 Used for pattern matching in text.
 Example:
o Extract email addresses:
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 12 of
Effective Date Page No.
20

python
Copy code
import re
text = "Contact us at support@[Link]."
emails = [Link](r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print(emails)
Output: ['support@[Link]']
2. Rule-Based POS Tagging
 Example using NLTK:
python
Copy code
import nltk
from nltk import RegexpTagger

# Define POS tagging rules


rules = [
(r'.*ing$', 'VBG'), # Gerunds
(r'.*ed$', 'VBD'), # Past tense verbs
(r'.*es$', 'VBZ'), # 3rd person singular verbs
(r'.*ly$', 'RB'), # Adverbs
(r'.*', 'NN') # Default: Noun
]

# Apply rule-based tagger


tagger = RegexpTagger(rules)
sentence = "The cat is running quickly."
tokens = nltk.word_tokenize(sentence)
tags = [Link](tokens)
print(tags)
Output:
css
Copy code
[('The', 'NN'), ('cat', 'NN'), ('is', 'NN'), ('running', 'VBG'), ('quickly', 'RB'), ('.', 'NN')]
3. Named Entity Recognition
 Extract dates using regex:
python
Copy code
import re
text = "The meeting is scheduled for 14-01-2025."
dates = [Link](r'\b\d{2}-\d{2}-\d{4}\b', text)
print(dates)
Output: ['14-01-2025']
Effective Date Page 13 of
Page No.
20
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

6. Rule-Based vs. Statistical/ML-Based NLP


Aspect Rule-Based NLP Statistical/ML-Based NLP
Interpretability High (rules are explicit). Low (complex models like neural networks).
Data Dependency Low (works without large datasets). High (requires labeled data for training).
Limited (hard to adapt to new
Flexibility High (generalizes better with enough data).
patterns).
Scales well with more data and computational
Scalability Difficult to scale as rules increase.
power.
Performance Good for small, well-defined tasks. Better for large, complex, and ambiguous tasks.
7. Hybrid Systems
Modern NLP systems often combine rule-based and ML-based approaches to leverage the strengths of
both:
 Example: Use ML models for initial tagging and apply rule-based post-processing for domain-
specific corrections.
8. Use Cases for Rule-Based NLP
 Domains with Low Data Availability:
o Legal or healthcare text analysis where labeled datasets are limited.
 Critical Applications:
o Applications requiring high interpretability (e.g., financial compliance).
 Text Preprocessing:
o Tokenization, normalization, and filtering in NLP pipelines.

Stochastic and Transformation-based tagging


POS tagging is a crucial task in NLP, and there are multiple approaches to achieving it, including
stochastic methods and transformation-based learning (TBL). Here's a breakdown of these two
approaches:
1. Stochastic Tagging
Stochastic Tagging relies on probability and statistics to assign part-of-speech (POS) tags to words. It
involves using statistical models trained on labeled data to compute the most likely sequence of tags for a
sentence.
Key Techniques in Stochastic Tagging
1. Hidden Markov Models (HMM)
 Overview:
o Assumes a sequence of words is generated by a hidden sequence of states (POS tags).
o Uses transition probabilities (from one tag to another) and emission probabilities
(probability of a word given a tag) to find the most likely sequence of tags.
 Key Steps:
o Calculate probabilities from a tagged corpus.
o Use the Viterbi Algorithm to find the most probable sequence of tags.
 Example:
o Given the sentence "The cat sleeps":
 Transition Probability: P(NN → VB) = 0.2 (probability of a noun followed by a verb).
 Emission Probability: P(sleeps | VB) = 0.8 (probability of the word "sleeps" given the
tag VB).
2. N-gram Models
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 14 of
Effective Date Page No.
20

 Uses n-grams (sequences of n words or tags) to compute probabilities.


 For POS tagging, bigrams and trigrams are commonly used.
 Example:
o If the bigram probability P(DT → NN) is high, the word "dog" is likely to be a noun when
preceded by a determiner like "the."
3. Maximum Entropy Models
 Probabilistic models that consider a wider range of contextual features.
 Predict the POS tag that maximizes the conditional probability given the features.
4. Conditional Random Fields (CRFs)
 Discriminative models that predict the sequence of tags directly, using both current and surrounding
context.
 Example:
o Predicts tags for the sentence "The quick brown fox" by considering the relationship between
neighboring words and their tags.
Advantages of Stochastic Tagging
1. Can handle ambiguity using probabilities.
2. Generalizes well to unseen data with sufficient training.
3. Scalable for large datasets.
Challenges of Stochastic Tagging
1. Requires a large, annotated corpus for training.
2. Struggles with domain-specific text or out-of-vocabulary words.
3. Complex models like CRFs and HMMs may be computationally expensive.
2. Transformation-Based Tagging (TBL)
Transformation-Based Tagging, also known as Brill Tagging, is a hybrid approach that combines
rule-based and stochastic methods. It learns rules from data iteratively to correct initial tagging errors.
How TBL Works
1. Initialization:
o Start with a baseline tagger (e.g., assign the most frequent tag for each word).
o Example: Tag "book" as NN (noun) because it's most commonly a noun.
2. Rule Generation:
o Identify contexts where the initial tag is incorrect and generate rules to correct these errors.
o Example Rule:
 If the previous word is "to" and the current word is "book," change the tag from NN
(noun) to VB (verb).
3. Rule Application:
o Apply the learned rules iteratively, refining the tagging process in each iteration.
4. Stopping Condition:
o Stop when no further improvements are made or when a predefined number of iterations is
reached.
Example of TBL
Input Sentence:
 "I want to book a flight."
Initial Tags (Baseline):
 I/PRP want/VB to/TO book/NN a/DT flight/NN
Transformation Rule:
 Rule: If the previous word is "to" and the current word is tagged as NN, change the tag to VB.
Final Tags:
Effective Date Page 15 of
Page No.
20
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

 I/PRP want/VB to/TO book/VB a/DT flight/NN


Advantages of Transformation-Based Tagging
1. Interpretable Rules:
o Rules are human-readable and explainable.
2. Domain Adaptability:
o Rules can be adapted for specific domains or tasks.
3. Efficiency:
o Does not require probabilistic computations during inference.
Challenges of TBL
1. Dependency on Baseline:
o The quality of the baseline tagger impacts performance.
2. Rule Creation:
o Iterative rule learning can be slow for large datasets.
3. Error Propagation:
o Errors in early iterations may propagate to later stages.
3. Comparison of Stochastic and TBL
Aspect Stochastic Tagging Transformation-Based Tagging (TBL)
Relies on probabilities and statistical
Methodology Learns explicit transformation rules iteratively.
models.
Requires labeled data but learns interpretable
Training Data Requires large labeled datasets.
rules.
Interpretability Low (statistical models are complex). High (rules are human-readable).
Adaptability Generalizes well with sufficient data. Can be fine-tuned for specific domains.
Speed Faster at runtime (once trained). Slower due to iterative rule application.
Error Handling Handles ambiguity using probabilities. Iteratively corrects errors with learned rules.
4. Applications of Stochastic and TBL Tagging
1. Part-of-Speech Tagging:
o Stochastic models (e.g., HMMs, CRFs) are widely used in general-purpose POS taggers.
o TBL is useful in domains where interpretability is crucial.
2. Named Entity Recognition (NER):
o Stochastic models identify entities using probabilistic tagging.
o TBL can refine entity tags by applying domain-specific rules.
3. Spell Checking and Correction:
o TBL can be used to create rules for correcting common spelling errors.
4. Syntactic Parsing:
o Stochastic models like CRFs assist in dependency and constituency parsing.

Issues in POS Tagging


POS tagging, the process of assigning grammatical tags to words in a text, is a fundamental task in
Natural Language Processing (NLP). Despite its importance, several challenges arise during its
implementation due to the complexity and ambiguity of human language.
1. Ambiguity
1.1 Lexical Ambiguity
 A single word can have multiple possible POS tags depending on its context.
o Example:
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 16 of
Effective Date Page No.
20

 "Book a flight" → "book" (Verb)


 "Read the book" → "book" (Noun)
1.2 Structural Ambiguity
 The structure of a sentence can lead to multiple valid tag sequences.
o Example:
 "Visiting relatives can be fun"
 "Visiting" as a Verb (action) or Adjective (modifier).
1.3 Tagging Ambiguity
 Words that can fit into multiple categories even within similar contexts.
o Example:
 "He saw her duck"
 "duck" could be a Noun (animal) or Verb (action).
2. Out-of-Vocabulary (OOV) Words
 Words not present in the training data can lead to incorrect or undefined tags.
o Common cases:
 Neologisms: Newly coined words (e.g., "selfie").
 Technical Terms: Domain-specific jargon.
 Foreign Words: Words borrowed from other languages.
3. Domain-Specific Challenges
 POS tagging systems trained on general-purpose corpora may fail in specific domains.
o Example:
 Medical domain: "MRI scan" (Medical jargon)
 Legal domain: "Hereby declare" (Formal language)
4. Language Variability
4.1 Morphological Richness
 Some languages (e.g., Finnish, Turkish) have rich morphology where a single word can represent an
entire phrase in English.
o Example:
 Turkish: "Evlerinizden" → "From your houses."
4.2 Free Word Order
 In free word order languages (e.g., Sanskrit, Hungarian), the sequence of words does not always
determine their grammatical role, making tagging more complex.
4.3 Lack of Resources
 For low-resource languages, there may be insufficient annotated corpora for training.
5. Multiword Expressions (MWEs)
 Phrases that function as a single unit can cause confusion in tagging.
o Example:
 "New York" should be tagged as a proper noun (NNP), not as two separate entities.

6. Context Sensitivity
 POS tags often depend on the broader sentence or paragraph context, which simple models may fail
to capture.
o Example:
 "He likes to fish"
 "The fish is fresh"
7. Inconsistent Annotation Standards
Effective Date Page 17 of
Page No.
20
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

 Different corpora use different POS tagsets or annotation guidelines, leading to inconsistency in
tagging models.
o Example:
 Universal POS Tagset (simpler): "book" → VERB
 Penn Treebank Tagset (granular): "book" → VB
8. Polysemy and Homonymy
 Polysemy: Words with multiple related meanings.
o Example: "run" → a physical action (VB) or a race (NN).
 Homonymy: Words with unrelated meanings but identical spelling.
o Example: "bank" → a financial institution (NN) or the side of a river (NN).
9. Noisy Text
 Tagging becomes difficult in non-standard or informal text formats, such as:
o Social Media Text: Contains abbreviations, emojis, and slang.
 Example: "u r gr8" → "you are great."
o Speech Transcriptions: May include disfluencies and fillers.
 Example: "Um, I think I like, uh, coffee."
10. Compound Words
 Words like "ice-cream" or "well-being" can be misinterpreted as separate tokens or misclassified.
11. Handling Code-Switching
 In multilingual contexts, speakers often switch between languages mid-sentence.
o Example:
 "I need to book a taxi जल्दी से" (English + Hindi).
12. Evaluation and Metrics
 Evaluating POS taggers is challenging due to:
o Different annotation schemes.
o Disagreement between annotators in ambiguous cases.
o Metric limitations: Precision, recall, and F1 may not always reflect real-world performance.
13. Dependency on Training Data
 Quality of Training Data:
o Poorly annotated corpora result in models learning incorrect patterns.
 Bias in Data:
o Models trained on biased datasets may perform poorly in diverse contexts.
14. Memory and Computational Constraints
 Resource-heavy models like CRFs or neural networks may not work well on devices with limited
computational power.

Strategies to Overcome Challenges


1. Ambiguity Handling
 Use context-aware models like Transformers (e.g., BERT) to capture sentence context.
 Leverage multi-task learning to integrate syntactic and semantic information.
2. OOV Word Management
 Use subword tokenization (e.g., Byte Pair Encoding or WordPiece).
 Add fallback rules for rare or unseen words.
3. Domain Adaptation
 Train or fine-tune models on domain-specific corpora.
 Use transfer learning techniques.
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 18 of
Effective Date Page No.
20

4. Handling Noisy Text


 Preprocess text to normalize slang, abbreviations, and spelling errors.
 Use domain-specific embeddings trained on noisy text (e.g., Twitter embeddings).
5. Multilingual Solutions
 Use models trained on Universal Dependencies (UD) to standardize tagging across languages.
 Build language-agnostic embeddings (e.g., mBERT, XLM-R).
6. Resource Scarcity
 Use cross-lingual transfer learning or unsupervised methods for low-resource languages.
 Employ crowdsourcing to create labeled datasets.
7. Incorporating Linguistic Knowledge
 Add linguistic rules or constraints to supplement statistical or neural models.
 Combine rule-based and data-driven approaches for better performance.

Hidden Markov Models (HMM) and Maximum Entropy Models


Hidden Markov Models (HMM) and Maximum Entropy Models (MaxEnt) are two widely used
statistical techniques in Natural Language Processing (NLP). They are often applied to sequence labeling
tasks, such as Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and other text classification
problems.
1. Hidden Markov Models (HMM)
An HMM is a probabilistic model used for modeling sequences, where the system being modeled is
assumed to follow a Markov process with hidden states.
Key Concepts in HMM
1. States:
o Represent hidden variables, e.g., POS tags.
o Example: {NN, VB, DT} (noun, verb, determiner).
2. Observations:
o Represent observable data, e.g., words in a sentence.
o Example: ["The", "cat", "jumps"].
3. Transition Probabilities (P(tag_i | tag_(i-1))):
o Probability of transitioning from one state to another.
o Example: P(VB → NN).
4. Emission Probabilities (P(word | tag)):
o Probability of observing a word given a state.
o Example: P("cat" | NN).

5. Initial Probabilities (P(tag)):


o Probability of starting in a specific state.
o Example: P(DT) = 0.3.
How HMM Works
 HMM assumes that:
1. The current state depends only on the previous state (Markov property).
2. The observed word depends only on the current state.
Decoding with HMM
 The goal is to find the most likely sequence of states (tags) given the observed sequence of words.
Effective Date Page 19 of
Page No.
20
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

 Viterbi Algorithm:
o A dynamic programming algorithm to compute the most probable tag sequence efficiently.
Advantages of HMM
 Simplicity: Easy to implement and interpret.
 Probabilistic Framework: Provides a natural way to handle uncertainty in language.
Disadvantages of HMM
1. Strong Independence Assumptions:
o Assumes that the current state depends only on the previous state and the current word.
2. Data Sparsity:
o Struggles with unseen words or rare transitions.
3. Fixed Features:
o Cannot incorporate rich contextual features easily.
2. Maximum Entropy Models (MaxEnt)
Maximum Entropy Models, also known as log-linear models, are discriminative models used for
classification tasks. MaxEnt models predict the conditional probability of a class (e.g., a tag) given an input
feature vector.
Key Concepts in MaxEnt
1. Feature Representation:
o Captures contextual information, e.g., surrounding words, word suffixes, and capitalization.
o Example Features:
 Current Word: "cat"
 Previous Word: "The"
 Is Capitalized: False
2. Conditional Probability:
o Computes the probability of a class (tag) given the features.

3. Training:
o Maximize the log-likelihood of the training data using optimization algorithms (e.g., gradient
descent).
Advantages of MaxEnt
1. Rich Features:
o Can incorporate arbitrary, overlapping, and non-independent features.
2. Flexibility:
o No need for independence assumptions (unlike HMM).
3. Discriminative:
o Directly models the conditional probability P(y∣x)P(y | x)P(y∣x).
Disadvantages of MaxEnt
1. Computational Cost:
o Training can be expensive, especially with many features.
2. Overfitting:
o Requires regularization to avoid overfitting on the training data.
3. Data Dependence:
o Performance depends heavily on feature engineering and quality of labeled data.
3. Comparison of HMM and MaxEnt
Aspect HMM MaxEnt
Generative: Models joint probability Discriminative: Models conditional
Model Type
P(x,y)P(x, y)P(x,y). probability ( P(y
VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN
(Autonomous)

Page 20 of
Effective Date Page No.
20

Aspect HMM MaxEnt


Independence
Strong independence assumptions. No independence assumptions.
Assumptions
Limited to emission and transition Can use rich, overlapping contextual
Features
probabilities. features.
Computationally intensive during
Efficiency Faster to train and decode.
training.
Sequence labeling with complex
Use Cases Sequence labeling with simple features.
features.
Handles sparse data better with proper
Robustness Struggles with sparse data.
regularization.
4. Applications of HMM and MaxEnt in NLP
HMM Applications
1. Part-of-Speech Tagging:
o Assign POS tags to words in a sentence.
2. Speech Recognition:
o Model sequences of phonemes or words in audio signals.
3. Machine Translation:
o Generate translation probabilities for word alignments.
MaxEnt Applications
1. Named Entity Recognition (NER):
o Identify entities like names, locations, and organizations.
2. Chunking:
o Identify phrases (e.g., noun or verb phrases) in sentences.
3. Sentiment Analysis:
o Classify text as positive, negative, or neutral.

You might also like