0% found this document useful (0 votes)

63 views12 pages

Minimum Edit Distance.

Minimum Edit Distance, or Levenshtein Distance, quantifies the dissimilarity between two strings by counting the minimum number of operations (insertion, deletion, substitution) needed to transform one into the other, and is widely used in NLP for applications like spelling correction and plagiarism detection. The calculation involves a dynamic programming approach to create a matrix that tracks transformation costs, ultimately yielding the minimum edit distance value. Additionally, spelling correction techniques in NLP utilize various methods, including dictionary-based, phonetic, statistical, and machine learning approaches to enhance text accuracy and readability.

Uploaded by

rishabhbathamhulahu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views12 pages

Minimum Edit Distance.

Uploaded by

rishabhbathamhulahu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

1.

Minimum Edit Distance in Natural Language Processing (NLP)

Minimum Edit Distance, also known as Levenshtein Distance, is a measure of how

dissimilar two strings (for example, words or sentences) are by counting the minimum
number of operations required to transform one string into the other. The operations typically
considered are:

1. Insertion: Adding a character to the string.

2. Deletion: Removing a character from the string.
3. Substitution: Replacing one character in the string with another character.

This concept is widely used in various NLP applications, including spelling correction, DNA
sequence analysis, and natural language understanding.

Importance of Minimum Edit Distance

 Spelling Correction: Helps in suggesting corrections for misspelled words by

calculating how closely a candidate word matches the intended word.
 Plagiarism Detection: Measures similarity between texts to identify possible
plagiarism or content duplication.
 Natural Language Processing Tasks: Useful in tasks such as machine translation
and information retrieval.

Calculating Minimum Edit Distance

The minimum edit distance can be calculated using a dynamic programming approach, which
involves creating a matrix to track the costs of transforming one string into another.

Steps to Calculate Minimum Edit Distance:

1. Initialize a Matrix: Create a matrix where the rows represent the characters of the
first string and the columns represent the characters of the second string.
2. Fill in the Matrix:
o The first row and the first column are initialized based on the costs of
converting an empty string to a non-empty string (and vice versa).
o For each cell in the matrix, compute the minimum cost considering the three
possible operations (insertion, deletion, substitution).
3. Traceback: The value in the bottom-right cell of the matrix will give the minimum
edit distance.

Example of Minimum Edit Distance Calculation

Consider the words "kitten" and "sitting":

1. Initialization:

"" s i t t i n g
"" 0 1 2 3 4 5 6 7
k 1
i 2
t 3
t 4
e 5
n 6

2. Fill in the Matrix:

o Fill in the costs based on the minimum of the three operations:

"" s i t t i n g
"" 0 1 2 3 4 5 6 7
k 1 1 2 3 4 5 6 7
i 2 2 1 2 3 4 5 6
t 3 3 2 1 2 3 4 5
t 4 4 3 2 1 2 3 4
e 5 5 4 3 2 2 3 4
n 6 6 5 4 3 3 2 3

3. Minimum Edit Distance:

o The minimum edit distance between "kitten" and "sitting" is 3 (substituting 'k'
with 's', substituting 'e' with 'i', and adding 'g' at the end).

Applications of Minimum Edit Distance

1. Spell Checkers: Suggesting the closest correct spelling for a misspelled word based
on the minimum edit distance from a list of valid words.
2. Plagiarism Detection: Assessing the similarity between two texts to identify potential
plagiarism by measuring the edit distance.
3. Machine Translation: Evaluating translation accuracy by comparing the translated
output with the original text.
4. Search Engines: Enhancing search query results by correcting user input based on the
calculated edit distance.

Conclusion

Minimum Edit Distance is a powerful concept in NLP for measuring the similarity between
strings. By quantifying how many edits are required to convert one string into another, it
facilitates various applications, from spell checking to text similarity detection.
Understanding and implementing minimum edit distance algorithms can significantly
enhance the functionality of NLP systems.
2. Detecting and Correcting Spelling Errors in Natural Language
Processing (NLP)

Detecting and correcting spelling errors is an essential task in Natural Language Processing
(NLP). It ensures the accuracy and readability of text by identifying misspellings and
providing appropriate corrections. This process is vital for various applications, such as text
editors, search engines, chatbots, and automated customer service systems.

Importance of Spelling Correction

1. Improves Communication: Correct spelling enhances the clarity of written

communication, making it easier for readers to understand the intended message.
2. Increases User Satisfaction: Applications that effectively correct spelling errors tend
to provide a better user experience, leading to higher satisfaction and engagement.
3. Enhances Search Accuracy: Search engines that can identify and correct spelling
mistakes improve the chances of returning relevant results for users.

Approaches to Spelling Correction

1. Dictionary-Based Methods:
o These methods rely on a predefined dictionary of correctly spelled words.
When a word is input, the algorithm checks against the dictionary to identify
any misspellings.
o Process:
 Identify the word to be checked.
 Compare it against a list of valid words in the dictionary.
 Suggest the closest matches based on similarity.
2. Phonetic Algorithms:
o Phonetic algorithms, such as Soundex or Metaphone, transform words into
phonetic representations. This approach helps identify misspellings based on
how words sound rather than their exact spelling.
o Example:
 The words "right" and "write" might be detected as potential
corrections based on their phonetic similarity.
3. Statistical Methods:
o These methods use probabilistic models to suggest corrections based on the
frequency of word occurrences in a given corpus. The idea is to suggest the
most likely correct word based on context and usage.
o Process:
 Analyze the frequency of words in a large corpus of text.
 Use this data to predict the most probable correction for a misspelled
word.
4. Machine Learning Approaches:
o Recent advancements in machine learning allow for more sophisticated
models that learn from large datasets to detect and correct spelling errors.
o Example:
 Models trained on extensive corpora can understand context, making
them more effective at suggesting corrections based on surrounding
words.
Example of Spelling Correction

Consider the following sentence with a misspelled word:

 Input: "I hav a dreem to become a great scienist."

Using a spelling correction algorithm, the system would identify "hav" and "scienist" as
potential errors.

Correction Process:

1. Identify Misspelled Words:

o "hav" (should be "have")
o "scienist" (should be "scientist")
2. Suggest Corrections:
o Based on dictionary lookup and context, the corrected sentence would be:
o Output: "I have a dream to become a great scientist."

Tools for Spelling Correction in NLP

1. NLTK (Natural Language Toolkit):

o NLTK provides tools for processing text, including spelling correction
functionalities through the use of dictionaries and statistical methods.
2. TextBlob:
o TextBlob is a simple library for processing textual data that includes built-in
functionality for spelling correction.
3. spaCy:
o spaCy is another popular library that can be used for text processing and
includes capabilities for identifying and correcting spelling errors.
4. Hunspell:
o Hunspell is an open-source spell checker and morphological analyzer, widely
used in various applications, including web browsers and word processors.

Conclusion

Detecting and correcting spelling errors is a critical component of NLP applications that
enhances the quality and usability of text. By employing various approaches, including
dictionary-based methods, phonetic algorithms, statistical models, and machine learning
techniques, systems can effectively identify and correct spelling mistakes, leading to
improved communication and user satisfaction. Understanding these methods allows
developers and researchers to implement robust spelling correction features in their NLP
applications.
Challenges in Spelling Correction

1. Ambiguity:
o Example: "Their going" could be corrected to "They're going" or "Their
going" based on context.
2. Language-Specific Rules:
o Handling grammatical rules like conjugation or pluralization.
3. Domain-Specific Vocabulary:
o Correcting terms in specialized fields (e.g., "PyTorch" in machine learning).
4. Real-Word Errors:
o Context is crucial for detecting these errors.

Applications of Spelling Correction

1. Search Engines:
o Example: "recieve results" → "receive results"
2. Autocorrect in Keyboards:
o Real-time correction during typing.
3. Chatbots and Virtual Assistants:
o Improving user interactions by interpreting typos.
4. Text Preprocessing:
o Cleaning data for NLP tasks like sentiment analysis.
3. Unsmoothed N-grams in NLP

Unsmoothed N-grams are a foundational concept in statistical language modeling, used to

estimate the likelihood of word sequences based on observed data. They provide a simple
way to predict the next word in a sequence or calculate the probability of a sentence.

1. What are N-grams?

An N-gram is a contiguous sequence of NNN items (words, characters, etc.) from a given
text.
 Unigram: N=1N = 1N=1, individual words.
Example: "I am happy" → "I","am","happy""I", "am", "happy""I","am","happy"
 Bigram: N=2N = 2N=2, pairs of consecutive words.
Example: "I am happy" → "Iam","amhappy""I am", "am happy""Iam","amhappy"
 Trigram: N=3N = 3N=3, triplets of consecutive words.
Example: "I am happy" → "Iamhappy""I am happy""Iamhappy"
2. Purpose of N-grams
 Estimate the likelihood of a word sequence: P(w1,w2,…,wN)P(w_1, w_2, \ldots,
w_N)P(w1,w2,…,wN)
 For simplicity, assume Markov Assumption:
o Each word depends only on the previous N−1N-1N−1 words.

P(w1,w2,…,wN)≈P(w1)⋅P(w2∣w1)⋅P(w3∣w1,w2)⋯P(w_1, w_2, \ldots, w_N) \approx P(w_1) \

cdot P(w_2|w_1) \cdot P(w_3|w_1, w_2) \cdotsP(w1,w2,…,wN)≈P(w1)⋅P(w2∣w1)⋅P(w3∣w1
,w2)⋯
4. Calculating Probabilities in N-grams
 Frequency-Based Estimation:
Probabilities are estimated based on observed counts in the corpus.

P(wi∣wi−1,…,wi−N+1)=Count(wi−N+1,…,wi)Count(wi−N+1,…,wi−1)P(w_i | w_{i-1}, \ldots,

w_{i-N+1}) = \frac{\text{Count}(w_{i-N+1}, \ldots, w_{i})}{\text{Count}(w_{i-N+1}, \ldots,
w_{i-1})}P(wi∣wi−1,…,wi−N+1)=Count(wi−N+1,…,wi−1)Count(wi−N+1,…,wi)
Example:
Corpus: "I am happy. I am sad."
1. Bigram Probabilities:
o P(am∣I)=Count(I am)Count(I)=22=1.0P(\text{am}|\text{I}) = \frac{\text{Count}
(\text{I am})}{\text{Count}(\text{I})} = \frac{2}{2} =
1.0P(am∣I)=Count(I)Count(I am)=22=1.0

o P(happy∣am)=Count(am happy)Count(am)=12=0.5P(\text{happy}|\text{am})
= \frac{\text{Count}(\text{am happy})}{\text{Count}(\text{am})} = \frac{1}{2}
= 0.5P(happy∣am)=Count(am)Count(am happy)=21=0.5
2. Sentence Probability:
o For the sentence "I am happy":

P(I am happy)=P(I)⋅P(am∣I)⋅P(happy∣am)P(\text{I am happy}) = P(\text{I}) \cdot P(\text{am}|\

text{I}) \cdot P(\text{happy}|\text{am})P(I am happy)=P(I)⋅P(am∣I)⋅P(happy∣am)
Assuming P(I)=1P(\text{I}) = 1P(I)=1:

P(I am happy)=1.0⋅1.0⋅0.5=0.5P(\text{I am happy}) = 1.0 \cdot 1.0 \cdot 0.5 =

0.5P(I am happy)=1.0⋅1.0⋅0.5=0.5

5. Strengths of Unsmoothed N-grams

1. Simplicity:
o Easy to compute and interpret.
o Requires only basic counting.
2. Efficiency:
o Works well for frequent word sequences.
3. Baseline Model:
o Useful as a benchmark for more advanced models.

6. Weaknesses of Unsmoothed N-grams

1. Zero Probability Issue:
o If a sequence doesn’t occur in the training corpus, its probability is zero.
Example: P(coffee∣am)=0P(\text{coffee}|\text{am}) = 0P(coffee∣am)=0 if "am
coffee" is absent.
2. Data Sparsity:
o Higher-order N-grams require exponentially more data to estimate
probabilities reliably.
3. Context Limitation:
o Limited to N−1N-1N−1 words of context, ignoring long-range dependencies.
4. Unsmoothed Nature:
o No adjustments for unseen data, making predictions brittle.

7. Applications of N-grams
1. Spell Checking:
o Identify the most probable word sequence.
Example: "I am hapy" → "I am happy".
2. Machine Translation:
o Rank translations based on likelihood.
3. Speech Recognition:
o Predict likely phrases from audio input.
4. Text Generation:
o Generate text by sampling words based on N-gram probabilities.

8. Visualization
Example Sentence: "I am happy"
 Unigram: "I","am","happy""I", "am", "happy""I","am","happy"
 Bigram: "Iam","amhappy""I am", "am happy""Iam","amhappy"
 Trigram: "Iamhappy""I am happy""Iamhappy"
Evaluating N-grams in NLP
Evaluating N-grams involves determining how well an N-gram model predicts or represents
the language. It includes measuring the model's performance in terms of accuracy,
efficiency, and relevance to the task.

1. Why Evaluate N-grams?

1. Language Modeling Accuracy: Determine how well the model predicts word
sequences.
2. Application-Specific Relevance: Assess how suitable the model is for tasks like
machine translation, text generation, or speech recognition.
3. Compare Models: Evaluate performance differences between unigram, bigram,
trigram, or more complex models.

2. Key Evaluation Metrics

A. Perplexity
 Definition: Perplexity measures how uncertain a language model is when predicting
the next word in a sequence.

 Formula: Perplexity(Model)=2−1N∑i=1Nlog⁡2P(wi∣wi−N+1,…,wi−1)\text{Perplexity}(\
text{Model}) = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(w_i | w_{i-N+1}, \ldots, w_{i-
1})}Perplexity(Model)=2−N1∑i=1Nlog2P(wi∣wi−N+1,…,wi−1) Where P(wi∣wi−N+1,
…,wi−1)P(w_i | w_{i-N+1}, \ldots, w_{i-1})P(wi∣wi−N+1,…,wi−1) is the predicted
probability of the iii-th word given its context.
Interpretation:
 Lower perplexity: The model is more confident and accurate in predicting
sequences.
 Higher perplexity: The model struggles to predict words, indicating a mismatch with
the data.
Example:
 Model A predicts a test set with a perplexity of 120, and Model B predicts it with 90.
Model B is more accurate.
3. Test Set and Cross-Validation
Test Set
 Use a separate dataset (not used during training) to evaluate the N-gram model.
 Example:
o Training set: "I am happy. I am sad."
o Test set: "I am excited."
Cross-Validation
 Split the data into training and validation subsets multiple times.
 Evaluate on each split to ensure robustness.

4. Challenges in Evaluating N-grams

1. Data Sparsity:
o Higher-order N-grams require large datasets to capture all possible word
sequences.
o Example: A trigram model may assign zero probability to unseen sequences.
2. Context Limitation:
o N-grams consider only limited context (N−1N-1N−1 words).

o Example: Bigram P(wi∣wi−1)P(w_i | w_{i-1})P(wi∣wi−1) ignores dependencies

beyond one word.
3. Zero Probabilities:
o Unseen sequences can lead to zero probabilities, causing issues with metrics
like perplexity.
o Solution: Use smoothing techniques.

5. Improving N-gram Evaluation

A. Smoothing Techniques
 Assign small probabilities to unseen sequences.
 Examples:

o Laplace Smoothing: Add 1 to all counts. P(wi∣wi−N+1,…,wi−1)=Count(wi−N+1,

…,wi)+1Count(wi−N+1,…,wi−1)+VP(w_i | w_{i-N+1}, \ldots, w_{i-1}) = \frac{\
text{Count}(w_{i-N+1}, \ldots, w_i) + 1}{\text{Count}(w_{i-N+1}, \ldots, w_{i-
1}) + V}P(wi∣wi−N+1,…,wi−1)=Count(wi−N+1,…,wi−1)+VCount(wi−N+1,…,wi)
+1 Where VVV is the vocabulary size.
o Backoff and Interpolation: Combine lower-order and higher-order N-grams.
B. Larger Datasets
 Larger datasets reduce data sparsity and improve performance.
C. Use Advanced Models
 Combine N-grams with machine learning or deep learning techniques (e.g.,
Recurrent Neural Networks).

6. Applications of Evaluation
1. Speech Recognition:
o Rank possible transcriptions based on N-gram probabilities.
2. Machine Translation:
o Evaluate translation fluency using N-grams (e.g., BLEU score).
3. Text Prediction:
o Test how well the model predicts the next word in a sentence.

7. Python Example: Perplexity

python
CopyEdit
from math import log2

def calculate_perplexity(test_set, ngram_probs, n):

perplexity = 0
N = len(test_set)

for i in range(n-1, N): # Start from n-1 for context

context = tuple(test_set[i-n+1:i])
word = test_set[i]
prob = ngram_probs.get(context + (word,), 1e-6) # Handle unseen cases
perplexity += -log2(prob)

perplexity = 2 ** (perplexity / N)
return perplexity

# Example Test Set

test_set = ["I", "am", "happy"]
ngram_probs = {("I", "am"): 1.0, ("am", "happy"): 0.5}

# Calculate Perplexity for Bigram Model

perplexity = calculate_perplexity(test_set, ngram_probs, 2)
print(f"Perplexity: {perplexity}")

8. Key Takeaways
 Evaluation metrics like perplexity and log-likelihood measure the quality of N-gram
models.
 Smoothing techniques and larger datasets can improve performance.
 Limitations of N-grams highlight the need for advanced models like neural networks.

DL_UNIT_V_NLP_Application (1)
No ratings yet
DL_UNIT_V_NLP_Application (1)
83 pages
03 Text Processing- Minimum Edit Distance
No ratings yet
03 Text Processing- Minimum Edit Distance
41 pages
UNIT3
No ratings yet
UNIT3
52 pages
Multimedia Application L3
No ratings yet
Multimedia Application L3
49 pages
Spell correction & edit distance
No ratings yet
Spell correction & edit distance
35 pages
2 EditDistance 2022
No ratings yet
2 EditDistance 2022
37 pages
Spell Correct
No ratings yet
Spell Correct
25 pages
2 EditDistance 2023
No ratings yet
2 EditDistance 2023
35 pages
Spell Correct
No ratings yet
Spell Correct
24 pages
Spelling Correction
No ratings yet
Spelling Correction
85 pages
To Enhance Your Custom GPT Model
No ratings yet
To Enhance Your Custom GPT Model
20 pages
Spelling Correction and Detection in NLP an Overview
No ratings yet
Spelling Correction and Detection in NLP an Overview
9 pages
Edit Distance
No ratings yet
Edit Distance
19 pages
Module2 Ch3 B
No ratings yet
Module2 Ch3 B
96 pages
Lecture # 15 - New
No ratings yet
Lecture # 15 - New
70 pages
Python_1_1-1
No ratings yet
Python_1_1-1
15 pages
Group 12 - Report
No ratings yet
Group 12 - Report
19 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
144 pages
A Guided Tour To Approximate String Matching: Gonzalo Navarro
No ratings yet
A Guided Tour To Approximate String Matching: Gonzalo Navarro
58 pages
AUTOCORRECT WITH PYTHON
No ratings yet
AUTOCORRECT WITH PYTHON
6 pages
EditDistance
No ratings yet
EditDistance
28 pages
Spelling Correction: Edit Distance: Pawan Goyal
No ratings yet
Spelling Correction: Edit Distance: Pawan Goyal
67 pages
Lec04 SpellingCorrection
No ratings yet
Lec04 SpellingCorrection
25 pages
22r2a66f8 ppt
No ratings yet
22r2a66f8 ppt
9 pages
Lec 6
No ratings yet
Lec 6
19 pages
18 nlp2
No ratings yet
18 nlp2
13 pages
Synopsis Chandrashekhar
No ratings yet
Synopsis Chandrashekhar
5 pages
Laxia and Dalin de Graff_Spelling_Correction
No ratings yet
Laxia and Dalin de Graff_Spelling_Correction
4 pages
Language and Computer
No ratings yet
Language and Computer
19 pages
Spelling Noisy Channel
No ratings yet
Spelling Noisy Channel
5 pages
01 Defining Minimum Edit Distance 7-04
No ratings yet
01 Defining Minimum Edit Distance 7-04
3 pages
Spell Correction
No ratings yet
Spell Correction
46 pages
Cme4408 p7 Probmodels Med
No ratings yet
Cme4408 p7 Probmodels Med
50 pages
DH24 Week13
No ratings yet
DH24 Week13
29 pages
Spelling Correction and The Noisy Channel
No ratings yet
Spelling Correction and The Noisy Channel
51 pages
Edit Dist
No ratings yet
Edit Dist
24 pages
Assignement 3 1
No ratings yet
Assignement 3 1
3 pages
Grammer Error Proceeding
No ratings yet
Grammer Error Proceeding
7 pages
IR PRACTICAL 3
No ratings yet
IR PRACTICAL 3
4 pages
Spell Check and Soundex
No ratings yet
Spell Check and Soundex
19 pages
Linear Regression Analysis 6th Edition Montgomery, Peck & Vining 1
No ratings yet
Linear Regression Analysis 6th Edition Montgomery, Peck & Vining 1
86 pages
Week-2
No ratings yet
Week-2
222 pages
Module 1 Jacard Distance and Editdistance
No ratings yet
Module 1 Jacard Distance and Editdistance
16 pages
Error Correction Thesis
100% (3)
Error Correction Thesis
7 pages
Unit 2 NLP
No ratings yet
Unit 2 NLP
5 pages
Mid-Term Project Report On Spell Checker
No ratings yet
Mid-Term Project Report On Spell Checker
15 pages
Task 1
No ratings yet
Task 1
5 pages
Automatic Error Detection and Correction in Malayalam
100% (1)
Automatic Error Detection and Correction in Malayalam
5 pages
Error Detection
No ratings yet
Error Detection
6 pages
IRJET-V6I674
No ratings yet
IRJET-V6I674
6 pages
Spelling Correction For Search Engine Queries
No ratings yet
Spelling Correction For Search Engine Queries
13 pages
Python Regular Expressions Explained: A Practical Guide with Examples
From Everand
Python Regular Expressions Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
An Improved Error Model For Noisy Channel Spelling Correction
No ratings yet
An Improved Error Model For Noisy Channel Spelling Correction
8 pages
Word Suggestions For Non-Word Text Errors Using Similarity Measure
No ratings yet
Word Suggestions For Non-Word Text Errors Using Similarity Measure
6 pages
Efficient Algorithm For Auto Correction Using N-Gram Indexing
No ratings yet
Efficient Algorithm For Auto Correction Using N-Gram Indexing
5 pages
The Sequence Algorithms Handbook
From Everand
The Sequence Algorithms Handbook
Pasquale De Marco
No ratings yet
A Survey of Spelling Error Detection and Correction Techniques
No ratings yet
A Survey of Spelling Error Detection and Correction Techniques
3 pages
The Enigmatic Bridge: Computing and Linguistics
From Everand
The Enigmatic Bridge: Computing and Linguistics
Pasquale De Marco
No ratings yet
How To Write A Spelling Corrector
No ratings yet
How To Write A Spelling Corrector
10 pages
Particle Filter Thesis
No ratings yet
Particle Filter Thesis
235 pages
Coding for beginners The basic syntax and structure of coding
From Everand
Coding for beginners The basic syntax and structure of coding
Diamond Moore
No ratings yet
CC CHP 12
No ratings yet
CC CHP 12
31 pages
B-15_Stratified Analysis
No ratings yet
B-15_Stratified Analysis
9 pages
A2c Bot
No ratings yet
A2c Bot
20 pages
I Can Sell This Indicator - HM - EMA - UTB
100% (1)
I Can Sell This Indicator - HM - EMA - UTB
3 pages
Assignment 5
No ratings yet
Assignment 5
5 pages
Correlation and Regression
No ratings yet
Correlation and Regression
31 pages
Module 1b
No ratings yet
Module 1b
38 pages
Ig FP Chapter1 Exercise2
No ratings yet
Ig FP Chapter1 Exercise2
5 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
38 pages
Neural Networks Neural Networks
No ratings yet
Neural Networks Neural Networks
30 pages
Mat306 Ass2
No ratings yet
Mat306 Ass2
11 pages
Projektovanje Mikrosistema1-2014
No ratings yet
Projektovanje Mikrosistema1-2014
48 pages
Synopsis On Spell Cheker
No ratings yet
Synopsis On Spell Cheker
12 pages
DIGITAL SIGNAL PROCESSING Unit 6
No ratings yet
DIGITAL SIGNAL PROCESSING Unit 6
28 pages
matlab13
No ratings yet
matlab13
3 pages
SCE52001 Advanced Engineering Mathematics Notes 1
No ratings yet
SCE52001 Advanced Engineering Mathematics Notes 1
24 pages
Price Prediction For Pre-Owned Cars Using Ensemble
No ratings yet
Price Prediction For Pre-Owned Cars Using Ensemble
10 pages
Continuous Time Markov Chains
No ratings yet
Continuous Time Markov Chains
37 pages
Week-9-LQR Control
No ratings yet
Week-9-LQR Control
15 pages
GAme Theory Problem Set
No ratings yet
GAme Theory Problem Set
5 pages
National High School Programming Contest: Junior Level
No ratings yet
National High School Programming Contest: Junior Level
15 pages
Tablesense: Spreadsheet Table Detection With Convolutional Neural Networks
No ratings yet
Tablesense: Spreadsheet Table Detection With Convolutional Neural Networks
8 pages
Finding of Arrythmia by 2D Image Classification
No ratings yet
Finding of Arrythmia by 2D Image Classification
5 pages
SNA Bullet 11 MCQS
No ratings yet
SNA Bullet 11 MCQS
12 pages
Imputation
No ratings yet
Imputation
10 pages
PHY-502 (Mathematical Methods of Physics-II) Mid Term 2017-I (GC University)
0% (1)
PHY-502 (Mathematical Methods of Physics-II) Mid Term 2017-I (GC University)
1 page
Artificial Intelligence Seminar
100% (1)
Artificial Intelligence Seminar
11 pages
COSC2429 - Intro To Programming Assessment 2 - Sem A 2021: RMIT Classification: Trusted
100% (1)
COSC2429 - Intro To Programming Assessment 2 - Sem A 2021: RMIT Classification: Trusted
3 pages
Cooperative Games: The Shapley Value: Game Theory
No ratings yet
Cooperative Games: The Shapley Value: Game Theory
3 pages

Minimum Edit Distance.

Uploaded by

Minimum Edit Distance.

Uploaded by

1.

Minimum Edit Distance in Natural Language Processing (NLP)

Minimum Edit Distance, also known as Levenshtein Distance, is a measure of how

1. Insertion: Adding a character to the string.

Importance of Minimum Edit Distance

 Spelling Correction: Helps in suggesting corrections for misspelled words by

Calculating Minimum Edit Distance

Steps to Calculate Minimum Edit Distance:

Example of Minimum Edit Distance Calculation

Consider the words "kitten" and "sitting":

2. Fill in the Matrix:

3. Minimum Edit Distance:

Applications of Minimum Edit Distance

Importance of Spelling Correction

1. Improves Communication: Correct spelling enhances the clarity of written

Approaches to Spelling Correction

Consider the following sentence with a misspelled word:

 Input: "I hav a dreem to become a great scienist."

1. Identify Misspelled Words:

Tools for Spelling Correction in NLP

1. NLTK (Natural Language Toolkit):

Applications of Spelling Correction

Unsmoothed N-grams are a foundational concept in statistical language modeling, used to

1. What are N-grams?

P(w1,w2,…,wN)≈P(w1)⋅P(w2∣w1)⋅P(w3∣w1,w2)⋯P(w_1, w_2, \ldots, w_N) \approx P(w_1) \

P(wi∣wi−1,…,wi−N+1)=Count(wi−N+1,…,wi)Count(wi−N+1,…,wi−1)P(w_i | w_{i-1}, \ldots,

P(I am happy)=P(I)⋅P(am∣I)⋅P(happy∣am)P(\text{I am happy}) = P(\text{I}) \cdot P(\text{am}|\

P(I am happy)=1.0⋅1.0⋅0.5=0.5P(\text{I am happy}) = 1.0 \cdot 1.0 \cdot 0.5 =

5. Strengths of Unsmoothed N-grams

6. Weaknesses of Unsmoothed N-grams

1. Why Evaluate N-grams?

2. Key Evaluation Metrics

4. Challenges in Evaluating N-grams

o Example: Bigram P(wi∣wi−1)P(w_i | w_{i-1})P(wi∣wi−1) ignores dependencies

5. Improving N-gram Evaluation

o Laplace Smoothing: Add 1 to all counts. P(wi∣wi−N+1,…,wi−1)=Count(wi−N+1,

7. Python Example: Perplexity

def calculate_perplexity(test_set, ngram_probs, n):

for i in range(n-1, N): # Start from n-1 for context

# Example Test Set

# Calculate Perplexity for Bigram Model

You might also like