N-Gram Language Modelling with NLTK
Last Updated :
12 Aug, 2024
Language modeling is the way of determining the probability of any sequence of words. Language modeling is used in various applications such as Speech Recognition, Spam filtering, etc. Language modeling is the key aim behind implementing many state-of-the-art Natural Language Processing models.
Methods of Language Modelling
Two methods of Language Modeling:
- Statistical Language Modelling: Statistical Language Modeling, or Language Modeling, is the development of probabilistic models that can predict the next word in the sequence given the words that precede. Examples such as N-gram language modeling.
- Neural Language Modeling: Neural network methods are achieving better results than classical methods both on standalone language models and when models are incorporated into larger models on challenging tasks like speech recognition and machine translation. A way of performing a neural language model is through word embeddings.
N-gram
N-gram can be defined as the contiguous sequence of n items from a given sample of text or speech. The items can be letters, words, or base pairs according to the application. The N-grams typically are collected from a text or speech corpus (A long text dataset).
For instance, N-grams can be unigrams like ("This", "article", "is", "on", "NLP") or bigrams ("This article", "article is", "is on", "on NLP").
N-gram Language Model
An N-gram language model predicts the probability of a given N-gram within any sequence of words in a language. A well-crafted N-gram model can effectively predict the next word in a sentence, which is essentially determining the value of p(w∣h), where h is the history or context and w is the word to predict.
Let’s explore how to predict the next word in a sentence. We need to calculate p(w|h), where w is the candidate for the next word. Consider the sentence 'This article is on...'.If we want to calculate the probability of the next word being "NLP", the probability can be expressed as:
p(\text{"NLP"} | \text{"This"}, \text{"article"}, \text{"is"}, \text{"on"})
To generalize, the conditional probability of the fifth word given the first four can be written as:
p(w_5 | w_1, w_2, w_3, w_4) \quad \text{or} \quad p(W) = p(w_n | w_1, w_2, \ldots, w_{n-1})
This is calculated using the chain rule of probability:
P(A|B) = \frac{P(A \cap B)}{P(B)} \quad \text{and} \quad P(A \cap B) = P(A|B)P(B)
Now generalize this to sequence probability:
P(X_1, X_2, \ldots, X_n) = P(X_1) P(X_2 | X_1) P(X_3 | X_1, X_2) \ldots P(X_n | X_1, X_2, \ldots, X_{n-1})
This yields:
P(w_1, w_2, w_3, \ldots, w_n) = \prod_{i} P(w_i | w_1, w_2, \ldots, w_{i-1})
By applying Markov assumptions, which propose that the future state depends only on the current state and not on the sequence of events that preceded it, we simplify the formula:
P(w_i | w_1, w_2, \ldots, w_{i-1}) \approx P(w_i | w_{i-k}, \ldots, w_{i-1})
For a unigram model (k=0), this simplifies further to:
P(w_1, w_2, \ldots, w_n) \approx \prod_i P(w_i)
And for a bigram model (k=1):
P(w_i | w_1, w_2, \ldots, w_{i-1}) \approx P(w_i | w_{i-1})
Implementing N-Gram Language Modelling in NLTK
Python
# Import necessary libraries
import nltk
from nltk import bigrams, trigrams
from nltk.corpus import reuters
from collections import defaultdict
# Download necessary NLTK resources
nltk.download('reuters')
nltk.download('punkt')
# Tokenize the text
words = nltk.word_tokenize(' '.join(reuters.words()))
# Create trigrams
tri_grams = list(trigrams(words))
# Build a trigram model
model = defaultdict(lambda: defaultdict(lambda: 0))
# Count frequency of co-occurrence
for w1, w2, w3 in tri_grams:
model[(w1, w2)][w3] += 1
# Transform the counts into probabilities
for w1_w2 in model:
total_count = float(sum(model[w1_w2].values()))
for w3 in model[w1_w2]:
model[w1_w2][w3] /= total_count
# Function to predict the next word
def predict_next_word(w1, w2):
"""
Predicts the next word based on the previous two words using the trained trigram model.
Args:
w1 (str): The first word.
w2 (str): The second word.
Returns:
str: The predicted next word.
"""
next_word = model[w1, w2]
if next_word:
predicted_word = max(next_word, key=next_word.get) # Choose the most likely next word
return predicted_word
else:
return "No prediction available"
# Example usage
print("Next Word:", predict_next_word('the', 'stock'))
Output:
Next Word: of
Metrics for Language Modelling
- Entropy: Entropy, as a measure of the amount of information conveyed by Claude Shannon. Below is the formula for representing entropy
H(p) = \sum_{x} p(x)\cdot (-log(p(x)))\\
H(p) is always greater than equal to 0.
- Cross-Entropy: It measures the ability of the trained model to represent test data(W_{1}^{i-1} ).
H(p) =\sum_{i=1}^{x} \frac{1}{n} (-log_2(p(w_i | w_{1}^{i-1})))
The cross-entropy is always greater than or equal to Entropy i.e the model uncertainty can be no less than the true uncertainty.
- Perplexity: Perplexity is a measure of how good a probability distribution predicts a sample. It can be understood as a measure of uncertainty. The perplexity can be calculated by cross-entropy to the exponent of 2.
2^{Cross-Entropy}
Following is the formula for the calculation of Probability of the test set assigned by the language model, normalized by the number of words:
PP(W) = \sqrt[n]{\prod_{i=1}^{N}\frac{1}{P(w_i | w_{i-1})}}
For Example:
- Let's take an example of the sentence: 'Natural Language Processing'. For predicting the first word, let's say the word has the following probabilities:
word | P(word | <start>) |
The | 0.4 |
Processing | 0.3 |
Natural | 0.12 |
Language | 0.18 |
- Now, we know the probability of getting the first word as natural. But, what's the probability of getting the next word after getting the word 'Language' after the word 'Natural'.
word | P(word | 'Natural' ) |
The | 0.05 |
Processing | 0.3 |
Natural | 0.15 |
Language | 0.5 |
- After getting the probability of generating words 'Natural Language', what's the probability of getting 'Processing'.
word | P(word | 'Language' ) |
The | 0.1 |
Processing | 0.7 |
Natural | 0.1 |
Language | 0.1 |
- Now, the perplexity can be calculated as:
PP(W) = \sqrt[n]{\prod_{i=1}^{N}\frac{1}{P(w_i | w_{i-1})}} = \sqrt[3]{\frac{1}{0.12 * 0.5 * 0.7}} \approx 2.876
- From that we can also calculate entropy:
Entropy = log_2(2.876) = 1.524
Shortcomings:
- To get a better context of the text, we need higher values of n, but this will also increase computational overhead.
- The increasing value of n in n-gram can also lead to sparsity.
References
Similar Reads
What are Language Models in NLP?
Language models are a fundamental component of natural language processing (NLP) and computational linguistics. They are designed to understand, generate, and predict human language. These models analyze the structure and use of language to perform tasks such as machine translation, text generation,
9 min read
Multilingual Language Models in NLP
In todayâs globalized world, effective communication is crucial, and the ability to seamlessly work across multiple languages has become essential. To address this need, Multilingual Language Models (MLMs) were introduced in Natural Language Processing. These models enable machines to understand, ge
4 min read
Creating a New Corpus with NLTK
The Natural Language Toolkit (NLTK) is a robust and versatile library for working with human language data in Python. Whether you're involved in research, data analysis, or developing applications, creating your own corpus can be incredibly useful for specific projects. This article will guide you t
5 min read
Generate bigrams with NLTK
Bigrams, or pairs of consecutive words, are an essential concept in natural language processing (NLP) and computational linguistics. Their utility spans various applications, from enhancing machine learning models to improving language understanding in AI systems. In this article, we are going to le
5 min read
Causal Language Models in NLP
Causal language models are a type of machine learning model that generates text by predicting the next word in a sequence based on the words that came before it. Unlike masked language models which predict missing words in a sentence by analyzing both preceding and succeeding words causal models ope
4 min read
Building Language Models in NLP
Building language models is a fundamental task in natural language processing (NLP) that involves creating computational models capable of predicting the next word in a sequence of words. These models are essential for various NLP applications, such as machine translation, speech recognition, and te
4 min read
Natural Language Processing with R
Natural Language Processing (NLP) is a field of artificial intelligence (AI) that enables machines to understand and process human language. R, known for its statistical capabilities, provides a wide range of libraries to perform various NLP tasks. Understanding Natural Language ProcessingNLP involv
4 min read
Natural Language Generation with R
Natural Language Generation (NLG) is a subfield of Artificial Intelligence (AI) that focuses on creating human-like text based on data or structured information. Itâs the process that powers chatbots, automated news articles, and other systems that need to generate text automatically. In this articl
6 min read
Katz's Back-Off Model in Language Modeling
Language Modeling is one of the main tasks in the field of natural language processing and linguistics. We use these models to predict the probability with which the next word should come in a sequence of words. There are many good language models, one such is Katzâs Back-Off Model which was introdu
6 min read
Universal Language Model Fine-tuning (ULMFit) in NLP
Understanding human language is one of the toughest challenges for computers. ULMFit (Universal Language Model Fine-tuning) is a technique used that helps machines learn language by first studying a large amount of text and then quickly adapting to specific language tasks. This makes building langua
6 min read