NLP - BLEU Score for Evaluating Neural Machine Translation - Python
Last Updated :
08 Mar, 2024
Neural Machine Translation (NMT) is a standard task in NLP that involves translating a text from a source language to a target language. BLEU (Bilingual Evaluation Understudy) is a score used to evaluate the translations performed by a machine translator. In this article, we'll see the mathematics behind the BLEU score and its implementation in Python.
What is BLEU Score?
As stated above BLEU Score is an evaluation metric for Machine Translation tasks. It is calculated by comparing the n-grams of machine-translated sentences to the n-gram of human-translated sentences. Usually, it has been observed that the BLEU score decreases as the sentence length increases. This, however, might vary depending upon the model used for translation. The following is a graph depicting the variation of the BLEU Score with the sentence length.

Mathematical Expression for BLEU Score
Mathematically, BLEU Score is given as follows:
BLEU Score = BP * exp(\sum_{i=1}^{N}(w_i * ln(p_i))
Here,
- BP stands for Brevity Penalty
- w_i is the weight for n-gram precision of order i (typically weights are equal for all i)
- p_i is the n-gram modified precision score of order i.
- N is the maximum n-gram order to consider (usually up to 4)
Modified n-gram precision (p_i)
The modified precision p_i is indeed calculated as the ratio between the number of n-grams in the candidate translation that match exactly n-grams in any of the reference translations, clipped by the number of n-grams in the candidate translation.
p_i = \frac{\text{Count Clip}(matches_i, \text{max-ref-count}_i)}{\text{candidate-n-grams}_i}
Here,
- Count Clips is a function that clips the number of matched n-grams (matches_i )by the maximum count of the n-gram across all reference translations (\text{max-ref-count}_i.
- matches_i is the number of n-grams of order i that match exactly between the candidate translation and any of the reference translations.
- \text{max-ref-count}_i is the maximum number of occurrences of the specific n-gram of order i found in any single reference translation.
- \text{candidate-n-grams}_i is the total number of n-grams of order i present in the candidate translation.
Brevity Penalty (BP)
Brevity Penalty penalizes translations that are shorter than the reference translations. The mathematical expression for Brevity Penalty is given as follows:
BP = \exp(1- \frac{r}{c})
Here,
- r is the length of the candidate translation
- c is the average length of the reference translations.
How to Compute BLEU Score?
For a better understanding of the calculation of the BLEU Score, let us take an example. Following is a case for French to English Translation:
- Source Text (French): cette image est cliqué par moi
- Machine Translated Text: the picture the picture by me
- Reference Text-1: this picture is clicked by me
- Reference Text-2: the picture was clicked by me
We can clearly see that the translation done by the machine is not accurate. Let's calculate the BLEU score for the translation.
Unigram Modified Precision
For n = 1, we'll calculate the Unigram Modified Precision:
Unigram | Count in Machine Translation | Max count in Ref
| Clipped Count = min (Count in MT, Max Count in Ref) |
---|
the | 2 |
1
| 1 |
picture | 2 |
1
| 1 |
by | 1 |
1
| 1 |
me | 1 |
1
| 1 |
Here the unigrams (the, picture, by, me) are taken from the machine-translated text. Count refers to the frequency of n-grams in all the Machine Translated Text, and Clipped Count refers to the frequency of unigram in the reference texts collectively.
P_1 = \frac{\text{Clipped Count}}{\text{Count in MT}} = \frac{1+1+1+1}{2+2+1+1} =\frac{4}{6} = \frac{2}{3}
Bigram Modified Precision
For n = 2, we'll calculate the Bigram Modified Precision:
Bigrams | Count in MT | Max Count in Ref
| Clipped Count = min (Count in MT, Max Count in Ref) |
---|
the picture | 2 |
1
| 1 |
picture the | 1 |
0
| 0 |
picture by | 1 |
0
| 0 |
by me | 1 |
1
| 1 |
P_2 = \frac{\text{Clip Count}}{\text{Count in MT}} = \frac{2}{5}
Trigram Modified Precision
For n = 3, we'll calculate the Trigram Modified Precision:
Trigram | Count in MT | Max Count in Ref
| Clipped Count = min (Count in MT, Max Count in Ref) |
---|
the picture the | 1 |
0
| 0 |
picture the picture | 1 |
0
| 0 |
the picture by | 1 |
0
| 0 |
picture by me | 1 |
0
| 0 |
P_3 = \frac{0+0+0+0}{1+1+1+1} =0.0
4-gram Modified Precision
For n =4, we'll calculate the 4-gram Modified Precision:
4-gram | Count | Max Count in Ref
| Clipped Count = min (Count in MT, Max Count in Ref) |
---|
the picture the picture | 1 |
0
| 0 |
picture the picture by | 1 |
0
| 0 |
the picture by me | 1 |
0
| 0 |
P_4 = \frac{0+0+0}{1+1+1} =0.0
Computing Brevity Penalty
Now we have computed all the precision scores, let's find the Brevity Penalty for the translation:
Brevity Penalty = min(1, \frac{Machine\,Translation\,Output\,Length}{Maximum\,Reference\,Output\,Length})
- Machine Translation Output Length = 6 (Machine Translated Text: the picture the picture by me)
- Max Reference Output Length = 6 (Reference Text-2: the picture was clicked by me)
Brevity Penalty (BP) = min(1, \frac{6}{6}) = 1
Computing BLEU Score
Finally, the BLEU score for the above translation is given by:
BLEU Score = BP * exp(\sum_{n=1}^{4} w_i * log(p_i))
On substituting the values, we get,
\text{BLEU Score} = 1 * exp(0.25*ln(2/3) + 0.25*ln(2/5) + 0*ln(0) + 0*ln(0))
\text{BLEU Score} = 0.718
Finally, we have calculated the BLEU score for the given translation.
BLEU Score Implementation in Python
Having calculated the BLEU Score manually, one is by now accustomed to the mathematical working of the BLEU score. However, Python's NLTK provides an in-built module for BLEU score calculation. Let's calculate the BLEU score for the same translation example as above but this time using NLTK.
Code:
Python3
from nltk.translate.bleu_score import sentence_bleu
# Define your desired weights (example: higher weight for bi-grams)
weights = (0.25, 0.25, 0, 0) # Weights for uni-gram, bi-gram, tri-gram, and 4-gram
# Reference and predicted texts (same as before)
reference = [["the", "picture", "is", "clicked", "by", "me"],
["this", "picture", "was", "clicked", "by", "me"]]
predictions = ["the", "picture", "the", "picture", "by", "me"]
# Calculate BLEU score with weights
score = sentence_bleu(reference, predictions, weights=weights)
print(score)
Output:
0.7186082239261684
We can see that the BLEU score computed using Python is the same as the one computed manually. Thus, we have successfully calculated the BLEU score and understood the mathematics behind it.
Similar Reads
Understanding BLEU and ROUGE score for NLP evaluation Natural Language Processing (NLP) consists of applications ranging from text summarization to sentiment analysis. With the unimaginable advancements of the NLP domain in the current scenario, understanding BLEU and ROURGE scores comes into play since these metrics are important in assessing the perf
7 min read
Machine Translation with Transformer in Python Machine translation means converting text from one language into another. Tools like Google Translate use this technology. Many translation systems use transformer models which are good at understanding the meaning of sentences. In this article, we will see how to fine-tune a Transformer model from
6 min read
Evaluation Metrics for Retrieval-Augmented Generation (RAG) Systems Retrieval-Augmented Generation (RAG) systems represent a significant leap forward in the realm of Generative AI, seamlessly integrating the capabilities of information retrieval and text generation. Unlike traditional models like GPT, which predict the next word based solely on previous context, RAG
7 min read
Evaluation Metrics For Classification Model in Python Classification is a supervised machine-learning technique that predicts the class label based on the input data. There are different classification algorithms to build a classification model, such as Stochastic Gradient Classifier, Support Vector Machine Classifier, Random Forest Classifier, etc. To
7 min read
LSTM Based Poetry Generation Using NLP in Python One of the major tasks that one aims to accomplish in Conversational AI is Natural Language Generation (NLG) which refers to employing models for the generation of natural language. In this article, we will get our hands on NLG by building an LSTM-based poetry generator. Note: The readers of this a
8 min read
Building a Simple Language Translation Tool Using a Pre-Trained Translation Model Translation is common among the applications of Natural Language Processing and machine learning. Though due to advancements in technology mainly in the pre-trained models, and transformers, it becomes easier to create a suitable language translation tool. Here in this article, we will create the la
7 min read
Top Natural Language Processing (NLP) Projects Natural Language Processing (NLP) is a growing field that combines computer science, linguistics and artificial intelligence to help machines understand and work with human language. It is used by many applications we use every day, like chatbots, voice assistants and translation tools. As the need
4 min read
Readability Index in Python(NLP) Readability is the ease with which a reader can understand a written text. In natural language, the readability of text depends on its content (the complexity of its vocabulary and syntax). It focuses on the words we choose, and how we put them into sentences and paragraphs for the readers to compre
6 min read
How to calculate the F1 score and other custom metrics in PyTorch? Evaluating deep learning models goes beyond just training them; it means rigorously checking their performance to ensure they're accurate, reliable, and efficient for real-world use. This evaluation is critical because it tells us how well a model has learned and how effective it might be in real-li
7 min read
Perplexity for LLM Evaluation Perplexity is a metric that measures the uncertainty of a model's predictions. Specifically, in language models, it quantifies how well the model predicts the next word in a sequence. When a model makes a prediction, it assigns probabilities to possible next words. Mathematically, perplexity is calc
7 min read