Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
Tony Snake was received Bachelor of Computer Science from the American
University, and Bachelor of Business Administration from the American
University, USA.
He is becoming Ph.D. Candidate of Department of Data Informatics,
(National) Korea Maritime and Ocean University, Busan 49112, Republic of
Korea (South Korea).
His research interests are social network analysis, big data, AI and robotics.
He received Best Paper Award the 15th International Conference on
Multimedia Information Technology and Applications (MITA 2019)
Table of Contents
Contents
About the Authors
Table of Contents
Natural Language Processing Practical using Transformers with Python
CHAPTER 1: Named Entity Recognition using Transformers and Spacy in Python
NER with Transformers
NER with SpaCy
Conclusion
SourceCode:
CHAPTER 2: Fake News Detection in Python
Introduction
How Big is this Problem?
The Solution
Data Exploration
Distribution of Classes
Data Cleaning for Analysis
Explorative Data Analysis
Single-word Cloud
Most Frequent Bigram (Two-word Combination)
Most Frequent Trigram (Three-word combination)
Building a Classifier by Fine-tuning BERT
Data Preparation
Tokenizing the Dataset
Loading and Fine-tuning the Model
Model Evaluation
Appendix: Creating a Submission File for Kaggle
Conclusion
SourceCode:
CHAPTER 3: Paraphrase Text using Transformers in Python
Pegasus Transformer
T5 Transformer
Parrot Paraphraser
Conclusion
SourceCode:
CHAPTER 4: Text Generation with Transformers in Python
Conclusion
SourceCode:
CHAPTER 5: Speech Recognition using Transformers in Python
Getting Started
Preparing the Audio File
Performing Inference
Wrapping up the Code
Conclusion
SourceCode:
CHAPTER 6: Machine Translation using Transformers in Python
Using Pipeline API
Manually Loading the Model
Conclusion
SourceCode:
CHAPTER 7: Train BERT from Scratch using Transformers in Python
Picking a Dataset
Training the Tokenizer
Tokenizing the Dataset
Loading the Model
Training
Using the Model
Conclusion
SourceCode:
CHAPTER 8: Conversational AI Chatbot with Transformers in Python
Generating Responses with Greedy Search
Generating Responses with Beam Search
Generating Responses with Sampling
Nucleus Sampling
Conclusion
SourceCode:
CHAPTER 9: Fine Tune BERT for Text Classification using Transformers in Python
Loading the Dataset
Training the Model
Performing Inference
Conclusion
SourceCode:
CHAPTER 10: Perform Text Summarization using Transformers in Python
Using pipeline API
Using T5 Model
Conclusion
SourceCode:
CHAPTER 11: Sentiment Analysis using VADER in Python
Conclusion
SourceCode:
CHAPTER 12: Translate Languages in Python
Translating Text
Translating List of Phrases
Language Detection
Supported Languages
Conclusion
SourceCode:
CHAPTER 13: Perform Text Classification in Python using Tensorflow 2 and Keras
Data Preparation
Building the Model
Training the Model
Testing the Model
Hyperparameter Tuning
Integrating Custom Datasets
SourceCode:
CHAPTER 14: Build a Text Generator using TensorFlow 2 and Keras in Python
Getting Started
Preparing the Dataset
Building the Model
Training the Model
Generating New Text
Conclusion
SourceCode:
CHAPTER 15: Build a Spam Classifier using Keras and TensorFlow in Python
1. Installing and Importing Dependencies
2. Loading the Dataset
3. Preparing the Dataset
4. Building the Model
5. Training the Model
6. Evaluating the Model
SourceCode:
Summary
At the end of this tutorial, you will be able to perform named entity
recognition on any given English text with HuggingFace
Transformers and SpaCy in Python; here's an example of the resulting NER:
To get started, let's install the required libraries for this tutorial. First,
installing transformers:
$ pip install --upgrade transformers sentencepiece
Next, we need to install spacy and spacy-transformers . To do that, I've grabbed the
latest .whl file from the spacy-models releases for installation:
Of course, if you're reading this tutorial in the future, make sure to get the
latest release from this page if you encounter any problems regarding the
above command.
Once done with the installation, let's get started with the code:
import spacy
from transformers import *
For this tutorial, we'll be performing NER on this text that I've grabbed from
Wikipedia:
Let's extract the entities for our text using this model:
Output:
[{'end': 7,
'entity': 'B-PER',
'index': 1,
'score': 0.99949145,
'start': 1,
'word': 'Albert'},
{'end': 16,
'entity': 'I-PER',
'index': 2,
'score': 0.998417,
'start': 8,
'word': 'Einstein'},
{'end': 29,
'entity': 'B-MISC',
'index': 5,
'score': 0.99211043,
'start': 23,
'word': 'German'},
{'end': 158,
'entity': 'B-PER',
'index': 28,
'score': 0.99736506,
'start': 150,
'word': 'Einstein'},
{'end': 318,
'entity': 'B-PER',
'index': 55,
'score': 0.9977113,
'start': 310,
'word': 'Einstein'},
{'end': 341,
'entity': 'B-LOC',
'index': 60,
'score': 0.50242233,
'start': 335,
'word': 'German'},
{'end': 348,
'entity': 'I-LOC',
'index': 61,
'score': 0.95330054,
'start': 342,
'word': 'Empire'},
{'end': 374,
'entity': 'B-LOC',
'index': 66,
'score': 0.99978524,
'start': 363,
'word': 'Switzerland'},
{'end': 404,
'entity': 'B-MISC',
'index': 74,
'score': 0.9995827,
'start': 398,
'word': 'German'},
{'end': 460,
'entity': 'B-LOC',
'index': 84,
'score': 0.9994709,
'start': 449,
'word': 'Württemberg'},
{'end': 590,
'entity': 'B-MISC',
'index': 111,
'score': 0.9888771,
'start': 585,
'word': 'Swiss'},
{'end': 627,
'entity': 'B-LOC',
'index': 119,
'score': 0.9977405,
'start': 621,
'word': 'Zürich'}]
As you can see, the output is a list of dictionaries that has the start and end
positions of the entity in the text, the prediction score, the word itself, the
index, and the entity name.
Next, let's make a function that uses spaCy to visualize this Python
dictionary:
The whole purpose of the for loop is to construct a list of dictionaries with
the start and end positions, and the entity's label. We also check to see if there
are some same entities nearby, so we combine them.
acknowledged to be one of the greatest and most influential physicists of all time. EinsteinB-
PER is best known for developing the theory of relativity, but he also made important
contributions to the development of the theory of quantum mechanics.EinsteinB-PER was born
in the GermanB-LOC EmpireI-LOC, but moved to SwitzerlandB-LOC in 1895, forsaking
his GermanB-MISC citizenship (as a subject of the Kingdom of WürttembergB-LOC) the
following year. In 1897, at the age of 17, he enrolled in the mathematics and physics teaching
diploma program at the SwissB-MISC Federal polytechnic school in ZürichB-LOC, graduating
in 1900
Next, let's load another relatively larger and better model that is based
on roberta-large :
Performing inference:
Visualizing:
As you can see, now it's improved, naming Albert Einstein as a single entity
and also the Kingdom of Wurttemberg.
There are a lot of other models that were fine-tuned on the same dataset.
Here's yet another one:
# load yet another roberta-large model
ner3 = pipeline("ner", model="Jean-Baptiste/roberta-large-ner-english")
# perform inference on this model
doc_ner3 = ner3(text)
# get HTML representation of NER of our text
get_entities_html(text, doc_ner3)
one of the greatest and most influential physicists of all time. EinsteinPER is best known for
developing the theory of relativity, but he also made important contributions to the development
of the theory of quantum mechanics.EinsteinPER was born in the German EmpireLOC, but
moved to SwitzerlandLOC in 1895, forsaking his GermanMISC citizenship (as a subject of
the Kingdom of WürttembergLOC) the following year. In 1897, at the age of 17, he enrolled in
the mathematics and physics teaching diploma program at
the SwissMISC FederalORG polytechnic school in ZürichLOC, graduating in 1900
This model, however, only has PER , MISC , LOC , and ORG entities. SpaCy
automatically colors the familiar entities.
We're loading the model we've downloaded. Make sure you download the
model you want to use before loading it here. Next, let's generate our
document:
# predict the entities
doc = nlp(text)
This one looks much better, and there are a lot more entities (18) than the
previous ones, namely CARDINAL,
DATE , EVENT , FAC , GPE , LANGUAGE , LAW , LOC , MONEY , NORP , ORDINAL , ORG , PERCENT
This time Swiss Federal was labeled as an organization, even though it wasn't
complete (it should be Swiss Federal polytechnic school), and quantum
mechanics is no longer an organization.
The en_core_web_trf model performs much better than the previous ones. Check
this table that shows each English model offered by spaCy with their size and
metrics evaluation of each:
For other languages, spaCy strives to make these models available for every
language globally. You can check this page to see the available models for
each language.
SourceCode:
ner.py
# %%
# !pip install --upgrade transformers sentencepiece
# %%
# !pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-
3.2.0/en_core_web_trf-3.2.0-py3-none-any.whl
# %%
# !python -m spacy download en_core_web_sm
# %%
import spacy
from transformers import *
# %%
# sample text from Wikipedia
text = """
Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest
and most influential physicists of all time.
Einstein is best known for developing the theory of relativity, but he also made important contributions
to the development of the theory of quantum mechanics.
Einstein was born in the German Empire, but moved to Switzerland in 1895, forsaking his German
citizenship (as a subject of the Kingdom of Württemberg) the following year.
In 1897, at the age of 17, he enrolled in the mathematics and physics teaching diploma program at the
Swiss Federal polytechnic school in Zürich, graduating in 1900
"""
# %%
# load BERT model fine-tuned for Named Entity Recognition (NER)
ner = pipeline("ner", model="dslim/bert-base-NER")
# %%
# perform inference on the transformer model
doc_ner = ner(text)
# print the output
doc_ner
# %%
def get_entities_html(text, ner_result, title=None):
"""Returns a visual version of NER with the help of SpaCy"""
ents = []
for ent in ner_result:
e = {}
# add the start and end positions of the entity
e["start"] = ent["start"]
e["end"] = ent["end"]
# add the score if you want in the label
# e["label"] = f"{ent["entity"]}-{ent['score']:.2f}"
e["label"] = ent["entity"]
if ents and -1 <= ent["start"] - ents[-1]["end"] <= 1 and ents[-1]["label"] == e["label"]:
# if the current entity is shared with previous entity
# simply extend the entity end position instead of adding a new one
ents[-1]["end"] = e["end"]
continue
ents.append(e)
# construct data required for displacy.render() method
render_data = [
{
"text": text,
"ents": ents,
"title": title,
}
]
return spacy.displacy.render(render_data, style="ent", manual=True, jupyter=True)
# %%
# get HTML representation of NER of our text
get_entities_html(text, doc_ner)
# %%
# load roberta-large model
ner2 = pipeline("ner", model="xlm-roberta-large-finetuned-conll03-english")
# %%
# perform inference on this model
doc_ner2 = ner2(text)
# %%
# get HTML representation of NER of our text
get_entities_html(text, doc_ner2)
# %%
# load yet another roberta-large model
ner3 = pipeline("ner", model="Jean-Baptiste/roberta-large-ner-english")
# %%
# perform inference on this model
doc_ner3 = ner3(text)
# %%
# get HTML representation of NER of our text
get_entities_html(text, doc_ner3)
# %%
# load the English CPU-optimized pipeline
nlp = spacy.load("en_core_web_sm")
# %%
# predict the entities
doc = nlp(text)
# %%
# display the doc with jupyter mode
spacy.displacy.render(doc, style="ent", jupyter=True)
# %%
# load the English transformer pipeline (roberta-base) using spaCy
nlp_trf = spacy.load('en_core_web_trf')
# %%
# perform inference on the model
doc_trf = nlp_trf(text)
# display the doc with jupyter mode
spacy.displacy.render(doc_trf, style="ent", jupyter=True)
CHAPTER 2: Fake News
Detection in Python
Exploring the fake news dataset, performing data analysis such as word clouds and ngrams, and fine-
tuning BERT transformer to build a fake news detector in Python using transformers library.
Introduction
Fake news is the intentional broadcasting of false or misleading claims as
news, where the statements are purposely deceitful.
Consumers now have instant access to the latest news. These digital media
platforms have increased in prominence due to their easy connectedness to
the rest of the world and allow users to discuss and share ideas and debate
topics such as democracy, education, health, research, and history. Fake news
items on digital platforms are getting more popular and are used for profit,
such as political and financial gain.
Several studies and experiments are being conducted to detect fake news
across all mediums.
Introduction
How Big is this Problem?
The Solution
Data Exploration
Distribution of Classes
Data Cleaning for Analysis
Explorative Data Analysis
Single-word Cloud
Most Frequent Bigram (Two-word Combination)
Most Frequent Trigram (Three-word Combination)
Building a Classifier by Fine-tuning BERT
Data Preparation
Tokenizing the Dataset
Loading and Fine-tuning the Model
Model Evaluation
Appendix: Creating a Submission File for Kaggle
Conclusion
Data Exploration
In this work, we utilized the fake news dataset from Kaggle to classify
untrustworthy news articles as fake news. We have a complete training
dataset containing the following characteristics:
If you have a Kaggle account, you can simply download the dataset from the
website there and extract the ZIP file.
I also uploaded the dataset into Google Drive, and you can get it here or use
the gdown library to download it in Google Colab or Jupyter notebooks
automatically:
Downloading...
From: https://drive.google.com/uc?id=178f_VkNxccNidap-5-
uffXUW475pAuPy&confirm=t
To: /content/fake-news.zip
100% 48.7M/48.7M [00:00<00:00, 74.6MB/s]
$ unzip fake-news.zip
Three files will appear in the current working directory: train.csv , test.csv ,
and submit.csv , we will be using train.csv in most of the tutorial.
Note: If you're in a local environment, make sure you install PyTorch for
GPU, head to this page for a proper installation.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
The NLTK corpora and modules must be installed using the standard NLTK
downloader:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
The fake news dataset comprises various authors' original and fictitious
article titles and text. Let's import our dataset:
Output:
Output:
id title author text label
0 0 House Dem Aide: We Didn’t Even See Comey’s
Let... Darrell Lucus House Dem Aide: We Didn’t Even See Comey’s
Let... 1
1 1 FLYNN: Hillary Clinton, Big Woman on Campus -
... Daniel J. Flynn Ever get the feeling your life circles the rou... 0
2 2 Why the Truth Might Get You
Fired Consortiumnews.com Why the Truth Might Get You Fired October
29, ... 1
3 3 15 Civilians Killed In Single US Airstrike
Hav... Jessica Purkiss Videos 15 Civilians Killed In Single US
Airstr... 1
4 4 Iranian woman jailed for fictional
unpublished... Howard Portnoy Print \nAn Iranian woman has
been sentenced to... 1
We have 20,800 rows, which have five columns. Let's see some statistics of
the text column:
txt_length = news_d.text.str.split().str.len()
txt_length.describe()
Output:
count 20761.000000
mean 760.308126
std 869.525988
min 0.000000
25% 269.000000
50% 556.000000
75% 1052.000000
max 24234.000000
Name: text, dtype: float64
#Title statistics
title_length = news_d.title.str.split().str.len()
title_length.describe()
Output:
count 20242.000000
mean 12.420709
std 4.098735
min 1.000000
25% 10.000000
50% 13.000000
75% 15.000000
max 72.000000
Name: title, dtype: float64
The statistics for the training and testing sets are as follows:
The text attribute has a higher word count with an average of 760 words
and 75% having more than 1000 words.
The title attribute is a short statement with an average of 12 words, and
75% of them are around 15 words.
Distribution of Classes
Counting plots for both labels:
sns.countplot(x="label", data=news_d);
print("1: Unreliable")
print("0: Reliable")
print("Distribution of labels:")
print(news_d.label.value_counts());
Output:
1: Unreliable
0: Reliable
Distribution of labels:
1 10413
0 10387
Name: label, dtype: int64
print(round(news_d.label.value_counts(normalize=True),2)*100);
Output:
1 50.0
0 50.0
Name: label, dtype: float64
# Clean Datasets
import nltk
from nltk.corpus import stopwords
import re
from nltk.stem.porter import PorterStemmer
from collections import Counter
ps = PorterStemmer()
wnl = nltk.stem.WordNetLemmatizer()
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)
def clean_dataset(df):
# remove unused column
df = remove_unused_c(df)
#impute null values
df = null_process(df)
return df
Output:
The WordCloud library's wordcloud() function will be used, and the generate() is
utilized for generating the word cloud image:
Output:
import torch
from transformers.file_utils import is_tf_available, is_torch_available,
is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
from sklearn.model_selection import train_test_split
import random
Args:
seed (:obj:`int`): The seed to set.
"""
random.seed(seed)
np.random.seed(seed)
if is_torch_available():
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# ^^ safe to call this function even if cuda is not available
if is_tf_available():
import tensorflow as tf
tf.random.set_seed(seed)
set_seed(1)
Data Preparation
Let's now clean NaN values from text , author , and title columns:
news_df = news_d[news_d['text'].notna()]
news_df = news_df[news_df["author"].notna()]
news_df = news_df[news_df["title"].notna()]
Next, making a function that takes the dataset as a Pandas dataframe and
returns the train/validation splits of texts and labels as lists:
The above function takes the dataset in a dataframe type and returns them as
lists split into training and validation sets. Setting include_title to True means
that we add the title column to the text we going to use for training,
setting include_author to True means we add the author to the text as well.
Let's make sure the labels and texts have the same length:
print(len(train_texts), len(train_labels))
print(len(valid_texts), len(valid_labels))
Output:
14628 14628
3657 3657
Tokenizing the Dataset
Let's use the BERT tokenizer to tokenize our dataset:
class NewsGroupsDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __len__(self):
return len(self.labels)
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
# calculate accuracy using sklearn's function
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
}
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=1, # total number of training epochs
per_device_train_batch_size=10, # batch size per device during training
per_device_eval_batch_size=20, # batch size for evaluation
warmup_steps=100, # number of warmup steps for learning rate
scheduler
logging_dir='./logs', # directory for storing logs
load_best_model_at_end=True, # load the best model when finished
training (default metric is loss)
# but you can specify `metric_for_best_model` argument to change to
accuracy or other metric
logging_steps=200, # log & save weights each logging_steps
save_steps=200,
evaluation_strategy="steps", # evaluate each `logging_steps`
)
I've set the per_device_train_batch_size to 10, but you should set it as high as your
GPU could possibly fit. Setting the logging_steps and save_steps to 200, meaning
we're going to perform evaluation and save the model weights on each 200
training step.
You can check this page for more detailed information about the available
training parameters.
trainer = Trainer(
model=model, # the instantiated Transformers model to be
trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=valid_dataset, # evaluation dataset
compute_metrics=compute_metrics, # the callback that computes
metrics of interest
)
Training the model:
The training takes a few hours to finish, depending on your GPU. If you're on
the free version of Colab, it should take an hour with NVIDIA Tesla K80.
Here is the output:
Model Evaluation
Since load_best_model_at_end is set to True , the best weights will be loaded when
the training is completed. Let's evaluate it with our validation set:
Output:
A new folder containing the model configuration and weights will appear
after running the above cell. If you want to perform prediction, you simply
use the from_pretrained() method we used when we loaded the model, and you're
good to go.
Next, let's make a function that accepts the article text as an argument and
return whether it's fake or not:
I've taken an example from test.csv that the model never saw to perform
inference, I've checked it, and it's an actual article from The New York
Times:
real_news = """
Tim Tebow Will Attempt Another Comeback, This Time in Baseball - The
New York Times",Daniel Victor,"If at first you don’t succeed, try a different
sport. Tim Tebow, who was a Heisman quarterback at the University of
Florida but was unable to hold an N. F. L. job, is pursuing a career in Major
League Baseball. <SNIPPED>
"""
The original text is in the Colab environment if you want to it, as it's a
complete article. Let's pass it to the model and see the results:
get_prediction(real_news, convert_to_label=True)
Output:
reliable
After we concatenate the author, title, and article text together, we pass
the get_prediction() function to the new column to fill the label column, we then
use to_csv() method to create the submission file for Kaggle. Here's my
submission score:
We got 99.78% and 100% accuracy on private and public leaderboards.
That's awesome!
Conclusion
Alright, we're done with the tutorial. You can check this page to see various
training parameters you can tweak.
If you have a custom fake news dataset for fine-tuning, you simply have to
pass a list of samples to the tokenizer as we did, you won't change any other
code after that.
SourceCode:
fakenews_detection.py
# -*- coding: utf-8 -*-
"""fakenews_seq_classification.ipynb
files.upload()
!unzip test.csv.zip
!unzip train.csv.zip
!unzip fake-news.zip
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
txt_length = news_d.text.str.split().str.len()
txt_length.describe()
#Title statistics
title_length = news_d.title.str.split().str.len()
title_length.describe()
sns.countplot(x="label", data=news_d);
print("1: Unreliable")
print("0: Reliable")
print("Distribution of labels:")
print(news_d.label.value_counts());
print(round(news_d.label.value_counts(normalize=True),2)*100);
# Clean Datasets
import nltk
from nltk.corpus import stopwords
import re
from nltk.stem.porter import PorterStemmer
from collections import Counter
ps = PorterStemmer()
wnl = nltk.stem.WordNetLemmatizer()
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)
def clean_dataset(df):
# remove unused column
df = remove_unused_c(df)
#impute null values
df = null_process(df)
return df
# Perform data cleaning on train and test dataset by calling clean_dataset function
df = clean_dataset(news_d)
# apply preprocessing on text through apply method by calling the function nltk_preprocess
df["text"] = df.text.apply(nltk_preprocess)
# apply preprocessing on title through apply method by calling the function nltk_preprocess
df["title"] = df.title.apply(nltk_preprocess)
import torch
from transformers.file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
from sklearn.model_selection import train_test_split
import random
Args:
seed (:obj:`int`): The seed to set.
"""
random.seed(seed)
np.random.seed(seed)
if is_torch_available():
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# ^^ safe to call this function even if cuda is not available
if is_tf_available():
import tensorflow as tf
tf.random.set_seed(seed)
set_seed(1)
news_df = news_d[news_d['text'].notna()]
news_df = news_df[news_df["author"].notna()]
news_df = news_df[news_df["title"].notna()]
print(len(train_texts), len(train_labels))
print(len(valid_texts), len(valid_labels))
class NewsGroupsDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __len__(self):
return len(self.labels)
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
# calculate accuracy using sklearn's function
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
}
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=1, # total number of training epochs
per_device_train_batch_size=10, # batch size per device during training
per_device_eval_batch_size=20, # batch size for evaluation
warmup_steps=100, # number of warmup steps for learning rate scheduler
logging_dir='./logs', # directory for storing logs
load_best_model_at_end=True, # load the best model when finished training (default metric is
loss)
# but you can specify `metric_for_best_model` argument to change to accuracy or other metric
logging_steps=200, # log & save weights each logging_steps
save_steps=200,
evaluation_strategy="steps", # evaluate each `logging_steps`
)
trainer = Trainer(
model=model, # the instantiated Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=valid_dataset, # evaluation dataset
compute_metrics=compute_metrics, # the callback that computes metrics of interest
)
real_news = """
Tim Tebow Will Attempt Another Comeback, This Time in Baseball - The New York Times",Daniel
Victor,"If at first you don’t succeed, try a different sport. Tim Tebow, who was a Heisman
quarterback at the University of Florida but was unable to hold an N. F. L. job, is pursuing a career in
Major League Baseball. He will hold a workout for M. L. B. teams this month, his agents told ESPN
and other news outlets. “This may sound like a publicity stunt, but nothing could be further from the
truth,” said Brodie Van Wagenen, of CAA Baseball, part of the sports agency CAA Sports, in the
statement. “I have seen Tim’s workouts, and people inside and outside the industry — scouts,
executives, players and fans — will be impressed by his talent. ” It’s been over a decade since
Tebow, 28, has played baseball full time, which means a comeback would be no easy task. But the
former major league catcher Chad Moeller, who said in the statement that he had been training Tebow
in Arizona, said he was “beyond impressed with Tim’s athleticism and swing. ” “I see bat speed and
power and real baseball talent,” Moeller said. “I truly believe Tim has the skill set and potential to
achieve his goal of playing in the major leagues and based on what I have seen over the past two
months, it could happen relatively quickly. ” Or, take it from Gary Sheffield, the former outfielder.
News of Tebow’s attempted comeback in baseball was greeted with skepticism on Twitter. As a junior
at Nease High in Ponte Vedra, Fla. Tebow drew the attention of major league scouts, batting . 494 with
four home runs as a left fielder. But he ditched the bat and glove in favor of pigskin, leading Florida to
two national championships, in 2007 and 2009. Two former scouts for the Los Angeles Angels told
WEEI, a Boston radio station, that Tebow had been under consideration as a high school junior.
“’x80’x9cWe wanted to draft him, ’x80’x9cbut he never sent back his information card,” said one of
the scouts, Tom Kotchman, referring to a questionnaire the team had sent him. “He had a strong arm
and had a lot of power,” said the other scout, Stephen Hargett. “If he would have been there his senior
year he definitely would have had a good chance to be drafted. ” “It was just easy for him,” Hargett
added. “You thought, If this guy dedicated everything to baseball like he did to football how good
could he be?” Tebow’s high school baseball coach, Greg Mullins, told The Sporting News in 2013 that
he believed Tebow could have made the major leagues. “He was the leader of the team with his
passion, his fire and his energy,” Mullins said. “He loved to play baseball, too. He just had a bigger fire
for football. ” Tebow wouldn’t be the first athlete to switch from the N. F. L. to M. L. B. Bo Jackson
had one season as a Kansas City Royal, and Deion Sanders played several years for the Atlanta Braves
with mixed success. Though Michael Jordan tried to cross over to baseball from basketball as a in
1994, he did not fare as well playing one year for a Chicago White Sox minor league team. As a
football player, Tebow was unable to match his college success in the pros. The Denver Broncos
drafted him in the first round of the 2010 N. F. L. Draft, and he quickly developed a reputation for
clutch performances, including a memorable pass against the Pittsburgh Steelers in the 2011 Wild
Card round. But his stats and his passing form weren’t pretty, and he spent just two years in Denver
before moving to the Jets in 2012, where he spent his last season on an N. F. L. roster. He was cut
during preseason from the New England Patriots in 2013 and from the Philadelphia Eagles in 2015.
"""
get_prediction(real_news, convert_to_label=True)
test_df.head()
# add a new column that contains the author, title and article content
new_df["new_text"] = new_df["author"].astype(str) + " : " + new_df["title"].astype(str) + " - " +
new_df["text"].astype(str)
new_df.head()
Pegasus Transformer
In this section, we'll use the Pegasus transformer architecture model that was
fine-tuned for paraphrasing instead of summarization. To instantiate the
model, we need to use PegasusForConditionalGeneration as it's a form of text
generation:
model =
PegasusForConditionalGeneration.from_pretrained("tuner007/pegasus_paraphrase"
tokenizer =
PegasusTokenizerFast.from_pretrained("tuner007/pegasus_paraphrase")
Next, let's make a general function that takes a model, its tokenizer, the target
sentence and returns the paraphrased text:
I highly suggest you check this blog post to learn more about the parameters
of the model.generate() method.
Outstanding results! Most of the generations are accurate and can be used.
You can try different sentences from your mind and see the results yourself.
T5 Transformer
This section will explore the T5 architecture model that was fine-tuned on the
PAWS dataset. PAWS consists of 108,463 human-labeled and 656k noisily
labeled pairs. Let's load the model and the tokenizer:
tokenizer = AutoTokenizer.from_pretrained("Vamsi/T5_Paraphrase_Paws")
model =
AutoModelForSeq2SeqLM.from_pretrained("Vamsi/T5_Paraphrase_Paws")
Output:
["One of the best ways to learn is to teach what you've already learned.",
'One of the best ways to learn is to teach what you have already learned.',
'One of the best ways to learn is to teach what you already know.',
'One of the best ways to learn is to teach what you already learned.',
"One of the best ways to learn is to teach what you've already learned."]
These are promising results too. However, if you get some not-so-good
paraphrased text, you can append the input text with "paraphrase: " , as T5 was
intended for multiple text-to-text NLP tasks such as machine translation, text
summarization, and more. It was pre-trained and fine-tuned like that.
$ pip install
git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git
parrot = Parrot()
This will download the models' weights and the tokenizer, give it some time,
and it'll finish in a few seconds to several minutes, depending on your
Internet connection.
This library uses more than one model. It uses one model for paraphrasing,
one for calculating adequacy, another for calculating fluency, and the last for
diversity.
Let's use the previous sentences and another one and see the results:
phrases = [
sentence,
"One of the best ways to learn is to teach what you've already learned",
"Paraphrasing is the process of coming up with someone else's ideas in your
own words"
]
for phrase in phrases:
print("-"*100)
print("Input_phrase: ", phrase)
print("-"*100)
paraphrases = parrot.augment(input_phrase=phrase)
for paraphrase in paraphrases:
print(paraphrase)
With this library, we simply use the parrot.augment() method and pass the
sentence in a text form, it returns several candidate paraphrased texts. Check
the output:
---------------------------------------------------------------------------------------------
-------
Input_phrase: Learning is the process of acquiring new understanding,
knowledge, behaviors, skills, values, attitudes, and preferences.
---------------------------------------------------------------------------------------------
-------
('learning is the process of acquiring new knowledge behaviors skills values
attitudes and preferences', 27)
('learning is the process of acquiring new understanding knowledge behaviors
skills values attitudes and preferences', 13)
---------------------------------------------------------------------------------------------
-------
Input_phrase: One of the best ways to learn is to teach what you've already
learned
---------------------------------------------------------------------------------------------
-------
('one of the best ways to learn is to teach what you know', 29)
('one of the best ways to learn is to teach what you already know', 21)
('one of the best ways to learn is to teach what you have already learned', 15)
---------------------------------------------------------------------------------------------
-------
Input_phrase: Paraphrasing is the process of coming up with someone else's
ideas in your own words
---------------------------------------------------------------------------------------------
-------
("paraphrasing is the process of coming up with a person's ideas in your own
words", 23)
("paraphrasing is the process of coming up with another person's ideas in
your own words", 23)
("paraphrasing is the process of coming up with another's ideas in your own
words", 22)
("paraphrasing is the process of coming up with someone's ideas in your own
words", 17)
("paraphrasing is the process of coming up with somebody else's ideas in
your own words", 15)
("paraphrasing is the process of coming up with someone else's ideas in your
own words", 12)
The number accompanied with each sentence is the diversity score. The
higher the value, the more diverse the sentence from the original.
Conclusion
Alright! That's it for the tutorial. Hopefully, you have explored the most
valuable ways to perform automatic text paraphrasing using transformers and
AI in general.
You can get the complete code here or the Colab notebook here.
SourceCode:
paraphrasing_with_transformers.py
# -*- coding: utf-8 -*-
"""Paraphrasing-with-Transformers_PythonCode.ipynb
model = PegasusForConditionalGeneration.from_pretrained("tuner007/pegasus_paraphrase")
tokenizer = PegasusTokenizerFast.from_pretrained("tuner007/pegasus_paraphrase")
sentence = "Learning is the process of acquiring new understanding, knowledge, behaviors, skills,
values, attitudes, and preferences."
tokenizer = AutoTokenizer.from_pretrained("Vamsi/T5_Paraphrase_Paws")
model = AutoModelForSeq2SeqLM.from_pretrained("Vamsi/T5_Paraphrase_Paws")
get_paraphrased_sentences(model, tokenizer, "paraphrase: " + "One of the best ways to learn is to teach
what you've already learned")
parrot = Parrot()
phrases = [
sentence,
"One of the best ways to learn is to teach what you've already learned",
"Paraphrasing is the process of coming up with someone else's ideas in your own words"
]
Unfortunately, we cannot use GPT-3 as OpenAI did not release the model
weights, and even if it did, we as normal people won't be able to have that
machine that can load the model weights into the memory, because it's too
large.
Luckily, EleutherAI did a great job trying to mimic the capabilities of GPT-3
by releasing the GPT-J model. GPT-J model has 6 billion parameters
consisting of 28 layers with a dimension of 4096, it was pre-trained on the
Pile dataset, which is a large-scale dataset created by EleutherAI itself.
The Pile dataset is a massive one with a size of over 825GB, it consists of 22
sub-datasets including Wikipedia English (6.38GB), GitHub (95.16GB),
Stack Exchange (32.2GB), ArXiv (56.21GB), and more. This explains the
amazing performance of GPT-J that you'll hopefully discover in this tutorial.
In this guide, we're going to perform text generation using GPT-2 as well as
EleutherAI models using the Huggingface Transformers library in Python.
The below table shows some of the useful models along with their number of
parameters and size, I suggest you choose the largest you can fit in your
environment memory:
Number of
Model Size
Parameters
gpt2 124M 523MB
EleutherAI/gpt-neo-125M 125M 502MB
EleutherAI/gpt-neo-1.3B 1.3B 4.95GB
EleutherAI/gpt-neo-2.7B 2.7B 9.94GB
EleutherAI/gpt-j-6B 6B 22.5GB
The EleutherAI/gpt-j-6B model has 22.5GB of size, so make sure you have at
least a memory of more than 22.5GB to be able to perform inference on the
model. The good news is that Google Colab with the High-RAM option
worked for me. If you're not able to load that big model, you can try other
smaller versions such as EleutherAI/gpt-neo-2.7B or EleutherAI/gpt-neo-1.3B . The
models we gonna use in this tutorial are the highlighted ones in the above
table.
In this tutorial, we will only use the pipeline API, as it'll be more than enough
for text generation.
Output:
Instead, researchers are using the deep learning technology that's built into
the brain to learn from the other side of the brain. It's called deep learning,
and it's pretty straightforward.
For example, you can do a lot of things that are still not very well understood
by the outside world. For example, you can read a lot of information, and it's
a lot easier to understand what the person
==================================================
To be honest, neural networks are not perfect, but they are pretty good at it.
I've used them to build a lot of things. I've built a lot of things. I'm pretty
good at it.
When we talk about what we're doing with our AI, it's kind of like a
computer is going to go through a process that you can't see. It's going to
have to go through it for you to see it. And then you can see it. And you can
see it. It's going to be there for you. And then it's going to be there for you to
see it
==================================================
To be honest, neural networks are going to be very interesting to study for a
long time. And they're going to be very interesting to read, because they're
going to be able to learn a lot from what we've learned. And this is where the
challenge is, and this is also something that we've been working on, is that we
can take a neural network and make a neural network that's a lot simpler than
a neural network, and then we can do that with a lot more complexity.
So, I think it's possible that in the next few years, we may be able to make a
neural network that
==================================================
Notice the third sentence was cut and not completed, you can always increase
the max_length to generate more tokens.
Now we have explored GPT-2, it's time to dive into the fascinating GPT-J:
The model size is about 22.5GB, make sure your environment is capable of
loading the model to the memory, I'm using the High-RAM instance on
Google Colab and it's running quite well. However, it may take a while to
generate sentences, especially when you pass a higher value of max_length.
Output:
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
To be honest, robots will never replace humans.
The reason is simple: We humans are far more complex than the average
robot.
The human brain is a marvel of complexity, capable of performing many
thousands of tasks simultaneously. There are over 100 billion neurons in the
human brain, with over 10,000 connections between each neuron, and
neurons are capable of firing over a million times per second.
We have a brain that can reason, learn, and remember things. We can learn
and retain information for a lifetime. We can communicate, collaborate, and
work together to achieve goals. We can learn languages, play instruments
==================================================
To be honest, robots will probably replace many human jobs.
They say that in the future we can expect to see more robots doing jobs that
are tedious, repetitive and dangerous.
The researchers also think that robots will become cheaper and more versatile
over time.
One company has already started a trial by offering a robot to take care of
your home.
But the
==================================================
To be honest, robots will never replace the human workforce. It’s not a
matter of if, but when. I can’t believe I’m writing this, but I’m glad I am.
Let’s start with what robots do. Robots are a form of technology. There’s a
difference between a technology and a machine. A machine is a physical
object designed to perform a specific task. A technology is a system of
machines.
For example, a machine can be used for a specific purpose, but it still
requires humans to operate it. A technology can be
==================================================
Since GPT-J and other EleutherAI pre-trained models are trained on the Pile
dataset, it can not only generate English text, but it can talk anything, let's try
to generate Python code:
Output:
import os
# make a list of all african countries
african_countries = ['Algeria', 'Angola', 'Benin', 'Burkina Faso', 'Burundi',
'Cameroon', 'Cape Verde', 'Central African Republic', 'Chad', 'Comoros',
'Congo', 'Democratic Republic of Congo', 'Djibouti', 'Egypt', 'Equatorial
Guinea', 'Eritrea', 'Ethiopia', 'Gabon', 'Gambia', 'Ghana', 'Guinea', 'Guinea-
Bissau', 'Kenya', 'Lesotho', 'Liberia', 'Libya', 'Madagascar', 'Malawi', 'Mali',
'Mauritania', 'Mauritius', 'Morocco', 'Mozambique', 'Namibia', 'Niger',
'Nigeria', 'Rwanda', 'Sao Tome and Principe', 'Senegal', 'Sierra Leone',
'Somalia', 'South Africa', 'South Sudan', 'Sudan', 'Swaziland', 'Tanzania',
'Togo', 'Tunisia',
I definitely invite you to play around with the model and let me know in the
comments if you find anything even more interesting.
Notice I have lowered the temperature to 0.05, as this is not really an open-
ended generation, I want the African countries to be correct as well as the
Python syntax, I have tried increasing the temperature in this type of
generation and it led to misleading generation.
print(gpt_j_generator(
"""
import cv2
image = "image.png"
import cv2
image = "image.png"
I have updated the repository using the apt-get command, and prompted to try
to generate the commands for installing and starting Nginx, here is the
output:
# Java code!
print(gpt_j_generator(
"""
public class Test {
Extraordinarily, the model added the complete Java code for generating
Fibonacci numbers:
A:
A:
I have executed the code before the weird "A:" , not only it's a working code,
but it generated the correct sequence!
Finally, Let's try generating LaTeX code:
# LATEX!
print(gpt_j_generator(
r"""
% list of Asian countries
\begin{enumerate}
""", max_length=128, top_k=15, temperature=0.1, do_sample=True)[0]
["generated_text"])
I tried to begin an ordered list in LaTeX, and before that, I added a comment
indicating a list of Asian countries, output:
If you run the above code snippets, you'll definitely get different results than
mine, as we're sampling from a token distribution by setting the
argument do_sample to True . Make sure you explore different decoding methods
by checking this blog post from Huggingface, model.generate() method
parameters
SourceCode:
textgeneration_transformers.py
# -*- coding: utf-8 -*-
"""TextGeneration-Transformers-PythonCodeTutorial.ipynb
print(gpt_j_generator(
"""
import cv2
image = "image.png"
# Java code!
print(gpt_j_generator(
"""
public class Test {
We'll be using torchaudio for loading audio files. Note that you need to install
PyAudio if you're going to use the code on your environment and PyDub if
you're on a Colab environment. We are going to use them for recording from
the microphone in Python.
Getting Started
Let's import our libraries:
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
There are two most used model architectures and weights for
wav2vec. wav2vec2-base-960h is a base architecture with about 360MB of size, it
achieved a 3.4% Word Error Rate (WER) on the clean test set and was
trained on 960 hours of LibriSpeech dataset on 16kHz sampled speech audio.
# audio_url = "https://github.com/x4nth055/pythoncode-
tutorials/raw/master/machine-learning/speech-recognition/16-122828-
0002.wav"
audio_url = "https://github.com/x4nth055/pythoncode-
tutorials/raw/master/machine-learning/speech-recognition/30-4447-
0004.wav"
# audio_url = "https://github.com/x4nth055/pythoncode-
tutorials/raw/master/machine-learning/speech-recognition/7601-291468-
0006.wav"
(16000, torch.Size([274000]))
The torchaudio.load() function loads the audio file and returns the audio as a
vector and the sample rate. It also automatically downloads the file if it's a
URL. If it's a path in the disk, it will also load it.
Note we use the squeeze() method as well, it is to remove the dimensions with
the size of 1. i.e., converting tensor from (1, 274000) to (274000,) .
Next, we need to make sure the input audio file to the model has the sample
rate of 16000Hz because wav2vec2 is trained on that:
torch.Size([274000])
Before we make the inference, we pass the audio vector to the wav2vec2
processor:
torch.Size([1, 274000])
Performing Inference
Let's pass the vector into our model now:
# perform inference
logits = model(input_values)["logits"]
logits.shape
Decoding them back to text, we also lower the text, as it's in all caps:
and missus goddard three ladies almost always at the service of an invitation
from hartfield and who were fetched and carried home so often that mister
woodhouse thought it no hardship for either james or the horses had it taken
place only once a year it would have been a grievance
def get_transcription(audio_path):
# load our wav file
speech, sr = torchaudio.load(audio_path)
speech = speech.squeeze()
# or using librosa
# speech, sr = librosa.load(audio_file, sr=16000)
# resample from whatever the audio sampling rate to 16000
resampler = torchaudio.transforms.Resample(sr, 16000)
speech = resampler(speech)
# tokenize our wav
input_values = processor(speech, return_tensors="pt",
sampling_rate=16000)["input_values"]
# perform inference
logits = model(input_values)["logits"]
# use argmax to get the predicted IDs
predicted_ids = torch.argmax(logits, dim=-1)
# decode the IDs to text
transcription = processor.decode(predicted_ids[0])
return transcription.lower()
get_transcription("http://www0.cs.ucl.ac.uk/teaching/GZ05/samples/lathe.wav"
Colab Environment
Local Environment
Note that there are other wav2vec2 weights trained by other people in
different languages than English. Check the models' page and filter on the
language of your desire to get the wanted model.
SourceCode:
AutomaticSpeechRecognition_PythonCodeTutorial.py
# %%
# !pip install transformers==4.11.2 datasets soundfile sentencepiece torchaudio pyaudio
# %%
from transformers import *
import torch
import soundfile as sf
# import librosa
import os
import torchaudio
# %%
# model_name = "facebook/wav2vec2-base-960h" # 360MB
model_name = "facebook/wav2vec2-large-960h-lv60-self" # 1.18GB
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
# %%
# audio_url = "https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/speech-
recognition/16-122828-0002.wav"
audio_url = "https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/speech-
recognition/30-4447-0004.wav"
# audio_url = "https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/speech-
recognition/7601-291468-0006.wav"
# %%
# load our wav file
speech, sr = torchaudio.load(audio_url)
speech = speech.squeeze()
# or using librosa
# speech, sr = librosa.load(audio_file, sr=16000)
sr, speech.shape
# %%
# resample from whatever the audio sampling rate to 16000
resampler = torchaudio.transforms.Resample(sr, 16000)
speech = resampler(speech)
speech.shape
# %%
# tokenize our wav
input_values = processor(speech, return_tensors="pt", sampling_rate=16000)["input_values"]
input_values.shape
# %%
# perform inference
logits = model(input_values)["logits"]
logits.shape
# %%
# use argmax to get the predicted IDs
predicted_ids = torch.argmax(logits, dim=-1)
predicted_ids.shape
# %%
# decode the IDs to text
transcription = processor.decode(predicted_ids[0])
transcription.lower()
# %%
def get_transcription(audio_path):
# load our wav file
speech, sr = torchaudio.load(audio_path)
speech = speech.squeeze()
# or using librosa
# speech, sr = librosa.load(audio_file, sr=16000)
# resample from whatever the audio sampling rate to 16000
resampler = torchaudio.transforms.Resample(sr, 16000)
speech = resampler(speech)
# tokenize our wav
input_values = processor(speech, return_tensors="pt", sampling_rate=16000)["input_values"]
# perform inference
logits = model(input_values)["logits"]
# use argmax to get the predicted IDs
predicted_ids = torch.argmax(logits, dim=-1)
# decode the IDs to text
transcription = processor.decode(predicted_ids[0])
return transcription.lower()
# %%
get_transcription(audio_url)
# %%
import pyaudio
import wave
# %%
get_transcription("recorded.wav")
# %%
CHAPTER 6: Machine
Translation using Transformers in
Python
Learn how to use Huggingface transformer models to perform machine translation on various
languages using transformers and PyTorch libraries in Python.
This tutorial will teach you how to perform machine translation without any
training. In other words, we'll be using pre-trained models from Huggingface
transformer models.
The Helsinki-NLP models we will use are primarily trained on the OPUS
dataset, a collection of translated texts from the web; it is free online data.
You can either make a new empty Python notebook or file to get started. You
can also follow with the notebook in Colab by clicking the Open In
Colab button above or down the article. First, let's install the required
libraries:
Importing transformers:
task_name = f"translation_{src}_to_{dst}"
model_name = f"Helsinki-NLP/opus-mt-{src}-{dst}"
srcand dst are the source and destination languages, respectively. Feel free to
change for your needs. We dynamically change the name
of task_name and model_name based on the source and destination languages, we
then initialize the pipeline by specifying the model and tokenizer arguments as
well. Let's test it out:
translator("You're a genius.")[0]["translation_text"]
Output:
article = """
Albert Einstein ( 14 March 1879 – 18 April 1955) was a German-born
theoretical physicist, widely acknowledged to be one of the greatest
physicists of all time.
Einstein is best known for developing the theory of relativity, but he also
made important contributions to the development of the theory of quantum
mechanics.
Relativity and quantum mechanics are together the two pillars of modern
physics.
His mass–energy equivalence formula E = mc2, which arises from relativity
theory, has been dubbed "the world's most famous equation".
His work is also known for its influence on the philosophy of science.
He received the 1921 Nobel Prize in Physics "for his services to theoretical
physics, and especially for his discovery of the law of the photoelectric
effect", a pivotal step in the development of quantum theory.
His intellectual achievements and originality resulted in "Einstein" becoming
synonymous with "genius"
"""
translator(article)[0]["translation_text"]
Output:
Albert Einstein (* 14. März 1879 – 18. April 1955) war ein deutscher
theoretischer Physiker, der allgemein als einer der größten Physiker aller
Zeiten anerkannt wurde.
Einstein ist am besten für die Entwicklung der Relativitätstheorie bekannt,
aber er leistete auch wichtige Beiträge zur Entwicklung der
Quantenmechaniktheorie.
Relativität und Quantenmechanik sind zusammen die beiden Säulen der
modernen Physik.
Seine Massenenergieäquivalenzformel E = mc2, die aus der
Relativitätstheorie hervorgeht, wurde als „die berühmteste Gleichung der
Welt" bezeichnet.
Seine Arbeit ist auch für ihren Einfluss auf die Philosophie der Wissenschaft
bekannt.
Er erhielt 1921 den Nobelpreis für Physik „für seine Verdienste um die
theoretische Physik und vor allem für seine Entdeckung des Gesetzes über
den photoelektrischen Effekt", einen entscheidenden Schritt in der
Entwicklung der Quantentheorie.
Seine intellektuellen Leistungen und Originalität führten dazu, dass
„Einstein" zum Synonym für „Genius" wurde.
I have tested this output on Google Translate to get it back in English, and it
seems to be an excellent translation!
# encode the text into tensor of integers using the appropriate tokenizer
inputs = tokenizer.encode(article, return_tensors="pt", max_length=512,
truncation=True)
print(inputs)
Output:
The tokenizer.encode() method encodes the text into tokens and converts them to
IDs, we set return_tensors to "pt" so it'll return a PyTorch tensor. We also
set max_length to 512 and truncation to True .
Let's now use greedy search to generate the translation for this:
We simply use the model.generate() method to get the outputs, and since the
outputs are also tokenized, we need to decode them back to human-readable
format. We also set skip_special_tokens to True so we don't see tokens such
as <pad> , etc. Here is the output:
You can also use beam search instead of greedy search, which may generate
better translations:
We set num_beams to 3 . I suggest reading this blog post or our tutorials on text
summarization and conversational AI chatbots for more information about
beams. The output:
Conclusion
That's it for this tutorial! I suggest you use your two languages and your text
to see which best suits you in terms of parameters in
the model.generate() method.
As stated above, there are a lot of parameters in the model.generate() method,
most of them are explained in the hugging face blog post or our tutorials
on text summarization and conversational AI chatbot.
Also, there are 1300+ pre-trained models on the Helsinki-NLP page, so your
native language is definitely present there!
SourceCode:
machine_translation.py
# -*- coding: utf-8 -*-
"""MachineTranslation-with-Transformers-PythonCode.ipynb
task_name = f"translation_{src}_to_{dst}"
model_name = f"Helsinki-NLP/opus-mt-{src}-{dst}"
translator("You're a genius.")[0]["translation_text"]
article = """
Albert Einstein ( 14 March 1879 – 18 April 1955) was a German-born theoretical physicist, widely
acknowledged to be one of the greatest physicists of all time.
Einstein is best known for developing the theory of relativity, but he also made important contributions
to the development of the theory of quantum mechanics.
Relativity and quantum mechanics are together the two pillars of modern physics.
His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been dubbed
"the world's most famous equation".
His work is also known for its influence on the philosophy of science.
He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for
his discovery of the law of the photoelectric effect", a pivotal step in the development of quantum
theory.
His intellectual achievements and originality resulted in "Einstein" becoming synonymous with
"genius"
"""
translator(article)[0]["translation_text"]
If you want to follow along, open up a new notebook, or Python file and
import the necessary libraries:
Picking a Dataset
If you're willing to pre-train a transformer, then you most likely have a
custom dataset. But for demonstration purposes in this tutorial, we're going to
use the cc_news dataset, we'll be using huggingface datasets library for that. As
a result, make sure to follow this link to get your custom dataset to be loaded
into the library.
CC-News dataset contains news articles from news sites all over the world. It
contains 708,241 news articles in English published between January 2017
and December 2019.
There is only one split in the dataset, so we need to split it into training and
testing sets:
You can also pass the seed parameter to the train_test_split() method so it'll be the
same sets after running multiple times.
Output:
(Dataset({
features: ['title', 'text', 'domain', 'date', 'description', 'url', 'image_url'],
num_rows: 637416
}), Dataset({
features: ['title', 'text', 'domain', 'date', 'description', 'url', 'image_url'],
num_rows: 70825
}))
for t in d["train"]["text"][:3]:
print(t)
print("="*50)
Output (stripped):
However, a better way to set up your custom dataset is to split your text file
into several chunk files using the split command or any other Python code,
and load them using load_dataset() as we did above, like this:
If you have your custom data as one massive file, then you should divide it
into a handful of text files (such as using the split command on Linux or
Colab) before loading them using the load_dataset() function, as the runtime will
crash if it exceeds the memory.
# if you want to train the tokenizer from scratch (especially if you have
custom
# dataset loaded as datasets object), then run this cell to save it as files
# but if you already have your custom data as text files, there is no point
using this
def dataset_to_text(dataset, output_filename="data.txt"):
"""Utility function to save dataset text to disk,
useful for using the texts to train the tokenizer
(as the tokenizer accepts files)"""
with open(output_filename, "w") as f:
for t in dataset["text"]:
print(t, file=f)
The main purpose of the above code cell is to save the dataset object as text
files. If you already have your dataset as text files, then you should skip this
step. Next, let's define some parameters:
special_tokens = [
"[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]", "<S>", "<T>"
]
# if you want to train the tokenizer on both sets
# files = ["train.txt", "test.txt"]
# training the tokenizer on the training set
files = ["train.txt"]
# 30,522 vocab is BERT's default vocab size, feel free to tweak
vocab_size = 30_522
# maximum sequence length, lowering will result to faster training (when
increasing batch size)
max_length = 512
# whether to truncate
truncate_longer_samples = False
The files list is the list of files to pass to the tokenizer for training. vocab_size is
the vocabulary size of tokens. max_length is the maximum sequence length.
model_path = "pretrained-bert"
The tokenizer.save_model() method saves the vocabulary file into that path, we
also manually save some tokenizer configurations, such as special tokens:
Of course, if you want to use the tokenizer multiple times, you don't have to
train it again, simply load it using the above cell.
def encode_with_truncation(examples):
"""Mapping function to tokenize the sentences passed with truncation"""
return tokenizer(examples["text"], truncation=True, padding="max_length",
max_length=max_length, return_special_tokens_mask=True)
def encode_without_truncation(examples):
"""Mapping function to tokenize the sentences passed without truncation"""
return tokenizer(examples["text"], return_special_tokens_mask=True)
# Note that with `batched=True`, this map processes 1,000 texts together, so
group_texts throws away a
# remainder for each of those groups of 1,000 texts. You can adjust that
batch_size here but a higher value
# might be slower to preprocess.
#
# To speed up this part, we use multiprocessing. See the documentation of the
map method for more information:
#
https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Datas
if not truncate_longer_samples:
train_dataset = train_dataset.map(group_texts, batched=True,
desc=f"Grouping texts in chunks of {max_length}")
test_dataset = test_dataset.map(group_texts, batched=True,
desc=f"Grouping texts in chunks of {max_length}")
# convert them from lists to torch tensors
train_dataset.set_format("torch")
test_dataset.set_format("torch")
Most of the above code was brought from the run_mlm.py script from
the huggingface transformers examples, so this is actually used by the library
itself.
If you don't want to concatenate all texts and then split them into chunks of
512 tokens, then make sure you set truncate_longer_samples to True , so it will treat
each line as an individual sample regardless of its length. If you
set truncate_longer_samples to True , the above code cell won't be executed at all.
len(train_dataset), len(test_dataset)
Output:
(643843, 71357)
We initialize the model config using BertConfig , and pass the vocabulary size
as well as the maximum sequence length. We then pass the config
to BertForMaskedLM to initialize the model itself.
Training
Before we start pre-training our model, we need a way to randomly mask
tokens in our dataset for the Masked Language Model (MLM) task.
Luckily, the library makes this easy for us by simply constructing
a DataCollatorForLanguageModeling object:
# initialize the data collator, randomly masking 20% (default is 15%) of the
tokens for the Masked Language
# Modeling (MLM) task
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.2
)
We pass the tokenizer and set mlm to True , and also set the mlm_probability to 0.2 to
randomly replace each token with [MASK] token by 20% probability.
training_args = TrainingArguments(
output_dir=model_path, # output directory to where save model
checkpoint
evaluation_strategy="steps", # evaluate each `logging_steps` steps
overwrite_output_dir=True,
num_train_epochs=10, # number of training epochs, feel free to
tweak
per_device_train_batch_size=10, # the training batch size, put it as high as
your GPU memory fits
gradient_accumulation_steps=8, # accumulating the gradients before
updating the weights
per_device_eval_batch_size=64, # evaluation batch size
logging_steps=1000, # evaluate, log and save model checkpoints
every 1000 step
save_steps=1000,
# load_best_model_at_end=True, # whether to load the best model (in
terms of loss) at the end of training
# save_total_limit=3, # whether you don't have much space so you
let only 3 model weights saved in the disk
)
We pass our training arguments to the Trainer , as well as the model, data
collator, and the training sets. We simply call train() now to start training:
The training will take several hours to several days, depending on the dataset
size, training batch size (i.e increase it as much as your GPU memory fits),
and GPU speed.
As you can see in the output, the model is still improving and the validation
loss is still decreasing. You usually have to cancel the training once the
validation loss stops decreasing or decreasing very slowly.
Since we have set logging_steps and save_steps to 1000, then the trainer will
evaluate and save the model after every 1000 steps (i.e trained on steps
x gradient_accumulation_step x per_device_train_size = 1000x8x10 = 80,000 samples).
As a result, I have canceled the training after about 19 hours of training,
or 10000 steps (that is about 1.27 epochs, or trained on 800,000 samples), and
started to use the model. In the next section, we'll see how we can use the
model for inference.
If you're on Google Colab, then you have to save your checkpoints in Google
Drive for later use, you can do that by setting model_path to a drive path instead
of a local path like we did here, just make sure you have enough space there.
Alternatively, you can push your model and tokenizer into the huggingface
hub, check this useful guide to do it.
We use the simple pipeline API, and pass both the model and the tokenizer . Let's
predict some examples:
# perform predictions
examples = [
"Today's most trending hashtags on [MASK] is Donald Trump",
"The [MASK] was cloudy yesterday, but today it's rainy.",
]
for example in examples:
for prediction in fill_mask(example):
print(f"{prediction['sequence']}, confidence: {prediction['score']}")
print("="*50)
Output:
That's impressive, I have canceled the training and the model is still
producing interesting results! If your model does not make good predictions,
then that's a good indicator that it wasn't trained enough.
Conclusion
And there you have a complete code for pretraining BERT or other
transformers using Huggingface libraries, below are some tips:
As mentioned above, the training speed will depend on the GPU speed,
the number of samples in the dataset, and batch size. I have set the
training batch size to 10, as that's the maximum it can fit my GPU
memory on Colab. If you have more memory, make sure to increase it
so you increase the training speed significantly.
During training, if you see the validation loss starts to increase, make
sure to remember the checkpoint where the lowest validation loss
occurs so you can load that checkpoint later for use. You can also
set load_best_model_at_end to True if you don't want to keep track of the loss,
as it will load the best weights in terms of loss when the training ends.
The vocabulary size was chosen based on the original BERT
configuration, as it had the size of 30,522, feel free to increase it if you
feel the language of your dataset has a large vocabulary, or you can
experiment with this.
If you set truncate_longer_samples to False , then the code assumes you have
larger text on one sentence (i.e line), you will notice that it takes much
longer to process, especially if you set a large batch_size on
the map() method. If it takes a lot of hours to process, then you can
either set truncate_longer_samples to True so you truncate sentences that
exceed max_length tokens or you can save the dataset after processing
using the save_to_disk() method, so you process it once and load it several
times.
In a newer version of the transformers library, there is a new parameter
called auto_find_batch_size in the TrainingArguments() class, you can pass it
as True so it'll find the optimal batch size for your GPU, avoiding Out-
of-Memory errors. Make sure you have accelerate library installed: pip
install accelerate .
SourceCode:
pretraining_bert.py
# -*- coding: utf-8 -*-
"""PretrainingBERT_PythonCodeTutorial.ipynb
for t in d["train"]["text"][:3]:
print(t)
print("="*50)
# if you want to train the tokenizer from scratch (especially if you have custom
# dataset loaded as datasets object), then run this cell to save it as files
# but if you already have your custom data as text files, there is no point using this
def dataset_to_text(dataset, output_filename="data.txt"):
"""Utility function to save dataset text to disk,
useful for using the texts to train the tokenizer
(as the tokenizer accepts files)"""
with open(output_filename, "w") as f:
for t in dataset["text"]:
print(t, file=f)
special_tokens = [
"[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]", "<S>", "<T>"
]
# if you want to train the tokenizer on both sets
# files = ["train.txt", "test.txt"]
# training the tokenizer on the training set
files = ["train.txt"]
# 30,522 vocab is BERT's default vocab size, feel free to tweak
vocab_size = 30_522
# maximum sequence length, lowering will result to faster training (when increasing batch size)
max_length = 512
# whether to truncate
truncate_longer_samples = False
def encode_with_truncation(examples):
"""Mapping function to tokenize the sentences passed with truncation"""
return tokenizer(examples["text"], truncation=True, padding="max_length",
max_length=max_length, return_special_tokens_mask=True)
def encode_without_truncation(examples):
"""Mapping function to tokenize the sentences passed without truncation"""
return tokenizer(examples["text"], return_special_tokens_mask=True)
if truncate_longer_samples:
# remove other columns and set input_ids and attention_mask as PyTorch tensors
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
else:
# remove other columns, and remain them as Python lists
test_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])
train_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])
# Note that with `batched=True`, this map processes 1,000 texts together, so group_texts throws away a
# remainder for each of those groups of 1,000 texts. You can adjust that batch_size here but a higher
value
# might be slower to preprocess.
#
# To speed up this part, we use multiprocessing. See the documentation of the map method for more
information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map
if not truncate_longer_samples:
train_dataset = train_dataset.map(group_texts, batched=True,
desc=f"Grouping texts in chunks of {max_length}")
test_dataset = test_dataset.map(group_texts, batched=True,
desc=f"Grouping texts in chunks of {max_length}")
# convert them from lists to torch tensors
train_dataset.set_format("torch")
test_dataset.set_format("torch")
len(train_dataset), len(test_dataset)
# initialize the data collator, randomly masking 20% (default is 15%) of the tokens for the Masked
Language
# Modeling (MLM) task
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.2
)
training_args = TrainingArguments(
output_dir=model_path, # output directory to where save model checkpoint
evaluation_strategy="steps", # evaluate each `logging_steps` steps
overwrite_output_dir=True,
num_train_epochs=10, # number of training epochs, feel free to tweak
per_device_train_batch_size=10, # the training batch size, put it as high as your GPU memory fits
gradient_accumulation_steps=8, # accumulating the gradients before updating the weights
per_device_eval_batch_size=64, # evaluation batch size
logging_steps=1000, # evaluate, log and save model checkpoints every 1000 step
save_steps=1000,
# load_best_model_at_end=True, # whether to load the best model (in terms of loss) at the end of
training
# save_total_limit=3, # whether you don't have much space so you let only 3 model weights
saved in the disk
)
# perform predictions
example = "It is known that [MASK] is the capital of Germany"
for prediction in fill_mask(example):
print(prediction)
# perform predictions
examples = [
"Today's most trending hashtags on [MASK] is Donald Trump",
"The [MASK] was cloudy yesterday, but today it's rainy.",
]
for example in examples:
for prediction in fill_mask(example):
print(f"{prediction['sequence']}, confidence: {prediction['score']}")
print("="*50)
!nvidia-smi
CHAPTER 8: Conversational AI
Chatbot with Transformers in
Python
Learn how to use Huggingface transformers library to generate conversational responses with the
pretrained DialoGPT model in Python.
Chatbots have gained a lot of popularity in recent years. As the interest grows
in using chatbots for business, researchers also did a great job on advancing
conversational AI chatbots.
In this tutorial, we'll use the Huggingface transformers library to employ the
pre-trained DialoGPT model for conversational response generation.
This tutorial is about text generation in chatbots and not regular text. If you
want open-ended generation, see this tutorial where I show you how to use
GPT-2 and GPT-J models to generate impressive text.
There are three versions of DialoGPT; small, medium, and large. Of course,
the larger, the better, but if you run this on your machine, I think small or
medium fits your memory with no problems. I tried loading the large model,
which takes about 5GB of my RAM. You can also use Google Colab to try
out the large one.
Let's make code for chatting with our AI using greedy search:
You see the model repeats a lot of responses, as these are the highest
probability, and it is choosing it every time.
Learn also: How to Train BERT from Scratch using Transformers in Python.
Now, we set top_k to 100 to sample from the top 100 words sorted
descendingly by probability. We also set temperature to 0.75 (default is 1.0 ) to
give a higher chance of picking high probability words, setting the
temperature to 0.0 is the same as greedy search; setting it to infinity is the
same as completely random.
Nucleus Sampling
Nucleus sampling or Top-p sampling chooses from the smallest possible
words whose cumulative probability exceeds the parameter p we set.
We set top_k to 0 to disable Top-k sampling, but you can use both methods,
which works better. Here is a chat:
Now let's add some code to generate more than one chatbot response, and
then we choose which response to include in the next input:
# chatting 5 times with nucleus & top-k sampling & tweaking temperature &
multiple
# sentences
for step in range(5):
# take user input
text = input(">> You:")
# encode the input and add end of string token
input_ids = tokenizer.encode(text + tokenizer.eos_token,
return_tensors="pt")
# concatenate new user input with chat history (if there is)
bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0
else input_ids
# generate a bot response
chat_history_ids_list = model.generate(
bot_input_ids,
max_length=1000,
do_sample=True,
top_p=0.95,
top_k=50,
temperature=0.75,
num_return_sequences=5,
pad_token_id=tokenizer.eos_token_id
)
#print the outputs
for i in range(len(chat_history_ids_list)):
output = tokenizer.decode(chat_history_ids_list[i]
[bot_input_ids.shape[-1]:], skip_special_tokens=True)
print(f"DialoGPT {i}: {output}")
choice_index = int(input("Choose the response you want for the next input:
"))
chat_history_ids = torch.unsqueeze(chat_history_ids_list[choice_index],
dim=0)
Conclusion
And there you go. I hope this tutorial helped you out on how to generate text
on DialoGPT and similar models. For more information on generating text, I
highly recommend you read the How to generate text with
Transformers guide.
I'll leave you tweaking the parameters to see if you can make the bot
performs better.
Also, a great and exciting challenge for you is combining this with text-to-
speech and speech-to-text tutorials to build a virtual assistant like Alexa, Siri,
and Cortana!
SourceCode:
dialogpt.py
# -*- coding: utf-8 -*-
"""DialoGPT.ipynb
# model_name = "microsoft/DialoGPT-large"
model_name = "microsoft/DialoGPT-medium"
# model_name = "microsoft/DialoGPT-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
print("====Greedy search chat====")
# chatting 5 times with greedy search
for step in range(5):
# take user input
text = input(">> You:")
# encode the input and add end of string token
input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
# concatenate new user input with chat history (if there is)
bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
# generate a bot response
chat_history_ids = model.generate(
bot_input_ids,
max_length=1000,
pad_token_id=tokenizer.eos_token_id,
)
#print the output
output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0],
skip_special_tokens=True)
print(f"DialoGPT: {output}")
print("====Beam search chat====")
# chatting 5 times with beam search
for step in range(5):
# take user input
text = input(">> You:")
# encode the input and add end of string token
input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
# concatenate new user input with chat history (if there is)
bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
# generate a bot response
chat_history_ids = model.generate(
bot_input_ids,
max_length=1000,
num_beams=3,
early_stopping=True,
pad_token_id=tokenizer.eos_token_id
)
#print the output
output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0],
skip_special_tokens=True)
print(f"DialoGPT: {output}")
print("====Sampling chat====")
# chatting 5 times with sampling
for step in range(5):
# take user input
text = input(">> You:")
# encode the input and add end of string token
input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
# concatenate new user input with chat history (if there is)
bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
# generate a bot response
chat_history_ids = model.generate(
bot_input_ids,
max_length=1000,
do_sample=True,
top_k=0,
pad_token_id=tokenizer.eos_token_id
)
#print the output
output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0],
skip_special_tokens=True)
print(f"DialoGPT: {output}")
print("====Sampling chat with tweaking temperature====")
# chatting 5 times with sampling & tweaking temperature
for step in range(5):
# take user input
text = input(">> You:")
# encode the input and add end of string token
input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
# concatenate new user input with chat history (if there is)
bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
# generate a bot response
chat_history_ids = model.generate(
bot_input_ids,
max_length=1000,
do_sample=True,
top_k=0,
temperature=0.75,
pad_token_id=tokenizer.eos_token_id
)
#print the output
output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0],
skip_special_tokens=True)
print(f"DialoGPT: {output}")
print("====Top-K sampling chat with tweaking temperature====")
# chatting 5 times with Top K sampling & tweaking temperature
for step in range(5):
# take user input
text = input(">> You:")
# encode the input and add end of string token
input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
# concatenate new user input with chat history (if there is)
bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
# generate a bot response
chat_history_ids = model.generate(
bot_input_ids,
max_length=1000,
do_sample=True,
top_k=100,
temperature=0.75,
pad_token_id=tokenizer.eos_token_id
)
#print the output
output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0],
skip_special_tokens=True)
print(f"DialoGPT: {output}")
print("====Nucleus sampling (top-p) chat with tweaking temperature====")
# chatting 5 times with nucleus sampling & tweaking temperature
for step in range(5):
# take user input
text = input(">> You:")
# encode the input and add end of string token
input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
# concatenate new user input with chat history (if there is)
bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
# generate a bot response
chat_history_ids = model.generate(
bot_input_ids,
max_length=1000,
do_sample=True,
top_p=0.95,
top_k=0,
temperature=0.75,
pad_token_id=tokenizer.eos_token_id
)
#print the output
output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0],
skip_special_tokens=True)
print(f"DialoGPT: {output}")
print("====chatting 5 times with nucleus & top-k sampling & tweaking temperature & multiple
sentences====")
# chatting 5 times with nucleus & top-k sampling & tweaking temperature & multiple
# sentences
for step in range(5):
# take user input
text = input(">> You:")
# encode the input and add end of string token
input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
# concatenate new user input with chat history (if there is)
bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
# generate a bot response
chat_history_ids_list = model.generate(
bot_input_ids,
max_length=1000,
do_sample=True,
top_p=0.95,
top_k=50,
temperature=0.75,
num_return_sequences=5,
pad_token_id=tokenizer.eos_token_id
)
#print the outputs
for i in range(len(chat_history_ids_list)):
output = tokenizer.decode(chat_history_ids_list[i][bot_input_ids.shape[-1]:],
skip_special_tokens=True)
print(f"DialoGPT {i}: {output}")
choice_index = int(input("Choose the response you want for the next input: "))
chat_history_ids = torch.unsqueeze(chat_history_ids_list[choice_index], dim=0)
CHAPTER 9: Fine Tune BERT
for Text Classification using
Transformers in Python
Learn how to use HuggingFace transformers library to fine tune BERT and other transformer models
for text classification task in Python.
Transformer models have been showing incredible results in most of the tasks
in the natural language processing field. The power of transfer learning
combined with large-scale transformer language models has become a
standard in state-of-the-art NLP.
One of the most significant milestones in the evolution of NLP is the release
of Google's BERT model in late 2018, which is known as the beginning of a
new era in NLP.
Please note that this tutorial is about fine-tuning the BERT model on a
downstream task (such as text classification). If you want to train BERT from
scratch, that's called pre-training; this tutorial will definitely help you.
import torch
from transformers.file_utils import is_tf_available, is_torch_available,
is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
import random
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
Next, let's make a function to set seed so we'll have the same results in
different runs:
Args:
seed (:obj:`int`): The seed to set.
"""
random.seed(seed)
np.random.seed(seed)
if is_torch_available():
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# ^^ safe to call this function even if cuda is not available
if is_tf_available():
import tensorflow as tf
tf.random.set_seed(seed)
set_seed(1)
We also set do_lower_case to True to make sure we lowercase all the text
(remember, we're using the uncased model).
def read_20newsgroups(test_size=0.2):
# download & load 20newsgroups dataset from sklearn's repos
dataset = fetch_20newsgroups(subset="all", shuffle=True, remove=
("headers", "footers", "quotes"))
documents = dataset.data
labels = dataset.target
# split into training & testing a return data as well as label names
return train_test_split(documents, labels, test_size=test_size),
dataset.target_names
The below code wraps our tokenized text data into a torch Dataset :
class NewsGroupsDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __len__(self):
return len(self.labels)
Since we gonna use Trainer from Transformers library, it expects our dataset
as a torch.utils.data.Dataset , so we made a simple class that implements
the __len__() method that returns the number of samples,
and __getitem__() method to return a data sample at a specific index.
We also cast our model to our CUDA GPU. If you're on CPU (not
suggested), then just delete to() method.
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
# calculate accuracy using sklearn's function
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
}
You're free to include any metric you want, I've included accuracy, but you
can add precision, recall, etc.
The below code uses TrainingArguments class to specify our training arguments,
such as the number of epochs, batch size, and some other parameters:
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=8, # batch size per device during training
per_device_eval_batch_size=20, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate
scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
load_best_model_at_end=True, # load the best model when finished
training (default metric is loss)
# but you can specify `metric_for_best_model` argument to change to
accuracy or other metric
logging_steps=400, # log & save weights each logging_steps
save_steps=400,
evaluation_strategy="steps", # evaluate each `logging_steps`
)
You can also tweak other parameters, such as increasing the number of
epochs for better training.
I've set the logging_steps and save_steps to 400, which means it will evaluate and
save the model after every 400 steps, make sure to increase it when you
decrease the batch size lower than 8, that's because it'll save a lot of
checkpoints after every few steps, and may take your whole environment disk
space.
trainer = Trainer(
model=model, # the instantiated Transformers model to be
trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=valid_dataset, # evaluation dataset
compute_metrics=compute_metrics, # the callback that computes
metrics of interest
)
As you can see, the validation loss is gradually decreasing, and the accuracy
increased to over 77.8%.
Remember we set load_best_model_at_end to True , this will automatically load the
best-performed model when finished training, let's make sure
with evaluate() method:
{'epoch': 3.0,
'eval_accuracy': 0.7758620689655172,
'eval_loss': 0.80070960521698}
Now that we trained our model, let's save it for inference later:
Performing Inference
Now we have a trained model on our dataset, let's try to have some fun with
it!
The below function takes a text as a string, tokenizes it with our tokenizer,
calculates the output probabilities using softmax function, and returns the
actual label:
def get_prediction(text):
# prepare our text into tokenized sequence
inputs = tokenizer(text, padding=True, truncation=True,
max_length=max_length, return_tensors="pt").to("cuda")
# perform inference to our model
outputs = model(**inputs)
# get output probabilities by doing softmax
probs = outputs[0].softmax(1)
# executing argmax function to get the candidate label
return target_names[probs.argmax()]
Here's an example:
# Example #1
text = """
The first thing is first.
If you purchase a Macbook, you should not encounter performance issues
that will prevent you from learning to code efficiently.
However, in the off chance that you have to deal with a slow computer, you
will need to make some adjustments.
Having too many background apps running in the background is one of the
most common causes.
The same can be said about a lack of drive storage.
For that, it helps if you uninstall xcode and other unnecessary applications, as
well as temporary system junk like caches and old backups.
"""
print(get_prediction(text))
Output:
comp.sys.mac.hardware
As expected, we're talking about Macbooks. Here's a second example:
# Example #2
text = """
A black hole is a place in space where gravity pulls so much that even light
can not get out.
The gravity is so strong because matter has been squeezed into a tiny space.
This can happen when a star is dying.
Because no light can get out, people can't see black holes.
They are invisible. Space telescopes with special tools can help find black
holes.
The special tools can see how stars that are very close to black holes act
differently than other stars.
"""
print(get_prediction(text))
Output:
sci.space
# Example #3
text = """
Coronavirus disease (COVID-19) is an infectious disease caused by a newly
discovered coronavirus.
Most people infected with the COVID-19 virus will experience mild to
moderate respiratory illness and recover without requiring special treatment.
Older people, and those with underlying medical problems like
cardiovascular disease, diabetes, chronic respiratory disease, and cancer are
more likely to develop serious illness.
"""
print(get_prediction(text))
Output:
sci.med
Conclusion
In this tutorial, you've learned how you can train the BERT model
using Huggingface Transformers library on your dataset.
Note that, you can also use other transformer models, such as GPT-
2 with GPT2ForSequenceClassification , RoBERTa with GPT2ForSequenceClassification , DistilBERT
and much more. Please head to the official documentation for a list of
available models.
Also, if your dataset is in a language other than English, make sure you pick
the weights for your language, this will help a lot during training. Check this
link and use the filter to get the model weights you need.
SourceCode:
train.py
# !pip install transformers
import torch
from transformers.file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
import random
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
Args:
seed (:obj:`int`): The seed to set.
"""
random.seed(seed)
np.random.seed(seed)
if is_torch_available():
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# safe to call this function even if cuda is not available
if is_tf_available():
import tensorflow as tf
tf.random.set_seed(seed)
set_seed(1)
def read_20newsgroups(test_size=0.2):
# download & load 20newsgroups dataset from sklearn's repos
dataset = fetch_20newsgroups(subset="all", shuffle=True, remove=("headers", "footers", "quotes"))
documents = dataset.data
labels = dataset.target
# split into training & testing a return data as well as label names
return train_test_split(documents, labels, test_size=test_size), dataset.target_names
class NewsGroupsDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __len__(self):
return len(self.labels)
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
# calculate accuracy using sklearn's function
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
}
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=8, # batch size per device during training
per_device_eval_batch_size=20, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
load_best_model_at_end=True, # load the best model when finished training (default metric is
loss)
# but you can specify `metric_for_best_model` argument to change to accuracy or other metric
logging_steps=200, # log & save weights each logging_steps
save_steps=200,
evaluation_strategy="steps", # evaluate each `logging_steps`
)
trainer = Trainer(
model=model, # the instantiated Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=valid_dataset, # evaluation dataset
compute_metrics=compute_metrics, # the callback that computes metrics of interest
)
# train the model
trainer.train()
# evaluate the current model after training
trainer.evaluate()
# saving the fine tuned model & tokenizer
model_path = "20newsgroups-bert-base-uncased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)
inference.py
from transformers import BertForSequenceClassification, BertTokenizerFast
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_20newsgroups
model_path = "20newsgroups-bert-base-uncased"
max_length = 512
def read_20newsgroups(test_size=0.2):
dataset = fetch_20newsgroups(subset="all", shuffle=True, remove=("headers", "footers", "quotes"))
documents = dataset.data
labels = dataset.target
return train_test_split(documents, labels, test_size=test_size), dataset.target_names
model = BertForSequenceClassification.from_pretrained(model_path,
num_labels=len(target_names)).to("cuda")
tokenizer = BertTokenizerFast.from_pretrained(model_path)
def get_prediction(text):
# prepare our text into tokenized sequence
inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length,
return_tensors="pt").to("cuda")
# perform inference to our model
outputs = model(**inputs)
# get output probabilities by doing softmax
probs = outputs[0].softmax(1)
# executing argmax function to get the candidate label
return target_names[probs.argmax()]
# Example #1
text = """With the pace of smartphone evolution moving so fast, there's always something waiting in
the wings.
No sooner have you spied the latest handset, that there's anticipation for the next big thing.
Here we look at those phones that haven't yet launched, the upcoming phones for 2021.
We'll be updating this list on a regular basis, with those device rumours we think are credible and
exciting."""
print(get_prediction(text))
# Example #2
text = """
A black hole is a place in space where gravity pulls so much that even light can not get out.
The gravity is so strong because matter has been squeezed into a tiny space. This can happen when a
star is dying.
Because no light can get out, people can't see black holes.
They are invisible. Space telescopes with special tools can help find black holes.
The special tools can see how stars that are very close to black holes act differently than other stars.
"""
print(get_prediction(text))
CHAPTER 10: Perform Text
Summarization using Transformers
in Python
Learn how to use Huggingface transformers and PyTorch libraries to summarize long text, using
pipeline API and T5 transformer model in Python.
Text summarization is the task of shortening long pieces of text into a concise
summary that preserves key information content and overall meaning.
There are two different approaches that are widely used for text
summarization:
Note that the first time you execute this, it'll download the model architecture
and the weights and tokenizer configuration.
We specify the "summarization" task to the pipeline, and then we simply pass
our long text to it. Here is the output:
Summary: Paul Walker died in November 2013 after a car crash in Los
Angeles .
The late actor was one of the nicest guys in Hollywood .
The release of "Furious 7" on Friday offers a chance to grieve again .
There have been multiple tributes to Walker leading up to the film's release .
print("="*50)
# another example
original_text = """
For the first time in eight years, a TV legend returned to doing what he does
best.
Contestants told to "come on down!" on the April 1 edition of "The Price Is
Right" encountered not host Drew Carey but another familiar face in charge
of the proceedings.
Instead, there was Bob Barker, who hosted the TV game show for 35 years
before stepping down in 2007.
Looking spry at 91, Barker handled the first price-guessing game of the
show, the classic "Lucky Seven," before turning hosting duties over to Carey,
who finished up.
Despite being away from the show for most of the past eight years, Barker
didn't seem to miss a beat.
"""
summary_text = summarization(original_text)[0]['summary_text']
print("Summary:", summary_text)
Output:
==================================================
Summary: Bob Barker returns to "The Price Is Right" for the first time in
eight years .
The 91-year-old hosted the show for 35 years before stepping down in 2007 .
Drew Carey finished up hosting duties on the April 1 edition of the game
show .
Barker handled the first price-guessing game of the show .
As you can see, the model generated an entirely new summarized text that
does not belong to the original text.
This is the quickest way to use transformers. In the next section, we will learn
another way to perform text summarization and customize how we want to
generate the output.
Using T5 Model
The following code cell initializes the T5 transformer model along with its
tokenizer:
The first time you execute the above code, will download the t5-base model
architecture, weights, tokenizer vocabulary, and configuration.
article = """
Justin Timberlake and Jessica Biel, welcome to parenthood.
The celebrity couple announced the arrival of their son, Silas Randall
Timberlake, in statements to People.
"Silas was the middle name of Timberlake's maternal grandfather Bill Bomar,
who died in 2012, while Randall is the musician's own middle name, as well
as his father's first," People reports.
The couple announced the pregnancy in January, with an Instagram post. It is
the first baby for both.
"""
Now let's encode this text to be suitable for the model as an input:
# encode the text into tensor of integers using the appropriate tokenizer
inputs = tokenizer.encode("summarize: " + article, return_tensors="pt",
max_length=512, truncation=True)
We set the max_length to 512, indicating that we do not want the original text to
bypass 512 tokens; we also set return_tensors to "pt" to get PyTorch tensors as
output.
Notice we prepended the text with "summarize: " text, and that's because T5 isn't
just for text summarization. You can use it for any text-to-text
transformation, such as machine translation or question answering, or
even paraphrasing.
For example, we can use the T5 transformer for machine translation, and you
can set "translate English to German: " instead of "summarize: " and you'll get a German
translation output (more precisely, you'll get a summarized German
translation, as you'll see why in model.generate() ). For more information about
translation, check this tutorial.
Output:
the couple announced the pregnancy in January. it is the first baby for both of
them.
the baby is the middle name of Timberlake's maternal grandfather, who died
in 2012.
Excellent, the output looks concise and is newly generated with a new
summarizing style.
We then the decode() method from the tokenizer to convert the tensor back to
human-readable text.
Conclusion
There are a lot of other parameters to tweak in model.generate() method. I highly
encourage you to check this tutorial from the HuggingFace blog.
Alright, that's it for this tutorial. You've learned two ways to use
HuggingFace's transformers library to perform text summarization. Check out
the documentation here.
SourceCode:
using_pipeline.py
using_t5.py
from transformers import T5ForConditionalGeneration, T5Tokenizer
Sentences hold many valuable information that may have a huge impact on
the decision making process of a given company, since it is a way to
perform customer analytics to get to better know your users hence giving
them better products in the future.
In this tutorial, we will learn on how to extract the sentiment score (-1 for
negative, 0 for neutral and 1 for positive) from any given text using
the vaderSentiment library.
The nice thing about this library is that you don't have to train anything in
order to use it, you'll soon realize that it is pretty straightforward to use it,
open up a new Python file and import SentimentIntensityAnalyzer class:
sentences = [
"This food is amazing and tasty !",
"Exoplanets are planets outside the solar system",
"This is sad to see such bad behavior"
]
The sentiment value of the sentence :"This food is amazing and tasty !" is :
0.6239
The sentiment value of the sentence :"Exoplanets are planets outside the solar
system" is : 0.0
The sentiment value of the sentence :"This is sad to see such bad behavior" is
: -0.765
Output:
Conclusion
In this tutorial you have learned:
sentiment_analysis.py
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentences = [
"This food is amazing and tasty !",
"Exoplanets are planets outside the solar system",
"This is sad to see such bad behavior"
]
Google translate is a free service that translates words, phrases and entire web
pages into more than 100 languages. You probably already know it and you
have used it many times in your life.
In this tutorial, you will learn how to perform language translation in Python
using Googletrans library. Googletrans is a free and unlimited Python library
that make unofficial Ajax calls to Google Translate API in order to detect
languages and translate text.
Note that Googletrans makes API calls to the Google translate API, if you
want a reliable use, then consider using an official API or making your own
machine translation machine learning model.
Translating Text
Importing necessary libraries:
service_urls :This should be a list of strings that are the URLs of google
translate API, an example is ["translate.google.com", "translate.google.co.uk"] .
user_agent : A string that will be included in User-Agent header in the
request.
proxies (dictionary): A Python dictionary that maps protocol or protocol
and host to the URL of the proxy, an example is {'http': 'example.com:3128',
'http://domain.example': 'example.com:3555'} , more on proxies in this tutorial.
timeout : The timeout of each request you make, expressed in seconds.
This will print the original text and language along with the translated text
and language:
Hola Mundo (es) --> Hello World (en)
Then you have to uninstall the current googletrans version and install the new
one using the following commands:
Going back to the code, it automatically detects the language and translate to
english by default, let's translate to another language, arabic for instance:
Output:
You can also check other translations and some other extra data:
{'all-translations': [['interjection',
['How are you doing?', "What's up?"],
[['How are you doing?', ["Wie geht's?"]],
["What's up?", ["Wie geht's?"]]],
"Wie geht's?",
9]],
'confidence': 1.0,
'definitions': None,
'examples': None,
'language': [['de'], None, [1.0], ['de']],
'original-language': 'de',
'possible-mistakes': None,
'possible-translations': [['Wie gehts ?',
None,
[['How are you ?', 1000, True, False],
["How's it going ?", 1000, True, False],
['How are you?', 0, True, False]],
[[0, 11]],
'Wie gehts ?',
0,
0]],
'see-also': None,
'synonyms': None,
'translation': [['How are you ?', 'Wie gehts ?', None, None, 1]]}
A lot of data to benefit from, you have all the possible translations,
confidence, definitions and even examples.
Language Detection
Google Translate API offers us language detection call as well:
# detect a language
detection = translator.detect(" नम त◌े दुिनय◌ा ")
print("Language code:", detection.lang)
print("Confidence:", detection.confidence)
This will print the code of the detected language along with confidence rate
(1.0 means 100% confident):
Language code: hi
Confidence: 1.0
This will return the language code, to get the full language name, you can use
the LANGUAGES dictionary provided by Googletrans:
print("Language:", constants.LANGUAGES[detection.lang])
Output:
Language: hindi
Supported Languages
As you may know, Google Translate supports more than 100 languages, let's
print all of them:
It also doesn't guarantee that the library would work properly at all times, if
you want to use a stable API you should use the official Google Translate
API.
If you get HTTP 5xx errors with this library, then Google has banned your IP
address, it's because using this library a lot, Google translate may block your
IP address, you'll need to consider using proxies by passing a proxy
dictionary to proxies parameter in Translator() class, or use the official API as
discussed.
Also, I've written a quick Python script that will allow you to translate text
into sentences as well as in documents in the command line, check it here.
Finally, I encourage you to further explore the library, check out its official
documentation.
Finally, if you're a beginner and want to learn Python, I suggest you take
the Python For Everybody Coursera course, in which you'll learn a lot about
Python. You can also check our resources and courses page to see the Python
resources I recommend!
SourceCode:
translator.py
from googletrans import Translator, constants
from pprint import pprint
# init the Google API translator
translator = Translator()
# detect a language
detection = translator.detect(" नम त◌े दुिनय◌ा ")
print("Language code:", detection.lang)
print("Confidence:", detection.confidence)
# print the detected language
print("Language:", constants.LANGUAGES[detection.lang])
translate_doc.py
from googletrans import Translator
import argparse
import os
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Simple Python script to translate text using Google
Translate API (googletrans wrapper)")
parser.add_argument("target", help="Text/Document to translate")
parser.add_argument("-s", "--source", help="Source language, default is Google Translate's auto
detection", default="auto")
parser.add_argument("-d", "--destination", help="Destination language, default is English",
default="en")
args = parser.parse_args()
target = args.target
src = args.source
dest = args.destination
if os.path.isfile(target):
# translate a document instead
# get basename of file
basename = os.path.basename(target)
# get the path dir
dirname = os.path.dirname(target)
try:
filename, ext = basename.split(".")
except:
# no extension
filename = basename
ext = ""
Usage:
python translate_doc.py --help
Output:
positional arguments:
target Text/Document to translate
optional arguments:
-h, --help show this help message and exit
-s SOURCE, --source SOURCE
Source language, default is Google Translate's auto
detection
-d DESTINATION, --destination DESTINATION
Destination language, default is English
For instance, if you want to translate text in the document wonderland.txt from
english ( en ) to arabic ( ar ), you can use:
python translate_doc.py wonderland.txt --source en --destination ar
A new file wonderland_ar.txt will appear in the current directory that contains the
translated document.
Output:
'Hello'
CHAPTER 13: Perform Text
Classification in Python using
Tensorflow 2 and Keras
Building deep learning models (using embedding and recurrent layers) for different text classification
problems such as sentiment analysis or 20 news group classification using Tensorflow and Keras in
Python
In this tutorial, we will build a text classifier model using RNNs using
Tensorflow in Python; we will be using the IMDB reviews dataset, which
has 50K real-world movie reviews along with their sentiment (positive or
negative). At the end of this tutorial, I will show you how you can integrate
your own dataset so you can train the model on it.
Data Preparation
Before we load our dataset into Python, you need to download the
dataset here; you'll see two files there, reviews.txt , which contains a movie
review in each line, and labels.txt which holds its corresponding label.
labels = []
with open("data/labels.txt") as f:
for label in f:
label = label.strip()
labels.append(label)
# tokenize the dataset corpus, delete uncommon words such as names, etc.
tokenizer = Tokenizer(num_words=num_words, oov_token=oov_token)
tokenizer.fit_on_texts(reviews)
X = tokenizer.texts_to_sequences(reviews)
X, y = np.array(X), np.array(labels)
# pad sequences with 0's
X = pad_sequences(X, maxlen=sequence_length)
# convert labels to one-hot encoded
y = to_categorical(y)
# split data to training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size,
random_state=1)
data = {}
data["X_train"] = X_train
data["X_test"]= X_test
data["y_train"] = y_train
data["y_test"] = y_test
data["tokenizer"] = tokenizer
data["int2label"] = {0: "negative", 1: "positive"}
data["label2int"] = {"negative": 0, "positive": 1}
return data
More precisely, we will use pre-trained GloVe word vectors, which are pre-
trained vectors that map each word to a vector of a specific size. This size
parameter is often called embedding size, although GloVe
uses 50, 100, 200, or 300 embedding size vectors. We will try all of them in
this tutorial and see which performs best. Also, two words with the same
meaning tend to have very close vectors.
The second layer will be recurrent, you'll have the choice to choose any
recurrent cell you want, including LSTM, GRU, or even
just SimpleRNN, and again, we'll see which one outperforms the others.
The last layer should be a dense layer with N neurons. N should be the same
number of categories in your dataset. In the case of positive/negative
sentiment analysis, it should be 2.
Now we are going to need a function that creates the model from scratch,
given the hyperparameters:
I know there are a lot of parameters in this function. Well, to test various
parameters, this function will be flexible to all parameters provided. Let's
explain them:
When you look closely, you'll notice that I'm using the Embedding class
with weights parameter. It specifies the pre-trained weights we just
downloaded, we're also setting trainable to False, so these vectors won't change
during the training process.
If your dataset is in a different language than English, make sure you find
embedding vectors for the language you're using, if not, you shouldn't set
weights parameter at all, and you need to set trainable to True, so you'll train
the parameters of the vector from scratch, check this page for word vectors of
your language.
def get_model_name(dataset_name):
# construct the unique model name
model_name = f"{dataset_name}-{RNN_CELL.__name__}-seq-
{SEQUENCE_LENGTH}-em-{EMBEDDING_SIZE}-w-{N_WORDS}-
layers-{N_LAYERS}-units-{UNITS}-opt-{OPTIMIZER}-BS-
{BATCH_SIZE}-d-{DROPOUT}"
if IS_BIDIRECTIONAL:
# add 'bid' str if bidirectional
model_name = "bid-" + model_name
if OOV_TOKEN:
# add 'oov' str if OOV token is specified
model_name += "-oov"
return model_name
This will take several minutes to train. Here is my execution output after the
training is finished:
def get_predictions(text):
sequence = data["tokenizer"].texts_to_sequences([text])
# pad the sequences
sequence = pad_sequences(sequence, maxlen=SEQUENCE_LENGTH)
# get the prediction
prediction = model.predict(sequence)[0]
return prediction, data["int2label"][np.argmax(prediction)]
Output:
Output:
It is pretty sure that it's a negative sentiment with about 92% confidence.
Let's be more challenging:
Output:
It is pretty 61% sure that's a good sentiment, as you can see, it's giving
interesting results, spend some time tricking the model!
Hyperparameter Tuning
Before I came up with 90% accuracy, I have experimented with various
hyper parameters, here are some of the interesting ones:
These are 4 models, and each has a different embedding size, as you can see,
the one that has a 300 length vector (each word got a 300 length vector)
reached the lowest validation loss value.
Here is another one when I used the sequence length as the varying
parameter:
The model which has a 300 sequence length (the green one) tends to perform
better.
Using tensorboard, you can see that after reaching epochs 4-5-6, the
validation loss will try to increase again, that's clearly overfitting. That's why
I set epochs to 6. try to tweak other parameters such as dropout rate and see if
you can decrease it furthermore.
Alright, good luck implementing your own text classifier, if you have any
problems integrating one, post your comment below and I'll try to reach you
as soon as possible.
SourceCode:
parameters.py
from tensorflow.keras.layers import LSTM
def get_model_name(dataset_name):
# construct the unique model name
model_name = f"{dataset_name}-{RNN_CELL.__name__}-seq-{SEQUENCE_LENGTH}-em-
{EMBEDDING_SIZE}-w-{N_WORDS}-layers-{N_LAYERS}-units-{UNITS}-opt-{OPTIMIZER}-
BS-{BATCH_SIZE}-d-{DROPOUT}"
if IS_BIDIRECTIONAL:
# add 'bid' str if bidirectional
model_name = "bid-" + model_name
if OOV_TOKEN:
# add 'oov' str if OOV token is specified
model_name += "-oov"
return model_name
utils.py
from tqdm import tqdm
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Dropout, LSTM, Embedding, Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_20newsgroups
for i in range(n_layers):
if i == n_layers - 1:
# last layer
if bidirectional:
model.add(Bidirectional(cell(units, return_sequences=False)))
else:
model.add(cell(units, return_sequences=False))
else:
# first layer or hidden layers
if bidirectional:
model.add(Bidirectional(cell(units, return_sequences=True)))
else:
model.add(cell(units, return_sequences=True))
model.add(Dropout(dropout))
model.add(Dense(output_length, activation="softmax"))
# compile the model
model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])
return model
labels = []
with open("data/labels.txt") as f:
for label in f:
label = label.strip()
labels.append(label)
# tokenize the dataset corpus, delete uncommon words such as names, etc.
tokenizer = Tokenizer(num_words=num_words, oov_token=oov_token)
tokenizer.fit_on_texts(reviews)
X = tokenizer.texts_to_sequences(reviews)
X, y = np.array(X), np.array(labels)
data = {}
data["X_train"] = X_train
data["X_test"]= X_test
data["y_train"] = y_train
data["y_test"] = y_test
data["tokenizer"] = tokenizer
data["int2label"] = {0: "negative", 1: "positive"}
data["label2int"] = {"negative": 0, "positive": 1}
return data
X, y = np.array(X), np.array(labels)
data = {}
data["X_train"] = X_train
data["X_test"]= X_test
data["y_train"] = y_train
data["y_test"] = y_test
data["tokenizer"] = tokenizer
return data
sentiment_analysis.py
from tensorflow.keras.callbacks import TensorBoard
import os
if not os.path.isdir("logs"):
os.mkdir("logs")
if not os.path.isdir("data"):
os.mkdir("data")
model.summary()
20_news_group_classification.py
from tensorflow.keras.callbacks import TensorBoard
import os
if not os.path.isdir("logs"):
os.mkdir("logs")
if not os.path.isdir("data"):
os.mkdir("data")
model.summary()
import pickle
import os
model.load_weights(os.path.join("results", f"{model_name}.h5"))
def get_predictions(text):
sequence = data["tokenizer"].texts_to_sequences([text])
# pad the sequences
sequence = pad_sequences(sequence, maxlen=SEQUENCE_LENGTH)
# get the prediction
prediction = model.predict(sequence)[0]
print("output vector:", prediction)
return data["int2label"][np.argmax(prediction)]
while True:
text = input("Enter your text: ")
prediction = get_predictions(text)
print("="*50)
print("The class is:", prediction)
CHAPTER 14: Build a Text
Generator using TensorFlow 2 and
Keras in Python
Building a deep learning model to generate human readable text using Recurrent Neural Networks
(RNNs) and LSTM with TensorFlow and Keras frameworks in Python.
Recurrent Neural Networks (RNNs) are very powerful sequence models for
classification problems. However, in this tutorial, we are doing to do
something different, we will use RNNs as generative models, which means
they can learn the sequences of a problem and then generate entirely a new
sequence for the problem domain.
After reading this tutorial, you will learn how to build an LSTM model that
can generate text (character by character) using TensorFlow and Keras in
Python.
Note that the ultimate goal of this tutorial is to use TensorFlow and Keras to
use LSTM models for text generation. If you want a better text generator,
check this tutorial that uses transformer models to generate text.
In text generation, we show the model many training examples so it can learn
a pattern between the input and output. Each input is a sequence of characters
and the output is the next single character. For instance, say we want to train
on the sentence "python is a great language", the input of the first sample
is "python is a great langua" and output would be "g". The second sample
input would be "ython is a great languag" and the output is "e", and so on,
until we loop all over the dataset. We need to show the model as many
examples as we can grab in order to make reasonable predictions.
Getting Started
Let's install the required dependencies for this tutorial:
pip3 install tensorflow==2.0.1 numpy requests tqdm
Importing everything:
import tensorflow as tf
import numpy as np
import os
import pickle
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from string import punctuation
import requests
content =
requests.get("http://www.gutenberg.org/cache/epub/11/pg11.txt").text
open("data/wonderland.txt", "w", encoding="utf-8").write(content)
Just make sure you have a folder called "data" exist in your current directory.
Now let's define our parameters and try to clean this dataset:
sequence_length = 100
BATCH_SIZE = 128
EPOCHS = 30
# dataset file path
FILE_PATH = "data/wonderland.txt"
BASENAME = os.path.basename(FILE_PATH)
# read the data
text = open(FILE_PATH, encoding="utf-8").read()
# remove caps, comment this code if you want uppercase characters as well
text = text.lower()
# remove punctuation
text = text.translate(str.maketrans("", "", punctuation))
The above code reduces our vocabulary for better and faster training by
removing upper case characters and punctuations as well as replacing two
consecutive newlines with just one. If you wish to keep commas, periods and
colons, just define your own punctuation string variable.
Output:
unique_chars:
0123456789abcdefghijklmnopqrstuvwxyz
Number of characters: 154207
Number of unique characters: 39
Now that we loaded and cleaned the dataset successfully, we need a way to
convert these characters into integers, there are a lot of Keras and Scikit-
Learn utilities out there for that, but we are going to make this manually in
Python.
Since we have vocab as our vocabulary that contains all the unique characters
of our dataset, we can make two dictionaries that map each character to an
integer number and vice-versa:
Let's save them to a file (to retrieve them later in text generation):
Now let's encode our dataset, in other words, we gonna convert each
character into its corresponding integer number:
Awesome, now this char_dataset object has all the characters of this dataset; let's
try to print the first characters:
This will take the very first 8 characters and print them out along with their
integer representation:
38
27 p
29 r
26 o
21 j
16 e
14 c
# print sequences
for sequence in sequences.take(2):
print(''.join([int2char[i] for i in sequence.numpy()]))
this ebook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever you may it give it away or
You notice I've converted the integer sequences into normal text
using int2char dictionary built earlier.
Now you know how each sample is represented, let's prepare our inputs and
targets, we need a way to convert a single sample (sequence of characters)
into multiple (input, target) samples. Fortunately, the flat_map() method is
exactly what we need; it takes a callback function that loops over all our data
samples:
def split_sample(sample):
# example :
# sequence_length is 10
# sample is "python is a great pro" (21 length)
# ds will equal to ('python is ', 'a') encoded as integers
ds = tf.data.Dataset.from_tensors((sample[:sequence_length],
sample[sequence_length]))
for i in range(1, (len(sample)-1) // 2):
# first (input_, target) will be ('ython is a', ' ')
# second (input_, target) will be ('thon is a ', 'g')
# third (input_, target) will be ('hon is a g', 'r')
# and so on
input_ = sample[i: i+sequence_length]
target = sample[i+sequence_length]
# extend the dataset with these samples by concatenate() method
other_ds = tf.data.Dataset.from_tensors((input_, target))
ds = ds.concatenate(other_ds)
return ds
To get a good understanding of how the above code works, let's take an
example: Let's say we have a sequence length of 10 (too small but good for
explanation), the sample argument is a sequence of 21 characters (remember
the 2*sequence_length+1) encoded in integers, for convenience, let's imagine
it isn't encoded, say it's "python is a great pro".
Now the first data sample we going to generate would be the following tuple
of inputs and targets ('python is ', 'a'), the second is ('ython is a', ' '), the third
is ('thon is a ', 'g') and so on. We do that on all samples, in the end, we'll see
that we dramatically increased the number of training samples. We've used
the ds.concatenate() method to add these samples together.
After we constructed our samples, let's one-hot encode both the inputs and
the labels (targets):
dataset = dataset.map(one_hot_samples)
We've used the convenient map() method to one-hot encode each sample on
our dataset, tf.one_hot() method does what we expect. Let's try to print the
first two data samples along with their shapes:
So each input element has the shape of (sequence length, vocabulary size), in
this case, there are 39 unique characters and 100 is the sequence length. The
shape of the output is a one-dimensional vector that is one-hot encoded.
The output layer is a fully-connected layer with 39 units where each neuron
corresponds to a character (probability of the occurrence of each character).
model = Sequential([
LSTM(256, input_shape=(sequence_length, n_unique_chars),
return_sequences=True),
Dropout(0.3),
LSTM(256),
Dense(n_unique_chars, activation="softmax"),
])
We're using Adam optimizer here, I suggest you experiment with different
optimizers.
After we've built our model, let's print the summary and compile it:
We fed the Dataset object that we prepared earlier, and since the model object
has no idea on many samples are there in the dataset, we
specified steps_per_epoch parameter, which is set to the number of training
samples divided by the batch size.
After running the above code, it should start training, which gonna look
something like this:
Epoch 29/30
6473/6473 [==============================] - 486s 75ms/step -
loss: 0.8728 - accuracy: 0.7509
Epoch 30/30
2576/6473 [==========>...................] - ETA: 4:56 - loss: 0.8063 -
accuracy: 0.7678
After the training is over, a new file should appear in the results folder, that
is, the model trained weights.
import numpy as np
import pickle
import tqdm
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, Activation
import os
sequence_length = 100
# dataset file path
FILE_PATH = "data/wonderland.txt"
# FILE_PATH = "data/python_code.py"
BASENAME = os.path.basename(FILE_PATH)
We need a sample text to start generating. This will depend on your problem,
you can take sentences from the training data in which it will perform better,
but I'll try to produce a new chapter of this book:
Let's load the dictionaries that map each integer to a character and vise-versa
that we saved before in the data preparation phase:
s = seed
n_chars = 400
# generate 400 characters
generated = ""
for i in tqdm.tqdm(range(n_chars), "Generating text"):
# make the input sequence
X = np.zeros((1, sequence_length, vocab_size))
for t, char in enumerate(seed):
X[0, (sequence_length - len(seed)) + t, char2int[char]] = 1
# predict the next character
predicted = model.predict(X, verbose=0)[0]
# converting the vector to an integer
next_index = np.argmax(predicted)
# converting the integer to a character
next_char = int2char[next_index]
# add the character to results
generated += next_char
# shift seed and the predicted character
seed = seed[1:] + next_char
print("Seed:", s)
print("Generated text:")
print(generated)
All we are doing here is starting with a seed text, constructing the input
sequence, and then predicting the next character. After that, we shift the input
sequence by removing the first character and adding the last character
predicted. This gives us a slightly changed sequence of inputs that still has a
length equal to the size of our sequence length.
We then feed this updated input sequence into the model to predict another
character. Repeating this process N times will generate a text
with N characters.
That is clearly English! But as you may notice, most of the sentences don't
make any sense, that is due to many reasons. One of the main reasons is that
the dataset is trained only on very few samples. Also, the model architecture
isn't optimal, other state-of-the-art architectures (such as GPT-2 and BERT)
tend to outperform this one drastically.
Note though, this is not limited to English text, you can use whatever type of
text you want. In fact, you can even generate Python code once you have
enough lines of code.
Conclusion
Great, we are done. Now you know how to:
train.py
import tensorflow as tf
import numpy as np
import os
import pickle
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.callbacks import ModelCheckpoint
from string import punctuation
sequence_length = 100
BATCH_SIZE = 128
EPOCHS = 3
# dataset file path
FILE_PATH = "data/wonderland.txt"
# FILE_PATH = "data/python_code.py"
BASENAME = os.path.basename(FILE_PATH)
# print sequences
for sequence in sequences.take(2):
print(''.join([int2char[i] for i in sequence.numpy()]))
def split_sample(sample):
# example :
# sequence_length is 10
# sample is "python is a great pro" (21 length)
# ds will equal to ('python is ', 'a') encoded as integers
ds = tf.data.Dataset.from_tensors((sample[:sequence_length], sample[sequence_length]))
for i in range(1, (len(sample)-1) // 2):
# first (input_, target) will be ('ython is a', ' ')
# second (input_, target) will be ('thon is a ', 'g')
# third (input_, target) will be ('hon is a g', 'r')
# and so on
input_ = sample[i: i+sequence_length]
target = sample[i+sequence_length]
# extend the dataset with these samples by concatenate() method
other_ds = tf.data.Dataset.from_tensors((input_, target))
ds = ds.concatenate(other_ds)
return ds
dataset = dataset.map(one_hot_samples)
# print first 2 samples
for element in dataset.take(2):
print("Input:", ''.join([int2char[np.argmax(char_vector)] for char_vector in element[0].numpy()]))
print("Target:", int2char[np.argmax(element[1].numpy())])
print("Input shape:", element[0].shape)
print("Target shape:", element[1].shape)
print("="*50, "\n")
model.load_weights(f"results/{BASENAME}-{sequence_length}.h5")
model.summary()
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
if not os.path.isdir("results"):
os.mkdir("results")
generate.py
import numpy as np
import pickle
import tqdm
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, Activation
import os
sequence_length = 100
# dataset file path
FILE_PATH = "data/wonderland.txt"
# FILE_PATH = "data/python_code.py"
BASENAME = os.path.basename(FILE_PATH)
# load vocab dictionaries
char2int = pickle.load(open(f"{BASENAME}-char2int.pickle", "rb"))
int2char = pickle.load(open(f"{BASENAME}-int2char.pickle", "rb"))
sequence_length = 100
vocab_size = len(char2int)
Since we all have the problem of spam emails filling our inboxes, in this
tutorial, we gonna build a model in Keras that can distinguish between spam
and legitimate emails.
Table of content:
import time
import pickle
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# only use GPU memory that we need, not allocate all the GPU memory
tf.config.experimental.set_memory_growth(gpus[0], enable=True)
import tqdm
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Embedding, LSTM, Dropout, Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.metrics import Recall, Precision
BATCH_SIZE = 64
EPOCHS = 10 # number of epochs
Don't worry if you are not sure what these parameters mean, we'll talk about
them later when we construct our model.
def load_data():
"""
Loads SMS Spam Collection dataset
"""
texts, labels = [], []
with open("data/SMSSpamCollection") as f:
for line in f:
split = line.split()
labels.append(split[0].strip())
texts.append(' '.join(split[1:]).strip())
return texts, labels
The dataset is in a single file, each line corresponds to a data sample, the first
word is the label and the rest is the actual email content, that's why we are
grabbing labels as split[0] and the content as split[1:].
# Text tokenization
# vectorizing text, turning each text into sequence of integers
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
# lets dump it to a file, so we can use it in testing
pickle.dump(tokenizer, open("results/tokenizer.pickle", "wb"))
# convert to sequence of integers
X = tokenizer.texts_to_sequences(X)
In [4]: print(X[0])
[49, 472, 4436, 843, 756, 659, 64, 8, 1328, 87, 123, 352, 1329, 148, 2996,
1330, 67, 58, 4437, 144]
In [6]: print(X[0])
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 49 471 4435 842
755 658 64 8 1327 88 123 351 1328 148 2996 1329 67 58
4436 144]
Now our labels are text also, but we gonna make a different approach here,
since the labels are only "spam" and "ham", we need to one-hot encode them:
We used keras.utils.to_categorial() here, which does what its name suggests, let's
try to print the first sample of the labels:
In [7]: print(y[0])
[1.0, 0.0]
Cell output:
As you can see, we have a total of 4180 training samples and 1494 validation
samples.
The second layer is a recurrent neural network with LSTM units. Finally, the
output layer is 2 neurons each corresponds to "spam" or "ham" with
a softmax activation function.
word_index = tokenizer.word_index
embedding_matrix = np.zeros((len(word_index)+1, dim))
for word, i in word_index.items():
embedding_vector = embedding_index.get(word)
if embedding_vector is not None:
# words not found will be 0s
embedding_matrix[i] = embedding_vector
return embedding_matrix
Note: In order to run this function properly, you need to download GloVe,
extract and put in the "data" folder, we will use the 100-dimensional vectors
here.
model.add(LSTM(lstm_units, recurrent_dropout=0.2))
model.add(Dropout(0.3))
model.add(Dense(2, activation="softmax"))
# compile as rmsprop optimizer
# aswell as with recall metric
model.compile(optimizer="rmsprop", loss="categorical_crossentropy",
metrics=["accuracy", keras_metrics.precision(),
keras_metrics.recall()])
model.summary()
return model
The above function constructs the whole model, we loaded the pre-trained
embedding vectors to the Embedding layer, and set trainable=False , this will freeze
the embedding weights during the training process.
After we add the RNN layer, we added a 30% dropout chance, this will freeze
30% of neurons in the previous layer in each iteration which will help
us reduce overfitting.
Note that accuracy isn't enough for determining whether the model is doing
great, that is because this dataset is unbalanced, only a few samples are spam.
As a result, we will use precision and recall metrics.
Let's call the function:
_________________________________________________________________
Layer (type) Output Shape Param #
================================================================
embedding_1 (Embedding) (None, 100, 100) 901300
_________________________________________________________________
lstm_1 (LSTM) (None, 128) 117248
_________________________________________________________________
dropout_1 (Dropout) (None, 128) 0
_________________________________________________________________
dense_1 (Dense) (None, 2) 258
================================================================
Total params: 1,018,806
Trainable params: 117,506
Non-trainable params: 901,300
_________________________________________________________________
Train on 4180 samples, validate on 1394 samples
Epoch 1/10
66/66 [==============================] - 86s 1s/step - loss: 0.2315
- accuracy: 0.8980 - precision: 0.8980 - recall: 0.8980 - val_loss: 0.1192 -
val_accuracy: 0.9555 - val_precision: 0.9555 - val_recall: 0.9555
Epoch 10/10
66/66 [==============================] - 89s 1s/step - loss: 0.0216
- accuracy: 0.9932 - precision: 0.9932 - recall: 0.9932 - val_loss: 0.0546 -
val_accuracy: 0.9842 - val_precision: 0.9842 - val_recall: 0.9842
Output:
1394/1394 [==============================]
- 1s 569us/step
[+] Accuracy: 98.21%
[+] Precision: 99.16%
[+] Recall: 98.75%
Here are what each metric means:
def get_predictions(text):
sequence = tokenizer.texts_to_sequences([text])
# pad the sequence
sequence = pad_sequences(sequence, maxlen=SEQUENCE_LENGTH)
# get the prediction
prediction = model.predict(sequence)[0]
# one-hot encoded vector, revert using np.argmax
return int2label[np.argmax(prediction)]
Output:
spam
Output:
ham
tensorboard --logdir="logs"
SourceCode:
utils.py
import tqdm
import numpy as np
from tensorflow.keras.layers import Embedding, LSTM, Dropout, Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.metrics import Recall, Precision
SEQUENCE_LENGTH = 100 # the length of all sequences (number of words per sample)
EMBEDDING_SIZE = 100 # Using 100-Dimensional GloVe embedding vectors
TEST_SIZE = 0.25 # ratio of testing set
BATCH_SIZE = 64
EPOCHS = 20 # number of epochs
model.add(LSTM(lstm_units, recurrent_dropout=0.2))
model.add(Dropout(0.3))
model.add(Dense(2, activation="softmax"))
# compile as rmsprop optimizer
# aswell as with recall metric
model.compile(optimizer="rmsprop", loss="categorical_crossentropy",
metrics=["accuracy", Precision(), Recall()])
model.summary()
return model
spam_classifier.py
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# only use GPU memory that we need, not allocate all the GPU memory
tf.config.experimental.set_memory_growth(gpus[0], enable=True)
def load_data():
"""
Loads SMS Spam Collection dataset
"""
texts, labels = [], []
with open("data/SMSSpamCollection") as f:
for line in f:
split = line.split()
labels.append(split[0].strip())
texts.append(' '.join(split[1:]).strip())
return texts, labels
# Text tokenization
# vectorizing text, turning each text into sequence of integers
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
# lets dump it to a file, so we can use it in testing
pickle.dump(tokenizer, open("results/tokenizer.pickle", "wb"))
test.py
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# only use GPU memory that we need, not allocate all the GPU memory
tf.config.experimental.set_memory_growth(gpus[0], enable=True)
from utils import get_model, int2label
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pickle
import numpy as np
SEQUENCE_LENGTH = 100
def get_predictions(text):
sequence = tokenizer.texts_to_sequences([text])
# pad the sequence
sequence = pad_sequences(sequence, maxlen=SEQUENCE_LENGTH)
# get the prediction
prediction = model.predict(sequence)[0]
# one-hot encoded vector, revert using np.argmax
return int2label[np.argmax(prediction)]
while True:
text = input("Enter the mail:")
# convert to sequences
print(get_predictions(text))
Summary
This book is dedicated to the readers who take time to write me each day.
Every morning I’m greeted by various emails — some with requests, a few
with complaints, and then there are the very few that just say thank you. All
these emails encourage and challenge me as an author — to better both my
books and myself.
Thank you!