Tokenization and how to use it from scratch

DATA PREPROCESSING AND
CLEANING (TOKENIZATION)
Eng Mahmoud Yasser Hammam

TEXT PREPROCESSING FOR SOCIAL
MEDIA ANALYSIS
• Preprocessing methods are fundamental steps of text analytics and NLP tasks to
process unstructured data. Analyzing suitable preprocessing methods like
tokenization, removal of stop word, stemming, and lemmatization are applied to
normalize the extracted data.

WHAT IS TOKENIZATION
• Unstructured text data, such as articles, social media posts, or emails, lacks a
predefined structure that machines can readily interpret. Tokenization bridges this
gap by breaking down the text into smaller units called tokens. These tokens can be
words, characters, or even subwords, depending on the chosen tokenization
strategy. By transforming unstructured text into a structured format, tokenization
lays the foundation for further analysis and processing.

WHY WE NEED TOKENIZATION
• One of the primary reasons for tokenization is to convert textual data into a
numerical representation that can be processed by machine learning algorithms.
With this numeric representation we can train the model to perform various tasks,
such as classification, sentiment analysis, or language generation.
• Tokens not only serve as numeric representations of text but can also be used as
features in machine learning pipelines. These features capture important linguistic
information and can trigger more complex decisions or behaviors. For example, in
text classification, the presence or absence of specific tokens can influence the
prediction of a particular class. Tokenization, therefore, plays a pivotal role in
extracting meaningful features and enabling effective machine learning models.

DIFFERENT STRATEGIES FOR
TOKENIZATION
• The simplest tokenization scheme is to feed each character individually to the
model. In Python, str objects are really arrays under the hood, which allows us to
quickly implement character-level tokenization with just one line of code:

• Our model expects each character to be converted to an integer, a process
sometimes called numericalization. One simple way to do this is by encoding each
unique token (which are characters in this case) with a unique integer:

• This gives us a mapping from each character in our vocabulary to a unique integer.
We can now use token2idx to transform the tokenized text to a list of integers:

• Each token has now been mapped to a unique numerical identifier (hence the name
input_ids). The last step is to convert input_ids to a 2D tensor of one-hot vectors.
One-hot vectors are frequently used in machine learning to encode categorical data.
We can create the one-hot encodings in PyTorch by converting input_ids to a tensor
and applying the one_hot() function as follows:

• For each of the 38 input tokens we now have a one-hot vector with 20 dimensions,
since our vocabulary consists of 20 unique characters.
• By examining the first vector, we can verify that a 1 appears in the location indicated
by input_ids[0]:

CHALLENGES OF CHARACTER
TOKENIZATION
• From our simple example we can see that character-level tokenization ignores any
structure in the text and treats the whole string as a stream of characters.
• Although this helps deal with misspellings and rare words, the main drawback is that
linguistic structures such as words need to be learned from the data. This requires
significant compute, memory, and data. For this reason, character tokenization is
rarely used in practice.
• Instead, some structure of the text is preserved during the tokenization step. Word
tokenization is a straightforward approach to achieve this, so let’s take a look at how
it works.

WORD TOKENIZATION
• Instead of splitting the text into characters, we can split it into words and map each
word to an integer. Using words from the outset enables the model to skip the step
of learning words from characters, and thereby reduces the complexity of the
training process.
• One simple class of word tokenizers uses whitespace to tokenize the text. We can do
this by applying Python’s split() function directly on the raw text :

CHALLENGES WITH WORD
TOKENIZATION
• 1. The current tokenization method doesn't account for punctuation, treating
phrases like "NLP." as single tokens. This oversight leads to a potentially inflated
vocabulary, particularly considering variations in word forms and possible
misspellings.
• 2. The large vocabulary size poses a challenge for neural networks due to the
substantial number of parameters required. For instance, if there are one million
unique words and the goal is to compress input vectors from one million
dimensions to one thousand dimensions in the first layer of the neural network, the
resulting weight matrix would contain about one billion weights. This is comparable
to the parameter count of the largest GPT-2 model, which has approximately 1.5
billion parameters in total.

SUB WORD TOKENIZATION
• The basic idea behind subword tokenization is to combine the best aspects of
character and word tokenization.

• On the one hand, we want to split rare words into smaller units to allow the model
to deal with complex words and misspellings. On the other hand, we want to keep
frequent words as unique entities so that we can keep the length of our inputs to a
manageable size.

• There are several subword tokenization algorithms that are commonly used in NLP,
but let’s start with WordPiece. which is used by the BERT and DistilBERT tokenizers.
The easiest way to understand how WordPiece works is to see it in action.
• Transformers library provides a convenient AutoTokenizer class that allows you to
quickly load the tokenizer associated with a pretrained model — we just call its
from_pretrained() method, providing the ID of a model on the Hugging-Face Hub or
a local file path.

Tokenization and how to use it from scratch

Tokenization and how to use it from scratch

More Related Content

What's hot (20)

Similar to Tokenization and how to use it from scratch (20)

Recently uploaded (20)

Tokenization and how to use it from scratch