3
Most read
6
Most read
7
Most read
DATA PREPROCESSING AND
CLEANING (TOKENIZATION)
Eng Mahmoud Yasser Hammam
TEXT PREPROCESSING FOR SOCIAL
MEDIA ANALYSIS
• Preprocessing methods are fundamental steps of text analytics and NLP tasks to
process unstructured data. Analyzing suitable preprocessing methods like
tokenization, removal of stop word, stemming, and lemmatization are applied to
normalize the extracted data.
WHAT IS TOKENIZATION
• Unstructured text data, such as articles, social media posts, or emails, lacks a
predefined structure that machines can readily interpret. Tokenization bridges this
gap by breaking down the text into smaller units called tokens. These tokens can be
words, characters, or even subwords, depending on the chosen tokenization
strategy. By transforming unstructured text into a structured format, tokenization
lays the foundation for further analysis and processing.
WHY WE NEED TOKENIZATION
• One of the primary reasons for tokenization is to convert textual data into a
numerical representation that can be processed by machine learning algorithms.
With this numeric representation we can train the model to perform various tasks,
such as classification, sentiment analysis, or language generation.
• Tokens not only serve as numeric representations of text but can also be used as
features in machine learning pipelines. These features capture important linguistic
information and can trigger more complex decisions or behaviors. For example, in
text classification, the presence or absence of specific tokens can influence the
prediction of a particular class. Tokenization, therefore, plays a pivotal role in
extracting meaningful features and enabling effective machine learning models.
DIFFERENT STRATEGIES FOR
TOKENIZATION
• The simplest tokenization scheme is to feed each character individually to the
model. In Python, str objects are really arrays under the hood, which allows us to
quickly implement character-level tokenization with just one line of code:
• Our model expects each character to be converted to an integer, a process
sometimes called numericalization. One simple way to do this is by encoding each
unique token (which are characters in this case) with a unique integer:
• This gives us a mapping from each character in our vocabulary to a unique integer.
We can now use token2idx to transform the tokenized text to a list of integers:
• Each token has now been mapped to a unique numerical identifier (hence the name
input_ids). The last step is to convert input_ids to a 2D tensor of one-hot vectors.
One-hot vectors are frequently used in machine learning to encode categorical data.
We can create the one-hot encodings in PyTorch by converting input_ids to a tensor
and applying the one_hot() function as follows:
• For each of the 38 input tokens we now have a one-hot vector with 20 dimensions,
since our vocabulary consists of 20 unique characters.
• By examining the first vector, we can verify that a 1 appears in the location indicated
by input_ids[0]:
CHALLENGES OF CHARACTER
TOKENIZATION
• From our simple example we can see that character-level tokenization ignores any
structure in the text and treats the whole string as a stream of characters.
• Although this helps deal with misspellings and rare words, the main drawback is that
linguistic structures such as words need to be learned from the data. This requires
significant compute, memory, and data. For this reason, character tokenization is
rarely used in practice.
• Instead, some structure of the text is preserved during the tokenization step. Word
tokenization is a straightforward approach to achieve this, so let’s take a look at how
it works.
WORD TOKENIZATION
• Instead of splitting the text into characters, we can split it into words and map each
word to an integer. Using words from the outset enables the model to skip the step
of learning words from characters, and thereby reduces the complexity of the
training process.
• One simple class of word tokenizers uses whitespace to tokenize the text. We can do
this by applying Python’s split() function directly on the raw text :
CHALLENGES WITH WORD
TOKENIZATION
• 1. The current tokenization method doesn't account for punctuation, treating
phrases like "NLP." as single tokens. This oversight leads to a potentially inflated
vocabulary, particularly considering variations in word forms and possible
misspellings.
• 2. The large vocabulary size poses a challenge for neural networks due to the
substantial number of parameters required. For instance, if there are one million
unique words and the goal is to compress input vectors from one million
dimensions to one thousand dimensions in the first layer of the neural network, the
resulting weight matrix would contain about one billion weights. This is comparable
to the parameter count of the largest GPT-2 model, which has approximately 1.5
billion parameters in total.
SUB WORD TOKENIZATION
• The basic idea behind subword tokenization is to combine the best aspects of
character and word tokenization.
• On the one hand, we want to split rare words into smaller units to allow the model
to deal with complex words and misspellings. On the other hand, we want to keep
frequent words as unique entities so that we can keep the length of our inputs to a
manageable size.
• There are several subword tokenization algorithms that are commonly used in NLP,
but let’s start with WordPiece. which is used by the BERT and DistilBERT tokenizers.
The easiest way to understand how WordPiece works is to see it in action.
• Transformers library provides a convenient AutoTokenizer class that allows you to
quickly load the tokenizer associated with a pretrained model — we just call its
from_pretrained() method, providing the ID of a model on the Hugging-Face Hub or
a local file path.
SEEING TOKENIZER IN HANDS
SEEING TOKENIZER IN HANDS
SEEING TOKENIZER IN HANDS
Tokenization and how to use it from scratch

More Related Content

PDF
Data Intensive Computing Frameworks
PPTX
Spell checker using Natural language processing
PPTX
Church Turing Thesis
PPTX
2 3 Trees Algorithm - Data Structure
PPTX
Dijkstra's Algorithm
PDF
CRYPTOGRAPHY AND NETWORK SECURITY
PPTX
Suffix Tree and Suffix Array
PPTX
Cryptography
Data Intensive Computing Frameworks
Spell checker using Natural language processing
Church Turing Thesis
2 3 Trees Algorithm - Data Structure
Dijkstra's Algorithm
CRYPTOGRAPHY AND NETWORK SECURITY
Suffix Tree and Suffix Array
Cryptography

What's hot (20)

PDF
Transposition cipher
PDF
Network security - OSI Security Architecture
PDF
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
PPT
DES (Data Encryption Standard) pressentation
PPTX
Automata Theory - Turing machine
PDF
Symmetric Cipher Model, Substitution techniques, Transposition techniques, St...
PPTX
Lecture 1: Semantic Analysis in Language Technology
PPT
Type Checking(Compiler Design) #ShareThisIfYouLike
PPTX
Unsupervised learning
PPTX
Sentiment analysis of twitter data
PPTX
Double DES & Triple DES
PPT
Query Processing in IR
PPTX
Data Mining: Classification and analysis
PPT
3.2 partitioning methods
PPTX
6. describing syntax and semantics
PPTX
Block cipher modes of operation
PPT
Pattern matching
PPT
1.8 discretization
PPTX
2 phase locking protocol DBMS
PPT
Chapter 5 Syntax Directed Translation
Transposition cipher
Network security - OSI Security Architecture
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
DES (Data Encryption Standard) pressentation
Automata Theory - Turing machine
Symmetric Cipher Model, Substitution techniques, Transposition techniques, St...
Lecture 1: Semantic Analysis in Language Technology
Type Checking(Compiler Design) #ShareThisIfYouLike
Unsupervised learning
Sentiment analysis of twitter data
Double DES & Triple DES
Query Processing in IR
Data Mining: Classification and analysis
3.2 partitioning methods
6. describing syntax and semantics
Block cipher modes of operation
Pattern matching
1.8 discretization
2 phase locking protocol DBMS
Chapter 5 Syntax Directed Translation
Ad

Similar to Tokenization and how to use it from scratch (20)

PDF
LLM.pdf
PPTX
Understanding Large Language Models (1).pptx
PDF
INTRODUCTION TO Natural language processing
PPTX
Text analytics
PPTX
NLP Section - 02 .Text Processing.pptx
PPTX
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
PPTX
NLP Deep Dive - recurrent neural networks .pptx
PDF
DOMAIN BASED CHUNKING
PDF
DOMAIN BASED CHUNKING
PDF
DOMAIN BASED CHUNKING
PDF
Y24168171
PDF
ENSEMBLE MODEL FOR CHUNKING
PDF
Master LLMs with LangChain -the basics of LLM
PPTX
Natural Language processing using nltk.pptx
PPTX
Networking lesson 4 chaoter 1 Module 4-1.pptx
PPTX
Cd ch2 - lexical analysis
PDF
Text Mining: open Source Tokenization Tools � An Analysis
PDF
TEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSIS
PDF
Text mining open source tokenization
PPTX
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
LLM.pdf
Understanding Large Language Models (1).pptx
INTRODUCTION TO Natural language processing
Text analytics
NLP Section - 02 .Text Processing.pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP Deep Dive - recurrent neural networks .pptx
DOMAIN BASED CHUNKING
DOMAIN BASED CHUNKING
DOMAIN BASED CHUNKING
Y24168171
ENSEMBLE MODEL FOR CHUNKING
Master LLMs with LangChain -the basics of LLM
Natural Language processing using nltk.pptx
Networking lesson 4 chaoter 1 Module 4-1.pptx
Cd ch2 - lexical analysis
Text Mining: open Source Tokenization Tools � An Analysis
TEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSIS
Text mining open source tokenization
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
Ad

Recently uploaded (20)

PDF
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
PPTX
Overview_of_Computing_Presentation.pptxxx
PPT
genetics-16bbbbbbhhbbbjjjjjjjjffggg11-.ppt
PDF
TenneT-Integrated-Annual-Report-2018.pdf
PPTX
BDA_Basics of Big data Unit-1.pptx Big data
PDF
PPT IEPT 2025_Ms. Nurul Presentation 10.pdf
PPTX
Understanding AI: Basics on Artificial Intelligence and Machine Learning
PPTX
Fkrjrkrkekekekeekkekswkjdjdjddwkejje.pptx
PPT
Drug treatment of Malbbbbbhhbbbbhharia.ppt
PPTX
DataGovernancePrimer_Hosch_2018_11_04.pptx
PPTX
AI-Augmented Business Process Management Systems
PPTX
The future of AIThe future of AIThe future of AI
PPT
DWDM unit 1 for btech 3rd year students.ppt
PPTX
UNIT-1 NOTES Data warehousing and data mining.pptx
PPTX
REAL of PPT_P1_5019211081 (1).pdf_20250718_084609_0000.pptx
PPTX
Microsoft Fabric Modernization Pathways in Action: Strategic Insights for Dat...
PDF
NU-MEP-Standards معايير تصميم جامعية .pdf
PDF
MULTI-ACCESS EDGE COMPUTING ARCHITECTURE AND SMART AGRICULTURE APPLICATION IN...
PDF
PPT nikita containers of the company use
PPTX
Power BI - Microsoft Power BI is an interactive data visualization software p...
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
Overview_of_Computing_Presentation.pptxxx
genetics-16bbbbbbhhbbbjjjjjjjjffggg11-.ppt
TenneT-Integrated-Annual-Report-2018.pdf
BDA_Basics of Big data Unit-1.pptx Big data
PPT IEPT 2025_Ms. Nurul Presentation 10.pdf
Understanding AI: Basics on Artificial Intelligence and Machine Learning
Fkrjrkrkekekekeekkekswkjdjdjddwkejje.pptx
Drug treatment of Malbbbbbhhbbbbhharia.ppt
DataGovernancePrimer_Hosch_2018_11_04.pptx
AI-Augmented Business Process Management Systems
The future of AIThe future of AIThe future of AI
DWDM unit 1 for btech 3rd year students.ppt
UNIT-1 NOTES Data warehousing and data mining.pptx
REAL of PPT_P1_5019211081 (1).pdf_20250718_084609_0000.pptx
Microsoft Fabric Modernization Pathways in Action: Strategic Insights for Dat...
NU-MEP-Standards معايير تصميم جامعية .pdf
MULTI-ACCESS EDGE COMPUTING ARCHITECTURE AND SMART AGRICULTURE APPLICATION IN...
PPT nikita containers of the company use
Power BI - Microsoft Power BI is an interactive data visualization software p...

Tokenization and how to use it from scratch

  • 1. DATA PREPROCESSING AND CLEANING (TOKENIZATION) Eng Mahmoud Yasser Hammam
  • 2. TEXT PREPROCESSING FOR SOCIAL MEDIA ANALYSIS • Preprocessing methods are fundamental steps of text analytics and NLP tasks to process unstructured data. Analyzing suitable preprocessing methods like tokenization, removal of stop word, stemming, and lemmatization are applied to normalize the extracted data.
  • 3. WHAT IS TOKENIZATION • Unstructured text data, such as articles, social media posts, or emails, lacks a predefined structure that machines can readily interpret. Tokenization bridges this gap by breaking down the text into smaller units called tokens. These tokens can be words, characters, or even subwords, depending on the chosen tokenization strategy. By transforming unstructured text into a structured format, tokenization lays the foundation for further analysis and processing.
  • 4. WHY WE NEED TOKENIZATION • One of the primary reasons for tokenization is to convert textual data into a numerical representation that can be processed by machine learning algorithms. With this numeric representation we can train the model to perform various tasks, such as classification, sentiment analysis, or language generation. • Tokens not only serve as numeric representations of text but can also be used as features in machine learning pipelines. These features capture important linguistic information and can trigger more complex decisions or behaviors. For example, in text classification, the presence or absence of specific tokens can influence the prediction of a particular class. Tokenization, therefore, plays a pivotal role in extracting meaningful features and enabling effective machine learning models.
  • 5. DIFFERENT STRATEGIES FOR TOKENIZATION • The simplest tokenization scheme is to feed each character individually to the model. In Python, str objects are really arrays under the hood, which allows us to quickly implement character-level tokenization with just one line of code:
  • 6. • Our model expects each character to be converted to an integer, a process sometimes called numericalization. One simple way to do this is by encoding each unique token (which are characters in this case) with a unique integer:
  • 7. • This gives us a mapping from each character in our vocabulary to a unique integer. We can now use token2idx to transform the tokenized text to a list of integers:
  • 8. • Each token has now been mapped to a unique numerical identifier (hence the name input_ids). The last step is to convert input_ids to a 2D tensor of one-hot vectors. One-hot vectors are frequently used in machine learning to encode categorical data. We can create the one-hot encodings in PyTorch by converting input_ids to a tensor and applying the one_hot() function as follows:
  • 9. • For each of the 38 input tokens we now have a one-hot vector with 20 dimensions, since our vocabulary consists of 20 unique characters. • By examining the first vector, we can verify that a 1 appears in the location indicated by input_ids[0]:
  • 10. CHALLENGES OF CHARACTER TOKENIZATION • From our simple example we can see that character-level tokenization ignores any structure in the text and treats the whole string as a stream of characters. • Although this helps deal with misspellings and rare words, the main drawback is that linguistic structures such as words need to be learned from the data. This requires significant compute, memory, and data. For this reason, character tokenization is rarely used in practice. • Instead, some structure of the text is preserved during the tokenization step. Word tokenization is a straightforward approach to achieve this, so let’s take a look at how it works.
  • 11. WORD TOKENIZATION • Instead of splitting the text into characters, we can split it into words and map each word to an integer. Using words from the outset enables the model to skip the step of learning words from characters, and thereby reduces the complexity of the training process. • One simple class of word tokenizers uses whitespace to tokenize the text. We can do this by applying Python’s split() function directly on the raw text :
  • 12. CHALLENGES WITH WORD TOKENIZATION • 1. The current tokenization method doesn't account for punctuation, treating phrases like "NLP." as single tokens. This oversight leads to a potentially inflated vocabulary, particularly considering variations in word forms and possible misspellings. • 2. The large vocabulary size poses a challenge for neural networks due to the substantial number of parameters required. For instance, if there are one million unique words and the goal is to compress input vectors from one million dimensions to one thousand dimensions in the first layer of the neural network, the resulting weight matrix would contain about one billion weights. This is comparable to the parameter count of the largest GPT-2 model, which has approximately 1.5 billion parameters in total.
  • 13. SUB WORD TOKENIZATION • The basic idea behind subword tokenization is to combine the best aspects of character and word tokenization.
  • 14. • On the one hand, we want to split rare words into smaller units to allow the model to deal with complex words and misspellings. On the other hand, we want to keep frequent words as unique entities so that we can keep the length of our inputs to a manageable size.
  • 15. • There are several subword tokenization algorithms that are commonly used in NLP, but let’s start with WordPiece. which is used by the BERT and DistilBERT tokenizers. The easiest way to understand how WordPiece works is to see it in action. • Transformers library provides a convenient AutoTokenizer class that allows you to quickly load the tokenizer associated with a pretrained model — we just call its from_pretrained() method, providing the ID of a model on the Hugging-Face Hub or a local file path.