Dictionary Based Tokenization in NLP
Last Updated :
24 Apr, 2025
Natural Language Processing (NLP) is a subfield of artificial intelligence that aims to enable computers to process, understand, and generate human language. One of the critical tasks in NLP is tokenization, which is the process of splitting text into smaller meaningful units, known as tokens. Dictionary-based tokenization is a common method used in NLP to segment text into tokens based on a pre-defined dictionary.
Dictionary-based tokenization is a technique in natural language processing (NLP) that involves splitting a text into individual tokens based on a predefined dictionary of multi-word expressions. This is useful when the standard word tokenization techniques may not be sufficient for certain applications, such as sentiment analysis or named entity recognition, where multi-word expressions need to be treated as a single token.
Dictionary-based tokenization divides the text into tokens by using a predefined dictionary of multi-word expressions. A dictionary is a list of words, phrases, and other linguistic constructions along with the definitions, speech patterns, and other pertinent data that go with them. Each word in the text is compared to the terms in the dictionary as part of the dictionary-based tokenization process, and the text is then divided into tokens based on the matches discovered. We can tokenize the name, and phrases by creating a custom dictionary.
A token in natural language processing is a group of characters that stands for a single meaning. Words, phrases, integers, and punctuation marks can all be used as tokens. Several NLP activities, including text classification, sentiment analysis, machine translation, and named entity recognition, depend on the tokenization process.
Several methods, including rule-based tokenization, machine learning-based tokenization, and hybrid tokenization, can be used to conduct the dictionary-based tokenization process. Rule-based tokenization divides the text into tokens according to the text’s characteristics, such as punctuation, capitalization, and spacing. Tokenization that is based on machine learning entails training a model to separate text into tokens based on a set of training data. To increase accuracy and efficiency, hybrid tokenization blends rule-based and machine-learning-based methods.
Steps needed for implementing Dictionary-based tokenization:
- Step 1: Collect a dictionary of words and their corresponding parts of speech. The dictionary can be created manually or obtained from a pre-existing source such as WordNet or Wikipedia.
- Step 2: Preprocess the text by removing any noise such as punctuation marks, stop words, and HTML tags.
- Step 3: Tokenize the text into words using a whitespace tokenizer or a sentence tokenizer.
- Step 4: Identify the parts of speech of each word in the text using a part-of-speech tagger such as the Stanford POS Tagger.
- Step 5: Segment the text into tokens by comparing each word in the text with the words in the dictionary. If a match is found, the corresponding word in the dictionary is used as a token. Otherwise, the word is split into smaller sub-tokens based on its parts of speech.
For example, consider the following sentence:
Jammu Kashmir is an integral part of India.
My name is Pawan Kumar Gunjan.
He is from Himachal Pradesh.
The steps involved in the dictionary-based tokenization of this sentence are as follows:
Step 1: Import the necessary libraries
Python3
from nltk import word_tokenize
from nltk.tokenize import MWETokenizer
|
Step 2: Create a custom dictionary using the name or phrases
Collect a dictionary of words having joint words like phrases or names. Let the dictionary contain the following name or phrases.
Python3
dictionary = [( "Jammu" , "Kashmir" ),
( "Pawan" , "Kumar" , "Gunjan" ),
( "Himachal" , "Pradesh" )]
|
Step 3: Create an instance of MWETokenizer with the dictionary
Python3
Dictionary_tokenizer = MWETokenizer(dictionary, separator = ' ' )
|
Step 4: Create a text dataset and tokenize with word_tokenize
Python3
text =
tokens = word_tokenize(text)
tokens
|
Output:
['Jammu',
'Kashmir',
'is',
'an',
'integral',
'part',
'of',
'India',
'.',
'My',
'name',
'is',
'Pawan',
'Kumar',
'Gunjan',
'.',
'He',
'is',
'from',
'Himachal',
'Pradesh',
'.']
Step 5: Apply Dictionary based tokenization with Dictionary_tokenizer
Python3
dictionary_based_token = Dictionary_tokenizer.tokenize(tokens)
dictionary_based_token
|
Output:
['Jammu Kashmir',
'is',
'an',
'integral',
'part',
'of',
'India',
'.',
'My',
'name',
'is',
'Pawan Kumar Gunjan',
'.',
'He',
'is',
'from',
'Himachal Pradesh',
'.']
We can easily observe the differences between General word tokenization and Dictionary-based tokenization. This is useful when we know the phrases or joint words present in the TEXT DOCUMENT and we want to assign these joint words as single tokens.
Full code implementations
Python3
from nltk import word_tokenize
from nltk.tokenize import MWETokenizer
dictionary = [( "Jammu" , "Kashmir" ),
( "Pawan" , "Kumar" , "Gunjan" ),
( "Himachal" , "Pradesh" )]
Dictionary_tokenizer = MWETokenizer(dictionary, separator = ' ' )
text =
tokens = word_tokenize(text)
print ( 'General Word Tokenization \n' ,tokens)
dictionary_based_token = Dictionary_tokenizer.tokenize(tokens)
print ( 'Dictionary based tokenization \n' ,dictionary_based_token)
|
Output:
General Word Tokenization
['Jammu', 'Kashmir', 'is', 'an', 'integral', 'part', 'of', 'India', '.', 'My', 'name', 'is', 'Pawan', 'Kumar', 'Gunjan', '.', 'He', 'is', 'from', 'Himachal', 'Pradesh', '.']
Dictionary based tokenization
['Jammu Kashmir', 'is', 'an', 'integral', 'part', 'of', 'India', '.', 'My', 'name', 'is', 'Pawan Kumar Gunjan', '.', 'He', 'is', 'from', 'Himachal Pradesh', '.']
Similar Reads
Rule-Based Tokenization in NLP
Natural Language Processing (NLP) is a subfield of artificial intelligence that aims to enable computers to process, understand, and generate human language. One of the critical tasks in NLP is tokenization, which is the process of splitting text into smaller meaningful units, known as tokens. Dicti
4 min read
String Tokenization in C
In C, tokenization is the process of breaking the string into smaller parts using delimiters (characters treated as separators) like space, commas, a specific character, or even a string. Those smaller parts are called tokens where each token is a substring of the original string separated by the de
3 min read
Subword Tokenization in NLP
Subword Tokenization is a Natural Language Processing technique(NLP) in which a word is split into subwords and these subwords are known as tokens. This technique is used in any NLP task where a model needs to maintain a large vocabulary and complex word structures. The concept behind this, frequent
5 min read
Multilingual Dictionary App Using Python
In this article, we will guide you through the process of creating a Multilingual Dictionary app using Python. The core functionality of this application involves translating words using the OpenAI API. To provide a user-friendly interface, we will leverage the Streamlit library. With this setup, tr
5 min read
Named Entity Recognition in NLP
In this article, we'll dive into the various concepts related to NER, explain the steps involved in the process, and understand it with some good examples. Named Entity Recognition (NER) is a critical component of Natural Language Processing (NLP) that has gained significant attention and research i
6 min read
Word Embedding Techniques in NLP
Word embedding techniques are a fundamental part of natural language processing (NLP) and machine learning, providing a way to represent words as vectors in a continuous vector space. In this article, we will learn about various word embedding techniques. Table of Content Importance of Word Embeddin
6 min read
NLP | Classifier-based tagging
ClassifierBasedPOSTagger class: It is a subclass of ClassifierBasedTagger that uses classification technique to do part-of-speech tagging. From the words, features are extracted and then passed to an internal classifier. It classifies the features and returns a label i.e. a part-of-speech tag. The f
2 min read
LSTM Based Poetry Generation Using NLP in Python
One of the major tasks that one aims to accomplish in Conversational AI is Natural Language Generation (NLG) which refers to employing models for the generation of natural language. In this article, we will get our hands on NLG by building an LSTM-based poetry generator. Note: The readers of this ar
7 min read
Tokenization Using Spacy
Before we get into tokenization, let's first take a look at what spaCy is. spaCy is a popular library used in Natural Language Processing (NLP). It's an object-oriented library that helps with processing and analyzing text. We can use spaCy to clean and prepare text, break it into sentences and word
3 min read
Tokenization with the SentencePiece Python Library
Tokenization is a crucial step in Natural Language Processing (NLP), where text is divided into smaller units, such as words or subwords, that can be further processed by machine learning models. One of the most popular tools for tokenization is the SentencePiece library, developed by Google. This v
5 min read