NLP Sequencing is the sequence of numbers that we will generate from a large corpus or body of statements by training a neural network. We will take a set of sentences and assign them numeric tokens based on the training set sentences.
Example:
sentences = [
'I love geeksforgeeks',
'You love geeksforgeeks',
'What do you think about geeksforgeeks?'
]
Word Index: {'geeksforgeeks': 1, 'love': 2, 'you': 3, 'i': 4,
'what': 5, 'do': 6, 'think': 7, 'about': 8}
Sequences: [[4, 2, 1], [3, 2, 1], [5, 6, 3, 7, 8, 1]]
Now if the test set consists of the word the network has not seen before, or we have to predict the word in the sentence then we can add a simple placeholder token.
Let the test set be :
test_data = [
'i really love geeksforgeeks',
'Do you like geeksforgeeks'
]
Then we will define an additional placeholder for words it hasn't seen before. The placeholder by default gets index as 1.
Word Index = {'placeholder': 1, 'geeksforgeeks': 2, 'love': 3, 'you': 4, 'i': 5, 'what': 6, 'do': 7, 'think': 8, 'about': 9}
Sequences = [[5, 3, 2], [4, 3, 2], [6, 7, 4, 8, 9, 2]]
As the word 'really' and 'like' has not been encountered before it is simply replaced by the placeholder which is indexed by 1.
So, the test sequence now becomes,
Test Sequence = [[5, 1, 3, 2], [7, 4, 1, 2]]
Code: Implementation with TensorFlow
python3
# importing all the modules required
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# the initial corpus of sentences or the training set
sentences = [
'I love geeksforgeeks',
'You love geeksforgeeks',
'What do you think about geeksforgeeks?'
]
tokenizer = Tokenizer(num_words = 100)
# the tokenizer also removes punctuations
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
print("Word Index: ", word_index)
print("Sequences: ", sequences)
# defining a placeholder token and naming it as placeholder
tokenizer = Tokenizer(num_words=100,
oov_token="placeholder")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
print("\nSequences = ", sequences)
# the training data with words the network hasn't encountered
test_data = [
'i really love geeksforgeeks',
'Do you like geeksforgeeks'
]
test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)
Output:
Word Index: {'geeksforgeeks': 1, 'love': 2, 'you': 3, 'i': 4, 'what': 5, 'do': 6, 'think': 7, 'about': 8}
Sequences: [[4, 2, 1], [3, 2, 1], [5, 6, 3, 7, 8, 1]]
Sequences = [[5, 3, 2], [4, 3, 2], [6, 7, 4, 8, 9, 2]]
Test Sequence = [[5, 1, 3, 2], [7, 4, 1, 2]]
Similar Reads
Sequential Data Analysis in Python Sequential data, often referred to as ordered data, consists of observations arranged in a specific order. This type of data is not necessarily time-based; it can represent sequences such as text, DNA strands, or user actions.In this article, we are going to explore, sequential data analysis, it's t
8 min read
What is Sequence-to-Sequence Learning? Sequence learning is one of the powerful techniques used to convert one sequence of data into another in the fields of data science and machine learning. It is abbreviated as Seq2Seq which also denotes that it is a learning machine model which takes a sequence of data as input and produces another s
8 min read
How to handle sequence padding and packing in PyTorch for RNNs? There are many dataset that have sequences with variable lengths and recurrent neural networks (RNNs) require fixed-length inputs. To address this challenge, sequence padding and packing techniques are used, particularly in PyTorch, a popular deep learning framework. The article demonstrates how seq
5 min read
N-Gram Language Modelling with NLTK Language modeling is the way of determining the probability of any sequence of words. Language modeling is used in various applications such as Speech Recognition, Spam filtering, etc. Language modeling is the key aim behind implementing many state-of-the-art Natural Language Processing models.Metho
5 min read
Lesk Algorithm in NLP - Python In Natural Language Processing (NLP), word sense disambiguation (WSD) is the challenge of determining which "sense" (meaning) of a word is activated by its use in a specific context, a process that appears to be mostly unconscious in individuals. Lesk Algorithm is a way of Word Sense Disambiguation.
3 min read
N-Grams In R N-grams are contiguous sequences of n items (words, characters, or symbols) extracted from a given sample of text or speech. They are widely used in natural language processing (NLP) and computational linguistics for various applications such as language modelling, text generation, and information r
5 min read
Difference Between Autoregressive And Non-Autoregressive Models In the realm of natural language processing (NLP) and time series analysis, two fundamental approaches for generating sequences are autoregressive (AR) models and non-autoregressive (NAR) models. Understanding the distinctions between these models is crucial for selecting the appropriate method for
5 min read
Sequence Alignment problem Given as an input two strings, X = x_{1} x_{2}... x_{m} , and Y = y_{1} y_{2}... y_{m} , output the alignment of the strings, character by character, so that the net penalty is minimized. The penalty is calculated as: A penalty of p_{gap} occurs if a gap is inserted between the string. A penalty of
15+ min read
Johnson's Rule in Sequencing Problems The sequencing problem deals with determining an optimum sequence of performing a number of jobs by a finite number of service facilities (machine) according to some pre-assigned order so as to optimize the output. The objective is to determine the optimal order of performing the jobs in such a way
3 min read
Biopython - Sequence input/output Biopython has an inbuilt Bio.SeqIO module which provides functionalities to read and write sequences from or to a file respectively. Bio.SeqIO supports nearly all file handling formats used in Bioinformatics. Biopython strictly follows single approach to represent the parsed data sequence to the use
3 min read