Bag-of-Words Representations in TensorFlow
Last Updated :
12 Feb, 2025
Bag-of-Words (BoW) converts text into numerical vectors based on word occurrences, ignoring grammar and word order. The model represents text as a collection (bag) of words, where each word's frequency or presence is recorded. It follows these steps:
- Tokenization – Splitting text into words.
- Vocabulary Creation – Listing all unique words from the dataset.
- Vectorization – Converting text into a numerical vector where each dimension represents a word's occurrence.
For example, consider two sentences:
- "The cat sat on the mat."
- "The dog lay on the rug."
The vocabulary: ["the", "cat", "sat", "on", "mat", "dog", "lay", "rug"]
Their BoW representation:
Sentence 1: [1, 1, 1, 1, 1, 0, 0, 0]
Sentence 2: [1, 0, 0, 1, 0, 1, 1, 1]
Bag-of-Words method is widely used in text classification, sentiment analysis, and information retrieval.
Implementing Bag-of-Words in TensorFlow
We will implement BoW using TensorFlow's tf.keras.layers.TextVectorization layer.
1. Install Dependencies
Ensure you have TensorFlow installed:
pip install tensorflow
2. Import Libraries
Python
import tensorflow as tf
import numpy as np
3. Define Sample Text Data
Python
text_data = [
"The cat sat on the mat.",
"The dog lay on the rug."
]
4. Create and Configure the TextVectorization Layer
Python
# Set parameters
max_tokens = 10
output_mode = "count"
# Create the vectorization layer
vectorizer = tf.keras.layers.TextVectorization(max_tokens=max_tokens, output_mode=output_mode)
# Adapt the vectorizer to the dataset
vectorizer.adapt(text_data)
5. Convert Text into BoW Representation
Python
bow_representation = vectorizer(text_data)
# Display results
print("Vocabulary:", vectorizer.get_vocabulary())
print("Bag-of-Words Representation:\n", bow_representation.numpy())
Output:
Vocabulary: ['[UNK]', 'the', 'on', 'sat', 'rug', 'mat', 'lay', 'dog', 'cat']
Bag-of-Words Representation:
[[0 2 1 1 0 1 0 0 1]
[0 2 1 0 1 0 1 1 0]]
The vectorizer.get_vocabulary() method returns the learned vocabulary, and bow_representation.numpy() provides the BoW vectors for the input sentences.
- 2 represents the count of "the".
- Other values indicate word frequencies in the respective sentence.
Applications of BoW
- Text Classification – Used in spam detection, sentiment analysis.
- Information Retrieval – Search engines match queries with documents using BoW.
- Topic Modeling – Helps in clustering similar documents.
Limitations of Bag-of-Words
- Ignores Word Order – Cannot differentiate between "dog bites man" and "man bites dog."
- Sparse Representation – Large vocabularies lead to high-dimensional vectors.
- Lack of Semantic Understanding – Words with similar meanings are treated differently.
Alternatives: TF-IDF, Word Embeddings (Word2Vec, GloVe), Transformer-based models (BERT).
Bag-of-Words is a simple yet effective method for text representation. With TensorFlow’s TextVectorization layer, implementing BoW is efficient and scalable. However, for complex NLP tasks, embeddings and deep learning-based representations are often preferred.
Similar Reads
TF-IDF Representations in TensorFlow Text data is one of the most common forms of unstructured data, and converting it into a numerical representation is essential for machine learning models. Term Frequency-Inverse Document Frequency (TF-IDF) is a widely used text vectorization technique that helps represent text in a way that capture
2 min read
String tensors in Tensorflow TensorFlow is a comprehensive open-source library for data science, it offers various data types for handling complex operations. The tf.string data type is used to represent string values. Unlike numeric data types that have a fixed size, strings are variable-length and can contain sequences of cha
5 min read
Sparse tensors in Tensorflow Imagine you are working with a massive dataset which is represented by multi-dimensional arrays called tensors. In simple terms, tensors are the building blocks of mathematical operations on the data. However, sometimes, tensors can have majority of values as zero. Such a tensor with a lot of zero v
10 min read
Recurrent Layers in TensorFlow Recurrent layers are used in Recurrent Neural Networks (RNNs), which are designed to handle sequential data. Unlike traditional feedforward networks, recurrent layers maintain information across time steps, making them suitable for tasks such as speech recognition, machine translation, and time seri
2 min read
TensorFlow for NLU and Text Processing Natural Language Understanding (NLU) focuses on the interaction between computers and humans through natural language. The main goal of NLU is to enable computers to understand, interpret, and generate human languages in a valuable way. It is crucial for processing and analyzing large amounts of uns
7 min read
Ragged tensors in TensorFlow Ragged tensors are a fundamental data structure in TensorFlow, especially in scenarios where data doesn't conform to fixed shapes, such as sequences of varying lengths or nested structures. In this article, we'll understand what ragged tensors are, why they're useful, and provide hands-on coding exa
5 min read