Doc2Vec is also called a Paragraph Vector a popular technique in Natural Language Processing that enables the representation of documents as vectors. This technique was introduced as an extension to Word2Vec, which is an approach to represent words as numerical vectors. While Word2Vec is used to learn word embeddings, Doc2Vec is used to learn document embeddings. In this article, we will discuss the Doc2Vec approach in detail.
What is Doc2Vec?
Doc2Vec is a neural network-based approach that learns the distributed representation of documents. It is an unsupervised learning technique that maps each document to a fixed-length vector in a high-dimensional space. The vectors are learned in such a way that similar documents are mapped to nearby points in the vector space. This enables us to compare documents based on their vector representation and perform tasks such as document classification, clustering, and similarity analysis.
There are two main variants of the Doc2Vec approach:
- Distributed Memory (DM)
- Distributed Bag of Words (DBOW)
Distributed Memory (DM)
Distributed Memory is a variant of the Doc2Vec model, which is an extension of the popular Word2Vec model. The basic idea behind Distributed Memory is to learn a fixed-length vector representation for each piece of text data (such as a sentence, paragraph, or document) by taking into account the context in which it appears.
-300.png)
DM Architecture
In the DM architecture, the neural network takes two types of inputs: the context words and a unique document ID. The context words are used to predict a target word, and the document ID is used to capture the overall meaning of the document. The network has two main components: the projection layer and the output layer.
The projection layer is responsible for creating the word vectors and document vectors. For each word in the input sequence, a unique word vector is created, and for each document, a unique document vector is created. These vectors are learned through the training process by optimizing a loss function that minimizes the difference between the predicted word and the actual target word. The output neural network takes the distributed representation of the context and predicts the target word.
Distributed Bag of Words (DBOW)
DBOW is a simpler version of the Doc2Vec algorithm that focuses on understanding how words are distributed in a text, rather than their meaning. This architecture is preferred when the goal is to analyze the structure of the text, rather than its content.
-300.png)
DBOW Architecture
In the DBOW architecture, a unique vector representation is assigned to each document in the corpus, but there are no separate word vectors. Instead, the algorithm takes in a document and learns to predict the probability of each word in the document given only the document vector.
The model does not take into account the order of the words in the document, treating the document as a collection or “bag ” of words. This makes the DBOW architecture faster to train than DM, but potentially less powerful in capturing the meaning of the documents.
Difference between DM and DBOW
DM architecture considers both the word order and document context, making it more powerful for capturing the semantic meaning of documents, while DBOW architecture is simpler and faster to train, and is useful for capturing distributional properties of words in a corpus.
The choice between the two architectures depends on the specific goals of the task at hand, and often both architectures are used in combination to capture both the semantic meaning and distributional properties of texts. Let’s write a Python code to implement Doc2Vec using Python’s Gensim library.
Python3
from gensim.models.doc2vec import Doc2Vec,\
TaggedDocument
from nltk.tokenize import word_tokenize
data = [ "This is the first document" ,
"This is the second document" ,
"This is the third document" ,
"This is the fourth document" ]
tagged_data = [TaggedDocument(words = word_tokenize(doc.lower()),
tags = [ str (i)]) for i,
doc in enumerate (data)]
model = Doc2Vec(vector_size = 20 ,
min_count = 2 , epochs = 50 )
model.build_vocab(tagged_data)
model.train(tagged_data,
total_examples = model.corpus_count,
epochs = model.epochs)
document_vectors = [model.infer_vector(
word_tokenize(doc.lower())) for doc in data]
for i, doc in enumerate (data):
print ( "Document" , i + 1 , ":" , doc)
print ( "Vector:" , document_vectors[i])
print ()
|
Output:
-660.png)
Document Vectors generated by Doc2Vec Model
Advantages of Doc2Vec
- Doc2Vec can capture the semantic meaning of entire documents or paragraphs, unlike traditional bag-of-words models that treat each word independently.
- It can be used to generate document embeddings, which can be used for a variety of downstream tasks such as document classification, clustering, and similarity search.
- Doc2Vec can handle unseen words by leveraging the context in which they appear in the document corpus, unlike methods such as TF-IDF that rely on word frequency in the corpus.
- It can be trained on large corpora using parallel processing, making it scalable to big data applications.
- It is flexible and can be easily customized by adjusting various hyperparameters such as the dimensionality of the document embeddings, the number of training epochs, and the training algorithm.
Similar Reads
M-CTC-T Model in NLP
Automatic Speech Recognition (ASR) stands at the forefront of cutting-edge Natural Language Processing (NLP) applications, revolutionizing the way computers interact with spoken language. In this article, we embark on a journey to unravel the intricacies of an advanced ASR model known as M-CTC-T. Th
15 min read
Bidirectional LSTM in NLP
Long Short-Term Memory (LSTM) are a type of Neural Network designed to handle long-term dependencies and overcome the vanishing gradient problem of RNN. It uses a memory cell along with input, forget and output gates to selectively retain or discard information. Bidirectional Long Short-Term Memory
4 min read
Rule Based Approach in NLP
Natural Language Processing serves as an interrelationship between human language and computers. It is a subfield of Artificial Intelligence that helps machines process, understand and generate natural language intuitively. Common tasks done by NLP are text and speech processing, language translatio
7 min read
One-Hot Encoding in NLP
Natural Language Processing (NLP) is a quickly expanding discipline that works with computer-human language exchanges. One of the most basic jobs in NLP is to represent text data numerically so that machine learning algorithms can comprehend it. One common method for accomplishing this is one-hot en
9 min read
Contrastive Learning In NLP
The goal of contrastive learning is to learn such embedding space in which similar samples are close to each other while dissimilar ones are far apart. It assumes a set of the paired sentences such as [Tex](x_i, x_i^{+}) [/Tex], where xi and xi+ are related semantically to each other. Let [Tex]h_i [
6 min read
NLP in customer service
Natural Language Processing (NLP) is transforming customer service by enabling more intuitive and efficient interactions between customers and businesses. This article explores the significant impact of NLP on Customer Service, highlighting how it is revolutionizing customer support, personalization
7 min read
NLP for Finance
Technology plays a pivotal role in shaping strategies, optimizing processes, and enhancing decision-making. Among the myriad of technological advancements, Natural Language Processing (NLP) stands out as a transformative force, revolutionizing how financial institutions analyze data, extract insight
8 min read
NLP | Flattening Deep Tree
Some of the corpora that we use are often deep trees of nested phrases. But working on such deep trees is a tedious job for training the chunker. As IOB tag parsing is not designed for nested chunks. So, in order to use these trees for chunker training, we must flatten them. Well, POS (part of Speec
2 min read
Python concordance command in NLTK
The Natural Language Toolkit (NLTK) is a powerful library in Python for working with human language data (text). One of its many useful features is the concordance command, which helps in text analysis by locating occurrences of a specified word within a body of text and displaying them along with t
7 min read
Self -attention in NLP
Self-attention was proposed by researchers at Google Research and Google Brain. It was proposed due to challenges faced by encoder-decoder in dealing with long sequences. The authors also provide two variants of attention and transformer architecture. This transformer architecture generates the stat
5 min read