Enhanced Subtitle Generation in
Videos Leveraging Hybrid BERT-
CNN-LSTM Architecture for
Contextual Understanding
z
Nagendra Babu Rajaboina
21PHD1097
Dept of SCOPE
z
Objective of the Work
This aim presents an end-to-end pipeline for video context
subtitle generation.
Building a reliable approach for subtitle generation is the
focus of this study.
Challenges of the Existing Work
z
Due to complicated backgrounds, a wide variety of fonts, low contrast
between words and backgrounds, and the lack of understanding of the
text's context, existing algorithms face significant difficulties in video
subtitle detection.
The absence of contextual awareness in processing video subtitles often
leads to errors and misinterpretations.
Consequently, the subtitles may fail to convey the intended meaning of
the text, preventing viewers from fully capturing the essence of the
video.
Introduction
z
In the digital age, the demand for accessible multimedia content has increased,
particularly in the realm of videos.
Subtitles play a pivotal role in making videos accessible to a broader audience,
including those with hearing impairments and individuals who speak different
languages.
Traditional methods of subtitle generation often rely on manual transcription or
simple speech-to-text algorithms, which can be both time-consuming and prone
to inaccuracies.
To address these challenges, we propose an advanced methodology for
automatic subtitle generation that leverages the power of deep learning.
z
Our approach focuses on generating contextually accurate subtitles by
analyzing the intricate relationship between the video's visual content and its
corresponding text.
Unlike conventional methods, which primarily depend on audio cues, our system
uses a hybrid model to comprehend both the visual and textual elements of a
video.
This ensures that the generated subtitles are not only accurate but also
contextually relevant, enhancing the viewer's experience.
z
The methodology we employ begins with raw video frames as input, followed by
the application of Tesseract Optical Character Recognition (OCR) to extract any
existing text within the frames.
This step is crucial for recognizing any subtitles already embedded in the video.
Next, we utilize Bidirectional Encoder Representations from Transformers
(BERT) to achieve contextual embedding of the recognized text.
BERT's ability to understand the context of words within a sentence enhances
the accuracy of the textual representation.
z
In parallel, a Convolutional Neural Network (CNN) is employed to extract
features from the video frames.
The CNN captures important visual elements, including objects, characters,
and background scenes, which are essential for understanding the video's
context.
The outputs from both BERT and the CNN are then fed into a Long Short-
Term Memory (LSTM) network.
The LSTM is adept at handling temporal dependencies, enabling it to
understand the sequence of events within the video.
z
By combining these advanced techniques, our system can generate
subtitles that are not only accurate but also contextually aligned with the
video's content.
This hybrid approach marks a significant advancement in the field of
automatic subtitle generation, offering a robust solution for creating high-
quality subtitles in various video formats.
z
Existing System
Alshawi, A. A. A., Tanha, J., & Hybrid encoder-decoder Lack of deep learning neural
Balafar, M. A. (2024). An attention- networks, in which different CNN network parameter tunning
based convolutionalz recurrent architectural depths were used to
neural networks for scene text encode the input images, and
recognition. IEEE Access. RNN was used to decode the
related character sequences.
Poulos, J., & Valle, R. (2021). Combined BiLSTM and CNNs to Identification time takes little longer
Character-based handwritten text create their feature extractors for
transcription with attention the scene image datasets. Applied
networks. Neural Computing and the contour detection technique was
Applications, 33(16), 10563-10573. applied to speed up the text
detection process.
Kantipudi, M. P., Kumar, S., & RNN encoder-decoder network was The model was trained to identify
Kumar Jha, A. (2021). Scene text used in to handle lengthy text character sequences rather than
recognition based on bidirectional transcription sequences. This used entire sentences.
LSTM and deep neural softmax function which
network. Computational Intelligence outperformed the sigmoid function
and Neuroscience, 2021(1),
2676780.
Yan, Hongyu, and Xin Xu. "End-to- It uses the Connectionist Text Tested on only particular datasets
end video subtitle recognition via a Proposal Network (CTPN) for
deep residual neural subtitle detection and the Residual
z
network." Pattern Recognition Network (ResNet), Gated Recurrent
Letters 131 (2020): 368-375. Unit (GRU), and Connectionist
Temporal Classification (CTC) for
Chinese and English subtitle
recognition.
Carbune, V., Gonnet, P., Deselaers, Integrated B´ezier curves with a High Complexity as BiLSTM is taking
T., Rowley, H. A., Daryin, A., Calvo, customized BiLSTM and CTC loss for care of all the tasks.
M., ... & Gervais, P. (2020). Fast multi- the input encoder for hardwriting
language LSTM-based online recognition
handwriting recognition. International
Journal on Document Analysis and
Recognition (IJDAR), 23(2), 89-102.
Aswin, V. B., Javed, M., Parihar, P., The Utilization of speech recognition Techniques have some limitations
Aswanth, K., Druval, C. R., Dagar, A., for subtitle generation and Natural such as ensuring precise speech
& Aravinda, C. V. (2021). NLP-driven Language Processing (NLP) recognition, computational cost, and
ensemble-based automatic subtitle algorithms for content summarization restricted generalizability across
generation and semantic video various video formats and linguistic
summarization technique. languages
In Advances in Artificial Intelligence
and Data Engineering: Select
Proceedings of AIDE 2019 (pp. 3-13).
z
Paper Techniques Disadvantages
Yao, C., Bai, X., & Liu, W. Scene Text Recognition (STR) STR models are more
(2014). A unified framework recognizes a text's individual complex. Hand-crafted
for multioriented text detection characters in a picture in the features were used.
and recognition. IEEE right order. In contrast to
Transactions on Image object recognition, it typically
Processing, 23(11), 4737- recognizes a single class of
4749. objects.
Lee, J., Park, S., Baek, J., Oh, Transformer-based models, Accuracy is limited because
S. J., Kim, S., & Lee, H. noise recurrence sequence- they frequently fail to
(2020). On recognizing texts to-sequence Text Recognition adequately exploit context.
of arbitrary shapes with 2D (NRTR), Self-Attention Text This emphasizes the
self-attention. In Proceedings Recognition Network (SATRN) necessity of using specialized
of the IEEE/CVF Conference methods for dynamic video
on Computer Vision and situations when recognizing
Pattern Recognition subtitles
Workshops (pp. 546-547).
z
Previous research has examined the use of NLP and speech recognition
algorithms for content summary and subtitle production, but has not fully
addressed the issues of accuracy, computational cost, varied text formats and
styles and contextual understanding.
Along with existing work fails to understand the intended meaning of the text
This gap highlights the need for further studies to improve the resilience and
usability of automated subtitle-generation systems across a range of multimedia
scenarios.
Research Contribution
z
An Extensive structure was formed by combining video pre-processing,
Tesseract OCR, contextual comprehension with BERT, image feature
extraction with CNN, and temporal dependency modelling with LSTM.
Preservation of context and understanding of word dependencies in video
subtitles. Unlike conventional approaches, which only classify words
without understanding their context.
Our methodology analyses the temporal dependencies between words using
LSTM and provides deep contextual awareness using BERT.
Why Tesseract
z
OCR?
Most current text recognition techniques focus on English and number recognition.
In text processing, OCR is used as the first step in extracting text from frames in photos or
videos.
Recognized words or characters are the outputs of the OCR. One of OCR's advantages is
that it can precisely extract individual characters or phrases from images.
One drawback of OCR is that it cannot to handle complicated layouts, a wide range of font
styles, and text sizes.
Tesseract OCR typically performs better than generic OCR in terms of accuracy and
flexibility for diverse font styles and text sizes because it is more specialized and optimized
for text recognition.
z
However, Tesseract OCR alone has many limitations, one of which is the
lack of document processing by understanding the contextual meaning of
words.
This limitation can be reduced by combining Tesseract OCR with a deep
neural network (DNN).
Working together with a DNN improves both feature extraction and
contextual understanding, which helps the combined system overcome
obstacles that a single Tesseract OCR cannot overcome.
Why BERT?
z
The most common text to numerical conversion techniques are the bag of words, TF-IDF,
Word2Vec, and Doc2Vec.
BOW represents each word as a distinct feature, and ignoring the word order and syntax.
To generate a numerical vector representation, the frequency of each word in the text was
counted.
Each dimension represents a distinct word in the vocabulary and the value denotes the
frequency of that word in the document.
The BOW model has a few drawbacks while being straightforward and simple to use. It
completely disregards the text's word order, which results in the loss of important semantic
information.
BOW cannot deal with phrases that are not included in the lexicon. These terms are either
disregarded or interpreted as unknown, leading to data loss.
The next most common method is TF-IDF.
z
Terms that are common inside a document but uncommon throughout the corpus are
given greater weights by TF-IDF, highlighting terms that are significant and unique to
the document.
However, TF-IDF has several drawbacks. First, it handles each term separately and
disregards the semantic connections between terms, which can result in
representations that are not ideal when the context is important.
Moreover, TF-IDF misses the position or order of terms inside a document.
The next method is Word2Vec. This utilizes neural networks to acquire distributed
word representations by analysing their co-occurrence patterns over an extensive
text corpus.
By encoding words as vectors and grouping words with similar meanings in vector
space, this method captures the semantic links between words.
However, Word2Vec has trouble producing embeddings for words that were not in
the training corpus owning to out-of-vocabulary (OOV) issues
z
An expansion of the well-known word embedding method word2vec is
Doc2Vec.
Doc2Vec is intended to produce fixed-size vectors for complete texts or phrases.
Doc2Vec learns vector representations for complete texts while retaining the
semantic meaning of words within the context of the document.
This is achieved by using word embeddings in conjunction with document-level
data during the training phase.
However, learning meaningful document representations successfully requires a
significant quantity of training data, which may be difficult for smaller datasets.
One more powerful technique for converting text to numerical text is the Bidirectional Encoder
Representations from Transformers (BERT).
z
Instead of depending fixed-size word embeddings, as do other classic models like bag of words, TF-
IDF, Word2Vec, and Doc2Vec, BERT uses a transformer architecture to extract bidirectional context
information from the full input text.
This enables BERT to produce word embeddings that are aware of the context and dynamically adjust
to the words that surround them in a sentence or page.
BERT creates dense contextual embeddings that retain not only the meaning of individual words, but
also their relationships with surrounding words.
BERT can construct contextual embeddings for unseen words depending on their surrounding context,
which makes it more effective than Word2Vec and Doc2Vec in handling out-of-vocabulary (OOV)
terms.
BERT is more resistant to changes in vocabulary and terminology specific to a given field. These
advantages motivated us to use the BERT base to convert text into contextual embedding.
Why Deep Learning?
z
A better feature extraction technique is important to obtain the best model.
Deep Learning can automatically learn hierarchical data representations from raw inputs.
In contrast to manually engineered features that require domain knowledge, deep learning
algorithms can automatically extract pertinent features from the data, thereby reducing labor
costs and increasing process efficiency.
Furthermore, complex patterns and relationships in the data can be captured by deep learning
models, even in high-dimensional spaces, which may be difficult using traditional feature
extraction methods.
Furthermore, deep learning models can handle large-scale feature extraction tasks quickly by
utilizing distributed training methods and parallel computing architectures, which enables them
to scale effectively as dataset sizes continue to increase.
These best features motivated us to use deep learning for feature extraction in this study.
Why CNN
z
Among the many deep learning techniques, Convolutional Neural Networks (CNN) have many
advantages over other deep learning techniques.
CNNs are designed for the purpose of extracting hierarchical spatial characteristics from pictures.
They can identify complex patterns inside video frames, from simple features, such as edges and
textures, to intricate structures, such as object sections and scene compositions, owing to this
hierarchical representation.
CNNs are ideally suited for tasks such as object identification and recognition in dynamic video
situations owing to this feature.
Ultimately, CNNs are a strong option for feature extraction from video frames.
The output of the CNN, which is a feature map, and the output of BERT, which is a contextual
embedding of the words, are given to the LSTM
Why LSTM
z
LSTMs are specifically designed to capture and utilize temporal dependencies in
sequential data.
In the context of subtitle generation, this means they can effectively understand the
sequence and timing of video frames and textual elements, leading to more accurate
and contextually appropriate subtitles.
Other deep learning techniques, like traditional feedforward neural networks, lack this
capability to retain information over long sequences.
Traditional Recurrent Neural Networks (RNNs) often suffer from the vanishing gradient
problem, where gradients diminish exponentially during backpropagation through time,
making it difficult to learn long-range dependencies.
z
LSTMs overcome this issue with their unique cell state and gating mechanisms (input,
output, and forget gates), allowing them to maintain and propagate gradients effectively
over long sequences.
This results in better learning and performance for tasks requiring understanding of
long-term dependencies, such as subtitle generation.
LSTMs can selectively remember or forget information through their gating mechanisms,
enabling them to maintain relevant contextual information over time.
This capability is particularly advantageous for subtitle generation, where the context of
previous frames and recognized text can significantly influence the accuracy of the
generated subtitles.
Other models like Convolutional Neural Networks (CNNs) or standard RNNs may not be
as effective in managing and utilizing this contextual information over long video
sequences.
PROPOSED METHOD
z
z
The end-to-end system shows the training model that
calculates the contextual embedding of the subtitles from the
video using BERT and image feature extraction from the
video frames using the CNN.
This model combines outputs from both BERT and CNN
using LSTM. The LSTM learns the relationship between the
image and subtitle, which later in the testing phase, produces
subtitles appropriate to the video frame.
Subtitle detection:
z
Subtitle detection is an essential step in the process of recognizing video
subtitles
It makes sense to use the OCR approach for these subtitles. However, in
reality, it is difficult to obtain subtitles of the same font, size, and color.
OCR-based subtitle identification may encounter issues, such as incorrect
text recognition, particularly when dealing with complex fonts or
backgrounds.
On the other hand, Tesseract OCR provides better flexibility for various text
formats and styles, which improves its capacity to recognize subtitles with
accuracy.
It can recognize and understand a wide range of font styles, sizes, and even
unusual text orientations owing to its strong training capabilities.
z
Tesseract OCR was used to process the video frame and
conduct text recognition.
The algorithm usually moves on to the next frame in the video
sequence if Tesseract OCR is unable to identify any subtitles in
the current video frame.
This process was repeated until the video ended .
After Tesseract, OCR (TOCR) is used to successfully find subtitles in the first step of
z
the process.
Then the process continues with "Text Segmentation" or "Foreground Extraction"
approach to extract simply the bare text without any background after Tesseract
OCR.
Initially, the TOCR output image would be binarized, meaning it would become a
binary image with background (black) pixels representing the background and
foreground (white) pixels representing the text.
The connected patches of foreground pixels in the binary image that correspond to
individual characters or text components are then determined using Connected
Component Analysis (CCA).
Finally, we used the binary picture to mask the original image and extract the
foreground regions that correspond to the text from the original OCR output image,
rejecting the background regions.
This gives only the text which can be extracted by isolating the background.
z
After Tesseract OCR, provides an efficient way to separate plain
text from background objects in image frame.
This guarantees that only the text content is maintained by
methodically examining and processing OCR output images,
thereby improving readability and clarity.
Moreover, removing the background noise increases the accuracy
of later analytic methods.
This method maximizes the extraction of valuable textual content,
making it more effective for a variety of applications
Contextualzembeddings of recognized text
BERT's bidirectional transformer design, which considers the entire context of word within
a sentence is used for contextual embedding,
Using Sub-word tokenization, it effectively handles uncommon or unseen terms while
capturing deep semantic links between words.
Furthermore, BERT can learn intricate linguistic patterns owing to its pretraining on big
corpora, which makes it very successful for a variety of natural language processing jobs
without significantly relying on manually created features or predefined vocabularies.
The position and context of a word within a sentence is represented by the BERT model.
Using a trained model, BERT converts text data into fixed feature vectors as a part of a
feature-based process. BERT can generate vector representations that consider the context
and position of a sentence
Tesseract OCR-identified text is pre-processed to remove noise, fix
z
punctuation, and adjust formatting, and improve the input quality for BERT.
To make further processing easier, the pre-processed text is then tokenized
into smaller chunks, such as words or sub-words.
Tokenization guarantees that each linguistic component is divided into
unique tokens.
The BERT model was then fed with these tokens to begin the encoding
process.
By considering the bidirectional context of every token in the text, BERT
creates contextual embeddings using a transformer-based architecture.
Through this procedure, BERT can better comprehend the semantics of the
recognized text by capturing complex contextual information.
The encoder and transformer layers are the two main parts of the bidirectional
z
transformer-based neural network architecture that comprises the BERT.
By processing the input tokens in a bidirectional fashion, the encoder part of the
BERT can extract contextual information from both the previous and subsequent
tokens.
Furthermore, BERT uses several layers of transformers, each comprising feed-
forward neural networks and self-attention processes.
By performing hierarchical feature extraction, these transformer layers gradually
improved the input token representations by considering their contextual
relationships.
Finally, this stage returns the contextual embeddings of the input text. These
embeddings are representations of each token in the input sequence, which
considers the surrounding context
z
Feature extraction
Convolutional Neural Networks (CNNs), which automatically generate hierarchical representations
of image attributes from raw pixel data.
CNNs do not require expert feature engineers because they can adaptively learn features from the
data itself.
CNNs can thus effectively represent semantic notions in images.
CNNs, on the other hand, are particularly effective at deriving hierarchical feature representations
from raw pixel data.
This motivated us to use CNN for feature extraction.
We resized the input images to 224x224 pixels.
The CNN applies two convolutional layers: the first has 32 dimensions and the second has 64
dimensions,
each using a set of learnable filters (kernels(3 x 3)) to convolve over the input image.
Stride set to 1.
Sequential
z
Dependency
The objective of the LSTM-based architecture created for this purpose is to
efficiently capture the interaction between the contextual data encoded in
subtitles and the visual elements captured from video frames.
The model uses a dual-input technique to accommodate both the word
embedding sequence and the image feature vector sequence in order to
accomplish this.
These two streams of data are combined at each time step so that the LSTM
can process textual and visual cues simultaneously.
z
Through the integration of these inputs, LSTM has the ability to identify
complex correlations between textual descriptions and visual material,
thus enabling a thorough comprehension of the link between both types
of information.
Through iterative training, the model enhances its capacity to identify
subtle semantic linkages by capitalizing on the complementary attributes
of both visual and textual data.
z
A combination of text data converted by BERT and picture feature vectors
from a CNN was used to create numerical representations for the LSTM
input.
LSTM is fed with this concatenated vector, which represents the textual
information as well as the context of the image.
To capture temporal correlations between frames and textual descriptions,
the LSTM is trained to comprehend both the backdrop image and the
sentence in its entirety.
z
When the video data were supplied without corresponding captions during
the testing phase, the focus shifted mainly to the visual component.
This stage involves running each video frame using a CNN to extract
features and record visual data.
Concatenated features are fed into the LSTM, which uses them to enhance
the learned comprehension of the combined image-text context that it has
learned throughout training.
Even though there is no textual input during testing, LSTM applies what it
has learned to produce subtitles for every frame based only on the visual
data acquired the CNN.
Layer Filters Kernal size
z
First Convolutional 64 3x3
Layer
Max pooling - 2x2
Second 128 3x3
Convolutional Layer
Max pooling - 2x2
Third Convolutional 256 3x3
Layer
Max pooling - 2x2
LSTM units 256
First Dense Layer 128, Relu Activation
Second Dense Layer Size of Vocabulary
EXPERIMENTAL RESULTS- Feature Selection
z
Accuracy
z vs Execution Time vs Size of
Convolutional Layer
Accuracy vs Execution Time vs Size of LSTM
z
z
Time Complexity Comparison
z
Accuracy Comparison
z
Recall Comparison
z
Precision Comparison
Conclusion
z
The proposed method uses several steps to create a strong connection between the video
frames and subtitles.
Through the use of Tesseract for Optical Character Recognition (OCR), BERT for
advanced language modelling, and a Convolutional Neural Network (CNN) for feature
extraction, the model is able to extract textual and visual information from video frames
and subtitles.
Moreover, sequential dependency learning enabled by Encoder and Decoder Long Short-
Term Memory (LSTM) networks improves the comprehension of contextual connections
between the associated text and visual content.
In the testing phase, the model showed promising skills in producing precise subtitles for
specified video frames after undergoing significant experimentation and training.
z
According to recent observations, the low contrast between text
and background, varying fonts, and diverse backgrounds pose
major obstacles to video subtitle identification.
Our results unambiguously show that the accuracy of subtitle
recognition is significantly increased by integrating OCR,
contextual embedding via BERT, and temporal dependencies via
CNN-LSTM.
The TED Talks and Open Images datasets were used.
z
Thank You