0% found this document useful (0 votes)

41 views49 pages

For The First Paper

Uploaded by

Swathi Potluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views49 pages

For The First Paper

Uploaded by

Swathi Potluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

Enhanced Subtitle Generation in

Videos Leveraging Hybrid BERT-

CNN-LSTM Architecture for
Contextual Understanding

z
Nagendra Babu Rajaboina

21PHD1097

Dept of SCOPE
z
Objective of the Work

 This aim presents an end-to-end pipeline for video context

subtitle generation.

 Building a reliable approach for subtitle generation is the

focus of this study.
Challenges of the Existing Work
z

 Due to complicated backgrounds, a wide variety of fonts, low contrast

between words and backgrounds, and the lack of understanding of the
text's context, existing algorithms face significant difficulties in video
subtitle detection.
 The absence of contextual awareness in processing video subtitles often
leads to errors and misinterpretations.
 Consequently, the subtitles may fail to convey the intended meaning of
the text, preventing viewers from fully capturing the essence of the
video.
Introduction
z

 In the digital age, the demand for accessible multimedia content has increased,
particularly in the realm of videos.

 Subtitles play a pivotal role in making videos accessible to a broader audience,

including those with hearing impairments and individuals who speak different
languages.

 Traditional methods of subtitle generation often rely on manual transcription or

simple speech-to-text algorithms, which can be both time-consuming and prone
to inaccuracies.

 To address these challenges, we propose an advanced methodology for

automatic subtitle generation that leverages the power of deep learning.
z

 Our approach focuses on generating contextually accurate subtitles by

analyzing the intricate relationship between the video's visual content and its
corresponding text.

 Unlike conventional methods, which primarily depend on audio cues, our system
uses a hybrid model to comprehend both the visual and textual elements of a
video.

 This ensures that the generated subtitles are not only accurate but also
contextually relevant, enhancing the viewer's experience.
z
 The methodology we employ begins with raw video frames as input, followed by
the application of Tesseract Optical Character Recognition (OCR) to extract any
existing text within the frames.

 This step is crucial for recognizing any subtitles already embedded in the video.
Next, we utilize Bidirectional Encoder Representations from Transformers
(BERT) to achieve contextual embedding of the recognized text.

 BERT's ability to understand the context of words within a sentence enhances

the accuracy of the textual representation.
z

 In parallel, a Convolutional Neural Network (CNN) is employed to extract

features from the video frames.

 The CNN captures important visual elements, including objects, characters,

and background scenes, which are essential for understanding the video's
context.

 The outputs from both BERT and the CNN are then fed into a Long Short-
Term Memory (LSTM) network.

 The LSTM is adept at handling temporal dependencies, enabling it to

understand the sequence of events within the video.
z

 By combining these advanced techniques, our system can generate

subtitles that are not only accurate but also contextually aligned with the
video's content.

 This hybrid approach marks a significant advancement in the field of

automatic subtitle generation, offering a robust solution for creating high-
quality subtitles in various video formats.
z

Existing System
Alshawi, A. A. A., Tanha, J., & Hybrid encoder-decoder Lack of deep learning neural
Balafar, M. A. (2024). An attention- networks, in which different CNN network parameter tunning
based convolutionalz recurrent architectural depths were used to
neural networks for scene text encode the input images, and
recognition. IEEE Access. RNN was used to decode the
related character sequences.

Poulos, J., & Valle, R. (2021). Combined BiLSTM and CNNs to Identification time takes little longer
Character-based handwritten text create their feature extractors for
transcription with attention the scene image datasets. Applied
networks. Neural Computing and the contour detection technique was
Applications, 33(16), 10563-10573. applied to speed up the text
detection process.
Kantipudi, M. P., Kumar, S., & RNN encoder-decoder network was The model was trained to identify
Kumar Jha, A. (2021). Scene text used in to handle lengthy text character sequences rather than
recognition based on bidirectional transcription sequences. This used entire sentences.
LSTM and deep neural softmax function which
network. Computational Intelligence outperformed the sigmoid function
and Neuroscience, 2021(1),
2676780.
Yan, Hongyu, and Xin Xu. "End-to- It uses the Connectionist Text Tested on only particular datasets
end video subtitle recognition via a Proposal Network (CTPN) for
deep residual neural subtitle detection and the Residual
z
network." Pattern Recognition Network (ResNet), Gated Recurrent
Letters 131 (2020): 368-375. Unit (GRU), and Connectionist
Temporal Classification (CTC) for
Chinese and English subtitle
recognition.

Carbune, V., Gonnet, P., Deselaers, Integrated B´ezier curves with a High Complexity as BiLSTM is taking
T., Rowley, H. A., Daryin, A., Calvo, customized BiLSTM and CTC loss for care of all the tasks.
M., ... & Gervais, P. (2020). Fast multi- the input encoder for hardwriting
language LSTM-based online recognition
handwriting recognition. International
Journal on Document Analysis and
Recognition (IJDAR), 23(2), 89-102.
Aswin, V. B., Javed, M., Parihar, P., The Utilization of speech recognition Techniques have some limitations
Aswanth, K., Druval, C. R., Dagar, A., for subtitle generation and Natural such as ensuring precise speech
& Aravinda, C. V. (2021). NLP-driven Language Processing (NLP) recognition, computational cost, and
ensemble-based automatic subtitle algorithms for content summarization restricted generalizability across
generation and semantic video various video formats and linguistic
summarization technique. languages
In Advances in Artificial Intelligence
and Data Engineering: Select
Proceedings of AIDE 2019 (pp. 3-13).
z
Paper Techniques Disadvantages
Yao, C., Bai, X., & Liu, W. Scene Text Recognition (STR) STR models are more
(2014). A unified framework recognizes a text's individual complex. Hand-crafted
for multioriented text detection characters in a picture in the features were used.
and recognition. IEEE right order. In contrast to
Transactions on Image object recognition, it typically
Processing, 23(11), 4737- recognizes a single class of
4749. objects.
Lee, J., Park, S., Baek, J., Oh, Transformer-based models, Accuracy is limited because
S. J., Kim, S., & Lee, H. noise recurrence sequence- they frequently fail to
(2020). On recognizing texts to-sequence Text Recognition adequately exploit context.
of arbitrary shapes with 2D (NRTR), Self-Attention Text This emphasizes the
self-attention. In Proceedings Recognition Network (SATRN) necessity of using specialized
of the IEEE/CVF Conference methods for dynamic video
on Computer Vision and situations when recognizing
Pattern Recognition subtitles
Workshops (pp. 546-547).
z
 Previous research has examined the use of NLP and speech recognition
algorithms for content summary and subtitle production, but has not fully
addressed the issues of accuracy, computational cost, varied text formats and
styles and contextual understanding.

 Along with existing work fails to understand the intended meaning of the text

 This gap highlights the need for further studies to improve the resilience and
usability of automated subtitle-generation systems across a range of multimedia
scenarios.
Research Contribution
z

 An Extensive structure was formed by combining video pre-processing,

Tesseract OCR, contextual comprehension with BERT, image feature
extraction with CNN, and temporal dependency modelling with LSTM.
 Preservation of context and understanding of word dependencies in video
subtitles. Unlike conventional approaches, which only classify words
without understanding their context.
 Our methodology analyses the temporal dependencies between words using
LSTM and provides deep contextual awareness using BERT.
Why Tesseract
z
OCR?
 Most current text recognition techniques focus on English and number recognition.

 In text processing, OCR is used as the first step in extracting text from frames in photos or
videos.

 Recognized words or characters are the outputs of the OCR. One of OCR's advantages is
that it can precisely extract individual characters or phrases from images.

 One drawback of OCR is that it cannot to handle complicated layouts, a wide range of font
styles, and text sizes.

 Tesseract OCR typically performs better than generic OCR in terms of accuracy and
flexibility for diverse font styles and text sizes because it is more specialized and optimized
for text recognition.
z
 However, Tesseract OCR alone has many limitations, one of which is the
lack of document processing by understanding the contextual meaning of
words.

 This limitation can be reduced by combining Tesseract OCR with a deep

neural network (DNN).

 Working together with a DNN improves both feature extraction and

contextual understanding, which helps the combined system overcome
obstacles that a single Tesseract OCR cannot overcome.
Why BERT?
z

 The most common text to numerical conversion techniques are the bag of words, TF-IDF,
Word2Vec, and Doc2Vec.

 BOW represents each word as a distinct feature, and ignoring the word order and syntax.

 To generate a numerical vector representation, the frequency of each word in the text was
counted.

 Each dimension represents a distinct word in the vocabulary and the value denotes the
frequency of that word in the document.

 The BOW model has a few drawbacks while being straightforward and simple to use. It
completely disregards the text's word order, which results in the loss of important semantic
information.

 BOW cannot deal with phrases that are not included in the lexicon. These terms are either
disregarded or interpreted as unknown, leading to data loss.
 The next most common method is TF-IDF.
z
 Terms that are common inside a document but uncommon throughout the corpus are
given greater weights by TF-IDF, highlighting terms that are significant and unique to
the document.

 However, TF-IDF has several drawbacks. First, it handles each term separately and
disregards the semantic connections between terms, which can result in
representations that are not ideal when the context is important.

 Moreover, TF-IDF misses the position or order of terms inside a document.

 The next method is Word2Vec. This utilizes neural networks to acquire distributed
word representations by analysing their co-occurrence patterns over an extensive
text corpus.

 By encoding words as vectors and grouping words with similar meanings in vector
space, this method captures the semantic links between words.

 However, Word2Vec has trouble producing embeddings for words that were not in
the training corpus owning to out-of-vocabulary (OOV) issues
z
 An expansion of the well-known word embedding method word2vec is
Doc2Vec.
 Doc2Vec is intended to produce fixed-size vectors for complete texts or phrases.

 Doc2Vec learns vector representations for complete texts while retaining the
semantic meaning of words within the context of the document.
 This is achieved by using word embeddings in conjunction with document-level
data during the training phase.
 However, learning meaningful document representations successfully requires a
significant quantity of training data, which may be difficult for smaller datasets.
 One more powerful technique for converting text to numerical text is the Bidirectional Encoder
Representations from Transformers (BERT).
z
 Instead of depending fixed-size word embeddings, as do other classic models like bag of words, TF-
IDF, Word2Vec, and Doc2Vec, BERT uses a transformer architecture to extract bidirectional context
information from the full input text.

 This enables BERT to produce word embeddings that are aware of the context and dynamically adjust
to the words that surround them in a sentence or page.

 BERT creates dense contextual embeddings that retain not only the meaning of individual words, but
also their relationships with surrounding words.

 BERT can construct contextual embeddings for unseen words depending on their surrounding context,
which makes it more effective than Word2Vec and Doc2Vec in handling out-of-vocabulary (OOV)
terms.

 BERT is more resistant to changes in vocabulary and terminology specific to a given field. These
advantages motivated us to use the BERT base to convert text into contextual embedding.
Why Deep Learning?
z
 A better feature extraction technique is important to obtain the best model.

 Deep Learning can automatically learn hierarchical data representations from raw inputs.

 In contrast to manually engineered features that require domain knowledge, deep learning
algorithms can automatically extract pertinent features from the data, thereby reducing labor
costs and increasing process efficiency.

 Furthermore, complex patterns and relationships in the data can be captured by deep learning
models, even in high-dimensional spaces, which may be difficult using traditional feature
extraction methods.

 Furthermore, deep learning models can handle large-scale feature extraction tasks quickly by
utilizing distributed training methods and parallel computing architectures, which enables them
to scale effectively as dataset sizes continue to increase.

 These best features motivated us to use deep learning for feature extraction in this study.
Why CNN
z

 Among the many deep learning techniques, Convolutional Neural Networks (CNN) have many
advantages over other deep learning techniques.

 CNNs are designed for the purpose of extracting hierarchical spatial characteristics from pictures.

 They can identify complex patterns inside video frames, from simple features, such as edges and
textures, to intricate structures, such as object sections and scene compositions, owing to this
hierarchical representation.

 CNNs are ideally suited for tasks such as object identification and recognition in dynamic video
situations owing to this feature.

 Ultimately, CNNs are a strong option for feature extraction from video frames.

 The output of the CNN, which is a feature map, and the output of BERT, which is a contextual
embedding of the words, are given to the LSTM
Why LSTM
z

 LSTMs are specifically designed to capture and utilize temporal dependencies in

sequential data.

 In the context of subtitle generation, this means they can effectively understand the
sequence and timing of video frames and textual elements, leading to more accurate
and contextually appropriate subtitles.

 Other deep learning techniques, like traditional feedforward neural networks, lack this
capability to retain information over long sequences.

 Traditional Recurrent Neural Networks (RNNs) often suffer from the vanishing gradient
problem, where gradients diminish exponentially during backpropagation through time,
making it difficult to learn long-range dependencies.
 z
LSTMs overcome this issue with their unique cell state and gating mechanisms (input,
output, and forget gates), allowing them to maintain and propagate gradients effectively
over long sequences.

 This results in better learning and performance for tasks requiring understanding of
long-term dependencies, such as subtitle generation.

 LSTMs can selectively remember or forget information through their gating mechanisms,
enabling them to maintain relevant contextual information over time.

 This capability is particularly advantageous for subtitle generation, where the context of
previous frames and recognized text can significantly influence the accuracy of the
generated subtitles.

 Other models like Convolutional Neural Networks (CNNs) or standard RNNs may not be
as effective in managing and utilizing this contextual information over long video
sequences.
PROPOSED METHOD
z
z
 The end-to-end system shows the training model that
calculates the contextual embedding of the subtitles from the
video using BERT and image feature extraction from the
video frames using the CNN.
 This model combines outputs from both BERT and CNN
using LSTM. The LSTM learns the relationship between the
image and subtitle, which later in the testing phase, produces
subtitles appropriate to the video frame.
Subtitle detection:
z
 Subtitle detection is an essential step in the process of recognizing video
subtitles
 It makes sense to use the OCR approach for these subtitles. However, in
reality, it is difficult to obtain subtitles of the same font, size, and color.
OCR-based subtitle identification may encounter issues, such as incorrect
text recognition, particularly when dealing with complex fonts or
backgrounds.
 On the other hand, Tesseract OCR provides better flexibility for various text
formats and styles, which improves its capacity to recognize subtitles with
accuracy.
 It can recognize and understand a wide range of font styles, sizes, and even
unusual text orientations owing to its strong training capabilities.
z

 Tesseract OCR was used to process the video frame and

conduct text recognition.
 The algorithm usually moves on to the next frame in the video
sequence if Tesseract OCR is unable to identify any subtitles in
the current video frame.
 This process was repeated until the video ended .
 After Tesseract, OCR (TOCR) is used to successfully find subtitles in the first step of
z
the process.

 Then the process continues with "Text Segmentation" or "Foreground Extraction"

approach to extract simply the bare text without any background after Tesseract
OCR.

 Initially, the TOCR output image would be binarized, meaning it would become a
binary image with background (black) pixels representing the background and
foreground (white) pixels representing the text.

 The connected patches of foreground pixels in the binary image that correspond to
individual characters or text components are then determined using Connected
Component Analysis (CCA).

 Finally, we used the binary picture to mask the original image and extract the
foreground regions that correspond to the text from the original OCR output image,
rejecting the background regions.

 This gives only the text which can be extracted by isolating the background.
z
 After Tesseract OCR, provides an efficient way to separate plain
text from background objects in image frame.
 This guarantees that only the text content is maintained by
methodically examining and processing OCR output images,
thereby improving readability and clarity.
 Moreover, removing the background noise increases the accuracy
of later analytic methods.
 This method maximizes the extraction of valuable textual content,
making it more effective for a variety of applications
Contextualzembeddings of recognized text

 BERT's bidirectional transformer design, which considers the entire context of word within
a sentence is used for contextual embedding,

 Using Sub-word tokenization, it effectively handles uncommon or unseen terms while

capturing deep semantic links between words.

 Furthermore, BERT can learn intricate linguistic patterns owing to its pretraining on big
corpora, which makes it very successful for a variety of natural language processing jobs
without significantly relying on manually created features or predefined vocabularies.

 The position and context of a word within a sentence is represented by the BERT model.

 Using a trained model, BERT converts text data into fixed feature vectors as a part of a
feature-based process. BERT can generate vector representations that consider the context
and position of a sentence
 Tesseract OCR-identified text is pre-processed to remove noise, fix
z
punctuation, and adjust formatting, and improve the input quality for BERT.
 To make further processing easier, the pre-processed text is then tokenized
into smaller chunks, such as words or sub-words.
 Tokenization guarantees that each linguistic component is divided into
unique tokens.
 The BERT model was then fed with these tokens to begin the encoding
process.
 By considering the bidirectional context of every token in the text, BERT
creates contextual embeddings using a transformer-based architecture.
 Through this procedure, BERT can better comprehend the semantics of the
recognized text by capturing complex contextual information.
 The encoder and transformer layers are the two main parts of the bidirectional
z
transformer-based neural network architecture that comprises the BERT.
 By processing the input tokens in a bidirectional fashion, the encoder part of the
BERT can extract contextual information from both the previous and subsequent
tokens.
 Furthermore, BERT uses several layers of transformers, each comprising feed-
forward neural networks and self-attention processes.
 By performing hierarchical feature extraction, these transformer layers gradually
improved the input token representations by considering their contextual
relationships.
 Finally, this stage returns the contextual embeddings of the input text. These
embeddings are representations of each token in the input sequence, which
considers the surrounding context
z
Feature extraction
 Convolutional Neural Networks (CNNs), which automatically generate hierarchical representations
of image attributes from raw pixel data.
 CNNs do not require expert feature engineers because they can adaptively learn features from the
data itself.
 CNNs can thus effectively represent semantic notions in images.
 CNNs, on the other hand, are particularly effective at deriving hierarchical feature representations
from raw pixel data.
 This motivated us to use CNN for feature extraction.
 We resized the input images to 224x224 pixels.
 The CNN applies two convolutional layers: the first has 32 dimensions and the second has 64
dimensions,
 each using a set of learnable filters (kernels(3 x 3)) to convolve over the input image.
 Stride set to 1.
Sequential
z
Dependency
 The objective of the LSTM-based architecture created for this purpose is to
efficiently capture the interaction between the contextual data encoded in
subtitles and the visual elements captured from video frames.

 The model uses a dual-input technique to accommodate both the word

embedding sequence and the image feature vector sequence in order to
accomplish this.

 These two streams of data are combined at each time step so that the LSTM
can process textual and visual cues simultaneously.
z

 Through the integration of these inputs, LSTM has the ability to identify
complex correlations between textual descriptions and visual material,
thus enabling a thorough comprehension of the link between both types
of information.

 Through iterative training, the model enhances its capacity to identify

subtle semantic linkages by capitalizing on the complementary attributes
of both visual and textual data.
z

 A combination of text data converted by BERT and picture feature vectors

from a CNN was used to create numerical representations for the LSTM
input.
 LSTM is fed with this concatenated vector, which represents the textual
information as well as the context of the image.
 To capture temporal correlations between frames and textual descriptions,
the LSTM is trained to comprehend both the backdrop image and the
sentence in its entirety.
z
 When the video data were supplied without corresponding captions during
the testing phase, the focus shifted mainly to the visual component.
 This stage involves running each video frame using a CNN to extract
features and record visual data.
 Concatenated features are fed into the LSTM, which uses them to enhance
the learned comprehension of the combined image-text context that it has
learned throughout training.
 Even though there is no textual input during testing, LSTM applies what it
has learned to produce subtitles for every frame based only on the visual
data acquired the CNN.
Layer Filters Kernal size
z
First Convolutional 64 3x3
Layer
Max pooling - 2x2
Second 128 3x3
Convolutional Layer
Max pooling - 2x2
Third Convolutional 256 3x3
Layer
Max pooling - 2x2
LSTM units 256
First Dense Layer 128, Relu Activation
Second Dense Layer Size of Vocabulary
EXPERIMENTAL RESULTS- Feature Selection
z
Accuracy
z vs Execution Time vs Size of
Convolutional Layer
Accuracy vs Execution Time vs Size of LSTM
z
z
Time Complexity Comparison
z
Accuracy Comparison
z
Recall Comparison
z
Precision Comparison
Conclusion
z

 The proposed method uses several steps to create a strong connection between the video
frames and subtitles.
 Through the use of Tesseract for Optical Character Recognition (OCR), BERT for
advanced language modelling, and a Convolutional Neural Network (CNN) for feature
extraction, the model is able to extract textual and visual information from video frames
and subtitles.
 Moreover, sequential dependency learning enabled by Encoder and Decoder Long Short-
Term Memory (LSTM) networks improves the comprehension of contextual connections
between the associated text and visual content.
 In the testing phase, the model showed promising skills in producing precise subtitles for
specified video frames after undergoing significant experimentation and training.
z
 According to recent observations, the low contrast between text
and background, varying fonts, and diverse backgrounds pose
major obstacles to video subtitle identification.
 Our results unambiguously show that the accuracy of subtitle
recognition is significantly increased by integrating OCR,
contextual embedding via BERT, and temporal dependencies via
CNN-LSTM.
 The TED Talks and Open Images datasets were used.
z

 Thank You

Video To Sequence
No ratings yet
Video To Sequence
9 pages
Paper 1
No ratings yet
Paper 1
13 pages
Transformer Network For Video To Text Translation
No ratings yet
Transformer Network For Video To Text Translation
6 pages
Attentive Visual Semantic Specialized Network For Video Captioning
No ratings yet
Attentive Visual Semantic Specialized Network For Video Captioning
8 pages
Bidirectional Attentive Fusion With Context Gating For Dense 1uq0v8w4z4
No ratings yet
Bidirectional Attentive Fusion With Context Gating For Dense 1uq0v8w4z4
9 pages
A Multimodal Text Block Segmentation Framework For Photo Translation
No ratings yet
A Multimodal Text Block Segmentation Framework For Photo Translation
12 pages
Enhancing LSTM-based Video Narration Through Text-Derived Linguistic Insights
No ratings yet
Enhancing LSTM-based Video Narration Through Text-Derived Linguistic Insights
5 pages
Imagecaptionusing CNNand LSTM
No ratings yet
Imagecaptionusing CNNand LSTM
11 pages
TS2-Net: Advanced Text-Video Retrieval
No ratings yet
TS2-Net: Advanced Text-Video Retrieval
23 pages
Deep Learning - 6 - 1730105277528
No ratings yet
Deep Learning - 6 - 1730105277528
5 pages
Minor
No ratings yet
Minor
14 pages
Show and Tell: A Neural Image Caption Generator
No ratings yet
Show and Tell: A Neural Image Caption Generator
9 pages
Adaptive Video Captioning Model
No ratings yet
Adaptive Video Captioning Model
16 pages
Hierarchical LSTMs for Visual Captioning
No ratings yet
Hierarchical LSTMs for Visual Captioning
18 pages
Aneja Convolutional Image Captioning CVPR 2018 Paper
No ratings yet
Aneja Convolutional Image Captioning CVPR 2018 Paper
10 pages
Two-Step Transformer for Video Captioning
No ratings yet
Two-Step Transformer for Video Captioning
15 pages
Paper - 3
No ratings yet
Paper - 3
33 pages
DL - Mod - 4 - 2025 (1) - Merged
No ratings yet
DL - Mod - 4 - 2025 (1) - Merged
115 pages
7
No ratings yet
7
5 pages
A Novel Ensemble Deep Network Framework For Scene Text Recognition
No ratings yet
A Novel Ensemble Deep Network Framework For Scene Text Recognition
11 pages
Video Captioning with LSRT and G3RM
No ratings yet
Video Captioning with LSRT and G3RM
13 pages
Boundary Detector Encoder and Decoder With Soft Attention For Video Captioning
No ratings yet
Boundary Detector Encoder and Decoder With Soft Attention For Video Captioning
11 pages
Power Point Presentation Report
No ratings yet
Power Point Presentation Report
31 pages
DL Project Report
No ratings yet
DL Project Report
10 pages
CNN and LSTM for Image Captioning in Python
No ratings yet
CNN and LSTM for Image Captioning in Python
5 pages
Mini Project Final Review Batch 8B
No ratings yet
Mini Project Final Review Batch 8B
16 pages
Irjet V11i617
No ratings yet
Irjet V11i617
7 pages
AI Image Captioning for CSE Students
No ratings yet
AI Image Captioning for CSE Students
17 pages
Language CNN for Image Captioning
No ratings yet
Language CNN for Image Captioning
10 pages
Gu An Empirical Study ICCV 2017 Paper PDF
No ratings yet
Gu An Empirical Study ICCV 2017 Paper PDF
10 pages
Image Captioning with CNN and RNN
No ratings yet
Image Captioning with CNN and RNN
4 pages
Image Caption
No ratings yet
Image Caption
16 pages
Transformer-Based Video Captioning
No ratings yet
Transformer-Based Video Captioning
4 pages
Afouras Et Al - 2018 - Deep Lip Reading
No ratings yet
Afouras Et Al - 2018 - Deep Lip Reading
8 pages
Exploration of English Speech Translation Recognition Based On The LSTM RNN Algorithm
No ratings yet
Exploration of English Speech Translation Recognition Based On The LSTM RNN Algorithm
10 pages
Deep Learning Image Summarization System
No ratings yet
Deep Learning Image Summarization System
7 pages
Deep Learning Image Captioning
No ratings yet
Deep Learning Image Captioning
7 pages
Review 3
No ratings yet
Review 3
18 pages
Final Report Major
No ratings yet
Final Report Major
43 pages
Hyb Conformer
No ratings yet
Hyb Conformer
5 pages
NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
No ratings yet
NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
6 pages
Audio-Visual Interpretable and Controllable Video Captioning CVPRW 2019 Paper
No ratings yet
Audio-Visual Interpretable and Controllable Video Captioning CVPRW 2019 Paper
4 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
Image Captioning Using R-CNN & LSTM Deep Learning Model
No ratings yet
Image Captioning Using R-CNN & LSTM Deep Learning Model
4 pages
Ijariie 26613
No ratings yet
Ijariie 26613
5 pages
Ref 12
No ratings yet
Ref 12
7 pages
Movie Caption Generation With Vision Transformer and Transformer-Based Language Model
No ratings yet
Movie Caption Generation With Vision Transformer and Transformer-Based Language Model
6 pages
Deep Visual-Semantic Image Descriptions
No ratings yet
Deep Visual-Semantic Image Descriptions
17 pages
Paper 4
No ratings yet
Paper 4
5 pages
Han 等 - 2025 - Vision as a Dialect Unifying Visual Understanding and Generation via Text-Aligned Representations
No ratings yet
Han 等 - 2025 - Vision as a Dialect Unifying Visual Understanding and Generation via Text-Aligned Representations
22 pages
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
No ratings yet
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
12 pages
Seminar Report 6657
No ratings yet
Seminar Report 6657
32 pages
AI Text Detection & Recognition
No ratings yet
AI Text Detection & Recognition
11 pages
Transformer-Transducer End-to-End Speech Recognition With Self-Attention
No ratings yet
Transformer-Transducer End-to-End Speech Recognition With Self-Attention
5 pages
Text Understanding From Scratch
No ratings yet
Text Understanding From Scratch
10 pages
Video Captioning Using Neural Networks
No ratings yet
Video Captioning Using Neural Networks
13 pages
Enhancing Text Spotting With A Language Model and Visual Context Information
No ratings yet
Enhancing Text Spotting With A Language Model and Visual Context Information
10 pages
Video Captioning Approaches
No ratings yet
Video Captioning Approaches
6 pages
Text To Video Generation Using Deep Learning
No ratings yet
Text To Video Generation Using Deep Learning
7 pages
Gradient Backpropagation Through A Long Short-Term Memory (LSTM) Cell
No ratings yet
Gradient Backpropagation Through A Long Short-Term Memory (LSTM) Cell
4 pages
Artificial Intelligence Brochure
No ratings yet
Artificial Intelligence Brochure
14 pages
Zhang Et Al 2022 Physics Informed Multifidelity Residual Neural Networks For Hydromechanical Modeling of Granular Soils
No ratings yet
Zhang Et Al 2022 Physics Informed Multifidelity Residual Neural Networks For Hydromechanical Modeling of Granular Soils
15 pages
Gray Scale Image Captioning Using CNN and LSTM
No ratings yet
Gray Scale Image Captioning Using CNN and LSTM
8 pages
Prediction of Soil Moisture and Temperature Based On Deep Learning
No ratings yet
Prediction of Soil Moisture and Temperature Based On Deep Learning
6 pages
Deep Learning LSTM Based Ransomware Detection. 2019 Recent Developments in Control, Automation & Power Engineering (RDCAPE) .
No ratings yet
Deep Learning LSTM Based Ransomware Detection. 2019 Recent Developments in Control, Automation & Power Engineering (RDCAPE) .
5 pages
A Research Proposal On ATM Cash Demand Prediction Using Deep Learning Approach: - A Case Study On Enat Bank
No ratings yet
A Research Proposal On ATM Cash Demand Prediction Using Deep Learning Approach: - A Case Study On Enat Bank
11 pages
AI's Impact on Literary Studies
No ratings yet
AI's Impact on Literary Studies
12 pages
B12 - Predicting Hourly Boarding Demand of Bus
No ratings yet
B12 - Predicting Hourly Boarding Demand of Bus
63 pages
Survey of Rainfall Prediction Using Deep Learning
No ratings yet
Survey of Rainfall Prediction Using Deep Learning
10 pages
CLV Analysis for Assay Sales Optimization
No ratings yet
CLV Analysis for Assay Sales Optimization
5 pages
MFDM™ Ai
50% (4)
MFDM™ Ai
48 pages
Food Delivery Time Prediction With LSTM Neural Network
No ratings yet
Food Delivery Time Prediction With LSTM Neural Network
7 pages
Re 2020
No ratings yet
Re 2020
20 pages
Intrusion Detection Systems Based On Machine Learning Algorithms
No ratings yet
Intrusion Detection Systems Based On Machine Learning Algorithms
7 pages
? Class 10 AI Part B Most Important Questions For Board Exam Barkha
No ratings yet
? Class 10 AI Part B Most Important Questions For Board Exam Barkha
233 pages
Aircraft Engine Remaining Useful Life Prediction U
No ratings yet
Aircraft Engine Remaining Useful Life Prediction U
3 pages
Deepfake Detection Techniques
No ratings yet
Deepfake Detection Techniques
11 pages
IDS
No ratings yet
IDS
18 pages
Artificial Intelligence in Finance Newsletter by Slidesgo
No ratings yet
Artificial Intelligence in Finance Newsletter by Slidesgo
27 pages
Report Final One
No ratings yet
Report Final One
36 pages
LNN13 Paper
No ratings yet
LNN13 Paper
11 pages
Python and Data Science Playlist Guide
No ratings yet
Python and Data Science Playlist Guide
60 pages
Machine Learning in Crypto Trading
No ratings yet
Machine Learning in Crypto Trading
22 pages
Time Series Forecasting of Petroleum
No ratings yet
Time Series Forecasting of Petroleum
11 pages
Deep Learning in Biometric Recognition
No ratings yet
Deep Learning in Biometric Recognition
32 pages
Sentiment Analysis For Afaan Oromoo Usin
No ratings yet
Sentiment Analysis For Afaan Oromoo Usin
12 pages
AI & Deep Learning Certification Course
No ratings yet
AI & Deep Learning Certification Course
12 pages
Lecture 9 - Speech Recognition
No ratings yet
Lecture 9 - Speech Recognition
65 pages

For The First Paper

Uploaded by

For The First Paper

Uploaded by

Enhanced Subtitle Generation in

Videos Leveraging Hybrid BERT-

 This aim presents an end-to-end pipeline for video context

 Building a reliable approach for subtitle generation is the

 Due to complicated backgrounds, a wide variety of fonts, low contrast

 Subtitles play a pivotal role in making videos accessible to a broader audience,

 Traditional methods of subtitle generation often rely on manual transcription or

 To address these challenges, we propose an advanced methodology for

 Our approach focuses on generating contextually accurate subtitles by

 BERT's ability to understand the context of words within a sentence enhances

 In parallel, a Convolutional Neural Network (CNN) is employed to extract

 The CNN captures important visual elements, including objects, characters,

 The LSTM is adept at handling temporal dependencies, enabling it to

 By combining these advanced techniques, our system can generate

 This hybrid approach marks a significant advancement in the field of

 An Extensive structure was formed by combining video pre-processing,

 This limitation can be reduced by combining Tesseract OCR with a deep

 Working together with a DNN improves both feature extraction and

 Moreover, TF-IDF misses the position or order of terms inside a document.

 LSTMs are specifically designed to capture and utilize temporal dependencies in

 Tesseract OCR was used to process the video frame and

 Then the process continues with "Text Segmentation" or "Foreground Extraction"

 Using Sub-word tokenization, it effectively handles uncommon or unseen terms while

 The model uses a dual-input technique to accommodate both the word

 Through iterative training, the model enhances its capacity to identify

 A combination of text data converted by BERT and picture feature vectors

You might also like