0% found this document useful (0 votes)
7 views

RP Springer

This document presents a deep learning framework for dynamic image caption generation, utilizing Convolutional Neural Networks (CNNs) for feature extraction and Long Short-Term Memory (LSTM) networks for caption generation. The system is trained on the Flickr8k dataset and employs transfer learning with the VGG-16 model to enhance performance and efficiency. The research highlights the potential applications in accessibility for visually impaired individuals and various fields such as medical diagnostics and automated content creation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

RP Springer

This document presents a deep learning framework for dynamic image caption generation, utilizing Convolutional Neural Networks (CNNs) for feature extraction and Long Short-Term Memory (LSTM) networks for caption generation. The system is trained on the Flickr8k dataset and employs transfer learning with the VGG-16 model to enhance performance and efficiency. The research highlights the potential applications in accessibility for visually impaired individuals and various fields such as medical diagnostics and automated content creation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Image-to-Text Intelligence: A Deep Learning

Framework for Dynamic Caption Generation


Dr.Bolla Jhansi Vazram 1 , Saiprathap Tedla2 , 2
,
Jashwanth Raj Kottapalli2 , Riyaz Shaik2 ,
Gopi Krishna Guntupalli2
1* Department of IT & CSE(AI), Narasaraopeta Engineering College,
Narasaraopet, Palnadu, 522601, Andhra Pradesh, India.
2 Department of IT & CSE(AI) , Narasaraopeta Engineering College,

Narasaraopet, Palnadu, 522601, Andhra Pradesh, India.

Contributing authors: [email protected];


[email protected]; ; [email protected];
[email protected]; [email protected];

Abstract
The growing need for machines to understand and explain pictures makes image
caption generation very crucial in research on computer vision and natural lan-
guage processing. In this study, the problem considered is that of automatically
generating useful captions for images by identifying objects, their relationship,
and the context in the image. Current methods mostly have difficulty producing
a variety of detailed and relevant captions, especially in complicated situations.
This study aims to build a rapid image captioning generation system. This sys-
tem utilizes CNNs for feature extraction from images and LSTM networks to
describe the images. It has been trained on the Flickr8k dataset with the use
of transfer learning of the VGG-16 model for performance and to save time in
training. There are various performance metrics, such as BLEU scores, to check
how good are the generated captions. They have had good results compared to
captions or real captions. Such studies find meaning in linking visual understand-
ing and that with understanding a text. Potential applications comprise devices
supporting those who are visually disabled or also useful in the creation of auto-
matic content, analysis in maps, locations, as well as in medical diagnostics. This
research uses new computer vision and deep learning technology to help smoothen
intelligent systems that perceive visual information. The idea shows changes deep
learning brings within machines with regard to communication and making sense
of vision, an aspect that fosters innovative aspects in artificial intelligence..

1
Keywords: Creation of image captions, Deep learning, Convolutional neural networks,
Long short-term memory, Faker8k dataset, Bleu score evaluation.

1 Introduction
The task of automatically generating captions for images poses a very complex chal-
lenge at the intersection of computer vision and natural language processing. The last
few years have seen some very significant advancements in this field, which have been
primarily driven by the progress in deep learning methodologies [1]. Our project intro-
duces a vision of an image caption generation system built on deep learning models,
specifically using image feature extraction through the use of Convolutional Neural
Networks (CNNs) and caption generation with the use of LSTM networks. These mod-
els can be trained on enormous data sets and have often been known to understand
and describe images accurately at times. For our effort [2], a pre-trained VGG-16
model-a well-known CNN architecture and highly regarded within visual recognition
tasks-is leveraged to extract features in the image. These features obtained will serve
as input to the LSTM model, which then writes descriptive text about an image [3].
The training is executed on the FLICKR 8K dataset. This dataset contains approx-
imately 8,000 images combined with several captions for every single image, thus
providing an excellent base for training as well as testing. This approach comprises
a series of critical steps like data pre-processing, feature extraction [4], training the
LSTM model, and finally its testing. Caption cleaning and tokenization are done in
the preprocessing stage. In addition to this, start and end sequences are also added
in order to enable the LSTM model to understand sentence structures. The VGG-16
model is used for extracting visual features from images that are then matched with
captions to train the LSTM network. This is actually a case where hyperparameter
tuning plays an important role in optimizing the performance of the LSTM model [5],
which is essentially about carefully tuning parameters such as the number of nodes,
structure of layers, learning rates, and dropout rates to raise the accuracy of captions.
The BLEU scores evaluate the captions generated to give a quantitative measure of
how well model-generated captions align with the reference captions in the dataset.
This process of evaluation quantifies the capability of the system to give meaningful
and contextually apt descriptions. The application potential of the system is diverse,
ranging from accessibility enhancement for visually impaired people to optimization
of image processing efficiency in medical and geospatial applications and advertising
[6]. In short, good performance by the model does not completely eliminate many
problems like the ambiguities of images and creation of more detailed descriptions
in complicated visual situations [7,8]. Further improving the generalization capability
of the system, transfer learning techniques are used by taking advantage of the pre-
trained VGG-16 model to adapt to the FLICKR 8K dataset. In a nutshell, this work
has considerably contributed to the developing landscape of automated image inter-
pretation by unveiling how deep learning models can serve as a bridge between the
visual content and textual depiction.

2
2 Literature Survey
Coming forth into the era of machine translation wherein deep learning has already
produced miracles, something called caption generation would be where machines
understand images as coherent textual descriptions. Initially, designs needed to engage
handcrafted features for the purpose of text generation. Such approaches were ulti-
mately limited in terms of flexibility, as well as performance. The place where things
changed for good, that is, moving to deep learning, is where CNNs and RNNs were
used to process images and automatically generate their captions. An image caption
generator developed by Amritkar and Jabade [9] used CNNs to extract the image
feature while captions were generated with the use of RNNs. The model exhibited
approximately 65% accuracy on benchmark datasets, yet, the model produced fairly
generic-type captions under potential diversification in complex scenarios and it does
not have a way of evaluation using advance metrics. In the same manner, Amirian
et al. [10] offer a discussion and a good appraisal of development from rule-based
to current modern deep learning approaches on image captioning. It also seems to
suggest the role of huge datasets and transfer learning as highlighted in the experi-
mental sections, achieving overall model accuracies of 50-60%. However their review
had no such results including theirs, and their limitations are practical, such as deal-
ing with ambiguous and unseen objects. Raypurkar et al. [11] implemented a caption
generator using CNN-LSTM and performance evaluation based on BLEU scores. The
BLEU value indicated 55% accuracy. The model showed good performance in describ-
ing the objects along with their relations but is not generalized with the datasets
other than its training domain. In addition, other advanced techniques such as atten-
tion mechanisms could not be used so there is less scope of improvement. The model
designed by Kumar et al. [12] was strictly aimed at detection of objects and managed
to give a 62% object detection accuracy. It recognizes different objects contained in
an image but does not provide meaningful captions, and the ones provided are the
most straightforward and to-the-point. Katpally and Bansal [13] proved an instance
of ensemble learning, in which a number of neural networks were enlisted into a single
ensemble, to improve the accuracy of the generated captions. This model surpasses
the traditional model by far, being 65% accurate. However, it caused some computa-
tional overhead, thus making it scaling difficult, and overfitting problems were not so
well addressed. The basis of the current window shows that image caption generation
through deep learning still has a few challenges. Some of them include complex scene
treatment, sparse captions, or computational limits to be addressed. Future improve-
ments in these systems regarding applicability in areas, like accessibility, healthcare,
or automated content writing, should integrate advanced techniques such as attention
mechanism, reinforcement learning, and scalable architecture.

3 Proposed Work
3.1 Data Set
Kaggle provides Flickr8k dataset. Flickr8k comprises of 8,000 images which were picked
from Flickr. This happened to be as varied in scenes or objects that activities are being

3
conducted inside it. Textual descriptions based on the diversity illustrated by image,
describe that using the contributions of five annotators and provide it for every indi-
vidual image. This will make the dataset really rich and diversified for training models
and their subsequent evaluation in building up the models for the visual understanding
domains along with the NLG domains.
The images are variably diverse within categories, such as natural, urban, peo-
ple, animals, and objects. These categories ensure a proper distribution of contextual
diversity. All images are in the JPEG format but vary in their resolution. All the cap-
tions are detailed, elaborating on the key elements or activities within the images to
foster models that can relate the visual inputs to meaningful text outputs.
All images are resized to a dimension of 224x224 pixels. This is primarily because
most of the pre-trained deep learning models, like VGG16, take in default dimensions
of 224x224. All captions are standardized by converting them into lower cases, strip-
ping out all special characters and digits, getting tokenized, and receiving start markers
attached as (startseq) and end markers as (endseq). This would provide vocabulary
and caption sequences in the form of structured datasets that can be fed directly into
training neural networks.
This leaves the dataset into two portions: 7,200 images or 90% for training, and
800 images or 10% for testing. The model’s performance will then be tested with
unseen data and therefore robust. The diversity of the Flickr8k dataset along with
the completeness of the dataset puts this benchmark in an important line to further
push forward in research concerning image captioning as well as multimodal machine
learning.

3.2 Preprocess Images


3.2.1 Loading Images
Image uploading to be prepared is a first step. In such a process, the uploaded image
files are converted to a format that is of much use for machine learning algorithms.
The data for an image are usually saved under formats like JPEG or PNG formats.
Hence, it requires conversion into numerical arrays corresponding to pixel values. Quite
recently, libraries used for those operations are Pillow (PIL) and OpenCV because
of their efficiency. Pillow makes working with images in RGB format easy. OpenCV
provides many features, including support for different color spaces. Ensuring images
load the same way helps them work well with later processing steps, which reduces
problems with the input data.

3.2.2 Resizing Images


Real-world datasets lead to the size of images that could pose challenges in training
models. It is solved by resizing all images either smaller or larger to meet the input
requirement of the pre-trained neural network in use. For example, while some models
such as ResNet or InceptionV3 require an input size at 224x224 and 299x299 respec-
tively. Resizing methods such as bilinear interpolation and bicubic interpolation are
commonly utilized as they maintain the tradeoff between speed and quality. Keeping

4
the same dimensions facilitates batch processing of a lot of images simultaneously, with
lesser likelihood of errors which could hamper smooth model training and evaluation.

3.2.3 Normalizing pixel values


The first key step is converting pixel values into a range more acceptable by the deep
learning models, that is to say working perfectly. Normally, raw pixel values lie within
the range 0-255; the values are then shifted either to the range [0, 1] or to [-1, 1]
in order to better the performance of the model. This can be done by dividing pixel
values by 255, or depending upon the design of the model, other modifications. It helps
the optimization algorithms in working faster and reduces the danger of numerical
problems in a large network. It assures that the input remains inside an appropriate
range with normalization and works therefore even more effectively and stably in
optimization.

3.2.4 Data Augmentation


Data augmentation is typically applied as a pre-processing method so that the model
avoids overfitting and shows better generalization. It involves creating a new version
of the same images in the original set by different transformations like rotations,
flip, crop, and even color variations to the original images. A good example could be
random horizontal flip or small angle of rotation. Similarly, this can adjust brightness
and the contrast as well in light variation scenarios. Augmentation efficiently expands
on the size and diversity of the dataset without requesting actual data, thus making
the model stronger compared to unseen data.

3.2.5 Extracting Image Features


The image features are taken by pre-trained CNNs for easy calculations and through
transfer learning. The last classification layer of the CNN is removed, and the output
from the second to last layer is used as a means of representing the image. These fea-
tures represent very crucial information such as shapes and textures that are needed in
tasks such as image captioning. The extracted features are stored for efficient retrieval
during model training, thereby reducing redundant computations and optimizing the
overall pipeline.

3.2.6 Mapping Captions to Images


For image captioning tasks, textual descriptions must be paired with their related
images. This means having a structured mapping in which one or more captions are
related to one image. The mapping is normally kept in formats like CSV files or dictio-
naries and accessed quite easily during training. Maintaining accurate and consistent
mappings is essential in the training of the model, knowing how the visual inputs could
correspond with the textual outputs so that they can appropriately be used to form
the basis for generating captions.

5
3.2.7 Caption Preparation
The text is cleaned so it is consistent and easily processible for creating captions with
an image-captioning model. all text are converted to lowercase and also get rid of
unwanted feature: punctuation, numbers, and special characters from this text. in this
way only meaningful words are used on creating captions. then text clean becomes
numbers by the application of a tokenizer which puts each unique word to the special
number. This process is called tokenization. It helps the model use the text data better.
In order to be more efficient, the vocabulary is limited to the most used words, like the
top 5,000 or 10,000 words. Special tokens are added to the captions so the model can
understand where each sentence begins and ends. Every caption obtains the addition
of a beginning start¡start¿ token along with an ending ¡end¿ token at the captions. This
is to enhance it to create sentences anytime during its predictions. Because different
captions vary in number, shorter captions are zeroed into fill to make it homogenous
to all since when doing this the data shall get processed without problems and more so
for the computational units. These steps match and make captions the same so they
are ready for training an image captioning model.

3.3 Model Design

Fig. 1 work flow

This model of image captioning uses CNN to understand pictures and, on the other
hand, an LSTM network to create words; together, they form a system that writes
captions for pictures by linking visual and text information. CNN looks for relevant
information in an input image. Generally, such tasks require pre-trained models CNN,

6
for instance InceptionV3 or ResNet. They are primarily helpful for the detection and
localization of objects as well as the identification of certain shapes and patterns within
the given images due to pre-trained models in massive amounts datasets, ImageNet,
hence the convolutional base that remains intact will try capturing essential features
after this CNN model’s classification layer had been stripped off. These features extract
important information about the image, namely, shape, texture, or objects, and feed
it into a captioning model.

Fig. 2 CNN Architecture

Fig. 3 LSTM Architecture

The features are given more work on from the image, and it is passed into the
LSTM network to get the caption. LSTM represents a Recurrent Neural Network
that does well with sequences like text because it will remember for a long period. It
takes important parts of the image, along with associated word data, called captions
for inspection to see how connected it is. The caption training uses image features
and has the LSTM process such images as input in their construction of captions by
anticipating a word that, for order, follows after reading in an image as well as words
that are now developed. Special tokens, such as ‘¡start¿‘ and ‘¡end¿‘, will also be used
to mark where each caption begins and where it ends. These help the model learn the
structures of sentences and can provide complete and readable captions on prediction.
In addition, all sequences are of equal length; hence captions with varying lengths
can be managed together. It forms an effective image description system, where CNN

7
is used along with LSTM. CNN looks at the ”what” part of the image by encoding
what it sees. Simultaneously, LSTM looks at the ”how,” which helps the model learn
to describe an image in simple words. Together, they help the model create helpful
captions that explain what is in an image. It helps the model to apply what it learns
for images and improves its training speed while making the model’s performance
improved. Another thing is that the complicated sentences can be processed, which
will add naturality and detail into captions. This method has lots of power and easily
expansible and often employed on captions tasks which automatically generates images
captions.

4 Result and Discussion


This was done on the Flickr8k dataset that comprises 8,000 images and each image
having five descriptive captions. The system makes use of a pre-trained VGG-16 model
for feature extraction and then makes use of an LSTM-based decoder for caption
generation. Results from the experiments are being discussed in the next sections.
The model achieved a training accuracy of 75%, showing that it was quite effective at
learning to create captions from the data. For the evaluation, BLEU scores were used.
BLEU scores are very commonly used in natural language processing to check how
similar the generated captions are to the true captions. Results are given in Table 1.

Fig. 4 Accuracy Table

Fig. 5 Training and Validation Loss Fig. 6 Training and Validation Accuracy

8
The model did a very good job in captioning with high unigram accuracy, meaning
it is very good at detecting objects within an image and naming those objects. When
the complexity of n-grams went up, BLEU scores went down because it wasn’t easy to
capture words’ context and order for a description of an image. It’s still a result to be
improved further with an attention mechanism or with an enormously bigger dataset.

5 Conclusion
This research shows that an image caption generation system works fine. It uses CNNs
in getting features and LSTMs in creating captions. The model was trained using
Flickr8k, yielding training accuracy of 75%, and it was capable enough to make useful
captions aligning with the proper description. Transfer learning on this model with
VGG-16 improved efficiency and also performance. The system was good at object
recognition and captioning, but still has problems with complex and unclear scenes.
Future work can look into attention mechanisms and bigger datasets to make captions
more diverse and improve understanding of context. This study shows the promise
of deep learning in connecting visual and text content for use in accessibility and
automated systems.

6 Feature Work
This image captioning system has tremendous potential to make the system much bet-
ter in the future. Adding an attention mechanism makes the model pay attention to
the more relevant parts of the image, which may make the captions much more accu-
rate. An increasing dataset through a wider variety of images and captions will allow
the system to work well in different and new situations. Also, with transformer-based
models like ViTs or BERT, it may improve performance through understanding long-
range dependencies in images as well as text. This system may be used for real-time
purposes such as visual aids for the visually impaired, automatic content generation
for social media, and diagnostic medical image analysis. Further on with the research,
this combination of deep learning with other types of techniques is going to continually
improve the image captioning systems so they would be stronger and faster.

References
[1] Aote, Shailendra S. ”Image Caption Generation using Deep Learning Technique.”
Journal of Algebraic Statistics 13, no. 3 (2022): 2260-2267.

[2] Yeshasvi, Mogula, and T. Subetha. ”Image Caption Generator Using Machine
Learning and Deep Neural Networks.” In Advances in Intelligent Computing and
Communication: Proceedings of ICAC 2021, pp. 137-144. Singapore: Springer
Nature Singapore, 2022.

[3] Shinde, Omkar Nitin, Rishikesh Gawde, and Anurag Paradkar. ”Social media
image caption generation using deep learning.” International Journal of Engineer-
ing Development and Research 8, no. 4 (2020): 222-228.

9
[4] Chaithra, V., DK Charitra Rao, and N. Jagadisha. ”Image caption generator
using deep learning.” International Journal of Engineering Applied Sciences and
Technology 7, no. 2 (2022): 289-293.

[5] Kamal, Abrar Hasin, Md Asifuzzaman Jishan, and Nafees Mansoor. ”Textmage:
The automated bangla caption generator based on deep learning.” In 2020 Inter-
national Conference on Decision Aid Sciences and Application (DASA), pp.
822-826. IEEE, 2020.

[6] He, Xiaodong, and Li Deng. ”Deep learning for image-to-text generation: A
technical overview.” IEEE Signal Processing Magazine 34, no. 6 (2017): 109-116.

[7] Kabra, Palak, Mihir Gharat, Dhiraj Jha, and Shailesh Sangle. ”Image Caption
Generator Using Deep Learning.” Published in International Journal for Research
in Applied Science & Engineering Technology (IJRASET) 10, no. X (2022).

[8] Mitra, Debasree, Pranati Rakshit, Tarak Shaw, Sourav Mandal, Sudipta Ghosh,
and Swapnadip Guha. ”Image Caption Generator Through Deep Learning.” In
International Conference on Communication, Devices and Networking, pp. 395-
403. Singapore: Springer Nature Singapore, 2022.

[9] Amritkar, Chetan, and Vaishali Jabade. ”Image caption generation using deep
learning technique.” In 2018 fourth international conference on computing
communication control and automation (ICCUBEA), pp. 1-4. IEEE, 2018.

[10] Amirian, Soheyla, Khaled Rasheed, Thiab R. Taha, and Hamid R. Arabnia.
”A short review on image caption generation with deep learning.” In Proceed-
ings of the International Conference on Image Processing, Computer Vision, and
Pattern Recognition (IPCV), pp. 10-18. The Steering Committee of The World
Congress in Computer Science, Computer Engineering and Applied Computing
(WorldComp), 2019.

[11] Raypurkar, Manish, Abhishek Supe, Pratik Bhumkar, Pravin Borse, and Shab-
nam Sayyad. ”Deep learning based image caption generator.” International
Research Journal of Engineering and Technology (IRJET) 8, no. 03 (2021).

[12] Kumar, N. Komal, D. Vigneswari, A. Mohan, K. Laxman, and J. Yuvaraj. ”Detec-


tion and recognition of objects in image caption generator system: A deep learning
approach.” In 2019 5th International Conference on Advanced Computing &
Communication Systems (ICACCS), pp. 107-109. IEEE, 2019.

[13] Katpally, Harshitha, and Ajay Bansal. ”Ensemble learning on deep neural net-
works for image caption generation.” In 2020 IEEE 14th international conference
on semantic computing (ICSC), pp. 61-68. IEEE, 2020.

10

You might also like