RP Springer
RP Springer
Abstract
The growing need for machines to understand and explain pictures makes image
caption generation very crucial in research on computer vision and natural lan-
guage processing. In this study, the problem considered is that of automatically
generating useful captions for images by identifying objects, their relationship,
and the context in the image. Current methods mostly have difficulty producing
a variety of detailed and relevant captions, especially in complicated situations.
This study aims to build a rapid image captioning generation system. This sys-
tem utilizes CNNs for feature extraction from images and LSTM networks to
describe the images. It has been trained on the Flickr8k dataset with the use
of transfer learning of the VGG-16 model for performance and to save time in
training. There are various performance metrics, such as BLEU scores, to check
how good are the generated captions. They have had good results compared to
captions or real captions. Such studies find meaning in linking visual understand-
ing and that with understanding a text. Potential applications comprise devices
supporting those who are visually disabled or also useful in the creation of auto-
matic content, analysis in maps, locations, as well as in medical diagnostics. This
research uses new computer vision and deep learning technology to help smoothen
intelligent systems that perceive visual information. The idea shows changes deep
learning brings within machines with regard to communication and making sense
of vision, an aspect that fosters innovative aspects in artificial intelligence..
1
Keywords: Creation of image captions, Deep learning, Convolutional neural networks,
Long short-term memory, Faker8k dataset, Bleu score evaluation.
1 Introduction
The task of automatically generating captions for images poses a very complex chal-
lenge at the intersection of computer vision and natural language processing. The last
few years have seen some very significant advancements in this field, which have been
primarily driven by the progress in deep learning methodologies [1]. Our project intro-
duces a vision of an image caption generation system built on deep learning models,
specifically using image feature extraction through the use of Convolutional Neural
Networks (CNNs) and caption generation with the use of LSTM networks. These mod-
els can be trained on enormous data sets and have often been known to understand
and describe images accurately at times. For our effort [2], a pre-trained VGG-16
model-a well-known CNN architecture and highly regarded within visual recognition
tasks-is leveraged to extract features in the image. These features obtained will serve
as input to the LSTM model, which then writes descriptive text about an image [3].
The training is executed on the FLICKR 8K dataset. This dataset contains approx-
imately 8,000 images combined with several captions for every single image, thus
providing an excellent base for training as well as testing. This approach comprises
a series of critical steps like data pre-processing, feature extraction [4], training the
LSTM model, and finally its testing. Caption cleaning and tokenization are done in
the preprocessing stage. In addition to this, start and end sequences are also added
in order to enable the LSTM model to understand sentence structures. The VGG-16
model is used for extracting visual features from images that are then matched with
captions to train the LSTM network. This is actually a case where hyperparameter
tuning plays an important role in optimizing the performance of the LSTM model [5],
which is essentially about carefully tuning parameters such as the number of nodes,
structure of layers, learning rates, and dropout rates to raise the accuracy of captions.
The BLEU scores evaluate the captions generated to give a quantitative measure of
how well model-generated captions align with the reference captions in the dataset.
This process of evaluation quantifies the capability of the system to give meaningful
and contextually apt descriptions. The application potential of the system is diverse,
ranging from accessibility enhancement for visually impaired people to optimization
of image processing efficiency in medical and geospatial applications and advertising
[6]. In short, good performance by the model does not completely eliminate many
problems like the ambiguities of images and creation of more detailed descriptions
in complicated visual situations [7,8]. Further improving the generalization capability
of the system, transfer learning techniques are used by taking advantage of the pre-
trained VGG-16 model to adapt to the FLICKR 8K dataset. In a nutshell, this work
has considerably contributed to the developing landscape of automated image inter-
pretation by unveiling how deep learning models can serve as a bridge between the
visual content and textual depiction.
2
2 Literature Survey
Coming forth into the era of machine translation wherein deep learning has already
produced miracles, something called caption generation would be where machines
understand images as coherent textual descriptions. Initially, designs needed to engage
handcrafted features for the purpose of text generation. Such approaches were ulti-
mately limited in terms of flexibility, as well as performance. The place where things
changed for good, that is, moving to deep learning, is where CNNs and RNNs were
used to process images and automatically generate their captions. An image caption
generator developed by Amritkar and Jabade [9] used CNNs to extract the image
feature while captions were generated with the use of RNNs. The model exhibited
approximately 65% accuracy on benchmark datasets, yet, the model produced fairly
generic-type captions under potential diversification in complex scenarios and it does
not have a way of evaluation using advance metrics. In the same manner, Amirian
et al. [10] offer a discussion and a good appraisal of development from rule-based
to current modern deep learning approaches on image captioning. It also seems to
suggest the role of huge datasets and transfer learning as highlighted in the experi-
mental sections, achieving overall model accuracies of 50-60%. However their review
had no such results including theirs, and their limitations are practical, such as deal-
ing with ambiguous and unseen objects. Raypurkar et al. [11] implemented a caption
generator using CNN-LSTM and performance evaluation based on BLEU scores. The
BLEU value indicated 55% accuracy. The model showed good performance in describ-
ing the objects along with their relations but is not generalized with the datasets
other than its training domain. In addition, other advanced techniques such as atten-
tion mechanisms could not be used so there is less scope of improvement. The model
designed by Kumar et al. [12] was strictly aimed at detection of objects and managed
to give a 62% object detection accuracy. It recognizes different objects contained in
an image but does not provide meaningful captions, and the ones provided are the
most straightforward and to-the-point. Katpally and Bansal [13] proved an instance
of ensemble learning, in which a number of neural networks were enlisted into a single
ensemble, to improve the accuracy of the generated captions. This model surpasses
the traditional model by far, being 65% accurate. However, it caused some computa-
tional overhead, thus making it scaling difficult, and overfitting problems were not so
well addressed. The basis of the current window shows that image caption generation
through deep learning still has a few challenges. Some of them include complex scene
treatment, sparse captions, or computational limits to be addressed. Future improve-
ments in these systems regarding applicability in areas, like accessibility, healthcare,
or automated content writing, should integrate advanced techniques such as attention
mechanism, reinforcement learning, and scalable architecture.
3 Proposed Work
3.1 Data Set
Kaggle provides Flickr8k dataset. Flickr8k comprises of 8,000 images which were picked
from Flickr. This happened to be as varied in scenes or objects that activities are being
3
conducted inside it. Textual descriptions based on the diversity illustrated by image,
describe that using the contributions of five annotators and provide it for every indi-
vidual image. This will make the dataset really rich and diversified for training models
and their subsequent evaluation in building up the models for the visual understanding
domains along with the NLG domains.
The images are variably diverse within categories, such as natural, urban, peo-
ple, animals, and objects. These categories ensure a proper distribution of contextual
diversity. All images are in the JPEG format but vary in their resolution. All the cap-
tions are detailed, elaborating on the key elements or activities within the images to
foster models that can relate the visual inputs to meaningful text outputs.
All images are resized to a dimension of 224x224 pixels. This is primarily because
most of the pre-trained deep learning models, like VGG16, take in default dimensions
of 224x224. All captions are standardized by converting them into lower cases, strip-
ping out all special characters and digits, getting tokenized, and receiving start markers
attached as (startseq) and end markers as (endseq). This would provide vocabulary
and caption sequences in the form of structured datasets that can be fed directly into
training neural networks.
This leaves the dataset into two portions: 7,200 images or 90% for training, and
800 images or 10% for testing. The model’s performance will then be tested with
unseen data and therefore robust. The diversity of the Flickr8k dataset along with
the completeness of the dataset puts this benchmark in an important line to further
push forward in research concerning image captioning as well as multimodal machine
learning.
4
the same dimensions facilitates batch processing of a lot of images simultaneously, with
lesser likelihood of errors which could hamper smooth model training and evaluation.
5
3.2.7 Caption Preparation
The text is cleaned so it is consistent and easily processible for creating captions with
an image-captioning model. all text are converted to lowercase and also get rid of
unwanted feature: punctuation, numbers, and special characters from this text. in this
way only meaningful words are used on creating captions. then text clean becomes
numbers by the application of a tokenizer which puts each unique word to the special
number. This process is called tokenization. It helps the model use the text data better.
In order to be more efficient, the vocabulary is limited to the most used words, like the
top 5,000 or 10,000 words. Special tokens are added to the captions so the model can
understand where each sentence begins and ends. Every caption obtains the addition
of a beginning start¡start¿ token along with an ending ¡end¿ token at the captions. This
is to enhance it to create sentences anytime during its predictions. Because different
captions vary in number, shorter captions are zeroed into fill to make it homogenous
to all since when doing this the data shall get processed without problems and more so
for the computational units. These steps match and make captions the same so they
are ready for training an image captioning model.
This model of image captioning uses CNN to understand pictures and, on the other
hand, an LSTM network to create words; together, they form a system that writes
captions for pictures by linking visual and text information. CNN looks for relevant
information in an input image. Generally, such tasks require pre-trained models CNN,
6
for instance InceptionV3 or ResNet. They are primarily helpful for the detection and
localization of objects as well as the identification of certain shapes and patterns within
the given images due to pre-trained models in massive amounts datasets, ImageNet,
hence the convolutional base that remains intact will try capturing essential features
after this CNN model’s classification layer had been stripped off. These features extract
important information about the image, namely, shape, texture, or objects, and feed
it into a captioning model.
The features are given more work on from the image, and it is passed into the
LSTM network to get the caption. LSTM represents a Recurrent Neural Network
that does well with sequences like text because it will remember for a long period. It
takes important parts of the image, along with associated word data, called captions
for inspection to see how connected it is. The caption training uses image features
and has the LSTM process such images as input in their construction of captions by
anticipating a word that, for order, follows after reading in an image as well as words
that are now developed. Special tokens, such as ‘¡start¿‘ and ‘¡end¿‘, will also be used
to mark where each caption begins and where it ends. These help the model learn the
structures of sentences and can provide complete and readable captions on prediction.
In addition, all sequences are of equal length; hence captions with varying lengths
can be managed together. It forms an effective image description system, where CNN
7
is used along with LSTM. CNN looks at the ”what” part of the image by encoding
what it sees. Simultaneously, LSTM looks at the ”how,” which helps the model learn
to describe an image in simple words. Together, they help the model create helpful
captions that explain what is in an image. It helps the model to apply what it learns
for images and improves its training speed while making the model’s performance
improved. Another thing is that the complicated sentences can be processed, which
will add naturality and detail into captions. This method has lots of power and easily
expansible and often employed on captions tasks which automatically generates images
captions.
Fig. 5 Training and Validation Loss Fig. 6 Training and Validation Accuracy
8
The model did a very good job in captioning with high unigram accuracy, meaning
it is very good at detecting objects within an image and naming those objects. When
the complexity of n-grams went up, BLEU scores went down because it wasn’t easy to
capture words’ context and order for a description of an image. It’s still a result to be
improved further with an attention mechanism or with an enormously bigger dataset.
5 Conclusion
This research shows that an image caption generation system works fine. It uses CNNs
in getting features and LSTMs in creating captions. The model was trained using
Flickr8k, yielding training accuracy of 75%, and it was capable enough to make useful
captions aligning with the proper description. Transfer learning on this model with
VGG-16 improved efficiency and also performance. The system was good at object
recognition and captioning, but still has problems with complex and unclear scenes.
Future work can look into attention mechanisms and bigger datasets to make captions
more diverse and improve understanding of context. This study shows the promise
of deep learning in connecting visual and text content for use in accessibility and
automated systems.
6 Feature Work
This image captioning system has tremendous potential to make the system much bet-
ter in the future. Adding an attention mechanism makes the model pay attention to
the more relevant parts of the image, which may make the captions much more accu-
rate. An increasing dataset through a wider variety of images and captions will allow
the system to work well in different and new situations. Also, with transformer-based
models like ViTs or BERT, it may improve performance through understanding long-
range dependencies in images as well as text. This system may be used for real-time
purposes such as visual aids for the visually impaired, automatic content generation
for social media, and diagnostic medical image analysis. Further on with the research,
this combination of deep learning with other types of techniques is going to continually
improve the image captioning systems so they would be stronger and faster.
References
[1] Aote, Shailendra S. ”Image Caption Generation using Deep Learning Technique.”
Journal of Algebraic Statistics 13, no. 3 (2022): 2260-2267.
[2] Yeshasvi, Mogula, and T. Subetha. ”Image Caption Generator Using Machine
Learning and Deep Neural Networks.” In Advances in Intelligent Computing and
Communication: Proceedings of ICAC 2021, pp. 137-144. Singapore: Springer
Nature Singapore, 2022.
[3] Shinde, Omkar Nitin, Rishikesh Gawde, and Anurag Paradkar. ”Social media
image caption generation using deep learning.” International Journal of Engineer-
ing Development and Research 8, no. 4 (2020): 222-228.
9
[4] Chaithra, V., DK Charitra Rao, and N. Jagadisha. ”Image caption generator
using deep learning.” International Journal of Engineering Applied Sciences and
Technology 7, no. 2 (2022): 289-293.
[5] Kamal, Abrar Hasin, Md Asifuzzaman Jishan, and Nafees Mansoor. ”Textmage:
The automated bangla caption generator based on deep learning.” In 2020 Inter-
national Conference on Decision Aid Sciences and Application (DASA), pp.
822-826. IEEE, 2020.
[6] He, Xiaodong, and Li Deng. ”Deep learning for image-to-text generation: A
technical overview.” IEEE Signal Processing Magazine 34, no. 6 (2017): 109-116.
[7] Kabra, Palak, Mihir Gharat, Dhiraj Jha, and Shailesh Sangle. ”Image Caption
Generator Using Deep Learning.” Published in International Journal for Research
in Applied Science & Engineering Technology (IJRASET) 10, no. X (2022).
[8] Mitra, Debasree, Pranati Rakshit, Tarak Shaw, Sourav Mandal, Sudipta Ghosh,
and Swapnadip Guha. ”Image Caption Generator Through Deep Learning.” In
International Conference on Communication, Devices and Networking, pp. 395-
403. Singapore: Springer Nature Singapore, 2022.
[9] Amritkar, Chetan, and Vaishali Jabade. ”Image caption generation using deep
learning technique.” In 2018 fourth international conference on computing
communication control and automation (ICCUBEA), pp. 1-4. IEEE, 2018.
[10] Amirian, Soheyla, Khaled Rasheed, Thiab R. Taha, and Hamid R. Arabnia.
”A short review on image caption generation with deep learning.” In Proceed-
ings of the International Conference on Image Processing, Computer Vision, and
Pattern Recognition (IPCV), pp. 10-18. The Steering Committee of The World
Congress in Computer Science, Computer Engineering and Applied Computing
(WorldComp), 2019.
[11] Raypurkar, Manish, Abhishek Supe, Pratik Bhumkar, Pravin Borse, and Shab-
nam Sayyad. ”Deep learning based image caption generator.” International
Research Journal of Engineering and Technology (IRJET) 8, no. 03 (2021).
[13] Katpally, Harshitha, and Ajay Bansal. ”Ensemble learning on deep neural net-
works for image caption generation.” In 2020 IEEE 14th international conference
on semantic computing (ICSC), pp. 61-68. IEEE, 2020.
10