0% found this document useful (0 votes)
16 views6 pages

10 1109@iccsp48568 2020 9182105

The document reviews various techniques for Automatic Image Captioning (AIC), which involves generating textual descriptions of images by identifying objects and their relationships using computer vision and Natural Language Processing (NLP). It discusses methods such as normalized cut for segmentation, hybrid engines combining algorithms for object detection, and neural network-based approaches for generating captions. The paper highlights the challenges in AIC, including the need for large datasets and the complexity of accurately modeling language and visual features.

Uploaded by

Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views6 pages

10 1109@iccsp48568 2020 9182105

The document reviews various techniques for Automatic Image Captioning (AIC), which involves generating textual descriptions of images by identifying objects and their relationships using computer vision and Natural Language Processing (NLP). It discusses methods such as normalized cut for segmentation, hybrid engines combining algorithms for object detection, and neural network-based approaches for generating captions. The paper highlights the challenges in AIC, including the need for large datasets and the complexity of accurately modeling language and visual features.

Uploaded by

Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Conference on Communication and Signal Processing, July 28 - 30, 2020, India

A Review on Automatic Image Captioning


Techniques
K. C. Nithya and V. Vinod Kumar

Abstract—Ongoing advancement on Automatic Image Caption


ing (AIC) has demonstrated that it is conceivable to depict the
most remarkable data passed on by pictures with exact and
correct sentences. Image captioning means creation of textual
depiction of an image automatically. Captioning of images goes
for portray a picture utilizing characteristic language. The impor-
tant step in image captioning is identifying different objects in an
image, find their relationships and classify them and combine the
words that may not use proper language modelling. Captioning
with proper sentences requires computer vision and Natural
Language processing(NLP) for obtaining accurate sentences.
Classified objects are then passed to language model to create
captions. Semantic knowledge about an object in a picture need Fig. 1. An example for image captioning
to obtain by capturing characteristics of an image globally and
locally. There are various methods used for captioning an image But challenges in an image captioning is large number of
but supervised learning provides better experience. This paper dataset require to get meaningful dataset. Automatic image
deals with some of the methods for image captioning. caption includes two steps. First step is to identify individual
objects in a picture by utilizing lines and strokes present in an
Index Terms—Normalized-cut, Hybrid engine, CNN, LSTM, image. Then separate the features into small meaningful parts.
Parallel-fusion architecture, Cascade Attention
Various visual locales from which visual highlights are
extricated. Then contrast it with current database, which
I. I NTRODUCTION discover the level of intelligibility between the components of
the picture present and the database. The second part of the
A UTOMATIC Image Captioning (AIC) is one of the
ongoing development field. Image captioning technique test includes utilizing key terms to shape sentences that
caption the picture precisely. In order to make proper
involves object identification and modelling it into a proper
sentences by using NLPs. AIC helps to trace major salient sentences, translation of pictures to text performed by using
characteristics in an image with error free and proper proper languages. Fig. 1 shows an example for image
sentences. Under- standing a picture to a great extent relies captioning.
upon acquiring picture highlights. The systems utilized for This paper is organized as follows, Section II deals with
this reason can be extensively separated into traditional literature review of six paper related to image captioning.
learning based and deep learning based strategies for AIC. Section III deals with evaluation metrics of different image
Initially, captioning is just endeavored to yield captioning system described in section II. At last, Section IV
straightforward portrayals for pictures taken under very deals with conclusion.
compelled conditions. As a difficult and important research eld II. LITERATURE REVIEW
, AIC is drawing in increasingly more consideration and is
This paper deals with some literature that uses different
getting progressively significant. Captioning system need
image captioning technique. There are different techniques for
correlation between each element in an image along with
AIC. In this paper we mainly focus on following methods.
actions and aspects. Generally AIC tries to provide a simple
Captioning based on normalized cut for segmentation. This
captions for a picture in any situations. Hence it is difficult
method does not use proper language modelling. They arrange
task to extract all the characteristics of an image along
corresponding words of each object in an image in order
with a proper NLPs. AIC have many applications. They can be
and combine them to describe the picture. Hybrid engine
used for blinds to identify images, by converting text into
use various algorithm for detection and NLPs for generating
audio. They can be also use for indexing an image which is
sentences. Neural-Network based captioning is more accurate.
important for CBIR (”Content Based Image Retrival”). Image
Based on neural-network there developed different captioning
captioning also have various other purpose such as web
systems. Here we mainly focus on region and scene specfic
searching, facebook, education purpose etc.
context, parellel-fusion , cascade attention, two-phase learning
methods for image captioning.
K. C. Nithya, Dept. of Electronics and Communication Engineering, Govt.
College of Engineering Kannur, Kannur, India ([email protected])
Dr. V. Vinod Kumar, Dept. of Electronics and Communication Engineering,
Govt. College of Engineering Kannur, Kannur, India ([email protected])

978-1-7281-4988-2/20/$31.00 ©2020 IEEE 0432

Authorized licensed use limited to: Middlesex University. Downloaded on September 06,2020 at 16:30:20 UTC from IEEE Xplore. Restrictions apply.
Fig. 3. Process flow. Courtesy: Kaustubh Shivdikar et al.[4]

of an image. In this framework, feature extraction depends on


Fig. 2. System overview of automatic image annotation. Courtesy: Myint their division of image. Individual objects are extricated from
Myint Sein et al.[2] the input picture by utilizing the color intensity estimation of
their eigen vectors.
A. Integrated Normalised-cut and Color-based Segmentation After feature extraction identify the individual object in a
Image Captioning System picture with their corresponding words, by matching input
First step is preprocessing. Greyscale conversion, noise object with annotated dataset. After identifying word, then
filtering and image enhancement are important steps in pre- combine words to describe the picture.
processing. Then followed by segmentation of image. Seg-
mentation is based on Normalized cut[1] (N-cut) and color B. Automatic Image Annotation using a Hybrid Engine
intensity. Invariant intensity is used to develop this model. N- Kaustubh Shivdikar and Kshitij Marwah[3] introduces hy-
cut and color intensity based image captioning focus on brid engine methods. Hybrid engine utilizes the combination
feature extraction. Fig. 2 describes detailed structure of of ”Speed Up Robust Algorithm”(SURF) with minimum eigen
integrated normalized-cut and color-based segmentation image value to notice and classify objects. Then passed to a Content
captioning system. Free Grammar (CFG) to create grammatically meaningful
1) Normalized Image Segmentation: Myint Myint Sein et phrases. The detection of the object is performed on the
al[2]presented integration of Normalized image division is dataset and uncaptioned images. Primary feature description is
signified as N-cut. This approach is utilized extricating the performed using the SURF algorithm. Here Hessen matrix[4]
global impression of a picture. Picture division is considered is used in SURF based object detection. SURF is used many
as a graph partitioning issue and propose a novel global times in this model. Input uncaptioned image is considered
foundation, the normalized cut, for dividing the graph. The as scene image and captioned dataset is considered as object
normalized cut estimates both the total divergence between image. First scene image undergoes iteration followed by
the various groups and total similarity inside the cluster. A successive iteration on object image. This helps to obtain
productive computational procedure dependent on a summed important points in both database and input image. After ob-
up eigenvalue issue can be utilized to advance this basis. taining key features matching is done to identify the presence
2) Color based image segmentation: Color features of of any kind of object in a scene. Matching feature descriptors
image picture elements are considered and assumes that are obtained by comparing features obtained from object
homogeneous hues in the picture relate to isolate clusters and scene images after undergoes SURF algorithm. Fig. 3
and thus obtain meaningful elements in the picture. That is depicts Hybrid engine based image captioning system. SURF
each group defines a grade of picture elements that share method followed by k-means grouping[5] to obtain desired
close color properties. Graph cut technique is used for image object accurately. It classify certain data into a particular
segmentation. As the segmentation results depend on the color group based on relative positions. Clustering is performed
space used, there is no single color space that can provide on matched locations. When k-clustering applied to matched
acceptable results for all types of images.Next main stages of descriptors. If the matched feature has higher density at the
image captioning is feature extraction. true position of the object then the cluster which has higher
In feature extraction, each individual object in an image is density extract information about position of object.
segmented and allocate related words for each item to frame The next stage is a foundation for object boundary detection.
training images.A feature is defined as a meaningful object A Minimum eigenvalue algorithm[6] is used for boundary

0433

Authorized licensed use limited to: Middlesex University. Downloaded on September 06,2020 at 16:30:20 UTC from IEEE Xplore. Restrictions apply.
detection. The same algorithm also performed on the database.
This algorithm used to get corners of each object in a
dataset. FREAK(”Fast Retina Key ”) algorithm is used to get
corner descriptors. Freak descriptor runs based on gaussian
difference[7] which then subjected by K-mean to obtain and
separate each object in a scene. Then separate each object
by drawing polygon using the Convex-hull method. Each
separated object names are stored in a text file.
After finding an object then Content Free Grammar(CFG)
is used to produce sentences. CFG produces meaningful sen-
Fig. 4. The architectural diagram of our image captioning system.
tences since it contains defined set of rules to correct grammar Courtesy: Kun Fu, al.[9]
and different terminals to check nouns, verbs etc. The CFG, in
this way created, comprises of sentences that are linguistically characteristics from the entire image and by biasing LSTM
right yet may need logical exactness; and in this way should generate words. A picture is first examined and spoke to with
be analyzed against a database of course books to remove various visual regions from which highlights of visuals are
wrong sentences. The whole procedure stream is portrayed in separated. Fig. 4 shows the detailed depiction of encoder-
Fig. 3, after which a sentence or a lot of sentences is gotten decoder system for AIC. The visual component vectors are then
that depicts the scene in an exact way. In case of image, SURF fed into a LSTM organize which predicts both the succession of
can identify more characteristics than a eigenvalue algorithm. concentrating on various areas and the arrangement of creating
If the variation of grey scale decreases then reduce the feature words dependent on the change of visual consideration. The
recognition capability of SURF whereas eigenvalue algorithm neural system model is too administered by a scene-vector. It
is best for the identification of boundaries. FREAK and extracts global visual features from a scene. Instinctively, it
SURF are also used to reduce false positive detection of chooses a scene-explicit language model for creating text.
elements in an image and allows high robustness. Performance 1) Multiple scale representation of image: This framework
and robustness can be measured by analyzing BLEU and F1- denotes the collection of feature vectors of an image processed
score respectively. on confined regions at different scales [10]. This representation
A blend of highlight discovery and regular language prepar- generates regions with good visuals depending on surface,
ing has been for quite some time used for picture explanation. color, shading, etc [11]. Then select great visual area, which
Techniques utilized can include a correlation of dataset of ought to semantically meaningful, relevantly rich, crude and
pictures and phrases to locate the most convenient match as non-compositional.
depicted by Farhadi et al. [8], the utilization of significanc 2) Extract visual features: First resize each image into
models , or the utilization of synchronous characterization and 224x224 and then feed into ResNet network to obtain CNN
comment among others. feature of dimension. Here CNN act as encoder.
3) Attention-based LSTM decoder: Attention based LSTM
C. Region-based and Scene-Specfic Context(SSC) System for
act as decoder. LSTMs are used for language modeling. LSTM
captioning
unit is consist of cell, an input gate, an output gate and a
Kun Fu, Junqi Jin, Runpeng Cui, Fei Sha, Changshui forget gate which helps in feedback connections and
Zhang[9] developed captioning framework that exploits par- furthermore process whole data. Fig. 4. shows encoder decoder
allel structures between sentences and images. In this method, framework based on CNN-LSTM [12 -15] .
there is a nearby correspondence between visual ideas that Scene-specific LSTM: It extract the whole visual con- tent from
recognize object regions and their corresponding sentences. entire image. Supervised technique such as LSTM is used for
In addition, the procedure of creating the next word, given the predicting scene-vectors. While distinguishing objects in the
already created ones, is lined up with the visual observation picture center around determining the latent arrangement between
experience where the consideration moving among the areas the recognized locales and the words in the training captions [16-
forces a requesting of visual recognition preference. The 19]. Their motivation is to utilize the arrangements to train a
alignment encodes visual scene and their corresponding repetitive neural system generator of word successions where the
content in textual depiction. These models have Convolution preparation of information has become the adjusted locales and the
Neural Network (CNN) for feature extraction and Long Short comparing words. In any case, while titling a new picture, the
Term Memory (LSTM) for language modeling. LSTM prepared generator takes the component vector figured over the
predicts both where the following visual center ought to be entire test picture as in other comparative frameworks. This model
and what the following word in the captioning ought to be. concentrates features from elements by using”bounding boxes”,
Encoder decoder structure is utilized for picture captioning. which is a more straightforward portrayal. Both their framework
This technique also introduced a scene-specific context. Such and instinct that various pieces of sentences should compare to
context catches more elevated level semantic data encoded in a various districts on the picture.
picture. Language models are used to generate words to
specify types of scenes. Scene context draws out visual

0434

Authorized licensed use limited to: Middlesex University. Downloaded on September 06,2020 at 16:30:20 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. Two-phase learning architecture. Courtesy: Lian Zhou et al[18]

a little amount of saliency. They do not use saliency-predictor.


So ”two-phase learning” is used for image captioning to
enhance accuracy and saliency of captioning. Saliency masks
Fig. 5. Parallel-Fusion Architecture. Courtesy: Xiaokang Yan et al[13] and their maps are produced by a visual saliency technique.
Grammatical properties and other word properties such as
To this end, the two frameworks need to display how noun etc present in a caption are provided by semantic
inscribing moves between areas, utilizing a consideration saliency. Degree of saliency of each samples computed by
model to describe the elements. sample saliency. This mechanism uses an encoder for feature
detection and then to extract it. Then use two-stage decoders
D. Parallel-Fusion Architecture for obtaining saliency information about deep features and to
Parallel-fusion architecture [13] is similar to encoder- develop grammatically correct sentences Fig. 6 show an
decoder. Parallel-fusion of LSTM-RNN enhances the effi- outline of the new structure for saliency-upgraded recaptioning
ciency. VGG-16 is a pre-trained model is used for feature by of means of two-stage learning.CNN act as an encoder.
extraction. Before passing to VGG-16[14] size of the images LSTM1 is the main stage decoder, and LSTM2 is the second-
are kept fixed. In parallel fusion first separate individual stage decoder.
features and then processed independently. A dictionary is 1) Image Encoder: Deep CNN[19] is considered as image
developed which consists of different words present in a encoders. Encoders extract feature vectors from an image.
dataset. Each word is considered as one hot-vector[15].The Convolutional layer produces spatial output features which are
Dimension of each one-hot vector is same as the size of the fine-grained. Each vectors of feature with their region can be
dictionary. Size of the encoded words and hidden layer sizes represented by a number of rows and columns. ReLU[20]
are equal. Sentences and features of images are lodged in feeds activation or feature map to a decoder
concurrent space. After feature extraction language modeling 2) Image Decoder: LSTM[21] framework is used as
is done by Recurrent Neural Network (RNN). decoders, which has elegant visual consideration[22]. Two
In parallel-fusion technique, hidden units of model RNN LSTM are used. Each stage have one LSTM .In the firs stage
are divided into small parts or units. RNN hidden units are learns about the saliency cues and types. Every one word
combine in parallel fusion form. Additionally, those units are implied as one-hot vector. First stage capture visual,semantic,
various sorts of RNN. RNN may not create long sentences but sample saliency.
have high memory volume [16]. Image depiction and 3) Visual-saliency technique: This method develop a global
generation of captions are carry out by CNN and RNN saliency-mask. Gives diverse image representation based on
respectively. In fusion mechanism first separate characteristics visuals and the alignment between visuals regions and
of an image and detect individual elements and adjust visual corresponding words can learn by visual attention.
and language informa- tion. During forward propagation 4) Semantic-saliency technique: Extract more discrimi-
features of source data and hidden layers are same. Hidden native visual informations . This method helps to predict
layer receives output data from previous time. Two hidden most important words such as nouns easily. Accuracy of the
units have different combinations such as RNN-RNN, RNN- prediction is also higher. Develop sequence of salient words
LSTM etc.To get result, merge output layers. Fig. 5. shows in order as they present in captions.
parallel-fusion architecture. In the figure ’y’ represents output 5) Sample-saliency technique: Improve robustness by
of RNN units,’W’ represents parameters for weightage and computing each samples. Second stage decoders are made
’h’ represents hidden units [17]. of LSTM. It combines three methods. The output of second
LSTM provide captions with high quality. The output depends
E. Saliency-Enhanced Two-Phase Learning Image Captioning upon output vectors of CNN and outputs of three saliency.
This method is developed by Lian Zhou et al. [18]. Visual
and semantic saliency are significant for captioning of images. F. Multiple Feature based Cascade Attention for Captioning
The main aim of the re-captioning mechanism is to magnify An alignment unit in the encoder-decoder network can
image captioning by completely utilizing information about sequentially improves the performance of captioning system.
saliency of an image. Single-stage saliency captioning provides

0435

Authorized licensed use limited to: Middlesex University. Downloaded on September 06,2020 at 16:30:20 UTC from IEEE Xplore. Restrictions apply.
Traditional attention unit ignores inequality between T ABLE I
different kind of inputs. Hence may lack the exploitation of DIFFERENT M ETHODS AND THEIR E VALUATION M ETRICS
some informative characteristics. Cascade-based attention [16]
Method Dataset Evaluation Metrics
helps to provide information so as to decrease the prominence
of inequality of input types. In the encoder-decoder structure Region- MSCOCO BLEU-1,BLEU-2,BLEU-3,BLEU-
based and 4,METEOR,ROUGE-L,CDIEr-D
for AIC, an attention module helps to pass input features SSC[9]
for a decoder to increase performance continuously. Inputs
Parallel- Flicker8k BLEU-1,METEOR,PPL
are processed in different ways which is useful than parallel fusion
processing of inputs. architec-
In cascade attention CNN used for feature extrication. ture[13]
Then group input based on their characteristics and processed Cascade at- MSCOCO BLEU-1,BLEU-2,BLEU-3,BLEU-
tention[16] 4,METEOR,CDIEr,ROUGE-
sequentially. The attention layer helps in identifying the nature L,SPICE
of input and arranges them orderly. Attention have different
Two-phase Flicker30k, BLEU-1,BLEU-2,BLEU-3,BLEU-
layers in which new clusters of inputs are evolved. New learning MSCOCO 4 METEOR,CDIEr,ROUGE-L
inputs are developed based on the calculation of weighted sum model[18]
of previous layer features. Hence each layer have different
types of inputs inorder to work properly. The output of an T ABLE II
encoder gives information about features of each ”Region of E VALUATION M ETRICS WHERE B LUE-1,2,3,4 DENOTED AS
interest” and global description of images. Then combine this B-1.B-2,B-3,B-4 R ESPECTIVELY .
information and pass to decoders. The sequence of captions Dataset Method Category B- B- B- B- M C
are predicted recurrently by decoders. Each time predicts 1 2 3 4
one word with respect to previously developed words. Two MSCOCO Region (RA+SS), 72.4 55.5 41.8 31.3 24.8 95.5
LSTMs [17]are fused for language development (LAN) and and
ssc[9]
attention development (ATT) for making accurate sentences. FLICKR8K Parallel- sRNN- 66.7 - - - 16.53 -
ATT LSTM and LAN LSTM deals with capturing of global Fusion h512
characteristics and visual sentinel or output of cascade atten- Ar-
chitec-
tion with global context respectively. Hidden features are used ture[13]
in ATT and omit update procedures and gives output caption. Mix6v4 64.7 - - - 18.85 -
MSCOCO Two VIS SVR 71.99 52.05 37.66 27.42 23.05 79.50
III. EVALUATION M ETRICS stage
learn-
Evaluation metrics are done to know the performance analy- ing[18]
sis of an image and helps to calculate the quality of captioning. VIS HVR 72.27 52.64 38.20 27.85 23.27 79.50
FLICKR30K Two VIS SVR 66.68 43.89 29.90 20.35 18.11 37.31
There are different method to extract characteristics of an stage
image. Different evaluation metrics are PPL, CDIEr(C)[23], learn-
BLEU[24] and METEOR(M)[25] shown in Table I and ing[4]
VIS HVR 66.67 44.52 30.46 20.85 18.39 38.05
metrics values of each method is shown in Table II. MSCOCO Cascaded Cascade 79.4 63.7 48.9 36.9 27.9 122.7
In scene-specfic encoder-decoder[9] structure use Atten-
CDIEr(C). CDIEr obtained by calculating all the pixels tion[16]
and find their sum based on weight of attention obtained
from a region. Randomly take similar region and words from
entire region. accomplish better or practically identical outright execution on
Parallel LSTM-RNN fusion[13] uses size of the model and a few measurements [28].
running times are taken in account to calculate performance Cascade attention uses perform end to end connection.
of the structure. BLUE score and LSTM are equal and PPL Cascade module have upto 122.7 CDIEr score which indicate
is little bit greater than reference whereas METEOR increase effective structure.
deeply. Combination of LSTM and sRNN represent as Mix3v7
have satisfactory results. In two stage learning technique[18] IV. CONCLUSION
execution performance increases of are restricted contrasted Automatic image captioning is emerging field of recent
with the uniform consideration pattern, with the exception of years. After reviewing number of papers we found that various
the CIDEr score [26,27]. The standard model in accomplishes approaches like N-cut and colour based segmentation, hybrid
pre- ferred total execution over that of our first-stage engine,encoder-decoder frame work are used. It mainly utl-
model.With one more stage, saliency upgraded models can izes computer vision along with natural language processing.
accomplish better supreme execution on both datasets with the Automatic image captioning using neural networks is more
exception of the BLEU score on Flickr30k. Second-phase advance and accurate method. AIC age for pictures for
decoders, VIS SVR and VIS HVR uses saliency-map and mask individuals who experience the ill effects of different degrees
refine training pictures respectively. Second-stage models
figure out how to

0436

Authorized licensed use limited to: Middlesex University. Downloaded on September 06,2020 at 16:30:20 UTC from IEEE Xplore. Restrictions apply.
of visual debilitation; the programmed making of metadata for [12] K. Simonyan and A. Zisserman, “Very deep convolutional net works for
pictures (ordering) for use via web indexes; universally useful large-scale image recognition,” in ICLR, 2015.
[13] Minsi Wang, Li Song, Xiaokang Yang, Chuanfei Lu. ”A Parallel- Fusion
robot vision frame works; and numerous others. Enormous Rnn-Lstm Architecture for Image Caption Gen- Eration”, ICIP 2016.
number of datasets are required. There are many open source, [14] Karen Simonyan and Andrew Zisserman, “Very deep convolutional net-
for example, MSCOCO, FLICKER datasets are works for large-scale image recogni tion,”arXiv preprint arXiv:1409.1556,
2014.
available.Changing the model engineering, for example [15] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan,
incorporate a considera- tion module and accomplishing more “Show and tell: A neural image cap tion generator,” in Proceedings of
hyper parameter such as batch size number of layers and the IEEE Conf on Comp Vision and Pattern Rec, 2015, pp.3156–3164.
[16] Jiahe Shi,Yali Li,Shengjin Wang.”Cascade Attention: Multiple Feature
units etc improve picture captioning framework. Based Learning For Image Captioning”.ICIP 2019.
[17] Sepp Hochreiter and Jrgen Schmidhuber, “Long short term memory,”
REFERENCES Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[1] J.Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans- [18] Lian Zhou, Yuejie Zhang, Yugang Jiang, Tao Zhang, and Weiguo
actions on Pattern Analysis and Machine Intelligence, 22(8):888–905, Fan.”Re-Caption : Saliency-Enhanced Image Captioning through Two-
2000. Phase Learning ”.IEEE Transactions on Image Processing,2019.
[2] May The‘ Yu and Myint Myint Sein.”Automatic Image Captioning System [19] J. Gu et al., “Recent advances in convolutional neural networks,” Pattern
Using Integration of N-cut and Color-based Segmentation Method” SICE Recognit., vol. 77, pp. 354–377, 2018.
Annual Conference 2011. [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica- tion
[3] Kaustubh Shivdikar, Ahan Kak, Kshitij Marwah.”Automatic Image An- with deep convolutional neural networks,” in Proc. Adv. Neural
notation using a Hybrid Engine”.IEEE INDICON 2015. Inf.Process. Syst., 2012, pp. 1106–1114.
[4] H. Bay et al., ”Speeded-Up Robust Features (SURF)”, Comput. Vis. [21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Image Und., vol. 110, issue 3, pp. 346-359, Jun. 2008. Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[5] J. MacQueen, ”Some methods for classification and analysis of multi [22] K.E.Purushothaman and V.Nagarajan (NOV 2015),“Design a Low Noise
variate observations”, in Proceedings of the 5th Berkeley Symposium on Amplifier Using Cascading of Resistive Shunt Feedback with Current
Mathematical Statistics and Probability, Berkerley, CA, 1965, pp. 281- Reuse”International Journal of Advanced and Innovative Research,Vol.4
297. Issue 11,Nov 2015,pp.171-175.
[6] J. Shi and C.Tomasi, ”Good Features to Track”, in Proceedings of the [23] K. Xu et al., “Show, attend and tell: neural image caption generation with
IEEE Conference on Computer Vision and Pattern Recognition, Seattle, visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.
WA, 1994, pp. 593-600. [24] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: consensus-based
[7] A. Alahi, ”FREAK: Fast Retina Keypoint”, in IEEE Conf on Computer image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern
Vision and Pattern Recognition, Providence, RI, 2012, pp. 510 517. Recognit., 2015, pp. 4566–4575.
[8] A. Farhadi et al., ”Every Picture Tells a Story: Generating Sentences [25] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for
from Images”, in The 11th European Conference on Computer Vision, automatic evaluation of machine translation,” in Proc. Annu. Meeting
Herkalion, 2010, pp. 15-29. Assoc. Comput. Ling., 2002, pp. 311–318.
[9] Kun Fu, Junqi Jin, Runpeng Cui, Fei Sha, Changshui Zhang.“ Aligning [26] M. J. Denkowski and A. Lavie, “Meteor universal: language specific
Where to See and What to Tell : Image Captioning with Region-based translation evaluation for any target language,” in Proc. Workshop Statist.
Attention and Scene-specic Contexts”. IEEE Transactions on Pattern Mach. Transl., 2014, pp. 376–380.
Analysis and Machine Intelligence,2016. [27] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[10] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Explain images with P. Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common objects in
multimodal recurrent neural networks,” in ICLR, 2015. context,” in ECCV, 2014.
[11] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeul- [28] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Col lecting
ders,“Selective search for object recognition,” International Journal of image annotations using Amazon’s Mechanical Turk,” in Workshop of
Computer Vision, vol. 104, no. 2, pp. 154–171, 2013. NAACL, 2010.

0437

Authorized licensed use limited to: Middlesex University. Downloaded on September 06,2020 at 16:30:20 UTC from IEEE Xplore. Restrictions apply.

You might also like