IMAGE CAPTIONING BOT
Made By ~
Aman Bahuguna (18BCS2441)
Deepak Yadav (18BCS2446)
Gyan Ranjan Kumar (18BCS2431)
TABLE OF CONTENT
• Project Description
• Datasets
• Models: RNN + CNN
• Architecture details
• Evaluation problems
• Results
INTRODUCTION
What do you See in the Picture?
Well some of you
might say “A white
dog in a grassy area”,
some may say “White
dog with brown spots”
and yet some others
might say “A dog on
grass and some pink
flowers”.
But, can you write a computer program that takes an image as
input and produces a relevant caption as output?
APPLICATION OF IMAGE CAPTIONING
Probably can be used in the applications where text is used mostly and with the
use of this we can infer a image in form of text.
NLP is used extensively in the market now-a-days. For example, summarizing
or gaining insights from a large corpus of text. In the same way, we can use the
same concept to get insights from images as well.
We can build a 360-degree metastore and make use of it in a wide variety of
business like making user searches more efficient on an e-commerce platform
based on metadata of products, other may be some other things like
recommendations and all.
We can describe like what happen in a given video segment.
Can be used to give something back to mankind for visually impaired people.
and many more.
DATASETS
Flickr8k
8000 images, each annotated with 5 sentences via AMT
Training Set — 6000 images
Dev Set — 1000 images
Test Set — 1000 images
• A child in a pink dress is climbing up a set of stairs in an
entry way
• A girl going into a wooden building .
• A little girl climbing into a wooden playhouse .
• A little girl climbing the stairs to her playhouse .
• A little girl in a pink dress going into a wooden cabin .
Keras is an open-source software library that provides
a Python interface for artificial neural networks. Keras acts as an
interface for the TensorFlow library.
MODELS: RNN + CNN
How to combine image and and sentence
RNN + CNN:
• Encoder-decoder model
• Multimodal layer
Encoder-decoder model: image caption
Multimodal Layer
ENCODER-DECODER MODEL:
MODEL ARCHITECTURE
ARCHITECTURE DETAILS:
WORD EMBEDDINGS
To encode the words in form of vector using Gloves
FULL MODEL DETAILS
OUTPUT GENERATIONS
OUTPUT
THANK YOU