Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts

This document introduces a new dataset of labeled Instagram posts that aims to analyze the complex relationship between images and captions in determining author intent. It presents taxonomies for labeling the authorial intent, contextual relationship between image and caption literal meanings, and semiotic relationship between their signified meanings. An initial deep learning model is able to better determine author intent when using both images and captions compared to images alone, demonstrating the value of this multimodal meaning multiplication approach. The dataset provides a new resource for computational study of how paired text and images combine to create rich meanings.

Uploaded by

Sasha-Lee Kriel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views11 pages

Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts

Uploaded by

Sasha-Lee Kriel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Integrating Text and Image: Determining Multimodal Document Intent in

Instagram Posts

Julia Kruk1∗, Jonah Lubin2∗ , Karan Sikka1 , Xiao Lin1 ,

Dan Jurafsky3 , Ajay Divakaran1†,
∗equal contribution
1
SRI International, Princeton, NJ 2 The University of Chicago, Chicago, Illinois
3
Stanford University, Stanford, CA

Abstract
Computing author intent from multimodal
data like Instagram posts requires modeling a
complex relationship between text and image.
For example, a caption might evoke an ironic
contrast with the image, so neither caption nor
image is a mere transcript of the other. In-
stead they combine—via what has been called
meaning multiplication Bateman (2014)—to
create a new meaning that has a more complex
Figure 1: Image-Caption meaning multiplication: A
relation to the literal meanings of text and im-
change in the caption completely changes the overall
age. Here we introduce a multimodal dataset
meaning of the image-caption pair.
of 1299 Instagram posts labeled for three or-
thogonal taxonomies: the authorial intent be-
hind the image-caption pair, the contextual re-
lationship between the literal meanings of the or captions (Chen et al., 2015; Faghri et al., 2018,
image and caption, and the semiotic relation- inter alia). But prior work on image–text pairs has
ship between the signified meanings of the im- generally been asymmetric, regarding either im-
age and caption. We build a baseline deep age or text as the primary content, and the other
multimodal classifier to validate the taxonomy, as mere complement. Scholars from semiotics as
showing that employing both text and image
well as computer science have pointed out that this
improves intent detection by 9.6% compared
to using only the image modality, demonstrat-
is insufficient; often text and image are not com-
ing the commonality of non-intersective mean- bined by a simple addition or intersection of the
ing multiplication. The gain with multimodal- component meanings (Bateman, 2014; Marsh and
ity is greatest when the image and caption di- Domas White, 2003; Zhang et al., 2018).
verge semiotically. Our dataset offers a new Rather, determining author intent with
resource for the study of the rich meanings that
text+image content requires a richer kind of
result from pairing text and image.
meaning composition that has been called mean-
1 Introduction ing multiplication (Bateman, 2014): the creation
of new meaning through integrating image and
Multimodal social platforms such as Instagram text. Meaning multiplication includes simple
let content creators combine visual and textual meaning intersection or concatenation (a picture
modalities. The resulting widespread use of of a dog with the label “dog”, or the label “Ru-
text+image makes interpreting author intent in fus”). But it also includes more sophisticated
multimodal messages an important task for NLP kinds of composition, such as irony or indirection,
for document understanding. where the text+image integration requires infer-
There are many recent language processing ence that creates a new meaning. For example in
studies of images accompanied by basic text labels Figure 1, a picture of a young woman smoking
∗ is given two different hypothetical captions that
Work done while Julia (from Cornell University) and
Jonah were interns at SRI International. result in different composed meanings. In Pairing
†
Corresponding author, [email protected]. I, the image and text are parallel, with the picture

4622
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing, pages 4622–4632,
Hong Kong, China, November 3–7, 2019. 2019c Association for Computational Linguistics
used to highlight relaxation through smoking. sian tradition focuses on advertisements, in which
Pairing II uses the tension between her image and the text serves as merely another connotative as-
the implications of her actions to highlight the pect to be incorporated into a larger connotative
dangers of smoking. meaning (Heath et al., 1977). Marsh and Do-
Computational models that detect complex re- mas White (2003) offer a taxonomy of the re-
lationships between text and image and how they lationship between image and text by consider-
cue author intent could be significant for many ar- ing image/illustration pairs found in textbooks or
eas, including the computational study of adver- manuals. We draw on their taxonomy, although as
tising, the detection and study of propaganda, and we will see, the connotational aspects of Instagram
our deeper understanding of many other kinds of posts require some additions.
persuasive text, as well as allowing NLP applica- For our model of speaker intent, we draw on
tions to news media to move beyond pure text. the classic concept of illocutionary acts (Austin,
To better understand author intent given such 1962) to develop a new taxonomy of illocutionary
meaning multiplication, we create three novel tax- acts focused on the kinds of intentions that tend
onomies related to the relationship between text to occur on social media. For example, we rarely
and image and their combination/multiplication see commissive posts on Instagram and Facebook
in Instagram posts, designed by modifying exist- because of the focus on information sharing and
ing taxonomies (Bateman, 2014; Marsh and Do- constructions of self-image.
mas White, 2003) from semiotics, rhetoric, and
Computational approaches to multi-modal doc-
media studies. Our taxonomies measure the au-
ument understanding have focused on key prob-
thorial intent behind the image-caption pair and
lems such as image captioning (Chen et al., 2015;
two kinds of text-image relations: the contextual
Faghri et al., 2018), visual question answering
relationship between the literal meanings of the
(Goyal et al., 2017; Zellers et al., 2018; Hudson
image and caption, and the semiotic relationship
and Manning, 2019), or extracting the literal or
between the signified meanings of the image and
connotative meaning of a post (Soleymani et al.,
caption. We then introduce a new dataset, MDID
2017). More recent work has explored the role
(Multimodal Document Intent Dataset), with 1299
of image as context for interaction and pragmat-
Instagram posts covering a variety of topics, anno-
ics, either in dialog (Mostafazadeh et al., 2016,
tated with labels from our three taxonomies.
2017), or as a prompt for users to generate de-
Finally, we build a deep neural network model scriptions (Bisk et al., 2019). Another important
for annotating Instagram posts with the labels direction has looked at an image’s perlocutionary
from each taxonomy, and show that combining force (how it is perceived by its audience), includ-
text and image leads to better classification, es- ing aspects such as memorability (Khosla et al.,
pecially when the caption and the image diverge. 2015), saliency (Bylinskii et al., 2018), popular-
While our goal here is to establish a computational ity (Khosla et al., 2014) and virality (Deza and
framework for investigating multimodal meaning Parikh, 2015; Alameda-Pineda et al., 2017).
multiplication, in other pilot work we have be-
gun to consider some applications, such as us- Some prior work has focused on intention. Joo
ing intent for social media event detection and for et al. (2014) and Huang and Kovashka (2016)
user engagement prediction. Both these directions study prediction of intent behind politician por-
highlight the importance of the intent and semiotic traits in the news. Hussain et al. (2017) study the
structure of a social media posting in determining understanding of image and video advertisements,
its influence on the social network as a whole. predicting topic, sentiment, and intent. Alikhani
et al. (2019) introduce a corpus of the coher-
2 Prior Work ence relationships between recipe text and im-
ages. Our work builds on Siddiquie et al. (2015),
A wide variety of work in multiple fields has ex- who focused on a single type of intent (detect-
plored the relationship between text and image ing politically persuasive video on the internet)
and extracting meaning, although often assigning and even more closely on Zhang et al. (2018),
a subordinate role to either text or images, rather who study visual rhetoric as interaction between
than the symmetric relationship in media such as the image and the text slogan in advertisements.
Instagram posts. The earliest work in the Barthe- They categorize image-text relationships into par-

4623
allel equivalent (image-text deliver same point at
equal strength), parallel non-equivalent (image-
text deliver the same point at different levels) and
non-parallel (text or image alone is insufficient in
point delivery). They also identify the novel issue
of understanding the complex, non-literal ways in
which text and image interacts. Weiland et al.
(2018) study the non-literal meaning conveyed by
image-caption pairs and draw on a knowledge-
base to generate the gist of the image-caption pair.

3 Taxonomies
As Berger (1972) points out in discussing the rela-
tionship between one image and its caption:

It is hard to define exactly how the

words have changed the image but un-
doubtedly they have. (p. 28).

We propose three taxonomies in an attempt to an-

swer Berger’s implicit question, two (contextual
and semiotic) to capture different aspects of the re- Figure 2: Examples of multimodal document intent:
lationship between the image and the caption, and advocative, provocative, expressive and promotive con-
one to capture speaker intent. tent

3.1 Intent Taxonomy

3. exhibitionist: create a self-image reflecting
The proposed intent taxonomy is a generalization the person, state etc. for the user using selfies,
and elaboration of existing rhetorical categories pictures of belongings (e.g. pets, clothes) etc.
pertaining to illocution, that targets multimodal
social networks like Instagram. We developed a 4. expressive: express emotion, attachment, or
set of eight illocutionary intents from our exam- admiration at an external entity or group.
ination and clustering of a large body of repre- 5. informative: relay information regarding a
sentative Instagram content, informed by previ- subject or event using factual language.
ous studies of intent in Instagram posts. There 6. entertainment: entertain using art, humor,
is some overlap between categories; to bound the memes, etc.
burden on the annotators, however, we asked them
7. provocative/discrimination: directly attack
to identify intent for the image-caption pairing as
an individual or group.
a whole and not for the individual components
For example drawing on Goffman’s idea of the 8. provocative/controversial: be shocking.
presentation of self (Goffman, 1978), Mahoney
et al. (2016) in their study of Scottish political In- 3.2 The Contextual Taxonomy
stagram posts define acts like Presentation of Self, The contextual relationship taxonomy captures the
which, following Hogan (2010) we refer to as ex- relationship between the literal meanings of the
hibition, or Personal Political Expression, which image and text. We draw on the three top-level cat-
we generalize to advocative. Following are our fi- egories of the Marsh and Domas White (2003) tax-
nal eight labels; Figure 2 shows some examples. onomy, which distinguished images that are min-
imally related to the text, highly related to the
1. advocative: advocate for a figure, idea,
text, and related but going beyond it. These three
movement, etc.
classes— reflecting Marsh et al.’s primary interest
2. promotive: promote events, products, orga- in illustration—frame the image only as subordi-
nizations etc. nate to the text. We slightly generalize the three

4624
Intent
Category # Samples
Provocative 84 Semiotic Contextual Relationship
Informative 119 Category # Samples Category # Samples
Advocative 97 Divergent 115 Minimal 372
Entertainment 310 Additive 277 Close 585
Expositive 237 Parallel 712 Transcendent 147
Expressive 95
Promotive 162

Table 1: Counts of different labels in the Multimodal Document Intent Dataset (MDID).

top-level categories taxonomy of Marsh and Do- of pride in one’s pets that is not directly reflected
mas White (2003) to make them symmetric for the in either modality taken by itself. First, it forces
Instagram domain: the reader to step back and consider what is being
signified by the image and the caption, in effect of-
Minimal Relationship: The literal meanings of
fering a meta-comment on the text-image relation.
the caption and image overlap very little. For
Second, there is a tension between what is signi-
example, a selfie of a person at a waterfall
fied (a family and a litter of young animals respec-
with the caption “selfie”. While such a terse
tively) that results in a richer idiomatic meaning.
caption does nevertheless convey a lot of in-
formation, it still leaves out details such as Our third taxonomy therefore captures the re-
the location, description of the scene, etc. lationship between what is signified by the re-
that are found in typical loquacious Insta- spective modalities, their semiotics. We draw on
gram captions. the earlier 3-way distinction of Kloepfer (1977)
as modeled by Bateman (2014) and the two-way
Close Relationship: The literal meanings of the
(parallel vs. non-parallel) distinction of Zhang
caption and the image overlap considerably.
et al. (2018) to classify the semiotic relationship
For example, a selfie of a person at a crowded
of image/text pairs as divergent, parallel and addi-
waterfall, with the caption “Selfie at Hemlock
tive. A divergent relationship occurs when the im-
falls on a crowded sunny day”.
age and text semiotics pull in opposite directions,
Transcendent Relationship: The literal meaning creating a gap between the meanings suggested by
of one modality picks up and expands on the the image and text. A parallel relationship occurs
literal meaning of the other. For example, when the image and text independently contribute
a selfie of a person at a crowded waterfall to the same meaning. An additive relationship oc-
with the caption “Selfie at Hemlock Falls on curs when the image and text semiotics amplify or
a sunny and crowded day. Hemlock falls is modify each other.
a popular picnic spot. There are hiking and
The semiotic classification is not always homol-
biking trails, and a great restaurant 3 miles
ogous to the contextual. For example, an image
down the road ...”.
of a mother feeding her baby with a caption “My
Note that while the labels “minimal” and ”close” new small business needs a lot of tender loving
could be thought of as lying on a continuous scale care”, would have a minimal contextual relation-
indicating semantic overlap, the label “transcen- ship. Yet because both signify loving care and
dent” indicates an expansion of the meaning that the image intensifies the caption’s sentiment, the
cannot be captured by such a continuous scale. semiotic relationship is additive. Or a lavish for-
mal farewell scene at an airport with the caption
3.3 The Semiotic Taxonomy “Parting is such sweet sorrow”, has a close con-
The contextual taxonomy described above does textual relationship because of the high overlap in
not deal with the more complex forms of “mean- literal meaning, but the semiotics would be addi-
ing multiplication” illustrated in Figure 1. For ex- tive, not parallel, since the image shows only the
ample, an image of three frolicking puppies with leave-taking, while the caption suggests love (or
the caption “My happy family,” sends a message ironic lack thereof) for the person leaving.

4625
night this photo was taken. ICP IV is an aesthetic
photo of a young woman, paired with a caption
that has no semantic elements in common with the
photo. The caption may be a prose excerpt, the
author’s reflection on what the image made them
think or feel, or perhaps just a pairing of pleas-
ant visual stimulus with pleasant literary material.
This divergent relationship is often found in pho-
tography, artistic and other entertainment posts.
ICP V uses one of the most common divergent re-
lationships, in which exhibitionist visual material
is paired with reflections or motivational captions.
ICP V is thus similar to ICP III, but without the
inside jokes/hidden meanings common to ICP III.
ICP VI is an exhibitionist post that seems to be
common recently among public figures on Insta-
gram. The image appears to be a classic selfie or
often a professionally taken image of the individ-
ual, but the caption refers to that person’s opinions
or agenda(s). This relationship is divergent—there
are no common semantic elements in the image
and caption—but the pair paints a picture of the
individual’s current state or future plans.
Figure 3: The top three images exemplify the semiotic
categories. Images I-VI show instances of divergent 4 The MDID Dataset
semiotic relationships.
Our dataset, MDID (the Multimodal Document In-
tent Dataset) consists of 1299 public Instagram
Figure 3 further illustrates the proposed semi- posts that we collected with the goal of develop-
otic classification. The first three image-caption ing a rich and diverse set of posts for each of the
pairs (ICP’s) exemplify the three semiotic relation- eight illocutionary types in our intent taxonomy.
ships. To give further insights into the rich com- For each intent we collected at least 16 hashtags
plexity of the divergent category, the six ICP’s be- or users likely to yield a high proportion of posts
low showcase the kinds of divergent relationships that could be labeled by that heading.
we observed most frequently on Instagram. For the advocative intent, we selected mostly
ICP I exploits the tension between the refer- hashtags advocating and spanning political or so-
ence to retirement expressed in the caption and cial ideology such as #pride and #maga. For the
the youth projected by the two young women in promotive intent we relied on the #ad tag that In-
the image to convey irony and thus humor in what stagram has recently begun requiring for spon-
is perhaps a birthday greeting or announcement. sored posts. For exhibitionist intent we used tags
Many ironic and humorous posts exhibit divergent that focused on the self as the most important as-
semiotic relationships. ICP II has the structure of a pect of the post such as #selfie and #ootd (outfit
classic Instagram meme, where the focus is on the of the day). The expressive posts were retrieved
image, and the caption is completely unrelated to via tags that actively expressed a stance or an af-
the image. This is also exhibited in the divergent fective intent, such as #lovehim or #merrychrist-
“Good Morning” caption in the top row. ICP III mas. Informative posts were taken from infor-
is an example of a divergent semiotic relationship mative accounts such as news websites. Enter-
within an exhibitionist post. A popular commu- tainment posts drew on an eclectic group of tags
nicative practice on Instagram is to combine self- such as #meme, #earthporn, #fatalframes. Finally,
ies with a caption that is some sort of inside joke. provocative posts were extracted via tags that ei-
The inside joke in ICP III is a lyric from a song ther expressed a controversial or provocative mes-
a group of friends found funny and discussed the sage or that would draw people into being influ-

4626
enced or provoked by the post (#redpill, #antifa, 6 Experiments
#eattherich, #snowflake).
We evaluate our models on predicting intent, semi-
Data Labeling: Data was pre-processed (for ex- otic relationships, and image-text relationships
ample to convert all albums to single image- from Instagram posts, using image only, text only,
caption pairs). We developed a simple annotation and both modalities.
toolkit that displayed an image–caption pair and
asked the annotator to confirm whether the pair 6.1 Dataset, Evaluation and Implementation
was relevant (contains both an image and text in We use the 1299-sample MDID dataset (sec-
English) and if so to identify the post’s intent (ad- tion 4). We only use corresponding image and
vocative, promotive, exhibitionist, expressive, in- text information for each post and do not use other
formative, entertainment, provocative), contextual meta-data to preserve the focus on image-caption
relationship (minimal, close, transcendent), and joint meaning. We perform basic pre-processing
semiotic relationship (divergent, parallel, addi- on the captions such as removing stopwords and
tive). Two of the authors collaborated on the label- non-alphanumeric characters. We do not perform
ers manual and then labeled the data by consensus, any pre-processing for images.
and any label on which the authors disagreed af- Due to the small dataset, we perform 5-fold
ter discussion was removed. Dataset statistics are cross-validation for our experiments reporting av-
shown in Table 1; see https://www.ksikka. erage performance across all splits. We report
com/document_intent.html for the data. classification accuracy (ACC) and also area un-
der the ROC curve (AUC) (since AUC is more
5 Computational Model robust to class-skew), using macro-average across
We train and test a deep convolutional neural net- all classes (Jeni et al., 2013; Stager et al., 2006).
work (DCNN) model on the dataset, both to of- We use a pre-trained ResNet-18 model as the
fer a baseline model for users of the dataset, and image encoder. For word token based embed-
to further explore our hypothesis about meaning dings we use 300 dimensional vectors trained from
multiplication. scratch. For ELMo we use a publicly available
Our model can take as input either image (Img), API 1 and use a pre-trained model with two layers
text (Txt) or both (Img + Txt), and consists of resulting in a 2048 dimensional input. We use a bi-
modality specific encoders, a fusion layer, and a directional GRU as the RNN model with 256 di-
class prediction layer. We use the ResNet-18 net- mensional hidden layers. We set the dimensional-
work pre-trained on ImageNet as the image en- ity of the common embedding space in the fusion
coder (He et al., 2016). For encoding captions, layer to 128. In case there is a single modality,
we use a standard pipeline that employs a RNN the fusion layer only projects features from that
model on word embeddings. We experiment with modality. We train with the Adam optimizer with
both word2vec type (word token-based) embed- a learning rate of 0.00005, which is decayed by
dings trained from scratch (Mikolov et al., 2013) 0.1 after every 15 epochs. We report results with
and pre-trained character-based contextual embed- the best model selected based on performance on
dings (ELMo) (Peters et al., 2018). For our pur- a mini validation set.
pose ELMo character embeddings are more useful
6.2 Quantitative Results
since they increase robustness to noisy and often
misspelled Instagram captions. For the combined We show results in Table 2. For the intent taxon-
model, we implement a simple fusion strategy that omy images are more informative than (word2vec)
first linearly projects encoded vectors from both text (76% for Img vs 72.7% for Txt-emb) but with
the modalities in the same embedding space and ELMo text outperforms just using images (82.6%
then adds the two vectors. Although naive, this for Txt-ELMo, 76.0% for Img). ELMo similarly
strategy has been shown to be effective at a vari- improves performance on the contextual taxon-
ety of tasks such as Visual Question Answering omy but not the semiotic taxonomy.
(Nguyen and Okatani, 2018) and image-caption For the semiotic taxonomy, ELMo and
matching (Ahuja et al., 2018). We then use the word2vec embeddings perform similarly, (67.8%
fused vector to predict class-wise scores using a 1
https://github.com/allenai/allennlp/
fully connected layer. blob/master/tutorials/how_to/elmo.md

4627
Method Intent Semiotic Contextual
ACC AU C ACC AU C ACC AU C
Chance 28.1 50.0 64.5 50.0 53.0 50.0
Img 42.9 (±0.0) 76.0 (±0.5) 61.5 (±0.0) 59.8 (±3.0) 52.5 (±0.0) 62.5 (±1.3)
Txt-emb 42.9 (±0.0) 72.7 (±1.5) 58.9 (±0.0) 67.8 (±1.7) 60.7 (±0.5) 74.9 (±3.0)
Txt-ELMo 52.7 (±0.0) 82.6 (±1.2) 61.7 (±0.0) 66.5 (±1.9) 65.4 (±0.0) 78.5 (±2.1)
Img + Txt-emb 48.1 (±0.0) 80.8 (±1.2) 60.4 (±0.0) 69.7 (±1.8) 60.8 (±0.0) 76.0 (±2.5)
Img + Txt-ELMo 56.7 (±0.0) 85.6 (±1.3) 61.8 (±0.0) 67.8 (±1.8) 63.6 (±0.5) 79.0 (±1.4)

Table 2: Table showing results with various DCNN models– image-only (Img), text-only (Txt-emb and Txt-
ELMo), and combined model (Img + Txt-emb and Img + Txt-ELMo). Here emb is the model using standard word
(token) based embeddings, while ELMo is the pre-trained ELMo based word embeddings (Peters et al., 2018). The
numbers in Table2 are standard deviations across 5 folds.

for Txt-emb vs. 66.5% for Txt-ELMo), suggest- 6.3 Discussion

ing that individual words are sufficient for the
semiotic labeling task, and the presence of the In general, using both text and image is helpful,
sentence context (as in ELMo) is not needed. a fact that is unsurprising since combinations of
text and image are known to increase performance
Combining visual and textual modalities helps
on tasks such as predicting post popularity or user
across the board. For example, for intent taxon-
personality (Hessel et al., 2017; Wendlandt et al.,
omy the joint model Img + Txt-ELMo achieves
2017). Most telling, however, were the differences
85.6% compared to 82.6% for Txt-ELMo. Images
in this helpfulness across items. In the semiotic
seem to help even more when using a word embed-
category the greatest gain came when the text-
ding based text model (80.8% for Img + Txt-emb
image semiotics were “divergent”. By contrast,
vs. 72.7% for Txt-emb). Joint models also im-
multimodal models help less when the image and
prove over single-modality on labeling the image-
text are additive, and help the least when the image
text relationship and the semiotic taxonomy. We
and text are parallel and provide less novel infor-
show class-wise performances with the single- and
mation. Similarly, with contextual relationships,
multi-modality models in Table 3. It is particularly
multimodal analysis helps the most with the “min-
interesting that in the semiotic taxonomy, multi-
imal” category (1.6%). This further supports the
modality helps the most with divergent semiotics
idea that on social media such as Instagram, the
(gain of 4.4% compared to the image-only model).
relation between image and text can be richly di-
vergent and thereby form new meanings.
The category confusion matrix in Figure 4 pro-
vides further insights. The least confused category
is informative. Informative posts are least similar
to the rest of Instagram, since they consist of de-
tached, objective posts, with little use of first per-
son pronouns like “I” or “me.” Promotive posts are
also relatively easy to detect, since they are for-
mally informative, telling the viewer the advan-
tages and logistics of an item or event, with the ad-
dition of a persuasive intent reflecting the poster’s
personal opinions (“I love this watch”.). We found
that the entertainment label is often misapplied;
perhaps to some extent all posts have a goal of en-
Figure 4: Confusion between intent classes for the in- tertaining, and any analysis must account for this
tent classification task. The confusion matrix was ob- filter of entertainment. The Exhibitionist intent
tained using the Img + Txt-ELMo model and the results tends to be predicted well, likely due to its visual
are averaged over the 5 splits.
and textual signifiers of individuality (e.g. selfies
are almost always exhibitionist, as are captions

4628
Semiotic
Class Img Txt-emb Img+
Intent Txt-ELMo
Class Img Txt-ELMo Img+ Divergent 69.8 72.7 77.1
Txt-ELMo Additive 55.0 66.7 68.2
Provocative 85.5 84.1 90.0 Parallel 54.5 64.3 64.0
Informative 77.0 93.9 92.8 Mean 59.8 67.8 69.7
Advocative 84.8 82.4 87.4
Entertainment 69.0 78.6 80.5 Contextual
Exhibitionist 81.7 78.7 84.9 Class Img Txt-ELMo Img+
Expressive 57.9 72.0 73.2 Txt-ELMo
Promotive 76.3 88.5 90.1 Minimal 60.9 79.7 81.3
Mean 76.0 82.6 85.6 Close 60.5 73.8 74.6
Transcendent 66.1 82.0 81.2
Mean 62.5 78.5 79.0

Table 3: Class-wise results (AUC) for the three taxonomies with different DCNN models on the MDID dataset.
Except for the semiotic taxonomy we used ELMo text representations (based on the performance in Table 2).

like “I love my new hair”). There is a great deal together signify the concept of spending winter at
of confusion, however, between the expressive and home with pets before the fireplace. The contex-
exhibitionist categories, since the only distinction tual relationship is classified as transcendental; the
lies in whether the post is about a general topic or caption indeed goes well beyond the image.
about the poster, and between the provocative and
advocative categories, perhaps because both often The top-right image-caption pair (Image II) is
seek to prove points in a similar way. classified as entertainment; the image caption pair
With the contextual and semiotic taxonomies, works as an ironic reference to dancing (“yeet”)
some good results are obtained with text alone. In grandparents, who are actually reading, in lan-
the “transcendent” contextual case, it is not neces- guage used usually by young people that a typ-
sarily surprising that using text alone enables 82% ical grandparent would never use. The semiotic
AUC because whenever a caption is really long or relationship is classified as divergent and the con-
has many adverbs, adjectives or abstract concepts, textual relationship is classified as minimal; there
it is highly likely to be transcendent. In the “diver- is semantic and semiotic divergence of the image-
gent” semiotics case, we were surprised that text caption pair caused by the juxtaposition of youth-
alone would predict divergence with 72.7% AUC. ful references with older people.
Examining these cases showed that many of them
had lexical cues suggesting irony or sarcasm, al- To further understand the role of meaning mul-
lowing the system to infer that the image will di- tiplication, we consider the change in intent and
verge in keeping with the irony. There is, how- semiotic relationships when the same image of the
ever, a consistent improvement when both modal- British Royal Family is matched with two differ-
ities are used for both taxonomies. ent captions in the bottom row of Figure 5 (Image
IV). In both cases the semiotic relationship is par-
6.4 Sample Outputs allel, perhaps due to the match between the multi-
We show some sample successful outputs of the figure portrait setting and the word family. But
(multimodal) model in Figure 5, in which the high- the other two dimensions show differences. When
est probability class in each of the three dimen- the caption is “the royal family” our system clas-
sion corresponds to our gold labels. The top-left sifies the intent as entertainment; presumably such
Image-caption pair (Image I) is classified as ex- pictures and caption pairs often appear on Insta-
hibitionist, closely followed by expressive; it is a gram intending to entertain. But when the caption
picture of someone’s home with a caption describ- is “my happy family” the intent is classified as ex-
ing a domestic experience. The semiotic relation- pressive, perhaps due to the family pride expressed
ship is classified as additive; the image and caption in the caption.

4629
Figure 5: Sample successful output predictions for the three taxonomies, showing ranked classes and predicted
probabilities. In images IV the same image when paired with a different caption gives rise to a different intent.

7 Conclusion helpful in cases where the image and text diverged

semiotically points out the importance of these
We have proposed a model to capture the complex complex relations, and our taxonomies, dataset,
meaning multiplication relationship between im- and tools should provide impetus for the commu-
age and text in multimodal Instagram posts. Our nity to further develop more complex models of
three new taxonomies, adapted from the media and this important relationship.
semiotic literature, allow the literal, semiotic, and
illocutionary relationship between text and image 8 Acknowledgment
to be coded. Of course our new dataset and the
baseline classifier models are just a preliminary ef- This project is sponsored by the Office of
fort, and future work will need to examine larger Naval Research (ONR) under the contract number
datasets, consider additional data such as hashtags, N00014-17C-1008 and by DARPA Communicat-
richer classification schemes, and more sophisti- ing with Computers (CwC) program under ARO
cated classifiers. Some of these may be domain- prime contract number W911NF- 15-1-0462. Dis-
specific. For example, Alikhani et al. (2019) show claimer: The views, opinions, and/or findings ex-
how to develop rich coherence relations that model pressed are those of the author(s) and should not
the contextual relationship between recipe text and be interpreted as representing the official views or
accompanying images (specific versions of Elabo- policies of the Department of Defense or the U.S.
ration or Exemplification such as “Shows a tool Government. We would like to thank the reviewers
used in the step but not mentioned in the text”). for their incisive critique. We would like to thank
Expanding our taxonomies with richer sets like Prof. Joyce Chai. Ajay Divakaran would like to
these is an important goal. Nonetheless, the fact dedicate this paper to his mother Mrs. Bharathi
that we found multimodal classification to be most Divakaran.

4630
References Bernie Hogan. 2010. The presentation of self in the age
of social media: Distinguishing performances and
Karuna Ahuja, Karan Sikka, Anirban Roy, and Ajay exhibitions online. Bulletin of Science, Technology
Divakaran. 2018. Understanding visual ads by & Society, 30(6):377–386.
aligning symbols and objects using co-attention.
arXiv preprint arXiv:1807.01448. Xinyue Huang and Adriana Kovashka. 2016. Inferring
visual persuasion via body language, setting, and
Xavier Alameda-Pineda, Andrea Pilzer, Dan Xu, Nicu deep features. In CVPRW, pages 73–79.
Sebe, and Elisa Ricci. 2017. Viraliency: Pooling
local virality. In CVPR. Drew A. Hudson and Christopher D. Manning. 2019.
Gqa: A new dataset for real-world visual reasoning
Malihe Alikhani, Sreyasi Nag Chowdhury, Gerard and compositional question answering. In Proceed-
de Melo, and Matthew Stone. 2019. CITE: A cor- ings of the IEEE Conference on Computer Vision
pus of image–text discourse relations. In NAACL. and Pattern Recognition, pages 6700–6709.
John Langshaw Austin. 1962. How to Do Things with Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang,
Words. Clarendon Press. Keren Ye, Christopher Thomas, Zuha Agha, Nathan
Ong, and Adriana Kovashka. 2017. Automatic un-
John Bateman. 2014. Text and image: A critical intro-
derstanding of image and video advertisements. In
duction to the visual/verbal divide. Routledge.
CVPR, pages 1100–1110. IEEE.
John Berger. 1972. Ways of Seeing. Penguin.
László A Jeni, Jeffrey F Cohn, and Fernando
Yonatan Bisk, Jan Buys, Karl Pichotta, and Yejin Choi. De La Torre. 2013. Facing imbalanced data–
2019. Benchmarking hierarchical script knowledge. recommendations for the use of performance met-
In Proceedings of NAACL 2019, pages 4077–4085. rics. In ACII, pages 245–251. IEEE.

Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Tor- Jungseock Joo, Weixin Li, Francis F Steen, and Song-
ralba, and Frédo Durand. 2018. What do differ- Chun Zhu. 2014. Visual persuasion: Inferring com-
ent evaluation metrics tell us about saliency models? municative intents of images. In CVPR, pages 216–
PAMI. 223.

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- Aditya Khosla, Atish Das Sarma, and Raffay Hamid.
ishna Vedantam, Saurabh Gupta, Piotr Dollár, and 2014. What makes an image popular? In WWW,
C Lawrence Zitnick. 2015. Microsoft coco captions: pages 867–876. ACM.
Data collection and evaluation server. arXiv preprint
arXiv:1504.00325. Aditya Khosla, Akhil S Raju, Antonio Torralba, and
Aude Oliva. 2015. Understanding and predicting
Arturo Deza and Devi Parikh. 2015. Understanding image memorability at a large scale. In ICCV, pages
image virality. In CVPR, pages 1818–1826. 2390–2398.

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and R Kloepfer. 1977. Komplementaritat von Sprache
Sanja Fidler. 2018. Vse++: Improving visual- und Bild, Am Beispiel von Comic, Karikatur, und
semantic embeddings with hard negatives. In Reklame. In R. Posner and H.-P. Reinecke, edi-
BMVC. tors, Zeichenprozesse, Semiotische Forschung in den
Einzelwissenschaften, pages 129–145. Athenaum.
Erving Goffman. 1978. The presentation of self in ev-
eryday life. Anchor Books, New York, NY. Jamie Mahoney, Tom Feltwell, Obinna Ajuruchi, and
Shaun Lawson. 2016. Constructing the visual on-
Yash Goyal, Tejas Khot, Douglas Summers-Stay, line political self: an analysis of Instagram use by
Dhruv Batra, and Devi Parikh. 2017. Making the the Scottish Electorate. In CHI, pages 3339–3351.
v in vqa matter: Elevating the role of image under- ACM.
standing in visual question answering. In CVPR, 2,
page 3. Emily E Marsh and Marilyn Domas White. 2003. A
taxonomy of relationships between images and text.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Journal of Documentation, 59(6):647–672.
Sun. 2016. Deep residual learning for image recog-
nition. In CVPR, pages 770–778. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
Stephen Heath et al. 1977. Image-music-text. Fontana, tions of words and phrases and their compositional-
London. ity. In NIPS, pages 3111–3119.

Jack Hessel, Lillian Lee, and David Mimno. 2017. Nasrin Mostafazadeh, Chris Brockett, Bill Dolan,
Cats and captions vs. creators and the clock: Com- Michel Galley, Jianfeng Gao, Georgios Sp-
paring multimodal content to context in predicting ithourakis, and Lucy Vanderwende. 2017. Image-
relative popularity. In WWW, pages 927–936. grounded conversations: Multimodal context for

4631
natural question and response generation. In Pro-
ceedings of the Eighth International Joint Confer-
ence on Natural Language Processing, pages 462–
472.
Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Mar-
garet Mitchell, Xiaodong He, and Lucy Vander-
wende. 2016. Generating natural questions about
an image. In Proceedings of the ACL 2016, pages
1802–1813.
Duy-Kien Nguyen and Takayuki Okatani. 2018. Im-
proved fusion of visual and language representations
by dense symmetric co-attention for visual question
answering. arXiv preprint arXiv:1804.00775.
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
resentations. arXiv preprint arXiv:1802.05365.
Behjat Siddiquie, David Chisholm, and Ajay Di-
vakaran. 2015. Exploiting multimodal affect and
semantics to identify politically persuasive web
videos. In ICMI, pages 203–210.
Mohammad Soleymani, David Garcia, Brendan Jou,
Björn Schuller, Shih-Fu Chang, and Maja Pantic.
2017. A survey of multimodal sentiment analysis.
IVC, 65:3–14.
Mathias Stager, Paul Lukowicz, and Gerhard Troster.
2006. Dealing with class skew in context recogni-
tion. In ICDCSW, pages 58–58. IEEE.
Lydia Weiland, Ioana Hulpus, Simone Paolo Ponzetto,
Wolfang Effelsberg, and Laura Dietz. 2018.
Knowledge-rich image gist understanding beyond
literal meaning. arXiv preprint 1904.08709.
Laura Wendlandt, Rada Mihalcea, Ryan L. Boyd, and
James W. Pennebaker. 2017. Multimodal analysis
and prediction of latent user dimensions. In Interna-
tional Conference on Social Informatics, pages 323–
340.
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin
Choi. 2018. From recognition to cognition: Vi-
sual commonsense reasoning. arXiv preprint
arXiv:1811.10830.
Mingda Zhang, Rebecca Hwa, and Adriana Kovashka.
2018. Equal but not the same: Understanding the
implicit relationship between persuasive images and
text. arXiv preprint arXiv:1807.08205.