0% found this document useful (0 votes)
31 views11 pages

1 s2.0 S095070512400412X Main

The document presents a Metaphor-aware Multi-modal Multi-task Framework (M3F) aimed at enhancing fine-grained meme understanding by integrating metaphorical information with text and image analysis. It addresses limitations in existing approaches by utilizing inter-modality and intra-modality attention mechanisms, allowing for better task correlation and feature extraction. Experimental results demonstrate that M3F outperforms state-of-the-art methods on the MET-Meme dataset, showcasing its effectiveness in understanding memes across multiple subtasks such as sentiment analysis and intention detection.

Uploaded by

naicj.08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views11 pages

1 s2.0 S095070512400412X Main

The document presents a Metaphor-aware Multi-modal Multi-task Framework (M3F) aimed at enhancing fine-grained meme understanding by integrating metaphorical information with text and image analysis. It addresses limitations in existing approaches by utilizing inter-modality and intra-modality attention mechanisms, allowing for better task correlation and feature extraction. Experimental results demonstrate that M3F outperforms state-of-the-art methods on the MET-Meme dataset, showcasing its effectiveness in understanding memes across multiple subtasks such as sentiment analysis and intention detection.

Uploaded by

naicj.08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Knowledge-Based Systems 294 (2024) 111778

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

What do they ‘‘meme’’? A metaphor-aware multi-modal multi-task


framework for fine-grained meme understanding
Bingbing Wang a,b , Shijue Huang a,b , Bin Liang d , Geng Tu a,b , Min Yang e , Ruifeng Xu a,b,c ,∗
a
Harbin Institute of Technology Shenzhen, Shenzhen, Guangdong, 518055, China
b
Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Shenzhen, Guangdong, 518055, China
c
Peng Cheng Laboratory, Shenzhen, Guangdong, China
d
The Chinese University of Hong Kong, Hong Kong, China
e
Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences, Shenzhen, Guangdong, 518055, China

ARTICLE INFO ABSTRACT

Keywords: Fine-grained meme understanding aims to explore and comprehend the meanings of memes from multiple
Meme understanding perspectives by performing various tasks, such as sentiment analysis, intention detection, and offensiveness
Metaphorical information detection. Existing approaches primarily focus on simple multi-modality fusion and individual task analysis.
Inter-modality attention
However, there remain several limitations that need to be addressed: (1) the neglect of incongruous features
Intra-modality attention
within and across modalities, and (2) the lack of consideration for correlations among different tasks. To this
Multi-task learning
end, we leverage metaphorical information as text modality and propose a Metaphor-aware Multi-modal Multi-
task Framework (M3 F) for fine-grained meme understanding. Specifically, we create inter-modality attention
enlightened by the Transformer to capture inter-modality interaction between text and image. Moreover, intra-
modality attention is applied to model the contradiction between the text and metaphorical information. To
learn the implicit interaction among different tasks, we introduce a multi-interactive decoder that exploits
gating networks to establish the relationship between various subtasks. Experimental results on the MET-
Meme dataset show that the proposed framework outperforms the state-of-the-art baselines in fine-grained
meme understanding.

1. Introduction meme dataset MET-Meme in two languages, which has addressed the
regrettable lack of metaphor information. Xu et al. [6] also proposed
Memes are cultural influences transmitted through the Internet, four fine-grained subtasks consisting of sentiment analysis, intention
particularly on popular social media platforms like Twitter and Face- detection, metaphorical detection, and offensiveness detection.
book. Fine-grained meme understanding is a task that understanding Despite the promising progress made by the above works, these
memes from the perspective of multiple subtasks. In most cases, an
approaches primarily rely on the concatenation or addition of features
ordinary sentence or a picture may not convey a specific emotional
from multiple modalities, addressing each subtask independently, ne-
meaning, but their combination brings about meaning. Therefore, it is
significant to consider both modalities to understand the meaning of the glecting the two noteworthy aspects of incongruous characters and
meme. As memes are often subtle and their true underlying meaning close-related labels for each meme. For instance, as shown in Fig. 1(a),
may be expressed in an implicit way, fine-grained meme understanding the image depicts a man being attracted to another woman, while
becomes a challenging task. his girlfriend is displeased by this. However, the text in the image
Several multi-modal meme datasets have been created for further is ‘‘APPLE FANS; IPHONE 10; IPHONE 11’’ which is unrelated to the
study [1,2]. In recent years, the popularity of memes has sparked image content, highlighting cross-modal inconsistency. Furthermore,
the interest of a few linguists meme. They demonstrated that memes metaphorical information and the text within the image both belong to
construe metaphors that are found in real life not only just in languages the text modality; the source domain mentions ‘‘Man; Woman; Woman’’,
but also via thoughts and actions [3,4]. Moreover, in the study of [5], while the image text reads ‘‘APPLE FANS; IPHONE 10; IPHONE 11’’.
two conceptual metaphor mappings were delineated, encompassing the
These appear unrelated, illustrating inconsistency within the same
source and target domains. To deeply understand the metaphor infor-
modality. As demonstrated above, the incongruity between and within
mation in memes, Xu et al. [6] constructed a multi-modal metaphor

∗ Corresponding author at: Harbin Institute of Technology Shenzhen, Shenzhen, Guangdong, 518055, China.
E-mail address: [email protected] (R. Xu).

https://doi.org/10.1016/j.knosys.2024.111778
Received 12 September 2023; Received in revised form 9 March 2024; Accepted 7 April 2024
Available online 10 April 2024
0950-7051/© 2024 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778

Fig. 1. Examples of metaphor information in the multi-modal meme.

modality in memes may lead to misunderstandings of the meme’s • The fine-grained meme understanding work is approached from a
message, requiring the combination of image, text, as well as metaphor- novel perspective that investigates the interplay between metap-
ical information for a complete understanding of its meaning. Further hor cues and explores the potential relationships within and
analysis reveals that merely examining the text in the meme depicted across modalities and subtasks.
in Fig. 1(a) does not convey the sentiment, but the image confirms • A novel Metaphor-aware Multi-modal Multi-task Framework
its expressive nature with a sentiment of love, hence making it non- (M3 F) is proposed to improve the utilization of meme information
offensive. Similarly, in Fig. 1(b), knowing that the meme carries a and metaphor integration at both inter-modal and intra-modal
sentiment of hate increases the likelihood of it being offensive. We levels, as well as capture the commonness across tasks.
can observe that the sentiment, intention, and offensiveness of memes • Performance evaluation on the MET-Meme benchmark dataset
are closely intertwined. Through effective representation, training, and demonstrates the robustness and superiority of the proposed
evaluation, these relationships can be modeled to enhance the model’s framework compared to the state-of-the-art baselines in fine-
performance collaboratively. grained meme understanding.
In this way, this paper investigates the congruity between different
as well as the same modalities and explores their potential using 2. Method
multi-task learning to enhance various tasks in fine-grained meme
understanding, including sentiment analysis, intention detection, In this section, we describe the proposed M3 F method in detail.
offensiveness detection, and metaphor recognition. Specifically, we As illustrated in Fig. 2, M3 F primarily consists of four components:
propose a novel Metaphor-aware Multi-modal Multi-task Framework (1) Feature extraction, which applies VGG16 and Multilingual BERT
(M3 F)1 . Through multi-task learning, we can simultaneously consider to extract features from the images, texts, source, and target domains.
(2) Inter-modality attention, which constructs the relationship between
and model the complex relationships among the tasks and facilitate
images and texts, and obtains the result of metaphor recognition to
collaborative learning and knowledge transfer within the model, fur-
dynamically guide the source and target domains. (3) Intra-modality
ther improving its performance. By sharing underlying representations
attention, which captures the congruity among texts and metaphorical
and feature learning processes, relevant information between differ-
information. (4) Multi-interactive decoder, which learns the implicit
ent tasks can be mutually reinforced, allowing the model to better
interaction among tasks.
utilize data throughout the entire training process, thereby enhancing
overall effectiveness. To reach this goal, we devise an inter-modality
2.1. Task definition
attention for modeling the interaction between image and text. Addi-
tionally, intra-modality attention is designed to learn the congruity of
Formally, supposing there is an example comprising an image 𝑉 ,
text and metaphorical information. To learn the implicit interaction
a corresponding text 𝑇 , source domain 𝑀𝑠 , and target domain 𝑀𝑔 ,
among different tasks, we present a multi-interactive decoder (MID) the goal of multi-modal meme analysis is to predict the metaphor
to establish the relationship between various tasks. Concretely, VGG16 categories 𝑦𝑀𝑅 , sentiment categories 𝑦𝑆𝐴 , intention categories 𝑦𝐼𝐷 , and
and Multilingual BERT are exploited to generate image features, text offensiveness categories 𝑦𝑂𝐷 . It should be noted that the source and
embedding, source embedding, and target embedding from image, text, target domains refer to metaphorical information, usually in the form
as well as source and target domains in metaphorical information. For of text. As shown in Fig. 1, the source domain forms the foundation of
inter-modality attention, we propose a cross-attention Transformer that the metaphor, while the target domain represents the concept or idea
leverages the image embedding as the query and the text features as being metaphorically expressed.
the key and value to elicit the inter-modality representation. Further-
more, to effectively utilize intra-modality attention between text and 2.2. Feature extraction
metaphorical information and obtain the intra-modality representation,
we employ the text embedding as the query and consider the source do- Given an input example for each sample 𝑥 = (𝑥𝑣 , 𝑥𝑡 , 𝑥𝑠 , 𝑥𝑔 ), where
main and target domain as the key and value. Notably, using target and 𝑥𝑣 represents the image input, 𝑥𝑡 denotes the text input with 𝐿𝑡 tokens.
source domain features for metaphor recognition is not advisable due to 𝑥𝑠 with 𝐿𝑠 tokens and 𝑥𝑔 with 𝐿𝑔 tokens represent the source domain
the potential leakage of metaphorical information, which could result and target domain in metaphor text, respectively. Four encoders are
in inaccurate metaphor recognition outcomes. Therefore, we use the used for feature extraction, three of which are Multilingual BERT [7]
inter-modality representation for metaphor recognition and addition- for extracting text features, and the other is VGG16 [8] for extracting
ally employ a dynamic replacement strategy to guide source and target image features. The text representations 𝑥𝑡 , 𝑥𝑠 , 𝑥𝑔 are input into the
domains for subsequent subtasks. Extensive experiments conducted on Multilingual BERT encoder, which produces hidden representations
the MET-Meme dataset verify the superiority of our framework and 𝑒𝑡 ∈ R𝑁×𝑑𝑡 , 𝑒𝑠 ∈ R𝑁×𝑑𝑠 , and 𝑒𝑔 ∈ R𝑁×𝑑𝑔 for all the tokens in the text
show that our proposed M3 F method significantly outperforms recent modality, where 𝑁 represents the number of tokens, and 𝑑𝑡 , 𝑑𝑠 , and 𝑑𝑔
state-of-the-art solutions. represent the respective dimensions of the hidden representations.
To summarize, the main contributions of our work are as follows: 𝑒𝑡 = M-BERT([𝐶𝐿𝑆]𝑥𝑡 [𝑆𝐸𝑃 ])1∶𝐿𝑡 ,
𝑒𝑠 = M-BERT([𝐶𝐿𝑆]𝑥𝑠 [𝑆𝐸𝑃 ])1∶𝐿𝑠 , (1)
1
The code of this work is released at: github.com/Vincy2King/M3F-MEME 𝑔 𝑔
𝑒 = M-BERT([𝐶𝐿𝑆]𝑥 [𝑆𝐸𝑃 ])1∶𝐿𝑔 ,

2
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778

Fig. 2. The architecture of the proposed M3 F method. The metaphor recognition labels are obtained through inter-modality representation, and the sentiment analysis labels,
intention detection labels, and offensiveness labels are output by the multi-interactive decoder. This way is to prevent the source domain and the target domain data from
influencing the results of metaphor recognition.

image and text features to capture and leverage these complementary


cross-modal information. In this section, we describe how to construct
the inter-modality attention for each multi-modal meme between two
modalities of text and image.
Given the input features of this module by ℎ𝑣 , ℎ𝑡 ∈ R𝑁×𝑑ℎ , the query
𝑄, key 𝐾, value 𝑉 are computed by ℎ𝑣 , ℎ𝑡 with weights 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 ∈
R𝑑ℎ ×𝑑ℎ . As shown in Fig. 3(a), our inter-modality attention accepts
the image features ℎ𝑣 as queries, and the text embeddings ℎ𝑡 as keys
and values. This allows the model to prioritize incongruous text con-
tents guided by the image features. Concretely, the 𝑖th head of the
inter-modality attention can be expressed in the following form:
[𝑊𝑖𝑄 ℎ𝑣 ][𝑊𝑖𝐾 ℎ𝑡 ]
Att𝑖 (ℎ𝑣 , ℎ𝑡 ) = 𝜎( √ )[𝑊𝑖𝑉 ℎ𝑡 ], (3)
𝑑𝑘

where 𝑑𝑘 ∈ R𝑑ℎ ∕𝑚 , Att𝑖 (ℎ𝑣 , ℎ𝑡 ) ∈ R𝑁×𝑑𝑘 , and {𝑊𝑖𝑄 , 𝑊𝑖𝐾 , 𝑊𝑖𝑉 } ∈ R𝑑𝑘 ×𝑑ℎ
are learnable parameters. 𝜎 denotes a softmax function. The outputs of
Fig. 3. The architecture of the inter-modality attention and intra-modality attention. 𝑚 heads are then concatenated and passed through a linear transforma-
tion to produce the final output.

M-Att(ℎ𝑣 , ℎ𝑡 ) = [Att1 (ℎ𝑣 , ℎ𝑡 ), … , Att𝑚 (ℎ𝑣 , ℎ𝑡 )]𝑊 𝑜 , (4)


We employ M-BERT as the Multilingual BERT encoder. For process-
ing meme images and extracting visual features, we utilize VGG16 [8], where 𝑊𝑜 ∈ R𝑑ℎ ×𝑑ℎ
is a learnable parameter. Afterward, a feedforward
a pre-trained convolutional neural network-based classifier. We apply network with linear-based activation functions is worked. The output
VGG16 to handle meme images and extract their visual features. Fol- of the inter-modality attention is then processed as follows:
lowing the architecture specified in [6], our implementation of VGG16
consists of 5 max-pooling layers and 16 weighted layers. Meme images 𝐻𝑖𝑛𝑡𝑒𝑟 = FN(M-Att(ℎ𝑣 , ℎ𝑡 )). (5)
are resized to dimensions of 224 × 224 × 3 and fed into VGG16 where FN is the feedforward network and 𝐻𝑖𝑛𝑡𝑒𝑟 denotes the inter-
to obtain image representations in R𝑁×4096 , where 𝑁 represents the modality representation.
number of memes. It is important to highlight that both the source and target domains
𝑒𝑣 = VGG16(𝑥𝑣 ). (2) contain metaphorical information. However, utilizing these features
for metaphor recognition may result in the leakage of metaphori-
Then, the representations of the output are mapped to the same 𝑑ℎ - cal information and inaccurate outcomes. Consequently, we use the
dimensional hidden vector space using four projection layers inter-modality representation for metaphor recognition and propose a
𝑝𝑣 (⋅), 𝑝𝑡 (⋅), 𝑝𝑠 (⋅), 𝑝𝑔 (⋅), which are presented as ℎ𝑣 , ℎ𝑡 , ℎ𝑠 , ℎ𝑔 ∈ R𝑁×𝑑ℎ . Each strategy to dynamically modify the source and target domains.
projection layer is implemented as a dense layer with a ReLU activation Metaphor Recognition. The inter-modality representation is first
function. exploited to perform metaphor recognition, thereby detecting whether
the meme contains metaphorical information. This procedure consists
2.3. Inter-modality attention of a linear layer for dimension reduction and a softmax function for
the probability distribution of each category. We utilize the stan-
Recent studies have empirically and theoretically demonstrated that dard gradient descent algorithm to train the model by minimizing the
multi-modal fusion may encounter challenges when dealing with low- cross-entropy loss:
quality multi-modal data [9], such as imbalanced datasets. Unlike other
multi-modal tasks, in the context of metaphor recognition, both image 𝑦̂𝑀𝑅 = 𝜎(𝑊𝑀𝑅 𝐻𝑖𝑛𝑡𝑒𝑟 + 𝑏𝑀𝑅 ), (6)
and text modalities carry rich metaphorical information, and their
correlations are often indirect and subtle. Therefore, the core idea ∑
𝑁
min 𝑀𝑅 = − 𝑦𝑗𝑀𝑅 log 𝑦̂𝑗𝑀𝑅 + 𝜆𝑀𝑅 ‖𝛩𝑀𝑅 ‖2 . (7)
of inter-modality attention is to establish direct connections between 𝛩
𝑗=1

3
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778

where 𝑊𝑀𝑅 ∈ R𝑑ℎ ×𝑑ℎ is the learnable parameter and 𝑏𝑀𝑅 is the bias Step 1. To begin, we generate seven distinct representations by
training alone with the model. 𝜎 indicates the softmax function. 𝑦𝑀𝑅 employing seven individual linear layers of 𝐻𝑚𝑢𝑙 for a tri-task represen-
and 𝑦̂𝑀𝑅 are the ground truth and estimated label distribution of the tation denoted as 𝐻𝑠,𝑖,𝑜 , as well as three bi-task representations, namely
metaphor recognition task. 𝛩𝑀𝑅 represents all the model’s learnable 𝐻𝑠,𝑖 , 𝐻𝑠,𝑜 , and 𝐻𝑖,𝑜 , alongside three task-specific representations: 𝐻𝑠 ,
parameters, 𝜆𝑀𝑅 denotes the coefficient of L2-regularization. 𝐻𝑖 , and 𝐻𝑜 . Here, 𝑠 represents the sentiment analysis task, and 𝑖 and
Dynamic Replacement. The metaphor recognition results primar- 𝑜 correspond to intention detection and offensiveness detection tasks,
ily focus on determining whether a meme contains a metaphorical respectively.
expression or not. As a result, the labels assigned to the memes are Step 2. To facilitate knowledge transfer among the three tasks, a
categorized as either metaphorical or literal. The metaphorical label gating layer that contains a linear transformation and a self-attention
indicates the presence of a metaphorical occurrence in the meme, while mechanism is introduced. We utilize seven separate gating layers
the literal label signifies the absence of metaphorical elements. When to obtain the updated representations of the tri-task, bi-task, and
the result is the literal label, the source and target domains would be task-specific representations. The update of tri-task, bi-task, and task-
replaced by ‘‘[CLS] [SEP]’’. specific representations aims to learn the shared representations that
facilitate knowledge transfer across three tasks, capture the corre-
2.4. Intra-modality attention lation between two tasks, and develop task-specific representations,
respectively.
In the context of meme analysis, the inclusion of metaphorical To better aggregate the tri-task, bi-task, and task-specific infor-
information from both the source and target domains is crucial. It mation, we first utilize linear layers to project the concatenation of
serves as a union of textual modality information and text within corresponding representations in step 1:
meme images, aiding in understanding and interpreting the underlying
𝐸𝑠,𝑖,𝑜 = 𝑊𝑠,𝑖,𝑜 (𝐻𝑠,𝑖,𝑜 ⊕ 𝐻𝑠,𝑖 ⊕ 𝐻𝑠,𝑜 ⊕ 𝐻𝑖,𝑜 ⊕ 𝐻𝑠 ⊕ 𝐻𝑖 ⊕ 𝐻𝑜 ) + 𝑏𝑠,𝑖,𝑜 ,
meanings of memes. As depicted in Fig. 1(a), if the model can infer
the metaphorical information associated with the text ‘‘APPLE FANS; 𝐸𝑠,𝑖 = 𝑊𝑠,𝑖 (𝐻𝑠,𝑖 ⊕ 𝐻𝑠 ⊕ 𝐻𝑖 ) + 𝑏𝑠,𝑖 ,
IPHONE 10; IPHONE 11’’ and relate it to ‘‘Man, Woman; Woman’’, 𝐸𝑠,𝑜 = 𝑊𝑠,𝑜 (𝐻𝑠,𝑜 ⊕ 𝐻𝑠 ⊕ 𝐻𝑜 ) + 𝑏𝑠,𝑜 ,
it would grasp the expectation of an iPhone fan towards the iPhone 𝐸𝑖,𝑜 = 𝑊𝑖,𝑜 (𝐻𝑖,𝑜 ⊕ 𝐻𝑖 ⊕ 𝐻𝑜 ) + 𝑏𝑖,𝑜 , (11)
11. Similarly, by understanding the metaphorical information behind
𝐸𝑠 = 𝑊𝑠 (𝐻𝑠,𝑖,𝑜 ⊕ 𝐻𝑠,𝑖 ⊕ 𝐻𝑠,𝑜 ⊕ 𝐻𝑠 ) + 𝑏𝑠 ,
‘‘Lungs’’ as the source domain and ‘‘Orange’’ as the target domain, the
model can further discern the implied distinction between the lungs 𝐸𝑖 = 𝑊𝑖 (𝐻𝑠,𝑖,𝑜 ⊕ 𝐻𝑠,𝑖 ⊕ 𝐻𝑖,𝑜 ⊕ 𝐻𝑖 ) + 𝑏𝑖 ,
of a vegetarian (represented by slices of orange) and a non-vegetarian 𝐸𝑜 = 𝑊𝑜 (𝐻𝑠,𝑖,𝑜 ⊕ 𝐻𝑠,𝑜 ⊕ 𝐻𝑖,𝑜 ⊕ 𝐻𝑜 ) + 𝑏𝑜 ,
(depicted by an anatomical illustration). Furthermore, the importance
where ⊕ is the concatenation operation and 𝑊𝑠,𝑖,𝑜 , 𝑊𝑠,𝑖 , 𝑊𝑠,𝑜 ,
of intra-modality attention lies in its ability to capture the internal
𝑊𝑖,𝑜 , 𝑊𝑠 , 𝑊𝑖 , 𝑊𝑜 are trainable parameters, and 𝑏𝑠,𝑖,𝑜 , 𝑏𝑠,𝑖 , 𝑏𝑠,𝑜 , 𝑏𝑖,𝑜 , 𝑏𝑠 , 𝑏𝑖 , 𝑏𝑜
correlations within text modality. The core principle of intra-modality
are biases.
attention is to dynamically allocate attention weights to different parts
or elements within the same modality. We then apply self-attention for each 𝐸 representation. Specifically,
each concatenated representation undergoes these projections to obtain
Similar to inter-modality attention, we adopt 𝑄, 𝐾, and 𝑉 to rep-
resent the query, key, and value in this module. In this case, 𝑄 is the corresponding 𝐾, 𝑄, and 𝑉 values. Dot products are then computed
computed based on ℎ𝑡 , 𝐾 and 𝑉 are calculated using ℎ𝑠 and ℎ𝑔 . Given between key-query pairs, followed by scaling to ensure stable training.
the weights 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 ∈ R𝑑ℎ ×𝑑ℎ , then 𝑄 = ℎ𝑡 𝑊𝑄 , 𝐾 = ℎ𝑠 𝑊𝐾 , 𝑉 = Subsequently, a softmax operation is applied to normalize the results.
ℎ𝑔 𝑊𝑉 . The output of 𝑛 heads is determined as below: Finally, a weighted sum is calculated using the output from the 𝑉
projection.
M-Att(ℎ𝑡 , ℎ𝑠 , ℎ𝑔 ) = [Att1 (ℎ𝑡 , ℎ𝑠 , ℎ𝑔 ), … , Att𝑛 (ℎ𝑡 , ℎ𝑠 , ℎ𝑔 )]𝑊 𝑗 , (8)
𝑄𝐾 ⊤
Attention(𝑄, 𝐾, 𝑉 ) = softmax( √ )𝑉 , (12)
𝑑𝑘
[𝑊 𝑄 ℎ𝑡 ][𝑊 𝐾 ℎ𝑠 ]
Att𝑖 (ℎ𝑡 , ℎ𝑠 , ℎ𝑔 ) = 𝜎( 𝑖 √ 𝑖 )[𝑊𝑖𝑉 ℎ𝑔 ], (9) In detail, The updated representations can be obtained by:
𝑑𝑙

𝐸𝑠,𝑖,𝑜 = 𝐸𝑠,𝑖,𝑜 (1 + Attention(𝐸𝑠,𝑖,𝑜 )),
where 𝑊 𝑗 ∈ R𝑑ℎ ×𝑑ℎ is a learnable parameter, 𝑑𝑙 ∈ R𝑑ℎ ∕𝑛 , Att𝑖 (ℎ𝑡 , ℎ𝑠 , ℎ𝑔 )

∈ R𝑁×𝑑𝑙 , and {𝑊𝑖𝑄 , 𝑊𝑖𝐾 , 𝑊𝑖𝑉 } ∈ R𝑑𝑙 ×𝑑ℎ are learnable parameters. 𝜎 𝐸𝑠,𝑖 = 𝐸𝑠,𝑖 (1 + Attention(𝐸𝑠,𝑖 )),
denotes the softmax function. Afterward, a feedforward network with ′
𝐸𝑠,𝑜 = 𝐸𝑠,𝑜 (1 + Attention(𝐸𝑠,𝑜 )),
linear-based activation functions is employed to obtain the inter-moda- ′
𝐸𝑖,𝑜 = 𝐸𝑖,𝑜 (1 + Attention(𝐸𝑖,𝑜 )), (13)
lity representation as:
𝐸𝑠′ = 𝐸𝑠 (1 + Attention(𝐸𝑠 )),
𝐻𝑖𝑛𝑡𝑟𝑎 = FN(M-Att(ℎ𝑡 , ℎ𝑠 , ℎ𝑔 )). (10)
𝐸𝑖′ = 𝐸𝑖 (1 + Attention(𝐸𝑖 )),
where FN is the feedforward network and 𝐻𝑖𝑛𝑡𝑟𝑎 denotes the intra- 𝐸𝑜′ = 𝐸𝑜 (1 + Attention(𝐸𝑜 )),
modality representation.
Step 3. To further aggregate corresponding information in a com-
prehensive manner, we concatenate all relevant representations to
2.5. Multi-interactive decoder
obtain 𝑈𝑠 , 𝑈𝑖 , 𝑈𝑜 :
′ ′ ′
After obtaining the inter-modality representation 𝐻𝑖𝑛𝑡𝑒𝑟 and intra- 𝑈𝑠 = 𝐸𝑠,𝑖,𝑜 ⊕ 𝐸𝑠,𝑖 ⊕ 𝐸𝑠,𝑜 ⊕ 𝐸𝑠′ ,
modality representation 𝐻𝑖𝑛𝑡𝑟𝑎 , we concatenate them with image fea- ′
𝑈𝑖 = 𝐸𝑠,𝑖,𝑜 ′ ′
⊕ 𝐸𝑖′ ,
⊕ 𝐸𝑠,𝑖 ⊕ 𝐸𝑖,𝑜 (14)
tures ℎ𝑣 , text embedding ℎ𝑡 , as well as source and target embedding
′ ′ ′
ℎ𝑠 , ℎ𝑔 to produce the multi-modal representation 𝐻𝑚𝑢𝑙 , which plays an 𝑈𝑜 = 𝐸𝑠,𝑖,𝑜 ⊕ 𝐸𝑠,𝑜 ⊕ 𝐸𝑖,𝑜 ⊕ 𝐸𝑜′ .
essential role in the following steps. While employing 𝐻𝑚𝑢𝑙 may offer a
viable approach for directly predicting the three tasks, it fails to com- 3. Multi-task prediction
prehensively capture the unique characteristics inherent to each task.
Inspired by [10], we proposed a multi-interactive decoder depicted in The outputs 𝑈𝑠 , 𝑈𝑖 , 𝑈𝑜 are forwarded through the softmax func-
Fig. 4, which comprises three steps. tion to yield the sentiment, intention, and offensiveness categories

4
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778

Fig. 4. The architecture of the multi-interactive decoder, where each gating layer consists of a linear transformation and a self-attention mechanism. ⊕ indicates the concatenation
operation.

Table 1 annotations towards metaphor understanding, including source and


MET-Meme dataset statistics.
target domains, alongside baseline accuracy for downstream machine
Samples Bilingual English Chinese learning tasks. Notably, MET-Meme also provides supplementary labels
Average Words 9 12 7 for offensiveness. Hence, we selected MET-Meme as the dataset of
Total Words 90,191 47,347 42,844
choice for our study.
Metaphorial 3,441 1,114 2,327 On the MET-Meme dataset, the metaphorical label indicates the
Literal 6,604 2,886 3,718
presence of metaphor in a meme, while the literal label represents the
Total Samples 10,045 4,000 6,045
absence of metaphor. Table 1 presents the basic statistics. The Chinese
meme dataset consists of 6045 text-image pairs, covering six distinct
categories including scenery, animals, animations, dolls, films, and
𝑦̂𝑆𝐴 , 𝑦̂𝐼𝐷 , 𝑦̂𝑂𝐷 . We minimize the cross-entropy loss for training each task humans. The English meme dataset is sourced from MEMOTION [15]
and minimize them. and Google search, comprising 4000 text-image pairs. In total, the

𝑁 dataset comprises 10,045 text-image pairs. Following [6], accuracy
min 𝑆𝐴 = − 𝑦𝑗𝑆𝐴 log 𝑦̂𝑗𝑆𝐴 + 𝜆𝑆𝐴 ‖𝛩𝑆𝐴 ‖2 , (Acc), weighted precision (Pre), and recall (Rec) are selected as metrics
𝛩
𝑗=1 to assess the performance of the classification task.

𝑁
min 𝐼𝐷 = − 𝑦𝑗𝐼𝐷 log 𝑦̂𝑗𝐼𝐷 + 𝜆𝐼𝐷 ‖𝛩𝐼𝐷 ‖2 , 4.2. Experimental setting
𝛩
𝑗=1 (15)

𝑁
For the textual encoder, we apply a pre-trained Multilingual BERT
min 𝑂𝐷 = − 𝑦𝑗𝑂𝐷 log 𝑦̂𝑗𝑂𝐷 + 𝜆𝑂𝐷 ‖𝛩𝑂𝐷 ‖2 ,
𝛩
𝑗=1
(bert-base-multilingual-cased) using a learning rate of 2e−5. For the
visual encoder, we train VGG16 with a learning rate of 1e−3 from
 = 𝑤𝑆𝐴 𝑆𝐴 + 𝑤𝐼𝐷 𝐼𝐷 + 𝑤𝑂𝐷 𝑂𝐷 .
scratch. The chosen optimizer is Adam [16] and the parameters of
where 𝑦𝑘 and 𝑦̂𝑘 are the ground truth and estimated label distri- Multilingual BERT and other model parameters are optimized sepa-
butions of sentiment analysis, intention detection, and offensiveness rately. Followed by [6], we randomly divided the dataset into training,
detection tasks, respectively. 𝛩𝑘 denotes all trainable parameters of validation, and test sets into the split of 60%, 20%, and 20%. The
the model, 𝜆𝑘 represents the coefficient of L2-regularization, where implementation of our framework is carried out using PyTorch [17].
𝑘 ∈ {𝑆𝐴, 𝐼𝐷, 𝑂𝐷}. 𝑤𝑆𝐴 , 𝑤𝐼𝐷 , 𝑤𝑂𝐷 are weights adjusted during the
learning process. 4.3. Baselines

4. Experiment In accordance with the methodology proposed by [6], we un-


dertake four distinct tasks: sentiment analysis, intention detection,
4.1. Dataset and metrics offensiveness detection, and metaphor recognition. Then a series of
corresponding strong baselines are adopted: (1) Uni-modal models
There is a scarcity of datasets available for downstream learning that only use information from image modality, such as only use
tasks on metaphorical memes. Existing datasets for detecting propa- VGG16 [8], DenseNet-161 [18], ResNet-50 [19]. (2) Multi-modal
ganda [11], offensive content [12], and hate speech [1] are not specif- models that apply text modality and image modality. They include
ically tailored for multi-modal metaphor research. Additionally, Meta- Multi-BERT_EfficientNet [20], Multi-BERT_ViT [21], Multi-BERT_PiT
CLUE [13], a recently published dataset of multi-modal metaphors, [22], MET (the model proposed by [6]), and MET𝑜 (the model we
primarily features image-dominating metaphors, rendering it unsuit- reproduce the MET model using the dataset specific to our experimental
able for our task. However, two datasets, MET-Meme [6] and Multi- setup). As the original dataset did not come with a partitioned training,
MET [14], emerged as suitable candidates for investigating metaphori- testing, and validation set, we had to use randomly partitioned data to
cal memes. Among these, MET-Meme stood out due to its additional replicate the MET model. Meanwhile, for a fair comparison with [6], we

5
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778

Table 2
Experimental results (%) on the MET-Meme dataset of four tasks. Results with † are retrieved from [6]. Bold indicates the best-performing model in each column. The results with

indicate the significance tests of our M3 F over other baseline models (with 𝑝-value < 0.05), and with ♯ indicate the performance of MET𝑜 over the MET method. Fusion methods
consist of element-wise add (add) and concatenation (cat).
Method English Chinese English Chinese
Acc Pre Rec Acc Pre Rec Acc Pre Rec Acc Pre Rec
Sentiment Analysis Intention Detection
VGG16 20.57 20.84 24.22 29.94 26.04 29.20 37.19 38.71 38.15 47.48 49.21 47.81
DenseNet-161 21.88 21.71 25.65 29.45 27.50 29.36 38.10 39.31 37.89 47.23 39.24 47.06
ResNet-50 21.74 18.63 21.35 29.36 27.50 29.28 39.19 37.12 40.10 47.15 39.23 47.06
Multi-BERT_EfficientNet 28.52 24.52 29.04 33.50 35.29 33.42 43.10 41.54 42.19 51.03 43.06 51.03
Multi-BERT_ViT 24.43 23.41 23.96 33.25 27.33 32.84 41.28 40.13 40.62 50.62 41.32 50.62
Multi-BERT_PiT 25.00 27.82 28.12 33.66 33.58 33.09 42.23 41.09 41.02 50.21 50.00 50.04
MET_add 24.65† 24.52† 25.26† 32.50† 32.62† 33.50† 40.32† 40.39† 41.28† 52.93† 52.68† 54.01†
MET𝑜 _add 25.91♯ 25.20♯ 25.52♯ 33.91♯ 33.44♯ 34.16♯ 41.02♯ 41.15♯ 41.54♯ 50.87 52.89♯ 51.28
MET_cat 27.68† 28.41† 29.82† 33.42† 34.33† 33.91† 38.56† 39.19† 39.84† 51.58† 51.48† 52.85†
MET𝑜 _cat 27.86♯ 28.64♯ 29.56 34.66♯ 35.26♯ 34.16♯ 40.10♯ 40.76♯ 40.11♯ 51.28 51.55♯ 53.43♯
ours_add 30.47∗ 33.45∗ 30.34∗ 39.95∗ 41.80∗ 39.87∗ 44.40∗ 41.89∗ 44.32∗ 55.25∗ 54.57∗ 55.00∗
ours_cat 29.82∗ 34.18∗ 30.73∗ 37.22∗ 39.55∗ 37.97 ∗ 44.10∗ 44.56∗ 43.53∗ 53.52∗ 54.72∗ 54.52∗
Offensiveness Detection Metaphor Recognition
VGG16 67.10 63.42 72.53 70.07 64.11 72.07 78.39 79.73 79.95 67.00 67.24 67.82
DenseNet-161 69.66 62.07 69.98 71.43 70.82 75.43 80.08 80.23 80.47 67.16 67.91 67.99
ResNet-50 69.21 64.62 72.57 73.24 69.62 75.74 80.34 81.22 80.86 67.74 67.63 67.66
Multi-BERT_EfficientNet 73.78 67.98 74.56 78.15 72.11 79.98 82.46 84.39 83.11 74.19 71.26 74.28
Multi-BERT_ViT 71.22 62.96 72.66 76.92 67.31 78.74 81.90 82.01 82.46 73.28 72.55 73.70
Multi-BERT_PiT 72.79 66.69 74.26 77.17 70.16 79.05 82.07 83.05 82.98 75.10 73.15 74.28
MET_add 68.39† 66.21† 72.14† 76.01† 74.76† 78.16† 81.33† 81.49† 82.29† 74.04† 74.51† 74.96†
MET𝑜 _add 69.53♯ 66.92♯ 73.18♯ 77.42♯ 74.94♯ 78.99♯ 81.64♯ 81.60♯ 83.20♯ 75.02♯ 72.81 75.67♯
MET_cat 67.25† 66.15† 74.48† 73.19† 71.59† 79.49† 82.39† 82.69† 83.33† 72.90† 72.80† 73.30†
MET𝑜 _cat 63.16 66.33♯ 74.61♯ 74.69♯ 72.05♯ 73.61 82.03 82.71♯ 83.85♯ 73.78♯ 73.46♯ 73.61♯
ours_add 76.17∗ 69.45∗ 76.19∗ 80.81∗ 76.00∗ 80.73∗ 83.98∗ 85.86∗ 84.38∗ 77.01∗ 72.94 82.68∗
ours_cat 74.09∗ 69.59∗ 76.15∗ 80.07∗ 76.20∗ 80.62∗ 83.20∗ 85.97∗ 85.81∗ 76.18∗ 73.02 80.00∗

also present the results of two fusion methods including element-wise sentiment analysis, intention detection, and offensiveness detection
add (add) and concatenation (cat) in their work. Moreover, we adopt tasks, the result of metaphor recognition is directly affected by inter-
VGG16 for image feature extraction for a fair comparison with [6]. To modality attention. Metaphorical information needs to be reflected
explore the fusion for obtaining multi-modal representation, we also in the two modalities of image and text. That is to say, the inter-
examine two fusion operations: addition for all the features (add) and modality attention allows the model to focus on the relevant cues
concatenation of all the features (cat). and context across modalities, leading to improved performance in
metaphor recognition.
4.4. Main result Interestingly, we also noticed that the three methods of Multi-
BERT_EfficientNet, Multi-BERT_ViT, and Multi-BERT_PiT can be com-
We report the main experimental results of the multi-modal meme parable to the performance of MET. This may be because the image
on the test set of the MET-Meme benchmark dataset. Table 2 demon- extraction method used by the MET method is more traditional VGG
strates the results for the following four tasks: sentiment analysis, or ResNet, so it is inferior to some of the latest vision-language models.
intention detection, offensiveness detection, and metaphor recognition Our M3 F model using VGG16 as the image encoder still performs better
than theirs, which reflects the superiority of our model performance.
in English and Chinese. No matter the accuracy, precision, and recall
of all tasks, our proposed M3 F approach consistently achieves the
4.5. Ablation study
highest performances in the above four tasks. These results highlight
the effectiveness of our model, which leverages inter-modality and
To assess how different components affect performance, we perform
intra-modality attention mechanisms, along with a multi-interactive
an ablation study of the proposed M3 F on three tasks using the con-
decoder. The improved performance across various metrics further
catenation method and report the accuracy results in Table 3. ‘‘ours’’
emphasizes the advantages of our approach in capturing and leveraging
is our proposed method, ‘‘w/o Inter’’ represents that without using
the synergies between modalities and tasks. inter-modality representation, and ‘‘w/o Intra’’ denotes that without
On the other hand, comparing the results between the MET and employing intra-modality representation. ‘‘w/o MID’’ indicates that
MET𝑜 methods, we can conclude that our dataset partitioning is sim- we directly exploit the multi-modal representation to obtain the final
ilar to that of [6], and it performs slightly better. Furthermore, the results without utilizing the multi-interactive decoder. We can observe
significance tests of M3 F over the baseline models demonstrate the that the removal of inter-modality representation sharply reduces the
effectiveness of our method, presenting a statistically significant im- performance, which verifies that inter-modality attention is signifi-
provement based on most evaluation metrics with 𝑝-value < 0.05. cant and effective in learning incongruity features between different
Although the effects of cat and add methods are different in different modalities. Furthermore, the removal of intra-modality attention also
tasks, they are all better than baseline models. Such as the accuracy in leads to considerably poorer performance. This indicates that compared
English, sentiment analysis showed the highest improvement of 9.90% with the simple way of directly concatenating or adding metaphori-
and the lowest improvement of 1.30%. Intention detection exhibited cal information with text features, Intra-modality attention enhances
the highest enhancement of 7.21% and the lowest improvement of the model’s ability to capture finer incongruities between textual and
1.00%. Offensiveness detection displayed the highest boost of 13.01% metaphorical information. It is also worth noting that, the removal of
and the lowest improvement of 0.31%. Furthermore, compared with the multi-interactive decoder degrades the performance. The results
other tasks, the performance improvement of metaphor recognition suggest that the multi-interactive decoder plays a prior role in capturing
shows relatively minor performance gains, because different from the and the interaction of various tasks.

6
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778

Table 3
Experimental results of ablation study. Acc, Pre, and Rec represent Accuracy, Precision, and Recall.
Method English Chinese English Chinese
Acc Pre Rec Acc Pre Rec Acc Pre Rec Acc Pre Rec
Sentiment Analysis Intention Detection
ours 29.82 34.18 30.73 37.22 39.55 37.97 44.10 44.56 43.53 53.52 54.72 54.52
-w/o Inter 29.04 30.60 29.95 37.06 38.30 37.14 43.49 43.08 42.58 52.77 53.82 53.02
-w/o Intra 28.65 33.34 28.65 36.15 38.65 36.02 43.10 42.25 42.58 52.36 53.53 53.11
-w/o MID 28.91 31.09 29.95 36.15 38.18 36.06 43.23 42.75 42.45 51.94 53.37 53.02
Offensiveness Detection
English Chinese
Acc Pre Rec Acc Pre Rec
ours 74.09 69.59 76.15 80.07 76.20 80.62
-w/o Inter 73.44 68.50 74.22 79.07 75.24 78.25
-w/o Intra 73.44 67.53 74.48 78.33 72.34 78.76
-w/o MID 73.18 67.11 74.09 77.17 71.82 78.14

Table 4
Experimental results (%) on the MET-Meme dataset of different backbones.
Backbone English Chinese English Chinese

Acc Pre Rec Acc Pre Rec Acc Pre Rec Acc Pre Rec

Sentiment Analysis Intention Detection

VGG16 29.82 34.18 30.73 37.22 39.55 37.97 44.10 44.56 43.53 53.52 54.72 54.52
DenseNet-161 30.73 28.72 30.21 40.69 43.12 40.61 44.27 45.49 44.60 54.76 59.07 54.01
ResNet-50 31.25 28.77 31.03 42.43 43.28 40.44 44.40 43.38 44.12 56.16 57.52 55.17
EfficientNet 31.12 28.68 31.76 40.45 41.71 40.53 43.23 43.35 43.10 55.42 60.07 55.25
ViT 30.86 29.71 30.21 40.53 41.43 40.67 43.49 45.11 44.27 56.91 61.49 56.41
PiT 30.08 30.97 30.23 40.86 42.52 40.83 44.14 42.22 44.35 56.16 58.82 56.07

Offensiveness Detection Metaphor Recognition

VGG16 74.09 69.59 76.15 80.07 76.20 80.62 83.20 85.97 85.81 76.18 73.02 80.00
DenseNet-161 75.26 68.22 75.13 81.14 76.55 81.06 82.68 84.68 83.79 76.34 72.23 79.59
ResNet-50 75.65 68.18 75.39 80.48 76.30 80.73 84.24 86.05 84.71 77.42 71.99 82.06
EfficientNet 75.00 68.64 75.68 81.22 76.99 81.24 83.33 84.65 83.85 77.58 72.10 81.22
ViT 75.78 69.84 76.04 81.56 77.03 81.47 83.13 85.51 83.98 78.08 74.02 82.47
PiT 74.87 70.37 75.61 80.81 76.89 80.95 83.46 85.29 83.77 77.83 73.22 83.30

4.6. Qualitative evaluation of backbones to enhanced performance. In this way, we investigate inter- and intra-
modality attention to effectively utilize incongruity information. As
This section analyzes different backbones of image feature extrac- illustrated in Fig. 5, our model tends to focus on the inconsistent regions
tion including VGG16 [8], DenseNet [18], ResNet50 [19], Vision Trans- between the image and the text. The attention mechanisms for images
former (ViT) [21], and Pooling-based Vision Transformer (PiT) [22]. and text can complement each other, as illustrated in Fig. 5(b), where
The result is presented in Table 4. While different backbones do have a the text focuses on the ‘‘unicorn’’ while it is overlooked in the meme
certain impact on the final performance, overall, models based on PiT image. Similarly, in Fig. 5(a), ‘‘spiderman’’ is attended to in vision but
demonstrate the best performance. Moreover, it is worth mentioning receives less attention in the text. Furthermore, as shown in Fig. 5,
that regardless of the backbone employed, the performance surpasses the model emphasizes the central portions of the ‘‘cat’’, while the text
the current state-of-the-art methods (as shown in Table 2), thus visually modality places the highest attention on the term ‘‘potluck’’. Through
illustrating the feasibility of our proposed GloG method. Additionally, inter-modality attention, our model captures incongruity dependencies
we observed that transformer-based architectures, including ViT and between the two modalities, enabling it to make accurate predictions
PiT, outperform others. This is attributed to the attention mechanism for such instances. Furthermore, we can notice that the word ‘‘me’’ in
from the Transformer architecture employed by ViT, allowing the the text modality refers to the ‘‘cat’’ in the image modality. Assisted by
model to capture global context information within images. inter-modality attention, our model can establish a connection between
these modalities, so as to learn the relationship for better performance.
4.7. Case study
4.8. Effect of hyperparameters
To qualitatively examine how M3 F investigates incongruent infor-
mation across different modalities, we offer an attention visualization Table 5 presents the performance variation of our model on English
of three representative examples that require the integration of both and Chinese datasets by adjusting the weights (𝑤𝑆𝐴 , 𝑤𝐼𝐷 , 𝑤𝑂𝐷 ) for
text and image data. The outcomes are depicted in Fig. 5. different tasks (sentiment analysis, intent detection, aggression detec-
It is important to emphasize that the primary focus lies in dis- tion, and metaphor identification). These weight adjustments reflect the
cerning incongruent information. Corresponding to multi-modal meme model’s emphasis on each sub-task during multi-task learning. Exper-
tasks consisting of sentiment analysis, intention detection, offensive- imental results indicate that by appropriately adjusting these weights,
ness detection, and metaphor recognition, capturing the contradictory significant improvements can be achieved in the model’s performance
expressions between different modalities can outstandingly contribute on specific tasks. For instance, on the English dataset, when the weights

7
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778

Fig. 5. Attention visualization of a representative example.

Table 5
Accuracy (%) of Different weights of each task.
English
w1 w2 w3 SA ID OD w1 w2 w3 SA ID OD w1 w2 w3 SA ID OD
0.1 0.1 0.8 22.25 26.50 76.38 0.2 0.5 0.3 26.62 32.62 70.88 0.4 0.4 0.2 28.25 34.12 73.75
0.1 0.2 0.7 28.25 37.25 76.62 0.2 0.6 0.2 26.38 35.75 69.62 0.4 0.5 0.1 20.38 32.50 58.13
0.1 0.3 0.6 27.50 33.38 65.38 0.2 0.7 0.1 25.62 35.50 65.75 0.5 0.1 0.4 23.62 30.12 60.12
0.1 0.4 0.5 26.80 34.38 58.25 0.3 0.1 0.6 26.38 37.75 68.25 0.5 0.2 0.3 27.62 36.50 67.00
0.1 0.5 0.4 25.62 33.38 61.62 0.3 0.2 0.5 29.12 37.25 66.50 0.5 0.3 0.2 25.25 33.50 65.50
0.1 0.6 0.3 26.25 32.88 64.00 0.3 0.3 0.4 22.12 27.62 56.12 0.5 0.4 0.1 17.12 31.13 48.00
0.1 0.7 0.2 25.37 33.75 67.25 0.3 0.4 0.3 24.75 29.88 49.12 0.6 0.1 0.3 16.88 23.50 48.38
0.1 0.8 0.1 20.00 34.25 68.75 0.3 0.5 0.2 24.00 38.00 59.00 0.6 0.2 0.2 22.25 26.00 76.62
0.2 0.1 0.7 25.87 34.12 54.75 0.3 0.6 0.1 18.12 37.38 41.25 0.6 0.3 0.1 26.00 33.12 66.88
0.2 0.2 0.6 24.50 29.88 60.62 0.4 0.1 0.5 28.88 36.12 74.50 0.7 0.1 0.2 22.75 31.13 69.62
0.2 0.3 0.5 27.62 34.00 71.50 0.4 0.2 0.4 29.82 44.10 74.09 0.7 0.2 0.1 23.00 28.12 56.62
0.2 0.4 0.4 26.50 34.75 69.75 0.4 0.3 0.3 23.00 30.00 70.38 0.8 0.1 0.1 26.12 28.25 62.38
Chinese
w1 w2 w3 SA ID OD w1 w2 w3 SA ID OD w1 w2 w3 SA ID OD
0.1 0.1 0.8 34.32 32.84 76.10 0.2 0.5 0.3 34.98 43.09 65.18 0.4 0.4 0.2 34.40 42.10 59.39
0.1 0.2 0.7 32.25 24.90 70.39 0.2 0.6 0.2 36.05 31.51 71.88 0.4 0.5 0.1 34.23 41.36 58.56
0.1 0.3 0.6 34.40 46.24 76.92 0.2 0.7 0.1 34.07 41.03 57.40 0.5 0.1 0.4 33.41 38.71 63.61
0.1 0.4 0.5 35.81 45.99 74.94 0.3 0.1 0.6 35.39 52.15 79.24 0.5 0.2 0.3 35.97 43.34 70.97
0.1 0.5 0.4 34.98 43.42 71.96 0.3 0.2 0.5 37.22 44.56 80.07 0.5 0.3 0.2 34.40 42.51 74.52
0.1 0.6 0.3 31.75 44.25 73.28 0.3 0.3 0.4 34.07 35.98 61.29 0.5 0.4 0.1 34.15 41.27 58.73
0.1 0.7 0.2 34.07 45.49 69.07 0.3 0.4 0.3 35.72 52.31 76.67 0.6 0.1 0.3 36.39 44.58 67.91
0.1 0.8 0.1 34.81 42.02 79.49 0.3 0.5 0.2 32.33 42.76 76.43 0.6 0.2 0.2 36.05 46.57 76.34
0.2 0.1 0.7 31.67 52.15 78.74 0.3 0.6 0.1 34.73 46.15 67.91 0.6 0.3 0.1 37.05 46.32 79.40
0.2 0.2 0.6 35.89 47.48 75.52 0.4 0.1 0.5 36.39 43.76 78.91 0.7 0.1 0.2 34.81 43.67 68.98
0.2 0.3 0.5 35.14 43.59 66.83 0.4 0.2 0.4 35.56 52.23 79.24 0.7 0.2 0.1 35.97 43.67 65.59
0.2 0.4 0.4 36.47 48.30 80.23 0.4 0.3 0.3 34.57 43.51 79.90 0.8 0.1 0.1 35.89 52.39 63.94

for sentiment analysis, intent detection, and aggression detection tasks 4.9. Error analysis
are set to 0.4, 0.2, and 0.4 respectively, the model achieves an accu-
racy of 29.82% for sentiment analysis, 44.10% for intent detection, We also perform error analysis on the experimental results for the
and 74.09% for aggression detection tasks. Similarly, on the Chinese wrongly predicted samples. The examples are shown in Fig. 6, and
dataset, by adjusting the weights, the model achieves accuracies of we can find that the error categories for wrongly classifying might
37.22%, 44.56%, and 80.07% for the three tasks respectively. These be summarized into two aspects. On the one hand, the information
results underscore the importance of weight adjustment in optimizing provided in an image and text is insufficient to assist various tasks as
model performance within a multi-task learning framework. shown in the image of example (a). On the other hand, as demonstrated

8
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778

This dataset serves as a valuable resource for studying offensive memes


and developing effective detection models. Pramanick et al. [26] in-
troduced two large-scale datasets and proposed a multi-modal deep
neural network that systematically analyzes the local and global aspects
of input memes, incorporating the background context. Their work
significantly contributes to the advancement of multi-modal meme
analysis, facilitating a deeper understanding of the content and context
of memes. Kirk et al. [2] collected a diverse multi-modal dataset of wild
hateful memes from Pinterest. This dataset encompasses both hateful
and non-hateful memes, enabling the evaluation of models in an out-of-
Fig. 6. Examples for error analysis.
sample setting to gauge their performance on novel instances. Gasparini
et al. [27] introduced a comprehensive multi-modal dataset of hateful
Table 6
memes, comprising 800 samples sourced from popular social media
Experimental results when removing the text on the image in our model. SA represents
the sentiment analysis task, ID denotes the intention detection task. OD and MR platforms such as Twitter, Facebook, Reddit, and meme-oriented web-
imply the offensiveness detection and metaphor recognition tasks. ↑ demonstrates the sites. This dataset serves as a benchmark for evaluating the detection
improved performance. and analysis of hateful content in memes.
Method SA ID OD MR
English 5.2. Multi-modal meme detection
this paper 29.82 44.10 74.09 83.20
+ removing text 30.08 ↑ 44.66 ↑ 74.87↑ 83.98 ↑ Multi-modal learning integrates image, text, or other media forms to
Chinese enhance various tasks, including Named Entity Recognition (NER) [28–
this paper 37.22 44.56 80.07 85.97 30] and image captioning [31], among others. Due to the predominant
+ removing text 41.03 ↑ 55.17 ↑ 79.74 ↓ 85.23 ↑ presentation of memes in both image and text modalities, existing
meme analysis works are predominantly implemented using multi-
modal approaches. Early methods often relied on feature-based tech-
niques, primarily focusing on learning joint representations of multi-
in the image of example (b), the text on the images may bring distur- modal content through feature processing. In contrast, current ap-
bance for the image feature extraction. Based on this observation, we proaches often utilize prompt-based methods for hateful meme de-
further implement an experiment in which the text on the images is tection. Additionally, there has been a recent surge of interest in
removed to improve the performance of various tasks. Following [23], leveraging diverse aspects of memes, such as sentiment and sarcasm,
we first detect the word regions in the image using the Character- with some works adopting multi-task learning to incorporate these
Region Awareness For Text detection (CRAFT) method. Then, we mask various aspects into the model.
the words and delete them with the help of an automated objects Feature-based Methods. Sandulescu [32] conducted initial exper-
removal inpainter [24]. We present the image after text removal on iments utilizing both single-stream architectures such as VLP [33],
the right side of example (b). UNITER [34], VL-BERT [35], and dual-stream models like LXMERT
Table 6 illustrates the accuracy results of our model adopting the [36], comparing their performance against established baselines. Sub-
concatenation method in the test set of four tasks and different lan- sequently, they introduced a novel bidirectional cross-attention mecha-
guages and shows that our model achieves a significant improvement nism to integrate inferred caption information with meme captions, re-
in most cases when removing the text on the images. It is worth noting sulting in improved classifier performance for detecting hateful memes.
that after removing the text in the image, the performance of the Zhang et al. [37] presented the Complementary Vision and Linguis-
Chinese dataset increases more than that of the English dataset. This tic (CVL) network, which uses visual and linguistic embeddings as
may be because more text appears in the image in the Chinese dataset, input to predict hateful memes. In detail, they employ a pre-trained
which introduces a certain noise. At the same time, the increase in classifier and object detector to extract contextual features and region-
the metaphor recognition task is very small in both the English dataset of-interests from the input, followed by position representation fu-
and the Chinese dataset. The most likely reason is that the metaphor sion for visual embedding. Linguistic embedding comprises sentence
information in the image is difficult to be learned by the model. word embedding, position embedding, and corresponding Spacy em-
bedding. Addressing the disparity between visual end textual elements
5. Related works in memes, Zhou et al. [38] introduced a novel approach using image
captions as a bridge between image content and Optical Character
5.1. Meme datasets Recognition (OCR) sentences. The study focused on the development of
a Triplet-Relation Network (TRN) to improve multi-modal relationship
In recent years, several datasets specifically designed for meme modeling between visual regions and sentences.
understanding have been created. A notable advancement in this field Prompt-based Methods. Cao et al. [39] introduced PromptHate,
was the release of the MEMOTION dataset by [15], which served as a multi-modal framework that constructs simple prompts and presents
a sentiment analysis challenge specifically focused on memes. Sub- several in-context examples to exploit the implicit knowledge in the
sequently, several meme datasets were collected to identify negative pre-trained language models, aiming to perform hateful meme classifi-
information or sentimental expressions. Kiela et al. [1] organized the cation. Cao et al. [40] devised a method for detecting hateful memes
Hateful Memes Challenge, which stands as the pioneering competition by harnessing the capabilities of a frozen pre-trained vision-language
explicitly designed to assess the multi-modal understanding and reason- model (PVLM) to complement PromptHate’s uni-modal approach. It
ing capabilities of models. The challenge incorporates clear evaluation employs a series of probing questions to prompt the PVLM for infor-
metrics and directly addresses real-world applicability. Kiela et al. [1] mation concerning commonly targeted vulnerable subjects in hateful
introduced a challenging dataset and benchmark specifically focused on content. Ji et al. [41] reframed harmful meme analysis as an auto-
detecting hate speech in multi-modal memes, fostering further research filling task and identified harmful memes in a prompt-based manner.
and development in this area. Suryawanshi et al. [25] utilized memes Initially, they consolidate multi-modal data into a single modality by
related to the 2016 U.S. presidential election to create the MultiOFF producing captions and attributes for visual data. Subsequently, they
dataset, a multi-modal dataset designed for offensive content detection. integrate textual data into a pre-trained language model.

9
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778

Multi-task Methods. Chauhan et al. [42] developed the Inter- CRediT authorship contribution statement
task Relationship Module (iTRM) to elucidate the synergistic between
tasks and the Inter-class Relationship Module (iCRM) to establish and Bingbing Wang: Writing – review & editing, Writing – original
strengthen connections among disparate task classes. Ultimately, rep- draft, Methodology, Investigation, Formal analysis, Data curation, Con-
resentations from these two Modules are cooperated across all the ceptualization. Shijue Huang: Writing – review & editing, Methodol-
tasks. Ma et al. [43] undertook a primary task alongside two uni-modal ogy, Conceptualization. Bin Liang: Writing – review & editing, Writing
auxiliary tasks. Notably, they introduced a self-supervised generation – original draft. Geng Tu: Visualization, Software. Min Yang: Writ-
strategy within the auxiliary tasks to automatically generate uni-modal ing – review & editing, Methodology. Ruifeng Xu: Writing – review
auxiliary labels. And a self-supervised approach is designed to enhance & editing, Writing – original draft, Project administration, Funding
the efficiency and effectiveness of the auxiliary tasks by eliminating the acquisition.
need for manually annotated labels. The work of [43] proposed a multi-
task learning approach for multi-modal meme detection. The method Declaration of competing interest
incorporates a primary multi-modal task and two additional uni-modal
auxiliary tasks to enhance the performance of the detection process. The authors declare that they have no known competing finan-
In contrast to these approaches, which may overlook the signifi- cial interests or personal relationships that could have appeared to
cance of metaphorical information, our study places a strong emphasis influence the work reported in this paper.
on metaphor recognition in multi-modal meme detection. In this paper,
we focus on four sub-tasks: sentiment analysis, intention detection, Data availability
offensiveness detection, and metaphor recognition.
Data will be made available on request.
5.3. Metaphorical information
Acknowledgments
In the field of NLP, there has been a growing interest in exploring
different approaches to meme understanding, particularly in relation
This research was supported in part by the National Natural
to metaphorical information. Early studies on metaphor have predom-
Science Foundation of China (62176076), the Guangdong Provin-
inantly relied on manually constructed knowledge. Jang et al. [44] in-
cial Key Laboratory of Novel Security Intelligence Technologies
troduced a novel and highly effective method for inducing and applying
(2022B1212010005), Guangdong Provincial Natural Science Foun-
metaphor frame templates. This approach served as a crucial advance-
dation (2023A1515012922), and Shenzhen Foundational Research
ment towards metaphor detection, providing a framework to recognize
Funding JCYJ20220818102415032.
and analyze metaphorical expressions in text data. Tsvetkov et al. [45]
demonstrated the feasibility of reliably distinguishing between literal
and metaphorical syntactic constructions. Through their research, they References
showcased that it is possible to discern whether a given syntactic
[1] D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, D. Testuggine,
construction should be interpreted literally or metaphorically by con- The hateful memes challenge: Detecting hate speech in multimodal memes,
sidering the lexical-semantic features of the words involved in the Advances in Neural Information Processing Systems 33 (2020) 2611–2624.
construction. Some researchers have also used distributional clustering [2] H.R. Kirk, Y. Jun, P. Rauba, G. Wachtel, R. Li, X. Bai, N. Broestl, M. Doff-Sotta,
and unsupervised approaches [46,47]. In recent years, there has been A. Shtedritski, Y.M. Asano, Memes in the wild: Assessing the generalizability of
the hateful memes challenge dataset, 2021, arXiv preprint arXiv:2107.04313.
an increasing exploration of deep learning models for metaphor detec-
[3] G. Lakoff, M. Johnson, Metaphors We Live By, University of Chicago Press, 2008.
tion. However, limited research has been conducted on multi-modal [4] S.M. Anurudu, I.M. Obi, Decoding the metaphor of internet meme: A study of
metaphor detection, with only a few studies. satirical tweets on black friday sales in Nigeria, Afrrev. Laligens 6 (1) (2017)
Metaphor means reasoning about one thing in terms of another [48]. 91–100.
Due to metaphor being a linguistic phenomenon, most of the previous [5] Z. Kovecses, Metaphor: A Practical Introduction, Oxford University Press, 2010.
[6] B. Xu, T. Li, J. Zheng, M. Naseriparsa, Z. Zhao, H. Lin, F. Xia, MET-Meme:
works detected the metaphorical information in text-only setting [45,
A multimodal meme dataset rich in metaphors, in: Proceedings of the 45th
49]. Recently, some works started to study metaphorical information in international ACM SIGIR conference on research and development in information
multi-modal scenarios. For example, [50] presented the first metaphor retrieval, 2022, pp. 2887–2899.
identification method that simultaneously draws knowledge from lin- [7] Z. Wang, S. Mayhew, D. Roth, et al., Cross-lingual ability of multilingual bert:
guistic and visual data. Lakoff and Johnson [3] demonstrated that An empirical study, 2019, arXiv.
metaphors are found not only in languages but also via thoughts and [8] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
image recognition, 2014, arXiv.
actions. Xu et al. [6] presented a large-scale multi-modal metaphor [9] Q. Zhang, H. Wu, C. Zhang, Q. Hu, H. Fu, J.T. Zhou, X. Peng, Provable dynamic
dataset with manual fine-grained annotation including metaphorical fusion for low-quality multimodal data, 2023, arXiv preprint arXiv:2306.02050.
information. [10] Y. Zhang, J. Wang, Y. Liu, L. Rong, Q. Zheng, D. Song, P. Tiwari, J. Qin,
A multitask learning model for multimodal sarcasm, sentiment and emotion
6. Conclusion recognition in conversations, Inf. Fusion 93 (2023) 282–301.
[11] D. Dimitrov, B.B. Ali, S. Shaar, F. Alam, F. Silvestri, H. Firooz, P. Nakov, G.
Da San Martino, Detecting propaganda techniques in memes, in: Proceedings of
In this paper, we explore a novel approach for fine-grained meme
the 59th Annual Meeting of the Association for Computational Linguistics and
understanding by aligning the incongruity among metaphorical infor- the 11th International Joint Conference on Natural Language Processing (Volume
mation, images, and texts, and leveraging their potential for enhancing 1: Long Papers), 2021, pp. 6603–6617.
various tasks. More concretely, we present a Metaphor-aware Multi- [12] S. Suryawanshi, B.R. Chakravarthi, M. Arcan, P. Buitelaar, Multimodal meme
modal Multi-task Framework (M3 F) and take metaphorical information dataset (multioff) for identifying offensive content in image and text, in:
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying,
into consideration. Furthermore, we construct inter-modality attention
2020, pp. 32–41.
to capture the interaction between text and image and create intra- [13] A.R. Akula, B. Driscoll, P. Narayana, S. Changpinyo, Z. Jia, S. Damle, G. Pruthi,
modality attention to model the congruity between text and metaphor- S. Basu, L. Guibas, W.T. Freeman, et al., Metaclue: Towards comprehensive visual
ical information. Additionally, to better learn the implicit interaction metaphors research, in: Proceedings of the IEEE/CVF Conference on Computer
across various tasks simultaneously, we also design a multi-interactive Vision and Pattern Recognition, 2023, pp. 23201–23211.
[14] D. Zhang, M. Zhang, H. Zhang, L. Yang, H. Lin, Multimet: A multimodal dataset
decoder that exploits gating networks to establish. Our experimen-
for metaphor understanding, in: Proceedings of the 59th Annual Meeting of
tal results on a widely recognized benchmark dataset demonstrate a the Association for Computational Linguistics and the 11th International Joint
clear superiority of our proposed method over state-of-the-art baseline Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp.
models. 3214–3225.

10
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778

[15] C. Sharma, D. Bhageria, W. Scott, S. Pykl, A. Das, T. Chakraborty, V. Pula- [34] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu,
baigari, B. Gamback, SemEval-2020 task 8: Memotion analysis–the visuo-lingual Uniter: Universal image-text representation learning, in: European Conference
metaphor! 2020, arXiv. on Computer Vision, Springer, 2020, pp. 104–120.
[16] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014, arXiv. [35] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, J. Dai, Vl-bert: Pre-training of generic
[17] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, visual-linguistic representations, 2019, arXiv preprint arXiv:1908.08530.
N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, high-performance [36] H. Tan, M. Bansal, LXMERT: Learning cross-modality encoder representations
deep learning library, Advances in Neural Information Processing Systems 32 from transformers, in: Proceedings of the 2019 Conference on Empirical Methods
(2019). in Natural Language Processing and the 9th International Joint Conference on
[18] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected con- Natural Language Processing, EMNLP-IJCNLP, 2019, pp. 5100–5111.
volutional networks, in: Proceedings of the IEEE/CVF Conference on Computer
[37] W. Zhang, G. Liu, Z. Li, F. Zhu, Hateful memes detection via complementary
Vision and Pattern Recognition, 2017, pp. 4700–4708.
visual and linguistic networks, 2020, arXiv preprint arXiv:2012.04977.
[19] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
[38] Y. Zhou, Z. Chen, H. Yang, Multimodal learning for hateful memes detection, in:
in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
2021 IEEE International Conference on Multimedia & Expo Workshops, ICMEW,
Recognition, 2016, pp. 770–778.
IEEE, 2021, pp. 1–6.
[20] M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural
networks, in: ICML, PMLR, 2019, pp. 6105–6114. [39] R. Cao, R.K.-W. Lee, W.-H. Chong, J. Jiang, Prompting for multimodal hateful
[21] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, meme classification, in: Proceedings of the 2022 Conference on Empirical
M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16 × 16 Methods in Natural Language Processing, 2022, pp. 321–332.
words: Transformers for image recognition at scale, 2020, arXiv. [40] R. Cao, M.S. Hee, A. Kuek, W.-H. Chong, R.K.-W. Lee, J. Jiang, Pro-cap:
[22] B. Heo, S. Yun, D. Han, S. Chun, J. Choe, S.J. Oh, Rethinking spatial dimensions Leveraging a frozen vision-language model for hateful meme detection, in:
of vision transformers, in: Proceedings of the IEEE/CVF International Conference Proceedings of the 31st ACM International Conference on Multimedia, 2023,
on Computer Vision, 2021, pp. 11936–11945. pp. 5244–5252.
[23] Y. Baek, B. Lee, D. Han, S. Yun, H. Lee, Character region awareness for text [41] J. Ji, W. Ren, U. Naseem, Identifying creative harmful memes via prompt
detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and based approach, in: Proceedings of the ACM Web Conference 2023, 2023, pp.
Pattern Recognition, 2019, pp. 9365–9374. 3868–3872.
[24] K. Nazeri, E. Ng, T. Joseph, F. Qureshi, M. Ebrahimi, Edgeconnect: Structure [42] D.S. Chauhan, S. Dhanush, A. Ekbal, P. Bhattacharyya, All-in-one: A deep
guided image inpainting using edge prediction, in: Proceedings of the IEEE/CVF attentive multi-task learning framework for humour, sarcasm, offensive, moti-
International Conference on Computer Vision Workshops, 2019. vation, and sentiment on memes, in: Proceedings of the 1st Conference of the
[25] S. Suryawanshi, B.R. Chakravarthi, M. Arcan, P. Buitelaar, Multimodal meme Asia-Pacific Chapter of the Association for Computational Linguistics and the
dataset (MultiOFF) for identifying offensive content in image and text, in: 10th International Joint Conference on Natural Language Processing, 2020, pp.
Proceedings of the Second Workshop on TRAC, ELRA, Marseille, France, 2020, 281–290.
pp. 32–41.
[43] Z. Ma, S. Yao, L. Wu, S. Gao, Y. Zhang, Hateful memes detection based on
[26] S. Pramanick, S. Sharma, D. Dimitrov, M.S. Akhtar, P. Nakov, T. Chakraborty,
multi-task learning, Mathematics 10 (23) (2022) 4525.
MOMENTA: A multimodal framework for detecting harmful memes and their
[44] H. Jang, K. Maki, E. Hovy, C. Rose, Finding structure in figurative language:
targets, in: Findings of the Association for Computational Linguistics: EMNLP
Metaphor detection with topic-based frames, in: Proceedings of the 18th Annual
2021, Punta Cana, Dominican Republic, 2021, pp. 4439–4455.
SIGDIAL Meeting on Discourse and Dialogue, 2017, pp. 320–330.
[27] F. Gasparini, G. Rizzi, A. Saibene, E. Fersini, Benchmark dataset of memes with
[45] Y. Tsvetkov, L. Boytsov, A. Gershman, E. Nyberg, C. Dyer, Metaphor detection
text transcriptions for automatic detection of multi-modal misogynistic content,
Data Brief 44 (2022) 108526. with cross-lingual model transfer, in: Proceedings of the 52nd Annual Meeting
[28] J. Wang, Y. Yang, K. Liu, Z. Zhu, X. Liu, M3S: Scene graph driven multi- of the ACL (Volume 1: Long Papers), 2014, pp. 248–258.
granularity multi-task learning for multi-modal NER, IEEE/ACM Transactions on [46] E. Shutova, L. Sun, E.D. Gutiérrez, P. Lichtenstein, S. Narayanan, Multilin-
Audio, Speech, and Language Processing 31 (2022) 111–120. gual metaphor processing: Experiments with semi-supervised and unsupervised
[29] F. Chen, J. Liu, K. Ji, W. Ren, J. Wang, J. Chen, Learning implicit entity- learning, Comput. Linguist. 43 (1) (2017) 71–123.
object relations by bidirectional generative alignment for multimodal NER, in: [47] R. Mao, C. Lin, F. Guerin, Word embedding and WordNet based metaphor
Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. identification and interpretation, in: Proceedings of the 56th Annual Meeting
4555–4563. of the Association for Computational Linguistics (Volume 1: Long Papers), 2018,
[30] J. Wu, C. Gong, Z. Cao, G. Fu, MCG-MNER: A multi-granularity cross-modality pp. 1222–1231.
generative framework for multimodal NER with instruction, in: Proceedings of [48] G. Lakoff, M. Johnson, Conceptual metaphor in everyday language, in: Shaping
the 31st ACM International Conference on Multimedia, 2023, pp. 3209–3218. Entrepreneurship Research, 1980.
[31] I. Laina, C. Rupprecht, N. Navab, Towards unsupervised image captioning with [49] G. Gao, E. Choi, Y. Choi, L. Zettlemoyer, Neural metaphor detection in context,
shared multimodal embeddings, in: Proceedings of the IEEE/CVF International in: Proceedings of the 2018 Conference on Empirical Methods in Natural
Conference on Computer Vision, 2019, pp. 7414–7424. Language Processing, ACL, Brussels, Belgium, 2018, pp. 607–613.
[32] V. Sandulescu, Detecting hateful memes using a multimodal deep ensemble, [50] E. Shutova, D. Kiela, J. Maillard, Black holes and white rabbits: Metaphor
2020, arXiv preprint arXiv:2012.13235. identification with visual features, in: Proceedings of the 2016 Conference of
[33] A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, Y. Artzi, A corpus for reasoning
NAACL: Human Language Technologies, ACL, San Diego, California, 2016, pp.
about natural language grounded in photographs, 2018, arXiv preprint arXiv:
160–170.
1811.00491.

11

You might also like