1 s2.0 S095070512400412X Main
1 s2.0 S095070512400412X Main
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
Keywords: Fine-grained meme understanding aims to explore and comprehend the meanings of memes from multiple
Meme understanding perspectives by performing various tasks, such as sentiment analysis, intention detection, and offensiveness
Metaphorical information detection. Existing approaches primarily focus on simple multi-modality fusion and individual task analysis.
Inter-modality attention
However, there remain several limitations that need to be addressed: (1) the neglect of incongruous features
Intra-modality attention
within and across modalities, and (2) the lack of consideration for correlations among different tasks. To this
Multi-task learning
end, we leverage metaphorical information as text modality and propose a Metaphor-aware Multi-modal Multi-
task Framework (M3 F) for fine-grained meme understanding. Specifically, we create inter-modality attention
enlightened by the Transformer to capture inter-modality interaction between text and image. Moreover, intra-
modality attention is applied to model the contradiction between the text and metaphorical information. To
learn the implicit interaction among different tasks, we introduce a multi-interactive decoder that exploits
gating networks to establish the relationship between various subtasks. Experimental results on the MET-
Meme dataset show that the proposed framework outperforms the state-of-the-art baselines in fine-grained
meme understanding.
1. Introduction meme dataset MET-Meme in two languages, which has addressed the
regrettable lack of metaphor information. Xu et al. [6] also proposed
Memes are cultural influences transmitted through the Internet, four fine-grained subtasks consisting of sentiment analysis, intention
particularly on popular social media platforms like Twitter and Face- detection, metaphorical detection, and offensiveness detection.
book. Fine-grained meme understanding is a task that understanding Despite the promising progress made by the above works, these
memes from the perspective of multiple subtasks. In most cases, an
approaches primarily rely on the concatenation or addition of features
ordinary sentence or a picture may not convey a specific emotional
from multiple modalities, addressing each subtask independently, ne-
meaning, but their combination brings about meaning. Therefore, it is
significant to consider both modalities to understand the meaning of the glecting the two noteworthy aspects of incongruous characters and
meme. As memes are often subtle and their true underlying meaning close-related labels for each meme. For instance, as shown in Fig. 1(a),
may be expressed in an implicit way, fine-grained meme understanding the image depicts a man being attracted to another woman, while
becomes a challenging task. his girlfriend is displeased by this. However, the text in the image
Several multi-modal meme datasets have been created for further is ‘‘APPLE FANS; IPHONE 10; IPHONE 11’’ which is unrelated to the
study [1,2]. In recent years, the popularity of memes has sparked image content, highlighting cross-modal inconsistency. Furthermore,
the interest of a few linguists meme. They demonstrated that memes metaphorical information and the text within the image both belong to
construe metaphors that are found in real life not only just in languages the text modality; the source domain mentions ‘‘Man; Woman; Woman’’,
but also via thoughts and actions [3,4]. Moreover, in the study of [5], while the image text reads ‘‘APPLE FANS; IPHONE 10; IPHONE 11’’.
two conceptual metaphor mappings were delineated, encompassing the
These appear unrelated, illustrating inconsistency within the same
source and target domains. To deeply understand the metaphor infor-
modality. As demonstrated above, the incongruity between and within
mation in memes, Xu et al. [6] constructed a multi-modal metaphor
∗ Corresponding author at: Harbin Institute of Technology Shenzhen, Shenzhen, Guangdong, 518055, China.
E-mail address: [email protected] (R. Xu).
https://doi.org/10.1016/j.knosys.2024.111778
Received 12 September 2023; Received in revised form 9 March 2024; Accepted 7 April 2024
Available online 10 April 2024
0950-7051/© 2024 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778
modality in memes may lead to misunderstandings of the meme’s • The fine-grained meme understanding work is approached from a
message, requiring the combination of image, text, as well as metaphor- novel perspective that investigates the interplay between metap-
ical information for a complete understanding of its meaning. Further hor cues and explores the potential relationships within and
analysis reveals that merely examining the text in the meme depicted across modalities and subtasks.
in Fig. 1(a) does not convey the sentiment, but the image confirms • A novel Metaphor-aware Multi-modal Multi-task Framework
its expressive nature with a sentiment of love, hence making it non- (M3 F) is proposed to improve the utilization of meme information
offensive. Similarly, in Fig. 1(b), knowing that the meme carries a and metaphor integration at both inter-modal and intra-modal
sentiment of hate increases the likelihood of it being offensive. We levels, as well as capture the commonness across tasks.
can observe that the sentiment, intention, and offensiveness of memes • Performance evaluation on the MET-Meme benchmark dataset
are closely intertwined. Through effective representation, training, and demonstrates the robustness and superiority of the proposed
evaluation, these relationships can be modeled to enhance the model’s framework compared to the state-of-the-art baselines in fine-
performance collaboratively. grained meme understanding.
In this way, this paper investigates the congruity between different
as well as the same modalities and explores their potential using 2. Method
multi-task learning to enhance various tasks in fine-grained meme
understanding, including sentiment analysis, intention detection, In this section, we describe the proposed M3 F method in detail.
offensiveness detection, and metaphor recognition. Specifically, we As illustrated in Fig. 2, M3 F primarily consists of four components:
propose a novel Metaphor-aware Multi-modal Multi-task Framework (1) Feature extraction, which applies VGG16 and Multilingual BERT
(M3 F)1 . Through multi-task learning, we can simultaneously consider to extract features from the images, texts, source, and target domains.
(2) Inter-modality attention, which constructs the relationship between
and model the complex relationships among the tasks and facilitate
images and texts, and obtains the result of metaphor recognition to
collaborative learning and knowledge transfer within the model, fur-
dynamically guide the source and target domains. (3) Intra-modality
ther improving its performance. By sharing underlying representations
attention, which captures the congruity among texts and metaphorical
and feature learning processes, relevant information between differ-
information. (4) Multi-interactive decoder, which learns the implicit
ent tasks can be mutually reinforced, allowing the model to better
interaction among tasks.
utilize data throughout the entire training process, thereby enhancing
overall effectiveness. To reach this goal, we devise an inter-modality
2.1. Task definition
attention for modeling the interaction between image and text. Addi-
tionally, intra-modality attention is designed to learn the congruity of
Formally, supposing there is an example comprising an image 𝑉 ,
text and metaphorical information. To learn the implicit interaction
a corresponding text 𝑇 , source domain 𝑀𝑠 , and target domain 𝑀𝑔 ,
among different tasks, we present a multi-interactive decoder (MID) the goal of multi-modal meme analysis is to predict the metaphor
to establish the relationship between various tasks. Concretely, VGG16 categories 𝑦𝑀𝑅 , sentiment categories 𝑦𝑆𝐴 , intention categories 𝑦𝐼𝐷 , and
and Multilingual BERT are exploited to generate image features, text offensiveness categories 𝑦𝑂𝐷 . It should be noted that the source and
embedding, source embedding, and target embedding from image, text, target domains refer to metaphorical information, usually in the form
as well as source and target domains in metaphorical information. For of text. As shown in Fig. 1, the source domain forms the foundation of
inter-modality attention, we propose a cross-attention Transformer that the metaphor, while the target domain represents the concept or idea
leverages the image embedding as the query and the text features as being metaphorically expressed.
the key and value to elicit the inter-modality representation. Further-
more, to effectively utilize intra-modality attention between text and 2.2. Feature extraction
metaphorical information and obtain the intra-modality representation,
we employ the text embedding as the query and consider the source do- Given an input example for each sample 𝑥 = (𝑥𝑣 , 𝑥𝑡 , 𝑥𝑠 , 𝑥𝑔 ), where
main and target domain as the key and value. Notably, using target and 𝑥𝑣 represents the image input, 𝑥𝑡 denotes the text input with 𝐿𝑡 tokens.
source domain features for metaphor recognition is not advisable due to 𝑥𝑠 with 𝐿𝑠 tokens and 𝑥𝑔 with 𝐿𝑔 tokens represent the source domain
the potential leakage of metaphorical information, which could result and target domain in metaphor text, respectively. Four encoders are
in inaccurate metaphor recognition outcomes. Therefore, we use the used for feature extraction, three of which are Multilingual BERT [7]
inter-modality representation for metaphor recognition and addition- for extracting text features, and the other is VGG16 [8] for extracting
ally employ a dynamic replacement strategy to guide source and target image features. The text representations 𝑥𝑡 , 𝑥𝑠 , 𝑥𝑔 are input into the
domains for subsequent subtasks. Extensive experiments conducted on Multilingual BERT encoder, which produces hidden representations
the MET-Meme dataset verify the superiority of our framework and 𝑒𝑡 ∈ R𝑁×𝑑𝑡 , 𝑒𝑠 ∈ R𝑁×𝑑𝑠 , and 𝑒𝑔 ∈ R𝑁×𝑑𝑔 for all the tokens in the text
show that our proposed M3 F method significantly outperforms recent modality, where 𝑁 represents the number of tokens, and 𝑑𝑡 , 𝑑𝑠 , and 𝑑𝑔
state-of-the-art solutions. represent the respective dimensions of the hidden representations.
To summarize, the main contributions of our work are as follows: 𝑒𝑡 = M-BERT([𝐶𝐿𝑆]𝑥𝑡 [𝑆𝐸𝑃 ])1∶𝐿𝑡 ,
𝑒𝑠 = M-BERT([𝐶𝐿𝑆]𝑥𝑠 [𝑆𝐸𝑃 ])1∶𝐿𝑠 , (1)
1
The code of this work is released at: github.com/Vincy2King/M3F-MEME 𝑔 𝑔
𝑒 = M-BERT([𝐶𝐿𝑆]𝑥 [𝑆𝐸𝑃 ])1∶𝐿𝑔 ,
2
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778
Fig. 2. The architecture of the proposed M3 F method. The metaphor recognition labels are obtained through inter-modality representation, and the sentiment analysis labels,
intention detection labels, and offensiveness labels are output by the multi-interactive decoder. This way is to prevent the source domain and the target domain data from
influencing the results of metaphor recognition.
where 𝑑𝑘 ∈ R𝑑ℎ ∕𝑚 , Att𝑖 (ℎ𝑣 , ℎ𝑡 ) ∈ R𝑁×𝑑𝑘 , and {𝑊𝑖𝑄 , 𝑊𝑖𝐾 , 𝑊𝑖𝑉 } ∈ R𝑑𝑘 ×𝑑ℎ
are learnable parameters. 𝜎 denotes a softmax function. The outputs of
Fig. 3. The architecture of the inter-modality attention and intra-modality attention. 𝑚 heads are then concatenated and passed through a linear transforma-
tion to produce the final output.
3
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778
where 𝑊𝑀𝑅 ∈ R𝑑ℎ ×𝑑ℎ is the learnable parameter and 𝑏𝑀𝑅 is the bias Step 1. To begin, we generate seven distinct representations by
training alone with the model. 𝜎 indicates the softmax function. 𝑦𝑀𝑅 employing seven individual linear layers of 𝐻𝑚𝑢𝑙 for a tri-task represen-
and 𝑦̂𝑀𝑅 are the ground truth and estimated label distribution of the tation denoted as 𝐻𝑠,𝑖,𝑜 , as well as three bi-task representations, namely
metaphor recognition task. 𝛩𝑀𝑅 represents all the model’s learnable 𝐻𝑠,𝑖 , 𝐻𝑠,𝑜 , and 𝐻𝑖,𝑜 , alongside three task-specific representations: 𝐻𝑠 ,
parameters, 𝜆𝑀𝑅 denotes the coefficient of L2-regularization. 𝐻𝑖 , and 𝐻𝑜 . Here, 𝑠 represents the sentiment analysis task, and 𝑖 and
Dynamic Replacement. The metaphor recognition results primar- 𝑜 correspond to intention detection and offensiveness detection tasks,
ily focus on determining whether a meme contains a metaphorical respectively.
expression or not. As a result, the labels assigned to the memes are Step 2. To facilitate knowledge transfer among the three tasks, a
categorized as either metaphorical or literal. The metaphorical label gating layer that contains a linear transformation and a self-attention
indicates the presence of a metaphorical occurrence in the meme, while mechanism is introduced. We utilize seven separate gating layers
the literal label signifies the absence of metaphorical elements. When to obtain the updated representations of the tri-task, bi-task, and
the result is the literal label, the source and target domains would be task-specific representations. The update of tri-task, bi-task, and task-
replaced by ‘‘[CLS] [SEP]’’. specific representations aims to learn the shared representations that
facilitate knowledge transfer across three tasks, capture the corre-
2.4. Intra-modality attention lation between two tasks, and develop task-specific representations,
respectively.
In the context of meme analysis, the inclusion of metaphorical To better aggregate the tri-task, bi-task, and task-specific infor-
information from both the source and target domains is crucial. It mation, we first utilize linear layers to project the concatenation of
serves as a union of textual modality information and text within corresponding representations in step 1:
meme images, aiding in understanding and interpreting the underlying
𝐸𝑠,𝑖,𝑜 = 𝑊𝑠,𝑖,𝑜 (𝐻𝑠,𝑖,𝑜 ⊕ 𝐻𝑠,𝑖 ⊕ 𝐻𝑠,𝑜 ⊕ 𝐻𝑖,𝑜 ⊕ 𝐻𝑠 ⊕ 𝐻𝑖 ⊕ 𝐻𝑜 ) + 𝑏𝑠,𝑖,𝑜 ,
meanings of memes. As depicted in Fig. 1(a), if the model can infer
the metaphorical information associated with the text ‘‘APPLE FANS; 𝐸𝑠,𝑖 = 𝑊𝑠,𝑖 (𝐻𝑠,𝑖 ⊕ 𝐻𝑠 ⊕ 𝐻𝑖 ) + 𝑏𝑠,𝑖 ,
IPHONE 10; IPHONE 11’’ and relate it to ‘‘Man, Woman; Woman’’, 𝐸𝑠,𝑜 = 𝑊𝑠,𝑜 (𝐻𝑠,𝑜 ⊕ 𝐻𝑠 ⊕ 𝐻𝑜 ) + 𝑏𝑠,𝑜 ,
it would grasp the expectation of an iPhone fan towards the iPhone 𝐸𝑖,𝑜 = 𝑊𝑖,𝑜 (𝐻𝑖,𝑜 ⊕ 𝐻𝑖 ⊕ 𝐻𝑜 ) + 𝑏𝑖,𝑜 , (11)
11. Similarly, by understanding the metaphorical information behind
𝐸𝑠 = 𝑊𝑠 (𝐻𝑠,𝑖,𝑜 ⊕ 𝐻𝑠,𝑖 ⊕ 𝐻𝑠,𝑜 ⊕ 𝐻𝑠 ) + 𝑏𝑠 ,
‘‘Lungs’’ as the source domain and ‘‘Orange’’ as the target domain, the
model can further discern the implied distinction between the lungs 𝐸𝑖 = 𝑊𝑖 (𝐻𝑠,𝑖,𝑜 ⊕ 𝐻𝑠,𝑖 ⊕ 𝐻𝑖,𝑜 ⊕ 𝐻𝑖 ) + 𝑏𝑖 ,
of a vegetarian (represented by slices of orange) and a non-vegetarian 𝐸𝑜 = 𝑊𝑜 (𝐻𝑠,𝑖,𝑜 ⊕ 𝐻𝑠,𝑜 ⊕ 𝐻𝑖,𝑜 ⊕ 𝐻𝑜 ) + 𝑏𝑜 ,
(depicted by an anatomical illustration). Furthermore, the importance
where ⊕ is the concatenation operation and 𝑊𝑠,𝑖,𝑜 , 𝑊𝑠,𝑖 , 𝑊𝑠,𝑜 ,
of intra-modality attention lies in its ability to capture the internal
𝑊𝑖,𝑜 , 𝑊𝑠 , 𝑊𝑖 , 𝑊𝑜 are trainable parameters, and 𝑏𝑠,𝑖,𝑜 , 𝑏𝑠,𝑖 , 𝑏𝑠,𝑜 , 𝑏𝑖,𝑜 , 𝑏𝑠 , 𝑏𝑖 , 𝑏𝑜
correlations within text modality. The core principle of intra-modality
are biases.
attention is to dynamically allocate attention weights to different parts
or elements within the same modality. We then apply self-attention for each 𝐸 representation. Specifically,
each concatenated representation undergoes these projections to obtain
Similar to inter-modality attention, we adopt 𝑄, 𝐾, and 𝑉 to rep-
resent the query, key, and value in this module. In this case, 𝑄 is the corresponding 𝐾, 𝑄, and 𝑉 values. Dot products are then computed
computed based on ℎ𝑡 , 𝐾 and 𝑉 are calculated using ℎ𝑠 and ℎ𝑔 . Given between key-query pairs, followed by scaling to ensure stable training.
the weights 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 ∈ R𝑑ℎ ×𝑑ℎ , then 𝑄 = ℎ𝑡 𝑊𝑄 , 𝐾 = ℎ𝑠 𝑊𝐾 , 𝑉 = Subsequently, a softmax operation is applied to normalize the results.
ℎ𝑔 𝑊𝑉 . The output of 𝑛 heads is determined as below: Finally, a weighted sum is calculated using the output from the 𝑉
projection.
M-Att(ℎ𝑡 , ℎ𝑠 , ℎ𝑔 ) = [Att1 (ℎ𝑡 , ℎ𝑠 , ℎ𝑔 ), … , Att𝑛 (ℎ𝑡 , ℎ𝑠 , ℎ𝑔 )]𝑊 𝑗 , (8)
𝑄𝐾 ⊤
Attention(𝑄, 𝐾, 𝑉 ) = softmax( √ )𝑉 , (12)
𝑑𝑘
[𝑊 𝑄 ℎ𝑡 ][𝑊 𝐾 ℎ𝑠 ]
Att𝑖 (ℎ𝑡 , ℎ𝑠 , ℎ𝑔 ) = 𝜎( 𝑖 √ 𝑖 )[𝑊𝑖𝑉 ℎ𝑔 ], (9) In detail, The updated representations can be obtained by:
𝑑𝑙
′
𝐸𝑠,𝑖,𝑜 = 𝐸𝑠,𝑖,𝑜 (1 + Attention(𝐸𝑠,𝑖,𝑜 )),
where 𝑊 𝑗 ∈ R𝑑ℎ ×𝑑ℎ is a learnable parameter, 𝑑𝑙 ∈ R𝑑ℎ ∕𝑛 , Att𝑖 (ℎ𝑡 , ℎ𝑠 , ℎ𝑔 )
′
∈ R𝑁×𝑑𝑙 , and {𝑊𝑖𝑄 , 𝑊𝑖𝐾 , 𝑊𝑖𝑉 } ∈ R𝑑𝑙 ×𝑑ℎ are learnable parameters. 𝜎 𝐸𝑠,𝑖 = 𝐸𝑠,𝑖 (1 + Attention(𝐸𝑠,𝑖 )),
denotes the softmax function. Afterward, a feedforward network with ′
𝐸𝑠,𝑜 = 𝐸𝑠,𝑜 (1 + Attention(𝐸𝑠,𝑜 )),
linear-based activation functions is employed to obtain the inter-moda- ′
𝐸𝑖,𝑜 = 𝐸𝑖,𝑜 (1 + Attention(𝐸𝑖,𝑜 )), (13)
lity representation as:
𝐸𝑠′ = 𝐸𝑠 (1 + Attention(𝐸𝑠 )),
𝐻𝑖𝑛𝑡𝑟𝑎 = FN(M-Att(ℎ𝑡 , ℎ𝑠 , ℎ𝑔 )). (10)
𝐸𝑖′ = 𝐸𝑖 (1 + Attention(𝐸𝑖 )),
where FN is the feedforward network and 𝐻𝑖𝑛𝑡𝑟𝑎 denotes the intra- 𝐸𝑜′ = 𝐸𝑜 (1 + Attention(𝐸𝑜 )),
modality representation.
Step 3. To further aggregate corresponding information in a com-
prehensive manner, we concatenate all relevant representations to
2.5. Multi-interactive decoder
obtain 𝑈𝑠 , 𝑈𝑖 , 𝑈𝑜 :
′ ′ ′
After obtaining the inter-modality representation 𝐻𝑖𝑛𝑡𝑒𝑟 and intra- 𝑈𝑠 = 𝐸𝑠,𝑖,𝑜 ⊕ 𝐸𝑠,𝑖 ⊕ 𝐸𝑠,𝑜 ⊕ 𝐸𝑠′ ,
modality representation 𝐻𝑖𝑛𝑡𝑟𝑎 , we concatenate them with image fea- ′
𝑈𝑖 = 𝐸𝑠,𝑖,𝑜 ′ ′
⊕ 𝐸𝑖′ ,
⊕ 𝐸𝑠,𝑖 ⊕ 𝐸𝑖,𝑜 (14)
tures ℎ𝑣 , text embedding ℎ𝑡 , as well as source and target embedding
′ ′ ′
ℎ𝑠 , ℎ𝑔 to produce the multi-modal representation 𝐻𝑚𝑢𝑙 , which plays an 𝑈𝑜 = 𝐸𝑠,𝑖,𝑜 ⊕ 𝐸𝑠,𝑜 ⊕ 𝐸𝑖,𝑜 ⊕ 𝐸𝑜′ .
essential role in the following steps. While employing 𝐻𝑚𝑢𝑙 may offer a
viable approach for directly predicting the three tasks, it fails to com- 3. Multi-task prediction
prehensively capture the unique characteristics inherent to each task.
Inspired by [10], we proposed a multi-interactive decoder depicted in The outputs 𝑈𝑠 , 𝑈𝑖 , 𝑈𝑜 are forwarded through the softmax func-
Fig. 4, which comprises three steps. tion to yield the sentiment, intention, and offensiveness categories
4
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778
Fig. 4. The architecture of the multi-interactive decoder, where each gating layer consists of a linear transformation and a self-attention mechanism. ⊕ indicates the concatenation
operation.
5
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778
Table 2
Experimental results (%) on the MET-Meme dataset of four tasks. Results with † are retrieved from [6]. Bold indicates the best-performing model in each column. The results with
∗
indicate the significance tests of our M3 F over other baseline models (with 𝑝-value < 0.05), and with ♯ indicate the performance of MET𝑜 over the MET method. Fusion methods
consist of element-wise add (add) and concatenation (cat).
Method English Chinese English Chinese
Acc Pre Rec Acc Pre Rec Acc Pre Rec Acc Pre Rec
Sentiment Analysis Intention Detection
VGG16 20.57 20.84 24.22 29.94 26.04 29.20 37.19 38.71 38.15 47.48 49.21 47.81
DenseNet-161 21.88 21.71 25.65 29.45 27.50 29.36 38.10 39.31 37.89 47.23 39.24 47.06
ResNet-50 21.74 18.63 21.35 29.36 27.50 29.28 39.19 37.12 40.10 47.15 39.23 47.06
Multi-BERT_EfficientNet 28.52 24.52 29.04 33.50 35.29 33.42 43.10 41.54 42.19 51.03 43.06 51.03
Multi-BERT_ViT 24.43 23.41 23.96 33.25 27.33 32.84 41.28 40.13 40.62 50.62 41.32 50.62
Multi-BERT_PiT 25.00 27.82 28.12 33.66 33.58 33.09 42.23 41.09 41.02 50.21 50.00 50.04
MET_add 24.65† 24.52† 25.26† 32.50† 32.62† 33.50† 40.32† 40.39† 41.28† 52.93† 52.68† 54.01†
MET𝑜 _add 25.91♯ 25.20♯ 25.52♯ 33.91♯ 33.44♯ 34.16♯ 41.02♯ 41.15♯ 41.54♯ 50.87 52.89♯ 51.28
MET_cat 27.68† 28.41† 29.82† 33.42† 34.33† 33.91† 38.56† 39.19† 39.84† 51.58† 51.48† 52.85†
MET𝑜 _cat 27.86♯ 28.64♯ 29.56 34.66♯ 35.26♯ 34.16♯ 40.10♯ 40.76♯ 40.11♯ 51.28 51.55♯ 53.43♯
ours_add 30.47∗ 33.45∗ 30.34∗ 39.95∗ 41.80∗ 39.87∗ 44.40∗ 41.89∗ 44.32∗ 55.25∗ 54.57∗ 55.00∗
ours_cat 29.82∗ 34.18∗ 30.73∗ 37.22∗ 39.55∗ 37.97 ∗ 44.10∗ 44.56∗ 43.53∗ 53.52∗ 54.72∗ 54.52∗
Offensiveness Detection Metaphor Recognition
VGG16 67.10 63.42 72.53 70.07 64.11 72.07 78.39 79.73 79.95 67.00 67.24 67.82
DenseNet-161 69.66 62.07 69.98 71.43 70.82 75.43 80.08 80.23 80.47 67.16 67.91 67.99
ResNet-50 69.21 64.62 72.57 73.24 69.62 75.74 80.34 81.22 80.86 67.74 67.63 67.66
Multi-BERT_EfficientNet 73.78 67.98 74.56 78.15 72.11 79.98 82.46 84.39 83.11 74.19 71.26 74.28
Multi-BERT_ViT 71.22 62.96 72.66 76.92 67.31 78.74 81.90 82.01 82.46 73.28 72.55 73.70
Multi-BERT_PiT 72.79 66.69 74.26 77.17 70.16 79.05 82.07 83.05 82.98 75.10 73.15 74.28
MET_add 68.39† 66.21† 72.14† 76.01† 74.76† 78.16† 81.33† 81.49† 82.29† 74.04† 74.51† 74.96†
MET𝑜 _add 69.53♯ 66.92♯ 73.18♯ 77.42♯ 74.94♯ 78.99♯ 81.64♯ 81.60♯ 83.20♯ 75.02♯ 72.81 75.67♯
MET_cat 67.25† 66.15† 74.48† 73.19† 71.59† 79.49† 82.39† 82.69† 83.33† 72.90† 72.80† 73.30†
MET𝑜 _cat 63.16 66.33♯ 74.61♯ 74.69♯ 72.05♯ 73.61 82.03 82.71♯ 83.85♯ 73.78♯ 73.46♯ 73.61♯
ours_add 76.17∗ 69.45∗ 76.19∗ 80.81∗ 76.00∗ 80.73∗ 83.98∗ 85.86∗ 84.38∗ 77.01∗ 72.94 82.68∗
ours_cat 74.09∗ 69.59∗ 76.15∗ 80.07∗ 76.20∗ 80.62∗ 83.20∗ 85.97∗ 85.81∗ 76.18∗ 73.02 80.00∗
also present the results of two fusion methods including element-wise sentiment analysis, intention detection, and offensiveness detection
add (add) and concatenation (cat) in their work. Moreover, we adopt tasks, the result of metaphor recognition is directly affected by inter-
VGG16 for image feature extraction for a fair comparison with [6]. To modality attention. Metaphorical information needs to be reflected
explore the fusion for obtaining multi-modal representation, we also in the two modalities of image and text. That is to say, the inter-
examine two fusion operations: addition for all the features (add) and modality attention allows the model to focus on the relevant cues
concatenation of all the features (cat). and context across modalities, leading to improved performance in
metaphor recognition.
4.4. Main result Interestingly, we also noticed that the three methods of Multi-
BERT_EfficientNet, Multi-BERT_ViT, and Multi-BERT_PiT can be com-
We report the main experimental results of the multi-modal meme parable to the performance of MET. This may be because the image
on the test set of the MET-Meme benchmark dataset. Table 2 demon- extraction method used by the MET method is more traditional VGG
strates the results for the following four tasks: sentiment analysis, or ResNet, so it is inferior to some of the latest vision-language models.
intention detection, offensiveness detection, and metaphor recognition Our M3 F model using VGG16 as the image encoder still performs better
than theirs, which reflects the superiority of our model performance.
in English and Chinese. No matter the accuracy, precision, and recall
of all tasks, our proposed M3 F approach consistently achieves the
4.5. Ablation study
highest performances in the above four tasks. These results highlight
the effectiveness of our model, which leverages inter-modality and
To assess how different components affect performance, we perform
intra-modality attention mechanisms, along with a multi-interactive
an ablation study of the proposed M3 F on three tasks using the con-
decoder. The improved performance across various metrics further
catenation method and report the accuracy results in Table 3. ‘‘ours’’
emphasizes the advantages of our approach in capturing and leveraging
is our proposed method, ‘‘w/o Inter’’ represents that without using
the synergies between modalities and tasks. inter-modality representation, and ‘‘w/o Intra’’ denotes that without
On the other hand, comparing the results between the MET and employing intra-modality representation. ‘‘w/o MID’’ indicates that
MET𝑜 methods, we can conclude that our dataset partitioning is sim- we directly exploit the multi-modal representation to obtain the final
ilar to that of [6], and it performs slightly better. Furthermore, the results without utilizing the multi-interactive decoder. We can observe
significance tests of M3 F over the baseline models demonstrate the that the removal of inter-modality representation sharply reduces the
effectiveness of our method, presenting a statistically significant im- performance, which verifies that inter-modality attention is signifi-
provement based on most evaluation metrics with 𝑝-value < 0.05. cant and effective in learning incongruity features between different
Although the effects of cat and add methods are different in different modalities. Furthermore, the removal of intra-modality attention also
tasks, they are all better than baseline models. Such as the accuracy in leads to considerably poorer performance. This indicates that compared
English, sentiment analysis showed the highest improvement of 9.90% with the simple way of directly concatenating or adding metaphori-
and the lowest improvement of 1.30%. Intention detection exhibited cal information with text features, Intra-modality attention enhances
the highest enhancement of 7.21% and the lowest improvement of the model’s ability to capture finer incongruities between textual and
1.00%. Offensiveness detection displayed the highest boost of 13.01% metaphorical information. It is also worth noting that, the removal of
and the lowest improvement of 0.31%. Furthermore, compared with the multi-interactive decoder degrades the performance. The results
other tasks, the performance improvement of metaphor recognition suggest that the multi-interactive decoder plays a prior role in capturing
shows relatively minor performance gains, because different from the and the interaction of various tasks.
6
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778
Table 3
Experimental results of ablation study. Acc, Pre, and Rec represent Accuracy, Precision, and Recall.
Method English Chinese English Chinese
Acc Pre Rec Acc Pre Rec Acc Pre Rec Acc Pre Rec
Sentiment Analysis Intention Detection
ours 29.82 34.18 30.73 37.22 39.55 37.97 44.10 44.56 43.53 53.52 54.72 54.52
-w/o Inter 29.04 30.60 29.95 37.06 38.30 37.14 43.49 43.08 42.58 52.77 53.82 53.02
-w/o Intra 28.65 33.34 28.65 36.15 38.65 36.02 43.10 42.25 42.58 52.36 53.53 53.11
-w/o MID 28.91 31.09 29.95 36.15 38.18 36.06 43.23 42.75 42.45 51.94 53.37 53.02
Offensiveness Detection
English Chinese
Acc Pre Rec Acc Pre Rec
ours 74.09 69.59 76.15 80.07 76.20 80.62
-w/o Inter 73.44 68.50 74.22 79.07 75.24 78.25
-w/o Intra 73.44 67.53 74.48 78.33 72.34 78.76
-w/o MID 73.18 67.11 74.09 77.17 71.82 78.14
Table 4
Experimental results (%) on the MET-Meme dataset of different backbones.
Backbone English Chinese English Chinese
Acc Pre Rec Acc Pre Rec Acc Pre Rec Acc Pre Rec
VGG16 29.82 34.18 30.73 37.22 39.55 37.97 44.10 44.56 43.53 53.52 54.72 54.52
DenseNet-161 30.73 28.72 30.21 40.69 43.12 40.61 44.27 45.49 44.60 54.76 59.07 54.01
ResNet-50 31.25 28.77 31.03 42.43 43.28 40.44 44.40 43.38 44.12 56.16 57.52 55.17
EfficientNet 31.12 28.68 31.76 40.45 41.71 40.53 43.23 43.35 43.10 55.42 60.07 55.25
ViT 30.86 29.71 30.21 40.53 41.43 40.67 43.49 45.11 44.27 56.91 61.49 56.41
PiT 30.08 30.97 30.23 40.86 42.52 40.83 44.14 42.22 44.35 56.16 58.82 56.07
VGG16 74.09 69.59 76.15 80.07 76.20 80.62 83.20 85.97 85.81 76.18 73.02 80.00
DenseNet-161 75.26 68.22 75.13 81.14 76.55 81.06 82.68 84.68 83.79 76.34 72.23 79.59
ResNet-50 75.65 68.18 75.39 80.48 76.30 80.73 84.24 86.05 84.71 77.42 71.99 82.06
EfficientNet 75.00 68.64 75.68 81.22 76.99 81.24 83.33 84.65 83.85 77.58 72.10 81.22
ViT 75.78 69.84 76.04 81.56 77.03 81.47 83.13 85.51 83.98 78.08 74.02 82.47
PiT 74.87 70.37 75.61 80.81 76.89 80.95 83.46 85.29 83.77 77.83 73.22 83.30
4.6. Qualitative evaluation of backbones to enhanced performance. In this way, we investigate inter- and intra-
modality attention to effectively utilize incongruity information. As
This section analyzes different backbones of image feature extrac- illustrated in Fig. 5, our model tends to focus on the inconsistent regions
tion including VGG16 [8], DenseNet [18], ResNet50 [19], Vision Trans- between the image and the text. The attention mechanisms for images
former (ViT) [21], and Pooling-based Vision Transformer (PiT) [22]. and text can complement each other, as illustrated in Fig. 5(b), where
The result is presented in Table 4. While different backbones do have a the text focuses on the ‘‘unicorn’’ while it is overlooked in the meme
certain impact on the final performance, overall, models based on PiT image. Similarly, in Fig. 5(a), ‘‘spiderman’’ is attended to in vision but
demonstrate the best performance. Moreover, it is worth mentioning receives less attention in the text. Furthermore, as shown in Fig. 5,
that regardless of the backbone employed, the performance surpasses the model emphasizes the central portions of the ‘‘cat’’, while the text
the current state-of-the-art methods (as shown in Table 2), thus visually modality places the highest attention on the term ‘‘potluck’’. Through
illustrating the feasibility of our proposed GloG method. Additionally, inter-modality attention, our model captures incongruity dependencies
we observed that transformer-based architectures, including ViT and between the two modalities, enabling it to make accurate predictions
PiT, outperform others. This is attributed to the attention mechanism for such instances. Furthermore, we can notice that the word ‘‘me’’ in
from the Transformer architecture employed by ViT, allowing the the text modality refers to the ‘‘cat’’ in the image modality. Assisted by
model to capture global context information within images. inter-modality attention, our model can establish a connection between
these modalities, so as to learn the relationship for better performance.
4.7. Case study
4.8. Effect of hyperparameters
To qualitatively examine how M3 F investigates incongruent infor-
mation across different modalities, we offer an attention visualization Table 5 presents the performance variation of our model on English
of three representative examples that require the integration of both and Chinese datasets by adjusting the weights (𝑤𝑆𝐴 , 𝑤𝐼𝐷 , 𝑤𝑂𝐷 ) for
text and image data. The outcomes are depicted in Fig. 5. different tasks (sentiment analysis, intent detection, aggression detec-
It is important to emphasize that the primary focus lies in dis- tion, and metaphor identification). These weight adjustments reflect the
cerning incongruent information. Corresponding to multi-modal meme model’s emphasis on each sub-task during multi-task learning. Exper-
tasks consisting of sentiment analysis, intention detection, offensive- imental results indicate that by appropriately adjusting these weights,
ness detection, and metaphor recognition, capturing the contradictory significant improvements can be achieved in the model’s performance
expressions between different modalities can outstandingly contribute on specific tasks. For instance, on the English dataset, when the weights
7
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778
Table 5
Accuracy (%) of Different weights of each task.
English
w1 w2 w3 SA ID OD w1 w2 w3 SA ID OD w1 w2 w3 SA ID OD
0.1 0.1 0.8 22.25 26.50 76.38 0.2 0.5 0.3 26.62 32.62 70.88 0.4 0.4 0.2 28.25 34.12 73.75
0.1 0.2 0.7 28.25 37.25 76.62 0.2 0.6 0.2 26.38 35.75 69.62 0.4 0.5 0.1 20.38 32.50 58.13
0.1 0.3 0.6 27.50 33.38 65.38 0.2 0.7 0.1 25.62 35.50 65.75 0.5 0.1 0.4 23.62 30.12 60.12
0.1 0.4 0.5 26.80 34.38 58.25 0.3 0.1 0.6 26.38 37.75 68.25 0.5 0.2 0.3 27.62 36.50 67.00
0.1 0.5 0.4 25.62 33.38 61.62 0.3 0.2 0.5 29.12 37.25 66.50 0.5 0.3 0.2 25.25 33.50 65.50
0.1 0.6 0.3 26.25 32.88 64.00 0.3 0.3 0.4 22.12 27.62 56.12 0.5 0.4 0.1 17.12 31.13 48.00
0.1 0.7 0.2 25.37 33.75 67.25 0.3 0.4 0.3 24.75 29.88 49.12 0.6 0.1 0.3 16.88 23.50 48.38
0.1 0.8 0.1 20.00 34.25 68.75 0.3 0.5 0.2 24.00 38.00 59.00 0.6 0.2 0.2 22.25 26.00 76.62
0.2 0.1 0.7 25.87 34.12 54.75 0.3 0.6 0.1 18.12 37.38 41.25 0.6 0.3 0.1 26.00 33.12 66.88
0.2 0.2 0.6 24.50 29.88 60.62 0.4 0.1 0.5 28.88 36.12 74.50 0.7 0.1 0.2 22.75 31.13 69.62
0.2 0.3 0.5 27.62 34.00 71.50 0.4 0.2 0.4 29.82 44.10 74.09 0.7 0.2 0.1 23.00 28.12 56.62
0.2 0.4 0.4 26.50 34.75 69.75 0.4 0.3 0.3 23.00 30.00 70.38 0.8 0.1 0.1 26.12 28.25 62.38
Chinese
w1 w2 w3 SA ID OD w1 w2 w3 SA ID OD w1 w2 w3 SA ID OD
0.1 0.1 0.8 34.32 32.84 76.10 0.2 0.5 0.3 34.98 43.09 65.18 0.4 0.4 0.2 34.40 42.10 59.39
0.1 0.2 0.7 32.25 24.90 70.39 0.2 0.6 0.2 36.05 31.51 71.88 0.4 0.5 0.1 34.23 41.36 58.56
0.1 0.3 0.6 34.40 46.24 76.92 0.2 0.7 0.1 34.07 41.03 57.40 0.5 0.1 0.4 33.41 38.71 63.61
0.1 0.4 0.5 35.81 45.99 74.94 0.3 0.1 0.6 35.39 52.15 79.24 0.5 0.2 0.3 35.97 43.34 70.97
0.1 0.5 0.4 34.98 43.42 71.96 0.3 0.2 0.5 37.22 44.56 80.07 0.5 0.3 0.2 34.40 42.51 74.52
0.1 0.6 0.3 31.75 44.25 73.28 0.3 0.3 0.4 34.07 35.98 61.29 0.5 0.4 0.1 34.15 41.27 58.73
0.1 0.7 0.2 34.07 45.49 69.07 0.3 0.4 0.3 35.72 52.31 76.67 0.6 0.1 0.3 36.39 44.58 67.91
0.1 0.8 0.1 34.81 42.02 79.49 0.3 0.5 0.2 32.33 42.76 76.43 0.6 0.2 0.2 36.05 46.57 76.34
0.2 0.1 0.7 31.67 52.15 78.74 0.3 0.6 0.1 34.73 46.15 67.91 0.6 0.3 0.1 37.05 46.32 79.40
0.2 0.2 0.6 35.89 47.48 75.52 0.4 0.1 0.5 36.39 43.76 78.91 0.7 0.1 0.2 34.81 43.67 68.98
0.2 0.3 0.5 35.14 43.59 66.83 0.4 0.2 0.4 35.56 52.23 79.24 0.7 0.2 0.1 35.97 43.67 65.59
0.2 0.4 0.4 36.47 48.30 80.23 0.4 0.3 0.3 34.57 43.51 79.90 0.8 0.1 0.1 35.89 52.39 63.94
for sentiment analysis, intent detection, and aggression detection tasks 4.9. Error analysis
are set to 0.4, 0.2, and 0.4 respectively, the model achieves an accu-
racy of 29.82% for sentiment analysis, 44.10% for intent detection, We also perform error analysis on the experimental results for the
and 74.09% for aggression detection tasks. Similarly, on the Chinese wrongly predicted samples. The examples are shown in Fig. 6, and
dataset, by adjusting the weights, the model achieves accuracies of we can find that the error categories for wrongly classifying might
37.22%, 44.56%, and 80.07% for the three tasks respectively. These be summarized into two aspects. On the one hand, the information
results underscore the importance of weight adjustment in optimizing provided in an image and text is insufficient to assist various tasks as
model performance within a multi-task learning framework. shown in the image of example (a). On the other hand, as demonstrated
8
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778
9
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778
Multi-task Methods. Chauhan et al. [42] developed the Inter- CRediT authorship contribution statement
task Relationship Module (iTRM) to elucidate the synergistic between
tasks and the Inter-class Relationship Module (iCRM) to establish and Bingbing Wang: Writing – review & editing, Writing – original
strengthen connections among disparate task classes. Ultimately, rep- draft, Methodology, Investigation, Formal analysis, Data curation, Con-
resentations from these two Modules are cooperated across all the ceptualization. Shijue Huang: Writing – review & editing, Methodol-
tasks. Ma et al. [43] undertook a primary task alongside two uni-modal ogy, Conceptualization. Bin Liang: Writing – review & editing, Writing
auxiliary tasks. Notably, they introduced a self-supervised generation – original draft. Geng Tu: Visualization, Software. Min Yang: Writ-
strategy within the auxiliary tasks to automatically generate uni-modal ing – review & editing, Methodology. Ruifeng Xu: Writing – review
auxiliary labels. And a self-supervised approach is designed to enhance & editing, Writing – original draft, Project administration, Funding
the efficiency and effectiveness of the auxiliary tasks by eliminating the acquisition.
need for manually annotated labels. The work of [43] proposed a multi-
task learning approach for multi-modal meme detection. The method Declaration of competing interest
incorporates a primary multi-modal task and two additional uni-modal
auxiliary tasks to enhance the performance of the detection process. The authors declare that they have no known competing finan-
In contrast to these approaches, which may overlook the signifi- cial interests or personal relationships that could have appeared to
cance of metaphorical information, our study places a strong emphasis influence the work reported in this paper.
on metaphor recognition in multi-modal meme detection. In this paper,
we focus on four sub-tasks: sentiment analysis, intention detection, Data availability
offensiveness detection, and metaphor recognition.
Data will be made available on request.
5.3. Metaphorical information
Acknowledgments
In the field of NLP, there has been a growing interest in exploring
different approaches to meme understanding, particularly in relation
This research was supported in part by the National Natural
to metaphorical information. Early studies on metaphor have predom-
Science Foundation of China (62176076), the Guangdong Provin-
inantly relied on manually constructed knowledge. Jang et al. [44] in-
cial Key Laboratory of Novel Security Intelligence Technologies
troduced a novel and highly effective method for inducing and applying
(2022B1212010005), Guangdong Provincial Natural Science Foun-
metaphor frame templates. This approach served as a crucial advance-
dation (2023A1515012922), and Shenzhen Foundational Research
ment towards metaphor detection, providing a framework to recognize
Funding JCYJ20220818102415032.
and analyze metaphorical expressions in text data. Tsvetkov et al. [45]
demonstrated the feasibility of reliably distinguishing between literal
and metaphorical syntactic constructions. Through their research, they References
showcased that it is possible to discern whether a given syntactic
[1] D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, D. Testuggine,
construction should be interpreted literally or metaphorically by con- The hateful memes challenge: Detecting hate speech in multimodal memes,
sidering the lexical-semantic features of the words involved in the Advances in Neural Information Processing Systems 33 (2020) 2611–2624.
construction. Some researchers have also used distributional clustering [2] H.R. Kirk, Y. Jun, P. Rauba, G. Wachtel, R. Li, X. Bai, N. Broestl, M. Doff-Sotta,
and unsupervised approaches [46,47]. In recent years, there has been A. Shtedritski, Y.M. Asano, Memes in the wild: Assessing the generalizability of
the hateful memes challenge dataset, 2021, arXiv preprint arXiv:2107.04313.
an increasing exploration of deep learning models for metaphor detec-
[3] G. Lakoff, M. Johnson, Metaphors We Live By, University of Chicago Press, 2008.
tion. However, limited research has been conducted on multi-modal [4] S.M. Anurudu, I.M. Obi, Decoding the metaphor of internet meme: A study of
metaphor detection, with only a few studies. satirical tweets on black friday sales in Nigeria, Afrrev. Laligens 6 (1) (2017)
Metaphor means reasoning about one thing in terms of another [48]. 91–100.
Due to metaphor being a linguistic phenomenon, most of the previous [5] Z. Kovecses, Metaphor: A Practical Introduction, Oxford University Press, 2010.
[6] B. Xu, T. Li, J. Zheng, M. Naseriparsa, Z. Zhao, H. Lin, F. Xia, MET-Meme:
works detected the metaphorical information in text-only setting [45,
A multimodal meme dataset rich in metaphors, in: Proceedings of the 45th
49]. Recently, some works started to study metaphorical information in international ACM SIGIR conference on research and development in information
multi-modal scenarios. For example, [50] presented the first metaphor retrieval, 2022, pp. 2887–2899.
identification method that simultaneously draws knowledge from lin- [7] Z. Wang, S. Mayhew, D. Roth, et al., Cross-lingual ability of multilingual bert:
guistic and visual data. Lakoff and Johnson [3] demonstrated that An empirical study, 2019, arXiv.
metaphors are found not only in languages but also via thoughts and [8] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
image recognition, 2014, arXiv.
actions. Xu et al. [6] presented a large-scale multi-modal metaphor [9] Q. Zhang, H. Wu, C. Zhang, Q. Hu, H. Fu, J.T. Zhou, X. Peng, Provable dynamic
dataset with manual fine-grained annotation including metaphorical fusion for low-quality multimodal data, 2023, arXiv preprint arXiv:2306.02050.
information. [10] Y. Zhang, J. Wang, Y. Liu, L. Rong, Q. Zheng, D. Song, P. Tiwari, J. Qin,
A multitask learning model for multimodal sarcasm, sentiment and emotion
6. Conclusion recognition in conversations, Inf. Fusion 93 (2023) 282–301.
[11] D. Dimitrov, B.B. Ali, S. Shaar, F. Alam, F. Silvestri, H. Firooz, P. Nakov, G.
Da San Martino, Detecting propaganda techniques in memes, in: Proceedings of
In this paper, we explore a novel approach for fine-grained meme
the 59th Annual Meeting of the Association for Computational Linguistics and
understanding by aligning the incongruity among metaphorical infor- the 11th International Joint Conference on Natural Language Processing (Volume
mation, images, and texts, and leveraging their potential for enhancing 1: Long Papers), 2021, pp. 6603–6617.
various tasks. More concretely, we present a Metaphor-aware Multi- [12] S. Suryawanshi, B.R. Chakravarthi, M. Arcan, P. Buitelaar, Multimodal meme
modal Multi-task Framework (M3 F) and take metaphorical information dataset (multioff) for identifying offensive content in image and text, in:
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying,
into consideration. Furthermore, we construct inter-modality attention
2020, pp. 32–41.
to capture the interaction between text and image and create intra- [13] A.R. Akula, B. Driscoll, P. Narayana, S. Changpinyo, Z. Jia, S. Damle, G. Pruthi,
modality attention to model the congruity between text and metaphor- S. Basu, L. Guibas, W.T. Freeman, et al., Metaclue: Towards comprehensive visual
ical information. Additionally, to better learn the implicit interaction metaphors research, in: Proceedings of the IEEE/CVF Conference on Computer
across various tasks simultaneously, we also design a multi-interactive Vision and Pattern Recognition, 2023, pp. 23201–23211.
[14] D. Zhang, M. Zhang, H. Zhang, L. Yang, H. Lin, Multimet: A multimodal dataset
decoder that exploits gating networks to establish. Our experimen-
for metaphor understanding, in: Proceedings of the 59th Annual Meeting of
tal results on a widely recognized benchmark dataset demonstrate a the Association for Computational Linguistics and the 11th International Joint
clear superiority of our proposed method over state-of-the-art baseline Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp.
models. 3214–3225.
10
B. Wang et al. Knowledge-Based Systems 294 (2024) 111778
[15] C. Sharma, D. Bhageria, W. Scott, S. Pykl, A. Das, T. Chakraborty, V. Pula- [34] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu,
baigari, B. Gamback, SemEval-2020 task 8: Memotion analysis–the visuo-lingual Uniter: Universal image-text representation learning, in: European Conference
metaphor! 2020, arXiv. on Computer Vision, Springer, 2020, pp. 104–120.
[16] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014, arXiv. [35] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, J. Dai, Vl-bert: Pre-training of generic
[17] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, visual-linguistic representations, 2019, arXiv preprint arXiv:1908.08530.
N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, high-performance [36] H. Tan, M. Bansal, LXMERT: Learning cross-modality encoder representations
deep learning library, Advances in Neural Information Processing Systems 32 from transformers, in: Proceedings of the 2019 Conference on Empirical Methods
(2019). in Natural Language Processing and the 9th International Joint Conference on
[18] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected con- Natural Language Processing, EMNLP-IJCNLP, 2019, pp. 5100–5111.
volutional networks, in: Proceedings of the IEEE/CVF Conference on Computer
[37] W. Zhang, G. Liu, Z. Li, F. Zhu, Hateful memes detection via complementary
Vision and Pattern Recognition, 2017, pp. 4700–4708.
visual and linguistic networks, 2020, arXiv preprint arXiv:2012.04977.
[19] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
[38] Y. Zhou, Z. Chen, H. Yang, Multimodal learning for hateful memes detection, in:
in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
2021 IEEE International Conference on Multimedia & Expo Workshops, ICMEW,
Recognition, 2016, pp. 770–778.
IEEE, 2021, pp. 1–6.
[20] M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural
networks, in: ICML, PMLR, 2019, pp. 6105–6114. [39] R. Cao, R.K.-W. Lee, W.-H. Chong, J. Jiang, Prompting for multimodal hateful
[21] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, meme classification, in: Proceedings of the 2022 Conference on Empirical
M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16 × 16 Methods in Natural Language Processing, 2022, pp. 321–332.
words: Transformers for image recognition at scale, 2020, arXiv. [40] R. Cao, M.S. Hee, A. Kuek, W.-H. Chong, R.K.-W. Lee, J. Jiang, Pro-cap:
[22] B. Heo, S. Yun, D. Han, S. Chun, J. Choe, S.J. Oh, Rethinking spatial dimensions Leveraging a frozen vision-language model for hateful meme detection, in:
of vision transformers, in: Proceedings of the IEEE/CVF International Conference Proceedings of the 31st ACM International Conference on Multimedia, 2023,
on Computer Vision, 2021, pp. 11936–11945. pp. 5244–5252.
[23] Y. Baek, B. Lee, D. Han, S. Yun, H. Lee, Character region awareness for text [41] J. Ji, W. Ren, U. Naseem, Identifying creative harmful memes via prompt
detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and based approach, in: Proceedings of the ACM Web Conference 2023, 2023, pp.
Pattern Recognition, 2019, pp. 9365–9374. 3868–3872.
[24] K. Nazeri, E. Ng, T. Joseph, F. Qureshi, M. Ebrahimi, Edgeconnect: Structure [42] D.S. Chauhan, S. Dhanush, A. Ekbal, P. Bhattacharyya, All-in-one: A deep
guided image inpainting using edge prediction, in: Proceedings of the IEEE/CVF attentive multi-task learning framework for humour, sarcasm, offensive, moti-
International Conference on Computer Vision Workshops, 2019. vation, and sentiment on memes, in: Proceedings of the 1st Conference of the
[25] S. Suryawanshi, B.R. Chakravarthi, M. Arcan, P. Buitelaar, Multimodal meme Asia-Pacific Chapter of the Association for Computational Linguistics and the
dataset (MultiOFF) for identifying offensive content in image and text, in: 10th International Joint Conference on Natural Language Processing, 2020, pp.
Proceedings of the Second Workshop on TRAC, ELRA, Marseille, France, 2020, 281–290.
pp. 32–41.
[43] Z. Ma, S. Yao, L. Wu, S. Gao, Y. Zhang, Hateful memes detection based on
[26] S. Pramanick, S. Sharma, D. Dimitrov, M.S. Akhtar, P. Nakov, T. Chakraborty,
multi-task learning, Mathematics 10 (23) (2022) 4525.
MOMENTA: A multimodal framework for detecting harmful memes and their
[44] H. Jang, K. Maki, E. Hovy, C. Rose, Finding structure in figurative language:
targets, in: Findings of the Association for Computational Linguistics: EMNLP
Metaphor detection with topic-based frames, in: Proceedings of the 18th Annual
2021, Punta Cana, Dominican Republic, 2021, pp. 4439–4455.
SIGDIAL Meeting on Discourse and Dialogue, 2017, pp. 320–330.
[27] F. Gasparini, G. Rizzi, A. Saibene, E. Fersini, Benchmark dataset of memes with
[45] Y. Tsvetkov, L. Boytsov, A. Gershman, E. Nyberg, C. Dyer, Metaphor detection
text transcriptions for automatic detection of multi-modal misogynistic content,
Data Brief 44 (2022) 108526. with cross-lingual model transfer, in: Proceedings of the 52nd Annual Meeting
[28] J. Wang, Y. Yang, K. Liu, Z. Zhu, X. Liu, M3S: Scene graph driven multi- of the ACL (Volume 1: Long Papers), 2014, pp. 248–258.
granularity multi-task learning for multi-modal NER, IEEE/ACM Transactions on [46] E. Shutova, L. Sun, E.D. Gutiérrez, P. Lichtenstein, S. Narayanan, Multilin-
Audio, Speech, and Language Processing 31 (2022) 111–120. gual metaphor processing: Experiments with semi-supervised and unsupervised
[29] F. Chen, J. Liu, K. Ji, W. Ren, J. Wang, J. Chen, Learning implicit entity- learning, Comput. Linguist. 43 (1) (2017) 71–123.
object relations by bidirectional generative alignment for multimodal NER, in: [47] R. Mao, C. Lin, F. Guerin, Word embedding and WordNet based metaphor
Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. identification and interpretation, in: Proceedings of the 56th Annual Meeting
4555–4563. of the Association for Computational Linguistics (Volume 1: Long Papers), 2018,
[30] J. Wu, C. Gong, Z. Cao, G. Fu, MCG-MNER: A multi-granularity cross-modality pp. 1222–1231.
generative framework for multimodal NER with instruction, in: Proceedings of [48] G. Lakoff, M. Johnson, Conceptual metaphor in everyday language, in: Shaping
the 31st ACM International Conference on Multimedia, 2023, pp. 3209–3218. Entrepreneurship Research, 1980.
[31] I. Laina, C. Rupprecht, N. Navab, Towards unsupervised image captioning with [49] G. Gao, E. Choi, Y. Choi, L. Zettlemoyer, Neural metaphor detection in context,
shared multimodal embeddings, in: Proceedings of the IEEE/CVF International in: Proceedings of the 2018 Conference on Empirical Methods in Natural
Conference on Computer Vision, 2019, pp. 7414–7424. Language Processing, ACL, Brussels, Belgium, 2018, pp. 607–613.
[32] V. Sandulescu, Detecting hateful memes using a multimodal deep ensemble, [50] E. Shutova, D. Kiela, J. Maillard, Black holes and white rabbits: Metaphor
2020, arXiv preprint arXiv:2012.13235. identification with visual features, in: Proceedings of the 2016 Conference of
[33] A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, Y. Artzi, A corpus for reasoning
NAACL: Human Language Technologies, ACL, San Diego, California, 2016, pp.
about natural language grounded in photographs, 2018, arXiv preprint arXiv:
160–170.
1811.00491.
11