Retrieval-Augmented Dynamic Prompt Tuning For Incomplete Multimodal Learning
Retrieval-Augmented Dynamic Prompt Tuning For Incomplete Multimodal Learning
Multimodal Learning
Jian Lang1 * , Zhangtao Cheng1 * , Ting Zhong1,2 , Fan Zhou1,2†
1
University of Electronic Science and Technology of China, Chengdu, Sichuan, China
2
Kash Institute of Electronics and Information Industry, Kashgar, Xinjiang, China
jian lang@[Link], [Link]@[Link], zhongting@[Link], [Link]@[Link]
Abstract Image-Only
arXiv:2501.01120v1 [[Link]] 2 Jan 2025
Multimodal
Classifier Output
Multimodal learning with incomplete modality is practical Transformer
and challenging. Recently, researchers have focused on en-
hancing the robustness of pre-trained MultiModal Transform- Pad Text Prompts Prior Prompt-based Methods
ers (MMTs) under missing modality conditions by apply-
ing learnable prompts. However, these prompt-based methods
Multimodal
face several limitations: (1) incomplete modalities provide re- Classifier Output
Transformer
stricted modal cues for task-specific inference, (2) dummy
imputation for missing content causes information loss and Text Learnable Frozen
introduces noise, and (3) static prompts are instance-agnostic, Prompts
Memory Bank
offering limited knowledge for instances with various missing Generator Prompter
conditions. To address these issues, we propose RAGPT, a Prompts Static Prompt
# Label
novel Retrieval-AuGmented dynamic Prompt Tuning frame- ...
work. RAGPT comprises three modules: (I) the multi-channel Text
Text
Text # Label Prompts Dynamic Prompt
retriever, which identifies similar instances through a within-
modality retrieval strategy, (II) the missing modality genera- Retriever Our RAGPT
tor, which recovers missing information using retrieved con-
texts, and (III) the context-aware prompter, which captures Figure 1: Prior prompt-based methods vs. our RAGPT in
contextual knowledge from relevant instances and generates tackling incomplete multimodal learning.
dynamic prompts to largely enhance the MMT’s robustness.
Extensive experiments conducted on three real-world datasets
show that RAGPT consistently outperforms all competitive
baselines in handling incomplete modality problems. The into three groups: (1) Joint learning methods (Wang et al.
code of our work and prompt-based baselines is available at 2023; Yao et al. 2024), (2) Cross-modal generation meth-
[Link] ods (Ma et al. 2021; Woo et al. 2023), and (3) Prompt-based
methods (Lee et al. 2023; Jang, Wang, and Kim 2024). For
joint learning methods, they heavily rely on the selection
Introduction of similarity measures and require filling missing-modality
Multimodal learning has emerged as a critical paradigm in inputs with masking values, resulting in the loss of criti-
both research and industry, demonstrating broad application cal information and the introduction of noise into the mod-
potential in areas such as healthcare assistance (Ghosh et al. els (Wang et al. 2024). Cross-modal generation methods in-
2024) and malicious content detection (Kiela et al. 2020). evitably face modality heterogeneity issues and incur lim-
However, most successful methods typically assume that the ited reconstruction quality. Recently, prompt-based methods
completeness of all modalities is essential during both train- have gained significant attention due to the rise of power-
ing and inference phases. In reality, factors such as malfunc- ful pre-trained MultiModal Transformers (MMTs). These
tioning sensors and privacy concerns often make it infeasible methods leverage prompt-tuning techniques to effectively
to collect complete modalities (Ma et al. 2021). As a result, transfer the capabilities of MMTs pre-trained on complete
the challenge of incomplete modalities significantly impacts multimodal datasets to tasks involving missing modalities,
the reliability, accuracy, and safety of multimodal models in achieving remarkable performance and making them a dom-
practical applications (Woo et al. 2023; Cheng et al. 2024a). inant trend in incomplete multimodal learning.
To address this challenge, researchers have developed var- However, for incomplete modalities, prompt-based meth-
ious robust multimodal methods that are broadly categorized ods typically use the available modalities as the only cue
* These authors contributed equally. to fulfill task-specific objectives through prompt learning
†
Corresponding author. (see Fig. 1). Despite their progress, these methods often
Copyright © 2025, Association for the Advancement of Artificial struggle in severe missing-modality scenarios due to sev-
Intelligence ([Link]). All rights reserved. eral unresolved issues inherent in their design: (1) Remain-
ing modalities typically provide restricted modal informa- trained MMTs to achieve a more accurate representation
tion, which fails to effectively address specific tasks when for both missing- and full-modality data. Following are our
the missing modality contains crucial modal cues. (2) Modal main contributions:
incomplete inputs are often filled with dummy values (e.g., • To our best knowledge, this is the first retrieval-augmented
empty strings/pixels for texts/images), which may introduce paradigm for incomplete modalities. We reveal that prior
noise, leading to degraded performance (Ma et al. 2022). (3) prompt-based methods suffer from issues related to dummy
The prompt tokens are shared across any inputs and there- padding and static prompts, which drastically degrade per-
fore are instance-agnostic. Thus, this static prompt-tuning is formance in severe missing-modality cases.
not well-suited for real multimodal instances, as instances • To address these issues, we propose RAGPT, pioneering
with different types of missing modalities belong to distinct a retrieval-augmented dynamic prompt-tuning framework
distributions. Additionally, static prompts typically provide that bridges target and relevant instances, recovers missing
limited knowledge for both missing- and full-modality in- modality, and generates dynamic prompts to enhance the
stances. Therefore, these observations motivate us to de- MMT’s robustness in diverse missing-modality situations.
sign a universal prompt-tuning strategy to enhance the pre- • We conduct extensive experiments on three real-world
trained MMT’s robustness for incomplete modalities. datasets to evaluate RAGPT in comparison with 9 competi-
To address these issues, we draw inspiration from the hu- tive baselines and the results confirm RAGPT’s effectiveness
man ability to learn through observation, which involves in addressing missing-modality issues.
mastering skills by observing relevant subjects rather than
attempting to memorize every subject (Hodges et al. 2007). Related Work
As shown in Fig. 1, we leverage this cognitive principle to Missing-Modality in Multimodal Learning. Researchers
address the challenge of missing modalities. Our core idea have developed various robust methods for incomplete mul-
is to retrieve relevant multimodal contents and utilize them timodal learning, which can be divided into three groups: (1)
as prompts to enhance the robustness of pre-trained MMT in Joint learning methods (Zhao, Li, and Jin 2021; Wang et al.
both missing- and full-modality scenarios. Intuitively, for in- 2023; Yao et al. 2024) focus on distilling complex correla-
stances with missing modalities, appending multimodal con- tions from complete modalities to tackle missing-modality
tent from similar instances can provide contextual knowl- data. However, these methods require filling modality-
edge relevant to the missing modality and improve task- incomplete inputs with masking values, which may cause
specific predictions. unexpected behavior and introduce additional noise in the
To this end, we propose RAGPT, a novel Retrieval- inference process. (2) Cross-modal generation methods (Lee
AuGmented dynamic Prompt Tuning framework to adap- et al. 2019; Yuan et al. 2021) primarily reconstruct the miss-
tively enhance the robustness of pre-trained MMT in both ing content by using remaining modalities. Researchers (Ma
missing- and full-modality scenarios. Fundamentally, we et al. 2021; Woo et al. 2023) directly employ VAE to gener-
reformulate incomplete modality learning in a principled ate the missing-modality based only on available modalities.
retrieve-and-prompt manner and maintain a model-agnostic Consequently, these methods inevitably face modality het-
design that facilitates seamless integration with various erogeneity problems. (3) Prompt-based methods (Lee et al.
prompt-based models. RAGPT includes three modules: 2023; Jang, Wang, and Kim 2024) represent a recent trend
multi-channel retriever, missing modality generator, and in this field, which introduces learnable prompts to help pre-
context-aware prompter. During retrieval, we propose a uni- trained MMTs address incomplete modalities.
versal multi-channel retrieval strategy that disentangles mul- However, prompt-based methods are constrained by the
timodal representations into unimodal components, facilitat- dummy imputation and static prompting strategy, resulting
ing the retrieval of similar samples based on within-modality in performance bottlenecks. In contrast, our RAGPT cap-
similarities for missing- and full-modality scenarios. tures contextual knowledge from retrieved instances to re-
Next, the missing modality generator comprises a learn- cover the missing content and generate dynamic prompts to
able filter to approximate the missing information. Beyond enhance the MMT’s robustness for missing modalities.
traditional reconstruction techniques, which suffer from Prompt Learning. Prompt learning (Liu et al. 2023) utilizes
modality gaps during the cross-modal generation, this gen- a small number of learnable prompt parameters added to the
erator realizes intra-modal reconstruction by leveraging in- input of pre-trained transformers, facilitating adjustments to
formation from the retrieved samples that belong to the same the pre-trained models for alignment with downstream tasks.
modality as the missing one to recover the missing content. It has been successfully applied to various domains, such
Moreover, this design enriches the missing-modality repre- as visual identity (Khattak et al. 2023; Lee et al. 2023) and
sentation, ensuring alignment with the complete-modality social network analysis (Zhou et al. 2021; Xu et al. 2021;
input format of pre-trained MMTs during the pre-training Zhong et al. 2024; Cheng et al. 2024b, 2023). Following
phase. Finally, the context-aware prompter identifies the the success of prompt learning in NLP tasks (Li and Liang
semantic correlations between the target and retrieved in- 2021), recent works have attempted to explore its applica-
stances, producing dynamic multimodal prompts tailored to tion in multimodal learning (Zhou et al. 2022a). For in-
different inputs. These prompts facilitate the adaptive re- stance, MaPLe (Khattak et al. 2023) introduces a soft prompt
finement of modality features in both missing- and full- appended to the hidden representations of MMTs, result-
modality scenarios, thereby enhancing the robustness of the ing in significant improvements in few-shot image recogni-
pre-trained models. We insert these modules into the pre- tion. For incomplete multimodal learning, MAPs (Lee et al.
(1) Multi-Channel Retriever (2) Context-Aware Prompter (3) Knowledge-Augmented Prompt-Tuning Learnable
Retrieved Instances Retrieved Instances Classifier Output Frozen
The This is
p
# Label # Label
Peanuts # Drama the # History Multimodal Transformer
Christmas … history of … Pooling Add & Norm
time ... # Family Panda... # Action p
cls
Cross Attention Inverse FFT
Image Retriever Text Retriever Pooling
Q K,V
Image Embedding Missing Modality Multiple
Image Encoder Text Encoder Linear Generator
Mapping Learnable Filter
Memory Bank Mimiko
A Maine # Label The FFT
banker stays at
# Drama Peanut
convicted home and
she visits a … Christmas Average Pooling
of a
# Family time...
murder... panda...
Target Retrieved Retrieved Retrieved Text Embedding
Image-Only Class: # Crime # Drama Text-Only Input: Image-Only Text (Missing)
Image Images Labels Texts
Figure 2: Overall framework of RAGPT. (1) The multi-channel retriever identifies similar instances through a within-modality
retrieval strategy; (2) The context-aware prompter captures contextual knowledge from relevant instances and generates dy-
namic prompts; (3) The knowledge-augmented prompt-tuning process first recovers the missing content by using a missing-
modality generator and then performs dynamic prompt-tuning on the pre-trained MMT for final prediction.
2023) and MSPs (Jang, Wang, and Kim 2024) design vari- through a unified retrieval architecture. Specifically, for the
ous prompts to fine-tune pre-trained MMTs, enabling them text-missing channel, the MCR employs the image represen-
to adapt effectively to missing-modality scenarios. However, tation as a query to identify top-K similar images and incor-
these prompts are instance-agnostic and provide limited in- porates the associated texts to create multimodal instances.
formation for both missing- and full-modality data. In con- For complete modalities, the MCR utilizes both the image
trast, the context-aware prompter in RAGPT captures rich and text to search relevant texts and images, respectively,
contextual knowledge from relevant instances, alleviating thereby creating multimodal instances.
the drawbacks associated with instance-agnostic prompts. Specifically, in the text-level branch, the MCR first tok-
enizes the text x1i in the target instance Ti into n word tokens
Methodology and then projects them into word embedding Wi ∈ Rn×dt ,
Problem Definition: In this paper, we consider a multi- where dt is the dimension of word embedding. Next, the em-
modal dataset incorporating two modalities. Formally, we bedding Wi is fed into a pre-trained textual encoder (e.g.,
define D = {Df , Dm } to represent the multimodal dataset. CLIP textual encoder (Radford et al. 2021)) Ψt (·) to obtain
f text representation, represented as Eti = Ψt (Wi ) ∈ Rdt .
Here, Df = {(x1i , x2i , yi )}N i=1 represents the modality- Subsequently, the MCR utilizes the text query Eti to calcu-
complete subset, where yi is the class label of i-th instance.
late similarity scores with the text representation Etr from
x1i and x2i denote two modalities (e.g., texts and images).
the memory B, enabling the identification of the top-K tex-
N f is the total number of instances in the subset Df . Con- tually similar instances CiR :
m 1 2 Nm
versely, D = {(xi , , yi ) ∨ ( , xi , yi )}i=1 is a modality-
incomplete subset, where “ ” indicates a missing modality ⊤
Eti Etr
and N m is the number of missing-modality data in Dm . The CiR = Top-K ( ). (1)
r∈B ∥Eti ∥ · ∥Etr ∥
objective of the task is to enhance model robustness in cases
where modalities are missing during both training and test- For the vision content, the MCR first divides the image x2i
ing phases. into m non-overlapping patches and then projects them into
Fig. 2 presents the key components and their relationships a sequence of patch tokens Vi ∈ Rm×dv . Next, these tokens
in RAGPT. The following sections delve into the specifics of Vi are input into a pre-trained vision encoder (e.g., CLIP vi-
each component and their respective implementations. sion encoder (Radford et al. 2021)) Ψv (·) to obtain vision
query Evi ∈ Rdv . Finally, the retrieval process for search-
Multi-Channel Retriever ing top-K vision content is the same as defined in Eq. 1.
In this section, we design a unified multi-channel retriever After retrieval, the top-K instances CiR = {cri 1 , · · · , cri K }
to identify similar modal content for queries within their re- can be readily obtained. Each retrieved instance cri k contains
spective modalities by using within-modality similarities. the (image, text, label) triplet. The retrieved top-K instances
Memory Construction To store high-quality semantic in- provide auxiliary context, guiding the recovery of missing
formation as prior knowledge, we define the memory B, content in the target instance and improving task-specific
which encodes multimodal instances using a collection of predictions.
(image, text, label) triples.
Multi-Channel Retrieval To adapt diverse missing- and Context-Aware Prompter
full-modality scenarios, we develop a Multi-Channel Re- To explicitly capture expressive contextual information and
triever (MCR) that effectively retrieves relevant instances enhance robustness of pre-trained MMTs against missing-
modality issues, we design a Context-Aware Prompter ter block (i.e., MLP-based filter (Zhou et al. 2022b)) to effi-
(CAP) that constructs text-, vision-, and label-level dy- ciently refine textual features W̄i by removing noise. Specif-
namic prompts from the retrieved instances CiR . For text- ically, the MMG employs the Fast Fourier Transform (FFT)
level prompts, the CAP fuses the reference textual fea- along the textual dimension. This operation transforms the
tures in CiR and aligns textual embedding in Ti through a text context representation W̄i into the frequency domain:
simple network. Specifically, the CAP first tokenizes and
projects the texts x1i and {x1,r
i
k K
}k=1 into word embeddings Zi = F(W̄i ) ∈ Cn×dt , (4)
Wi ∈ R n×dt
and Wi = {Wirk }K
R
k=1 ∈ R
K×n×dt
. Sub- where F(·) denotes the one-dimensional FFT, and Zi is the
sequently, the word embedding Wi is used as the query spectrum of W̄i . The MMG then modulates the spectrum
to interact with the retrieved text features {Wirk }K k=1 via a by element-wise multiplication with a learnable filter W ∈
cross-attention block to facilitate comprehension of context, Cn×dt :
thereby generating the text-level comprehensive representa- Z̃i = W ⊙ Zi , (5)
tion P̃ti ∈ Rn×dt : where ⊙ denotes the element-wise multiplication. Finally,
the MMG utilizes the inverse FFT operation to the modu-
P̃ti = Att ftQ (Wi ), ftK (WiR ), ftV (WiR ) , (2) lated spectrum Z̃i back into the time domain:
QKT
Att(Q, K, V) = Softmax √ V, (3) W̃i = F −1 (Z̃i ) ∈ Rn×dt , (6)
d
where F −1 (·) is the inverse one-dimensional FFT, convert-
where ftQ (.), ftK (.), ftV
(.) denote the query, key, and value ing the complex tensor back to a real-valued tensor. To
projection functions, respectively. For vision-level prompts, further stabilize training and enhance the embedding, the
the CAP uses the same process to interact the vision patch MMG incorporates a skip connection, layer normalization,
tokens Vi ∈ Rm×dv with the retrieved patch tokens ViR ∈ and dropout:
RK×m×dv to obtain the vision-level representation P̃vi ∈ Ŵi = LayerNorm(W̃i + Dropout(W̃i )). (7)
Rm×dv . Then, the CAP employs an adaptive pooling strat-
egy to obtain the final context-aware prompts Pti ∈ Rl×dt Finally, the recovered representation Ŵi is used as the em-
and Pvi ∈ Rl×dv , where l is the prompt length. For label- bedding for the missing modality and is subsequently fed
level prompts, the CAP yields a label embedding matrix into the pre-trained MMT. Additionally, the aforementioned
P̃li ∈ RC×d to encode C class labels, where d is an ad- process is applied to scenarios involving missing images to
justable dimension. Given retrieved labels, the CAP per- obtain the corresponding vision patch embedding V̂i .
forms a look-up operation on embedding matrix P̃li and ob- Dynamic Prompt-Tuning Given a pre-trained MMT fθ
tains each label embedding. Next, the CAP averages K label with N consecutive Multi-head Self-Attention (MSA) lay-
embeddings and generates label-level prompts Pli ∈ Rd . ers, we denote the input representation of the b-th MSA layer
as hb ∈ RL×d , b = 1, 2, . . . , N with input length L and
Knowledge-Augmented Prompt-Tuning embedding dimension d. For full-modality data, we utilize
the embedding layer of the pre-trained model fθ (·) to obtain
In this process, we first utilize the retrieved modal infor- the corresponding text embedding Et and image embedding
mation to approximate the missing content through a miss- Ev . In the case of missing-modality, we employ the gener-
ing modality generator. Next, we perform dynamic prompt-
tuning on the pre-trained MMT (e.g., ViLT (Kim, Son, and ated word embedding Ŵ and vision patch embedding V̂ to
Kim 2021)) to enhance task-specific inference. fill the corresponding missing modality. h1 is the concate-
Missing Modality Generator Existing reconstruction nation of text embedding Et and image embedding Ev . The
methods (Ma et al. 2021) address missing-modality issues context-aware prompts Pt , Pv , and Pl are then attached to
by recovering missing content through available modalities. the embedding features along the sequence-length dimen-
However, these methods often overlook the modal hetero- sion to form the extended features hbp = [Pt , Pv , Pl , hb ].
geneity issue and rely on complex generative structures. These extended features hbp are fed into the MMT starting
Based on these observations, we propose a Missing Modal- from the b-th layer and continue to propagate through the
ity Generator (MMG) that recovers the missing modality remaining layers. The final output hN p represents compre-
through an “intra-modal reconstruction”. The MMG lever- hensive modal representation after the N -th layer. Rather
ages retrieved content of the same modality as the missing than adding prompts at each MSA layer, which can result
one and incorporates a learnable filter layer to effectively in considerable overhead, we selectively insert the prompts
approximate the missing modality in a simpler but effective into the specific b-th layer.
manner. Specifically, given the text-missing instance Ti , the Label-Augmented Prediction To further leverage the con-
MMG employs a non-parametric strategy to average all text textual information in label-level prompts, we design a label-
embeddings WiR = {Wirk }K k=1 from retrieved instances
augmented classifier by computing the similarity between
CiR , thereby obtaining textual representation W̄i ∈ Rn×dt the output representation of the MMT and the label matrix
to approximate the missing modality. P̃l . Specifically, for the final prediction, we feed the output
Considering potential noise in comprehensive textual rep- representation hN p into the pooler layer to obtain the repre-
resentation W̄i , the MMG introduces a simple learnable fil- sentation Z ∈ Rd×1 . Next, we calculate the probabilities
MM-IMDb HateMemes Food101
Missing Type Text Image Both Text Image Both Text Image Both
Method F1-M F1-S F1-M F1-S F1-M F1-S AUROC AUROC AUROC ACC ACC ACC
SMIL 38.32 38.55 27.57 35.27 35.12 31.87 50.32 58.50 54.63 51.83 49.86 46.77
TFR-Net 37.70 38.82 38.14 39.45 37.24 38.11 51.18 55.57 52.12 65.91 67.58 63.41
AcMAE 47.47 46.73 43.82 42.20 44.05 43.75 55.74 59.66 57.25 69.28 73.75 71.15
IF-MMIN 39.63 38.10 31.95 26.89 31.98 29.33 57.62 53.44 55.19 66.76 64.36 68.53
ShaSpec 44.04 42.05 44.23 42.53 44.06 42.13 58.75 60.30 60.96 60.99 74.87 70.02
DrFuse 47.05 45.22 43.58 42.19 48.83 47.15 57.60 60.66 55.84 66.30 75.09 68.23
CorrKD 44.82 45.27 39.48 39.11 41.20 40.51 58.74 55.59 57.91 61.37 66.83 62.87
MAPs 46.12 45.47 44.86 43.19 45.48 44.30 58.62 60.16 58.89 67.02 75.62 72.52
MSPs 49.16 48.81 44.62 43.06 48.28 46.71 59.60 60.05 59.08 71.74 79.09 74.46
RAGPT 55.16 55.00 46.44 45.12 50.89 50.22 64.10 62.57 63.47 75.53 81.98 76.94
Improv. (%) 12.21↑ 12.68↑ 3.52↑ 4.47↑ 4.22↑ 6.51↑ 7.55↑ 3.15↑ 4.12↑ 5.28↑ 3.65↑ 3.33↑
p-val. 8.93e−6 1.73e−5 5.94e−5 9.68e−6 6.43e−6 2.92e−5 1.24e−6 3.44e−5 1.03e−5 1.63e−6 3.24e−6 8.50e−5
Table 2: Performance comparison on three datasets with a 70% missing rate across various missing-modality scenarios. The best
results are in bold font and the second underlined. Higher values of F1-M, F1-S, AUROC, and ACC indicate better performance.
Dataset # Image # Text # Train # Val # Test Sample (F1-S) for the MM-IMDb dataset, AUROC for the
MM-IMDb 25,959 25,959 15,552 2,608 7,799 HateMemes dataset, and classification accuracy (ACC) for
HateMemes 10,000 10,000 8,500 500 1,500 the Food101 dataset.
Food101 90,688 90,688 67,972 - 22,716 Setting of Missing Pattern We define the missing rate η%
as the proportion of modality-incomplete data relative to the
Table 1: Statistics of three multimodal downstream datasets. entire dataset. For each dataset, there are three possible cases
of missing-modality: text missing, image missing, and both
modalities missing. Text/image missing with a missing rate
ŷ ∈ RC×1 for C classes: ŷ = softmax(P̃l ∗Z). During of η% indicates that there are η% instances consisting of
training, we freeze all parameters in the MMT and optimize texts/images and (1-η)% instances that contain both modal-
the model using cross-entropy loss. ities. Missing both modalities with a missing rate of η% in-
dicates that there are η2 % instances consisting solely of im-
Experiments ages, η2 % instances consisting solely of text, and (1-η)% in-
stances that are complete, containing both modalities.
Experimental Settings Implementation Details Following prior works (Lee et al.
A summary of the experimental settings is provided in this 2023; Jang, Wang, and Kim 2024), we utilize the pre-trained
section, which refers to datasets, baselines, evaluation met- ViLT (Kim, Son, and Kim 2021) as our MMT backbone. The
rics, setting of missing pattern, and implementation details. memory B for each dataset is constructed with the corre-
Datasets Following previous work (Lee et al. 2023; Jang, sponding training set. The length l of context-aware prompts
Wang, and Kim 2024), we evaluate our RAGPT on three is set to 2, the number of retrieved instances K is chosen
downstream tasks. (1) MM-IMDb (Arevalo et al. 2017), pri- from {1, 3, 5, 7, 9}, and the prompt insertion layer b is set to
marily used for movie genre classification involving both 2. We utilize the AdamW optimizer (Loshchilov and Hutter
image and text modalities. (2) Food101 (Wang et al. 2015), 2017) with a learning rate of 1×10−3 and total 20 epochs for
which focuses on image classification that incorporates both optimizing the parameters. All experiments are conducted
image and text. (3) HateMemes (Kiela et al. 2020), aimed to with an NVIDIA RTX 3090 GPU.
identify hate speech in memes using image and text modal-
ities. Detailed statistics of datasets are presented in Table 1. Overall Performance
The dataset splits are consistent with the original paper. To verify the superiority of RAGPT, we compare it with 9
Baselines We compare our RAGPT with 9 competitive base- competitive baselines on three datasets under a missing rate
lines, which are classified into three categories: (1) Cross- of η% = 70%. From these results, we have the following
modal generation methods: SMIL (Ma et al. 2021), TFR- observations:
Net (Yuan et al. 2021), and AcMAE (Woo et al. 2023). First, our RAGPT consistently outperforms all strong
(2) Joint learning methods: IF-MMIN (Zuo et al. 2023), baselines on three datasets under various modal conditions
ShaSpec (Wang et al. 2023), DrFuse (Yao et al. 2024), and and metrics. Moreover, we retrain RAGPT and the best-
CorrKD (Li et al. 2024). (3) Prompt-based methods: MAPs performing baseline five times to calculate the p-value.
(Lee et al. 2023) and MSPs (Jang, Wang, and Kim 2024). Notably, RAGPT achieves improvements of 12.21% and
Evaluation Metrics Following prior works (Lee et al. 2023; 12.68% in the F1-M and F1-S metrics, respectively, on the
Jang, Wang, and Kim 2024), we adopt appropriate dataset- MM-IMDb dataset with missing text. These results vali-
specific metrics for evaluation: F1-Micro (F1-M) and F1- date our design of exploiting expressive knowledge from
MM-IMDb HateMemes Food101 7 H [ W % R W K , P D J H 7 H [ W % R W K , P D J H
Module Variant
F1-M F1-S AUROC ACC
RAGPT All 55.16 55.00 64.10 75.53
$ 8 5 2 &
) 0
CM Retriever 52.37 51.70 61.87 74.24
Retriever
w/o Retriever 49.25 48.36 60.29 73.60
Padding 51.14 51.63 61.30 72.78
Generator
w/o Filter 54.15 52.99 60.67 74.07
Static Prompt 54.38 53.14 62.65 74.40 (a) Effect of K on MM-IMDb (b) Effect of K on HateMemes
Prompter w/o Label 53.41 53.45 62.01 74.34
w/o Prompter 51.49 50.43 60.94 72.65 Figure 3: Hyper-parameter analysis of K under three modal-
ity missing scenarios.
Table 3: Ablation study of RAGPT under 70% text missing.
Target Instance Top-1 Instance Top-2 Instance
Top Hot Hot Dog Wrapped
Dog Recipes Recipes Hot Dogs
retrieved instances to enhance both missing and complete (1)
Cooking Toppings - Recipe |
modality data. Meanwhile, the missing modality generator Tips [Link] POPSUGA
and context-aware prompter distill expressive contextual in-
Ice Cream Ice cream Cake Batter
formation from retrieved instances to approximate missing Cone cone Ice Cream
(2)
content and generate dynamic prompts, respectively, thereby Cupcake cupcakes | Recipe For
Recipe Daily Kids
improving model robustness for incomplete modalities.
Second, cross-modal generation and joint learning meth-
ods demonstrate inferior performance, primarily due to Figure 4: Examples of Top-2 retrieved instances for two
the uncertainty introduced by random placeholders and modality-incomplete target instances. The first target in-
the challenges of modality heterogeneity in reconstruction, stance is image-only while the second one is text-only. Red
which create significant performance bottlenecks. Moreover, texts highlight similar content.
prompt-based methods also exhibit limited effectiveness in
missing-modality scenarios, as they rely on dummy imputa-
tions and static prompting strategies, further restricting their placing context-aware prompts with static prompts; (2) w/o
potential and resulting in performance stagnation. Label: removing label enhancement; and (3) w/o Prompter:
eliminating text-, vision-, label-prompts entirely. The three
Ablation Study variants result in poorer performance, validating that static
prompts offer limited relevant cues for addressing incom-
We conduct various ablation experiments to evaluate the im- plete multimodal learning.
pact of each component within RAGPT under a 70% text
missing case and summarize the results in Table 3. Hyper-Parameter Analysis
Effect of Multi-Channel Retriever To analyze the impact
Fig. 3(a) and 3(b) present the sensitivity analysis of
of the retriever in RAGPT, we designed two variants: (1)
RAGPT’s hyper-parameters K on the MM-IMDb and Hate-
CM Retriever: replacing the multi-channel retriever with
Memes datasets. The results demonstrate that the perfor-
cross-modal retriever, and (2) w/o Retriever: removing the
mance of RAGPT is improved by retrieving relevant in-
retriever entirely. These results confirm the presence of the
stances. However, incorporating a larger number of in-
modal gap problem in cross-modal retrieval, which renders
stances may result in a decline in performance due to the in-
the retrieved instances irrelevant to the target images. Fur-
troduction of noise (i.e., irrelevant instances). Consequently,
thermore, this finding reinforces our design of the multi-
we adopt K = 3 under the image missing case on the MM-
channel retrieval that retrieves relevant instances by calcu-
IMDb dataset and K = 5 under other scenarios.
lating within-modality similarities, thereby enhancing both
missing and complete modality data. Retrieval Quality Presentation
Effect of Missing Modality Generator To evaluate the im-
pact of the missing modality generator, we designed vari- To further analyze the efficacy of our proposed multi-
ant models: (1) Padding: using random values to fill in the channel retriever, we randomly select two instances with in-
missing modality, and (2) w/o Filter: removing the filter complete modalities from the Food101 dataset. Fig. 4 visu-
block entirely. We observe that dummy padding results in a alizes the Top-2 similar retrieved instances, demonstrating a
decline in performance. This finding supports our assertion strong semantic correlation between the retrieved and target
that dummy padding contributes to performance bottlenecks instances in both image and text modalities. The high quality
in prompt-based methods. Additionally, the removal of the of retrieval relevance indicates our multi-channel retriever’s
filter layer leads to a significant performance drop, under- ability to effectively identify relevant modal information.
scoring the importance of the filter layer in RAGPT for ef-
fectively mitigating noise. Model Generalizability
Effect of Context-Aware Prompter To analyze the context- To investigate the model’s generalizability, we design two
aware prompts, we design variants: (1) Static Prompt: re- experiments with varying missing rates in the training set