0% found this document useful (0 votes)
100 views9 pages

Retrieval-Augmented Dynamic Prompt Tuning For Incomplete Multimodal Learning

The document presents RAGPT, a novel Retrieval-Augmented dynamic Prompt Tuning framework designed to enhance the robustness of pre-trained MultiModal Transformers (MMTs) in scenarios with incomplete modalities. RAGPT addresses limitations of existing prompt-based methods by incorporating a multi-channel retriever, a missing modality generator, and a context-aware prompter to dynamically generate prompts and recover missing information. Extensive experiments demonstrate that RAGPT consistently outperforms competitive baselines across various datasets, confirming its effectiveness in handling incomplete multimodal learning challenges.

Uploaded by

daerhoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views9 pages

Retrieval-Augmented Dynamic Prompt Tuning For Incomplete Multimodal Learning

The document presents RAGPT, a novel Retrieval-Augmented dynamic Prompt Tuning framework designed to enhance the robustness of pre-trained MultiModal Transformers (MMTs) in scenarios with incomplete modalities. RAGPT addresses limitations of existing prompt-based methods by incorporating a multi-channel retriever, a missing modality generator, and a context-aware prompter to dynamically generate prompts and recover missing information. Extensive experiments demonstrate that RAGPT consistently outperforms competitive baselines across various datasets, confirming its effectiveness in handling incomplete multimodal learning challenges.

Uploaded by

daerhoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Retrieval-Augmented Dynamic Prompt Tuning for Incomplete

Multimodal Learning
Jian Lang1 * , Zhangtao Cheng1 * , Ting Zhong1,2 , Fan Zhou1,2†
1
University of Electronic Science and Technology of China, Chengdu, Sichuan, China
2
Kash Institute of Electronics and Information Industry, Kashgar, Xinjiang, China
jian lang@[Link], [Link]@[Link], zhongting@[Link], [Link]@[Link]

Abstract Image-Only
arXiv:2501.01120v1 [[Link]] 2 Jan 2025

Multimodal
Classifier Output
Multimodal learning with incomplete modality is practical Transformer
and challenging. Recently, researchers have focused on en-
hancing the robustness of pre-trained MultiModal Transform- Pad Text Prompts Prior Prompt-based Methods
ers (MMTs) under missing modality conditions by apply-
ing learnable prompts. However, these prompt-based methods
Multimodal
face several limitations: (1) incomplete modalities provide re- Classifier Output
Transformer
stricted modal cues for task-specific inference, (2) dummy
imputation for missing content causes information loss and Text Learnable Frozen
introduces noise, and (3) static prompts are instance-agnostic, Prompts
Memory Bank
offering limited knowledge for instances with various missing Generator Prompter
conditions. To address these issues, we propose RAGPT, a Prompts Static Prompt
# Label
novel Retrieval-AuGmented dynamic Prompt Tuning frame- ...
work. RAGPT comprises three modules: (I) the multi-channel Text
Text
Text # Label Prompts Dynamic Prompt
retriever, which identifies similar instances through a within-
modality retrieval strategy, (II) the missing modality genera- Retriever Our RAGPT
tor, which recovers missing information using retrieved con-
texts, and (III) the context-aware prompter, which captures Figure 1: Prior prompt-based methods vs. our RAGPT in
contextual knowledge from relevant instances and generates tackling incomplete multimodal learning.
dynamic prompts to largely enhance the MMT’s robustness.
Extensive experiments conducted on three real-world datasets
show that RAGPT consistently outperforms all competitive
baselines in handling incomplete modality problems. The into three groups: (1) Joint learning methods (Wang et al.
code of our work and prompt-based baselines is available at 2023; Yao et al. 2024), (2) Cross-modal generation meth-
[Link] ods (Ma et al. 2021; Woo et al. 2023), and (3) Prompt-based
methods (Lee et al. 2023; Jang, Wang, and Kim 2024). For
joint learning methods, they heavily rely on the selection
Introduction of similarity measures and require filling missing-modality
Multimodal learning has emerged as a critical paradigm in inputs with masking values, resulting in the loss of criti-
both research and industry, demonstrating broad application cal information and the introduction of noise into the mod-
potential in areas such as healthcare assistance (Ghosh et al. els (Wang et al. 2024). Cross-modal generation methods in-
2024) and malicious content detection (Kiela et al. 2020). evitably face modality heterogeneity issues and incur lim-
However, most successful methods typically assume that the ited reconstruction quality. Recently, prompt-based methods
completeness of all modalities is essential during both train- have gained significant attention due to the rise of power-
ing and inference phases. In reality, factors such as malfunc- ful pre-trained MultiModal Transformers (MMTs). These
tioning sensors and privacy concerns often make it infeasible methods leverage prompt-tuning techniques to effectively
to collect complete modalities (Ma et al. 2021). As a result, transfer the capabilities of MMTs pre-trained on complete
the challenge of incomplete modalities significantly impacts multimodal datasets to tasks involving missing modalities,
the reliability, accuracy, and safety of multimodal models in achieving remarkable performance and making them a dom-
practical applications (Woo et al. 2023; Cheng et al. 2024a). inant trend in incomplete multimodal learning.
To address this challenge, researchers have developed var- However, for incomplete modalities, prompt-based meth-
ious robust multimodal methods that are broadly categorized ods typically use the available modalities as the only cue
* These authors contributed equally. to fulfill task-specific objectives through prompt learning

Corresponding author. (see Fig. 1). Despite their progress, these methods often
Copyright © 2025, Association for the Advancement of Artificial struggle in severe missing-modality scenarios due to sev-
Intelligence ([Link]). All rights reserved. eral unresolved issues inherent in their design: (1) Remain-
ing modalities typically provide restricted modal informa- trained MMTs to achieve a more accurate representation
tion, which fails to effectively address specific tasks when for both missing- and full-modality data. Following are our
the missing modality contains crucial modal cues. (2) Modal main contributions:
incomplete inputs are often filled with dummy values (e.g., • To our best knowledge, this is the first retrieval-augmented
empty strings/pixels for texts/images), which may introduce paradigm for incomplete modalities. We reveal that prior
noise, leading to degraded performance (Ma et al. 2022). (3) prompt-based methods suffer from issues related to dummy
The prompt tokens are shared across any inputs and there- padding and static prompts, which drastically degrade per-
fore are instance-agnostic. Thus, this static prompt-tuning is formance in severe missing-modality cases.
not well-suited for real multimodal instances, as instances • To address these issues, we propose RAGPT, pioneering
with different types of missing modalities belong to distinct a retrieval-augmented dynamic prompt-tuning framework
distributions. Additionally, static prompts typically provide that bridges target and relevant instances, recovers missing
limited knowledge for both missing- and full-modality in- modality, and generates dynamic prompts to enhance the
stances. Therefore, these observations motivate us to de- MMT’s robustness in diverse missing-modality situations.
sign a universal prompt-tuning strategy to enhance the pre- • We conduct extensive experiments on three real-world
trained MMT’s robustness for incomplete modalities. datasets to evaluate RAGPT in comparison with 9 competi-
To address these issues, we draw inspiration from the hu- tive baselines and the results confirm RAGPT’s effectiveness
man ability to learn through observation, which involves in addressing missing-modality issues.
mastering skills by observing relevant subjects rather than
attempting to memorize every subject (Hodges et al. 2007). Related Work
As shown in Fig. 1, we leverage this cognitive principle to Missing-Modality in Multimodal Learning. Researchers
address the challenge of missing modalities. Our core idea have developed various robust methods for incomplete mul-
is to retrieve relevant multimodal contents and utilize them timodal learning, which can be divided into three groups: (1)
as prompts to enhance the robustness of pre-trained MMT in Joint learning methods (Zhao, Li, and Jin 2021; Wang et al.
both missing- and full-modality scenarios. Intuitively, for in- 2023; Yao et al. 2024) focus on distilling complex correla-
stances with missing modalities, appending multimodal con- tions from complete modalities to tackle missing-modality
tent from similar instances can provide contextual knowl- data. However, these methods require filling modality-
edge relevant to the missing modality and improve task- incomplete inputs with masking values, which may cause
specific predictions. unexpected behavior and introduce additional noise in the
To this end, we propose RAGPT, a novel Retrieval- inference process. (2) Cross-modal generation methods (Lee
AuGmented dynamic Prompt Tuning framework to adap- et al. 2019; Yuan et al. 2021) primarily reconstruct the miss-
tively enhance the robustness of pre-trained MMT in both ing content by using remaining modalities. Researchers (Ma
missing- and full-modality scenarios. Fundamentally, we et al. 2021; Woo et al. 2023) directly employ VAE to gener-
reformulate incomplete modality learning in a principled ate the missing-modality based only on available modalities.
retrieve-and-prompt manner and maintain a model-agnostic Consequently, these methods inevitably face modality het-
design that facilitates seamless integration with various erogeneity problems. (3) Prompt-based methods (Lee et al.
prompt-based models. RAGPT includes three modules: 2023; Jang, Wang, and Kim 2024) represent a recent trend
multi-channel retriever, missing modality generator, and in this field, which introduces learnable prompts to help pre-
context-aware prompter. During retrieval, we propose a uni- trained MMTs address incomplete modalities.
versal multi-channel retrieval strategy that disentangles mul- However, prompt-based methods are constrained by the
timodal representations into unimodal components, facilitat- dummy imputation and static prompting strategy, resulting
ing the retrieval of similar samples based on within-modality in performance bottlenecks. In contrast, our RAGPT cap-
similarities for missing- and full-modality scenarios. tures contextual knowledge from retrieved instances to re-
Next, the missing modality generator comprises a learn- cover the missing content and generate dynamic prompts to
able filter to approximate the missing information. Beyond enhance the MMT’s robustness for missing modalities.
traditional reconstruction techniques, which suffer from Prompt Learning. Prompt learning (Liu et al. 2023) utilizes
modality gaps during the cross-modal generation, this gen- a small number of learnable prompt parameters added to the
erator realizes intra-modal reconstruction by leveraging in- input of pre-trained transformers, facilitating adjustments to
formation from the retrieved samples that belong to the same the pre-trained models for alignment with downstream tasks.
modality as the missing one to recover the missing content. It has been successfully applied to various domains, such
Moreover, this design enriches the missing-modality repre- as visual identity (Khattak et al. 2023; Lee et al. 2023) and
sentation, ensuring alignment with the complete-modality social network analysis (Zhou et al. 2021; Xu et al. 2021;
input format of pre-trained MMTs during the pre-training Zhong et al. 2024; Cheng et al. 2024b, 2023). Following
phase. Finally, the context-aware prompter identifies the the success of prompt learning in NLP tasks (Li and Liang
semantic correlations between the target and retrieved in- 2021), recent works have attempted to explore its applica-
stances, producing dynamic multimodal prompts tailored to tion in multimodal learning (Zhou et al. 2022a). For in-
different inputs. These prompts facilitate the adaptive re- stance, MaPLe (Khattak et al. 2023) introduces a soft prompt
finement of modality features in both missing- and full- appended to the hidden representations of MMTs, result-
modality scenarios, thereby enhancing the robustness of the ing in significant improvements in few-shot image recogni-
pre-trained models. We insert these modules into the pre- tion. For incomplete multimodal learning, MAPs (Lee et al.
(1) Multi-Channel Retriever (2) Context-Aware Prompter (3) Knowledge-Augmented Prompt-Tuning Learnable
Retrieved Instances Retrieved Instances Classifier Output Frozen
The This is
p
# Label # Label
Peanuts # Drama the # History Multimodal Transformer
Christmas … history of … Pooling Add & Norm
time ... # Family Panda... # Action p

cls
Cross Attention Inverse FFT
Image Retriever Text Retriever Pooling
Q K,V
Image Embedding Missing Modality Multiple
Image Encoder Text Encoder Linear Generator
Mapping Learnable Filter
Memory Bank Mimiko
A Maine # Label The FFT
banker stays at
# Drama Peanut
convicted home and
she visits a … Christmas Average Pooling
of a
# Family time...
murder... panda...
Target Retrieved Retrieved Retrieved Text Embedding
Image-Only Class: # Crime # Drama Text-Only Input: Image-Only Text (Missing)
Image Images Labels Texts

Figure 2: Overall framework of RAGPT. (1) The multi-channel retriever identifies similar instances through a within-modality
retrieval strategy; (2) The context-aware prompter captures contextual knowledge from relevant instances and generates dy-
namic prompts; (3) The knowledge-augmented prompt-tuning process first recovers the missing content by using a missing-
modality generator and then performs dynamic prompt-tuning on the pre-trained MMT for final prediction.

2023) and MSPs (Jang, Wang, and Kim 2024) design vari- through a unified retrieval architecture. Specifically, for the
ous prompts to fine-tune pre-trained MMTs, enabling them text-missing channel, the MCR employs the image represen-
to adapt effectively to missing-modality scenarios. However, tation as a query to identify top-K similar images and incor-
these prompts are instance-agnostic and provide limited in- porates the associated texts to create multimodal instances.
formation for both missing- and full-modality data. In con- For complete modalities, the MCR utilizes both the image
trast, the context-aware prompter in RAGPT captures rich and text to search relevant texts and images, respectively,
contextual knowledge from relevant instances, alleviating thereby creating multimodal instances.
the drawbacks associated with instance-agnostic prompts. Specifically, in the text-level branch, the MCR first tok-
enizes the text x1i in the target instance Ti into n word tokens
Methodology and then projects them into word embedding Wi ∈ Rn×dt ,
Problem Definition: In this paper, we consider a multi- where dt is the dimension of word embedding. Next, the em-
modal dataset incorporating two modalities. Formally, we bedding Wi is fed into a pre-trained textual encoder (e.g.,
define D = {Df , Dm } to represent the multimodal dataset. CLIP textual encoder (Radford et al. 2021)) Ψt (·) to obtain
f text representation, represented as Eti = Ψt (Wi ) ∈ Rdt .
Here, Df = {(x1i , x2i , yi )}N i=1 represents the modality- Subsequently, the MCR utilizes the text query Eti to calcu-
complete subset, where yi is the class label of i-th instance.
late similarity scores with the text representation Etr from
x1i and x2i denote two modalities (e.g., texts and images).
the memory B, enabling the identification of the top-K tex-
N f is the total number of instances in the subset Df . Con- tually similar instances CiR :
m 1 2 Nm
versely, D = {(xi , , yi ) ∨ ( , xi , yi )}i=1 is a modality-
incomplete subset, where “ ” indicates a missing modality ⊤
Eti Etr
and N m is the number of missing-modality data in Dm . The CiR = Top-K ( ). (1)
r∈B ∥Eti ∥ · ∥Etr ∥
objective of the task is to enhance model robustness in cases
where modalities are missing during both training and test- For the vision content, the MCR first divides the image x2i
ing phases. into m non-overlapping patches and then projects them into
Fig. 2 presents the key components and their relationships a sequence of patch tokens Vi ∈ Rm×dv . Next, these tokens
in RAGPT. The following sections delve into the specifics of Vi are input into a pre-trained vision encoder (e.g., CLIP vi-
each component and their respective implementations. sion encoder (Radford et al. 2021)) Ψv (·) to obtain vision
query Evi ∈ Rdv . Finally, the retrieval process for search-
Multi-Channel Retriever ing top-K vision content is the same as defined in Eq. 1.
In this section, we design a unified multi-channel retriever After retrieval, the top-K instances CiR = {cri 1 , · · · , cri K }
to identify similar modal content for queries within their re- can be readily obtained. Each retrieved instance cri k contains
spective modalities by using within-modality similarities. the (image, text, label) triplet. The retrieved top-K instances
Memory Construction To store high-quality semantic in- provide auxiliary context, guiding the recovery of missing
formation as prior knowledge, we define the memory B, content in the target instance and improving task-specific
which encodes multimodal instances using a collection of predictions.
(image, text, label) triples.
Multi-Channel Retrieval To adapt diverse missing- and Context-Aware Prompter
full-modality scenarios, we develop a Multi-Channel Re- To explicitly capture expressive contextual information and
triever (MCR) that effectively retrieves relevant instances enhance robustness of pre-trained MMTs against missing-
modality issues, we design a Context-Aware Prompter ter block (i.e., MLP-based filter (Zhou et al. 2022b)) to effi-
(CAP) that constructs text-, vision-, and label-level dy- ciently refine textual features W̄i by removing noise. Specif-
namic prompts from the retrieved instances CiR . For text- ically, the MMG employs the Fast Fourier Transform (FFT)
level prompts, the CAP fuses the reference textual fea- along the textual dimension. This operation transforms the
tures in CiR and aligns textual embedding in Ti through a text context representation W̄i into the frequency domain:
simple network. Specifically, the CAP first tokenizes and
projects the texts x1i and {x1,r
i
k K
}k=1 into word embeddings Zi = F(W̄i ) ∈ Cn×dt , (4)
Wi ∈ R n×dt
and Wi = {Wirk }K
R
k=1 ∈ R
K×n×dt
. Sub- where F(·) denotes the one-dimensional FFT, and Zi is the
sequently, the word embedding Wi is used as the query spectrum of W̄i . The MMG then modulates the spectrum
to interact with the retrieved text features {Wirk }K k=1 via a by element-wise multiplication with a learnable filter W ∈
cross-attention block to facilitate comprehension of context, Cn×dt :
thereby generating the text-level comprehensive representa- Z̃i = W ⊙ Zi , (5)
tion P̃ti ∈ Rn×dt : where ⊙ denotes the element-wise multiplication. Finally,
  the MMG utilizes the inverse FFT operation to the modu-
P̃ti = Att ftQ (Wi ), ftK (WiR ), ftV (WiR ) , (2) lated spectrum Z̃i back into the time domain:
QKT
 
Att(Q, K, V) = Softmax √ V, (3) W̃i = F −1 (Z̃i ) ∈ Rn×dt , (6)
d
where F −1 (·) is the inverse one-dimensional FFT, convert-
where ftQ (.), ftK (.), ftV
(.) denote the query, key, and value ing the complex tensor back to a real-valued tensor. To
projection functions, respectively. For vision-level prompts, further stabilize training and enhance the embedding, the
the CAP uses the same process to interact the vision patch MMG incorporates a skip connection, layer normalization,
tokens Vi ∈ Rm×dv with the retrieved patch tokens ViR ∈ and dropout:
RK×m×dv to obtain the vision-level representation P̃vi ∈ Ŵi = LayerNorm(W̃i + Dropout(W̃i )). (7)
Rm×dv . Then, the CAP employs an adaptive pooling strat-
egy to obtain the final context-aware prompts Pti ∈ Rl×dt Finally, the recovered representation Ŵi is used as the em-
and Pvi ∈ Rl×dv , where l is the prompt length. For label- bedding for the missing modality and is subsequently fed
level prompts, the CAP yields a label embedding matrix into the pre-trained MMT. Additionally, the aforementioned
P̃li ∈ RC×d to encode C class labels, where d is an ad- process is applied to scenarios involving missing images to
justable dimension. Given retrieved labels, the CAP per- obtain the corresponding vision patch embedding V̂i .
forms a look-up operation on embedding matrix P̃li and ob- Dynamic Prompt-Tuning Given a pre-trained MMT fθ
tains each label embedding. Next, the CAP averages K label with N consecutive Multi-head Self-Attention (MSA) lay-
embeddings and generates label-level prompts Pli ∈ Rd . ers, we denote the input representation of the b-th MSA layer
as hb ∈ RL×d , b = 1, 2, . . . , N with input length L and
Knowledge-Augmented Prompt-Tuning embedding dimension d. For full-modality data, we utilize
the embedding layer of the pre-trained model fθ (·) to obtain
In this process, we first utilize the retrieved modal infor- the corresponding text embedding Et and image embedding
mation to approximate the missing content through a miss- Ev . In the case of missing-modality, we employ the gener-
ing modality generator. Next, we perform dynamic prompt-
tuning on the pre-trained MMT (e.g., ViLT (Kim, Son, and ated word embedding Ŵ and vision patch embedding V̂ to
Kim 2021)) to enhance task-specific inference. fill the corresponding missing modality. h1 is the concate-
Missing Modality Generator Existing reconstruction nation of text embedding Et and image embedding Ev . The
methods (Ma et al. 2021) address missing-modality issues context-aware prompts Pt , Pv , and Pl are then attached to
by recovering missing content through available modalities. the embedding features along the sequence-length dimen-
However, these methods often overlook the modal hetero- sion to form the extended features hbp = [Pt , Pv , Pl , hb ].
geneity issue and rely on complex generative structures. These extended features hbp are fed into the MMT starting
Based on these observations, we propose a Missing Modal- from the b-th layer and continue to propagate through the
ity Generator (MMG) that recovers the missing modality remaining layers. The final output hN p represents compre-
through an “intra-modal reconstruction”. The MMG lever- hensive modal representation after the N -th layer. Rather
ages retrieved content of the same modality as the missing than adding prompts at each MSA layer, which can result
one and incorporates a learnable filter layer to effectively in considerable overhead, we selectively insert the prompts
approximate the missing modality in a simpler but effective into the specific b-th layer.
manner. Specifically, given the text-missing instance Ti , the Label-Augmented Prediction To further leverage the con-
MMG employs a non-parametric strategy to average all text textual information in label-level prompts, we design a label-
embeddings WiR = {Wirk }K k=1 from retrieved instances
augmented classifier by computing the similarity between
CiR , thereby obtaining textual representation W̄i ∈ Rn×dt the output representation of the MMT and the label matrix
to approximate the missing modality. P̃l . Specifically, for the final prediction, we feed the output
Considering potential noise in comprehensive textual rep- representation hN p into the pooler layer to obtain the repre-
resentation W̄i , the MMG introduces a simple learnable fil- sentation Z ∈ Rd×1 . Next, we calculate the probabilities
MM-IMDb HateMemes Food101
Missing Type Text Image Both Text Image Both Text Image Both
Method F1-M F1-S F1-M F1-S F1-M F1-S AUROC AUROC AUROC ACC ACC ACC
SMIL 38.32 38.55 27.57 35.27 35.12 31.87 50.32 58.50 54.63 51.83 49.86 46.77
TFR-Net 37.70 38.82 38.14 39.45 37.24 38.11 51.18 55.57 52.12 65.91 67.58 63.41
AcMAE 47.47 46.73 43.82 42.20 44.05 43.75 55.74 59.66 57.25 69.28 73.75 71.15
IF-MMIN 39.63 38.10 31.95 26.89 31.98 29.33 57.62 53.44 55.19 66.76 64.36 68.53
ShaSpec 44.04 42.05 44.23 42.53 44.06 42.13 58.75 60.30 60.96 60.99 74.87 70.02
DrFuse 47.05 45.22 43.58 42.19 48.83 47.15 57.60 60.66 55.84 66.30 75.09 68.23
CorrKD 44.82 45.27 39.48 39.11 41.20 40.51 58.74 55.59 57.91 61.37 66.83 62.87
MAPs 46.12 45.47 44.86 43.19 45.48 44.30 58.62 60.16 58.89 67.02 75.62 72.52
MSPs 49.16 48.81 44.62 43.06 48.28 46.71 59.60 60.05 59.08 71.74 79.09 74.46
RAGPT 55.16 55.00 46.44 45.12 50.89 50.22 64.10 62.57 63.47 75.53 81.98 76.94
Improv. (%) 12.21↑ 12.68↑ 3.52↑ 4.47↑ 4.22↑ 6.51↑ 7.55↑ 3.15↑ 4.12↑ 5.28↑ 3.65↑ 3.33↑
p-val. 8.93e−6 1.73e−5 5.94e−5 9.68e−6 6.43e−6 2.92e−5 1.24e−6 3.44e−5 1.03e−5 1.63e−6 3.24e−6 8.50e−5

Table 2: Performance comparison on three datasets with a 70% missing rate across various missing-modality scenarios. The best
results are in bold font and the second underlined. Higher values of F1-M, F1-S, AUROC, and ACC indicate better performance.

Dataset # Image # Text # Train # Val # Test Sample (F1-S) for the MM-IMDb dataset, AUROC for the
MM-IMDb 25,959 25,959 15,552 2,608 7,799 HateMemes dataset, and classification accuracy (ACC) for
HateMemes 10,000 10,000 8,500 500 1,500 the Food101 dataset.
Food101 90,688 90,688 67,972 - 22,716 Setting of Missing Pattern We define the missing rate η%
as the proportion of modality-incomplete data relative to the
Table 1: Statistics of three multimodal downstream datasets. entire dataset. For each dataset, there are three possible cases
of missing-modality: text missing, image missing, and both
modalities missing. Text/image missing with a missing rate
ŷ ∈ RC×1 for C classes: ŷ = softmax(P̃l ∗Z). During of η% indicates that there are η% instances consisting of
training, we freeze all parameters in the MMT and optimize texts/images and (1-η)% instances that contain both modal-
the model using cross-entropy loss. ities. Missing both modalities with a missing rate of η% in-
dicates that there are η2 % instances consisting solely of im-
Experiments ages, η2 % instances consisting solely of text, and (1-η)% in-
stances that are complete, containing both modalities.
Experimental Settings Implementation Details Following prior works (Lee et al.
A summary of the experimental settings is provided in this 2023; Jang, Wang, and Kim 2024), we utilize the pre-trained
section, which refers to datasets, baselines, evaluation met- ViLT (Kim, Son, and Kim 2021) as our MMT backbone. The
rics, setting of missing pattern, and implementation details. memory B for each dataset is constructed with the corre-
Datasets Following previous work (Lee et al. 2023; Jang, sponding training set. The length l of context-aware prompts
Wang, and Kim 2024), we evaluate our RAGPT on three is set to 2, the number of retrieved instances K is chosen
downstream tasks. (1) MM-IMDb (Arevalo et al. 2017), pri- from {1, 3, 5, 7, 9}, and the prompt insertion layer b is set to
marily used for movie genre classification involving both 2. We utilize the AdamW optimizer (Loshchilov and Hutter
image and text modalities. (2) Food101 (Wang et al. 2015), 2017) with a learning rate of 1×10−3 and total 20 epochs for
which focuses on image classification that incorporates both optimizing the parameters. All experiments are conducted
image and text. (3) HateMemes (Kiela et al. 2020), aimed to with an NVIDIA RTX 3090 GPU.
identify hate speech in memes using image and text modal-
ities. Detailed statistics of datasets are presented in Table 1. Overall Performance
The dataset splits are consistent with the original paper. To verify the superiority of RAGPT, we compare it with 9
Baselines We compare our RAGPT with 9 competitive base- competitive baselines on three datasets under a missing rate
lines, which are classified into three categories: (1) Cross- of η% = 70%. From these results, we have the following
modal generation methods: SMIL (Ma et al. 2021), TFR- observations:
Net (Yuan et al. 2021), and AcMAE (Woo et al. 2023). First, our RAGPT consistently outperforms all strong
(2) Joint learning methods: IF-MMIN (Zuo et al. 2023), baselines on three datasets under various modal conditions
ShaSpec (Wang et al. 2023), DrFuse (Yao et al. 2024), and and metrics. Moreover, we retrain RAGPT and the best-
CorrKD (Li et al. 2024). (3) Prompt-based methods: MAPs performing baseline five times to calculate the p-value.
(Lee et al. 2023) and MSPs (Jang, Wang, and Kim 2024). Notably, RAGPT achieves improvements of 12.21% and
Evaluation Metrics Following prior works (Lee et al. 2023; 12.68% in the F1-M and F1-S metrics, respectively, on the
Jang, Wang, and Kim 2024), we adopt appropriate dataset- MM-IMDb dataset with missing text. These results vali-
specific metrics for evaluation: F1-Micro (F1-M) and F1- date our design of exploiting expressive knowledge from
MM-IMDb HateMemes Food101 7H[W %RWK ,PDJH 7H[W %RWK ,PDJH
Module Variant
F1-M F1-S AUROC ACC  
RAGPT All 55.16 55.00 64.10 75.53  

$852&


)0
CM Retriever 52.37 51.70 61.87 74.24 
Retriever 
w/o Retriever 49.25 48.36 60.29 73.60 

Padding 51.14 51.63 61.30 72.78  
Generator          
w/o Filter 54.15 52.99 60.67 74.07
Static Prompt 54.38 53.14 62.65 74.40 (a) Effect of K on MM-IMDb (b) Effect of K on HateMemes
Prompter w/o Label 53.41 53.45 62.01 74.34
w/o Prompter 51.49 50.43 60.94 72.65 Figure 3: Hyper-parameter analysis of K under three modal-
ity missing scenarios.
Table 3: Ablation study of RAGPT under 70% text missing.
Target Instance Top-1 Instance Top-2 Instance
Top Hot Hot Dog Wrapped
Dog Recipes Recipes Hot Dogs
retrieved instances to enhance both missing and complete (1)
Cooking Toppings - Recipe |
modality data. Meanwhile, the missing modality generator Tips [Link] POPSUGA
and context-aware prompter distill expressive contextual in-
Ice Cream Ice cream Cake Batter
formation from retrieved instances to approximate missing Cone cone Ice Cream
(2)
content and generate dynamic prompts, respectively, thereby Cupcake cupcakes | Recipe For
Recipe Daily Kids
improving model robustness for incomplete modalities.
Second, cross-modal generation and joint learning meth-
ods demonstrate inferior performance, primarily due to Figure 4: Examples of Top-2 retrieved instances for two
the uncertainty introduced by random placeholders and modality-incomplete target instances. The first target in-
the challenges of modality heterogeneity in reconstruction, stance is image-only while the second one is text-only. Red
which create significant performance bottlenecks. Moreover, texts highlight similar content.
prompt-based methods also exhibit limited effectiveness in
missing-modality scenarios, as they rely on dummy imputa-
tions and static prompting strategies, further restricting their placing context-aware prompts with static prompts; (2) w/o
potential and resulting in performance stagnation. Label: removing label enhancement; and (3) w/o Prompter:
eliminating text-, vision-, label-prompts entirely. The three
Ablation Study variants result in poorer performance, validating that static
prompts offer limited relevant cues for addressing incom-
We conduct various ablation experiments to evaluate the im- plete multimodal learning.
pact of each component within RAGPT under a 70% text
missing case and summarize the results in Table 3. Hyper-Parameter Analysis
Effect of Multi-Channel Retriever To analyze the impact
Fig. 3(a) and 3(b) present the sensitivity analysis of
of the retriever in RAGPT, we designed two variants: (1)
RAGPT’s hyper-parameters K on the MM-IMDb and Hate-
CM Retriever: replacing the multi-channel retriever with
Memes datasets. The results demonstrate that the perfor-
cross-modal retriever, and (2) w/o Retriever: removing the
mance of RAGPT is improved by retrieving relevant in-
retriever entirely. These results confirm the presence of the
stances. However, incorporating a larger number of in-
modal gap problem in cross-modal retrieval, which renders
stances may result in a decline in performance due to the in-
the retrieved instances irrelevant to the target images. Fur-
troduction of noise (i.e., irrelevant instances). Consequently,
thermore, this finding reinforces our design of the multi-
we adopt K = 3 under the image missing case on the MM-
channel retrieval that retrieves relevant instances by calcu-
IMDb dataset and K = 5 under other scenarios.
lating within-modality similarities, thereby enhancing both
missing and complete modality data. Retrieval Quality Presentation
Effect of Missing Modality Generator To evaluate the im-
pact of the missing modality generator, we designed vari- To further analyze the efficacy of our proposed multi-
ant models: (1) Padding: using random values to fill in the channel retriever, we randomly select two instances with in-
missing modality, and (2) w/o Filter: removing the filter complete modalities from the Food101 dataset. Fig. 4 visu-
block entirely. We observe that dummy padding results in a alizes the Top-2 similar retrieved instances, demonstrating a
decline in performance. This finding supports our assertion strong semantic correlation between the retrieved and target
that dummy padding contributes to performance bottlenecks instances in both image and text modalities. The high quality
in prompt-based methods. Additionally, the removal of the of retrieval relevance indicates our multi-channel retriever’s
filter layer leads to a significant performance drop, under- ability to effectively identify relevant modal information.
scoring the importance of the filter layer in RAGPT for ef-
fectively mitigating noise. Model Generalizability
Effect of Context-Aware Prompter To analyze the context- To investigate the model’s generalizability, we design two
aware prompts, we design variants: (1) Static Prompt: re- experiments with varying missing rates in the training set
 
     
    
     


$852&


           0$3V  063V

 0$3V  063V
            ,PSURYH ,PSURYH
  
                
 
         
         
  (a) Text Missing on MAPs (b) Text Missing on MSPs
SHF XVH 3V 63V 37 SHF XVH 3V 63V 37
6KD6 'U) 0$ 0 5$* 6KD6 'U) 0$ 0 5$*
Figure 7: Effect of integrating key modules in RAGPT for
(a) Text Missing (b) Both Missing baselines on the HateMemes dataset in terms of AUROC.
Figure 5: Generalization analysis on the HateMemes dataset
across various missing rates in terms of AUROC. :HVWHUQ
6SRUW
)LOP1RLU

:HVWHUQ
6SRUW
)LOP1RLU
(a) RAGPT (b) MSPs

Figure 8: t-SNE visualization of RAGPT and MSPs on the


(a) Text Missing (b) Both Missing MM-IMDb dataset under a 90% text missing rate.
Figure 6: Robustness analysis on the HateMemes dataset
across various missing rates in terms of AUROC. erator, and context-aware prompter) into two prompt-based
baselines (MAPs and MSPs). In Fig. 7, we observe a sig-
nificantly slower rate of performance decline in the two
and evaluate their performance on a test set with a 90% miss- baselines as the missing rate increased. This finding indi-
ing rate. Compared with four strong baselines (ShaSpec, Dr- cates that our modules significantly enhance the robustness
Fuse, MAPs, and MSPs), Fig. 5(a) shows the results for the of these baselines for incomplete modalities. It also vali-
missing-text case, while Fig. 5(b) presents the results for dates the effectiveness of our design in extracting informa-
scenarios of missing both modalities. We observe that our tive multimodal cues from relevant instances and prompting
RAGPT outperforms all baselines across all missing rates, pre-trained MMTs.
demonstrating superior performance for missing-modality.
These results highlight RAGPT’s generalizability, which can Model Prediction Visualization
be attributed to the ability of exploring crucial cues from rel- Fig. 8 illustrates the t-SNE (Van der Maaten and Hinton
evant contexts. 2008) visualization of the embedding distributions for three
genres (i.e., Sport, Film-Noir, and Western) in the MM-
Robustness to Different Missing Rates IMDb test set under a 90% text missing rate. We observe
We conduct an experiment to analyze the model’s robustness that while baseline MSPs learns distinguishable features, the
to varying missing rates. Fig. 6 illustrates the results compar- learned features remain intertwined. In contrast, the repre-
ing RAGPT with four strong baselines (ShaSpec, DrFuse, sentations of three genres learned by our RAGPT are more
MAPs, and MSPs) on the HateMemes dataset. We observe discriminative, exhibiting larger segregated areas among in-
that the performance of all baselines deteriorates markedly stances with different labels.
as the missing rate increases. In contrast, RAGPT demon-
strates only a slight performance decrease as the missing rate Conclusion
increases. This result highlights the valuable components In this work, we proposed RAGPT, a novel retrieval-
of RAGPT for effectively mitigating the impact of missing augmented dynamic prompt-tuning framework to address
data. Specifically, RAGPT leverages expressive knowledge the missing-modality issue. This model-agnostic framework
from retrieved instances to approximate missing modali- includes three key components: (1) the multi-channel re-
ties through the missing modality generator. Additionally, triever, (2) the missing modality generator, and (3) the
RAGPT generates context-aware prompts that enhance the context-aware prompter, to effectively inject valuable con-
performance of the pre-trained MMTs. textual knowledge into pre-trained MMT, thereby enhanc-
ing its robustness in the missing-modality scenario. Ex-
Model Scalability tensive experiments conducted on three real-world datasets
To further validate the RAGPT’s scalability, we integrate demonstrate the superiority of RAGPT in tackling incom-
key modules (multi-channel retriever, missing modality gen- plete modality learning.
Acknowledgments Lee, Y.-L.; Tsai, Y.-H.; Chiu, W.-C.; and Lee, C.-Y. 2023.
This work was supported by National Natural Science Foun- Multimodal prompting with missing modalities for visual
dation of China (Grant No.62176043, No.62072077, and recognition. In IEEE/CVF Conference on Computer Vision
No.U22A2097), and Kashgar Science and Technology Bu- and Pattern Recognition (CVPR), 14943–14952.
reau (Grant No.KS2023025). Li, M.; Yang, D.; Zhao, X.; Wang, S.; Wang, Y.; Yang, K.;
Sun, M.; Kou, D.; Qian, Z.; and Zhang, L. 2024. Correlation-
Decoupled Knowledge Distillation for Multimodal Senti-
References ment Analysis with Incomplete Modalities. In IEEE/CVF
Arevalo, J.; Solorio, T.; Montes-y Gómez, M.; and Conference on Computer Vision and Pattern Recognition
González, F. A. 2017. Gated multimodal units for informa- (CVPR), 12458–12468.
tion fusion. arXiv preprint arXiv:1702.01992. Li, X. L.; and Liang, P. 2021. Prefix-Tuning: Optimizing
Cheng, Z.; Ye, W.; Liu, L.; Tai, W.; and Zhou, F. 2023. Continuous Prompts for Generation. In Proceedings of the
Enhancing Information Diffusion Prediction with Self- Annual Meeting of the Association for Computational Lin-
Supervised Disentangled User and Cascade Representa- guistics (ACL), 4582–4597.
tions. In Proceedings of the ACM International Confer- Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; and Neubig,
ence on Information and Knowledge Management (CIKM), G. 2023. Pre-train, prompt, and predict: A systematic survey
3808–3812. of prompting methods in natural language processing. ACM
Cheng, Z.; Zhang, J.; Xu, X.; Trajcevski, G.; Zhong, T.; and Computing Surveys, 55(9): 1–35.
Zhou, F. 2024a. Retrieval-augmented hypergraph for mul- Loshchilov, I.; and Hutter, F. 2017. Decoupled Weight De-
timodal social media popularity prediction. In Proceedings cay Regularization. arXiv e-prints, arXiv–1711.
of the ACM SIGKDD Conference on Knowledge Discovery Ma, M.; Ren, J.; Zhao, L.; Testuggine, D.; and Peng, X.
and Data Mining (KDD), 445–455. 2022. Are multimodal transformers robust to missing
Cheng, Z.; Zhou, F.; Xu, X.; Zhang, K.; Trajcevski, G.; modality? In IEEE/CVF Conference on Computer Vision
Zhong, T.; and Philip, S. Y. 2024b. Information Cascade and Pattern Recognition (CVPR), 18177–18186.
Popularity Prediction via Probabilistic Diffusion. IEEE Ma, M.; Ren, J.; Zhao, L.; Tulyakov, S.; Wu, C.; and Peng,
Transactions on Knowledge and Data Engineering (TKDE). X. 2021. Smil: Multimodal learning with severely missing
Ghosh, A.; Acharya, A.; Jain, R.; Saha, S.; Chadha, A.; and modality. In Proceedings of the AAAI Conference on Artifi-
Sinha, S. 2024. Clipsyntel: clip and llm synergy for multi- cial Intelligence (AAAI), volume 35, 2302–2310.
modal question summarization in healthcare. In Proceedings Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.;
of the AAAI Conference on Artificial Intelligence (AAAI), Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.;
volume 38, 22031–22039. et al. 2021. Learning transferable visual models from nat-
Hodges, N. J.; Williams, A. M.; Hayes, S. J.; and Breslin, ural language supervision. In International Conference on
G. 2007. What is modelled during observational learning? Machine Learning (ICML), 8748–8763. PMLR.
Journal of Sports Sciences, 25(5): 531–545. Van der Maaten, L.; and Hinton, G. 2008. Visualizing data
using t-SNE. Journal of Machine Learning Research, 9(11).
Jang, J.; Wang, Y.; and Kim, C. 2024. Towards Robust Mul-
timodal Prompting with Missing Modalities. In IEEE Inter- Wang, H.; Chen, Y.; Ma, C.; Avery, J.; Hull, L.; and
national Conference on Acoustics, Speech and Signal Pro- Carneiro, G. 2023. Multi-modal learning with miss-
cessing (ICASSP), 8070–8074. ing modality via shared-specific feature modelling. In
IEEE/CVF Conference on Computer Vision and Pattern
Khattak, M. U.; Rasheed, H.; Maaz, M.; Khan, S.; and Recognition (CVPR), 15878–15887.
Khan, F. S. 2023. Maple: Multi-modal prompt learning. Wang, H.; Luo, S.; Hu, G.; and Zhang, J. 2024. Gradient-
In IEEE/CVF Conference on Computer Vision and Pattern Guided Modality Decoupling for Missing-Modality Robust-
Recognition (CVPR), 19113–19122. ness. In Proceedings of the AAAI Conference on Artificial
Kiela, D.; Firooz, H.; Mohan, A.; Goswami, V.; Singh, A.; Intelligence (AAAI), volume 38, 15483–15491.
Ringshia, P.; and Testuggine, D. 2020. The hateful memes Wang, X.; Kumar, D.; Thome, N.; Cord, M.; and Precioso,
challenge: Detecting hate speech in multimodal memes. Ad- F. 2015. Recipe recognition with large multimodal food
vances in Neural Information Processing Systems (Neurips), dataset. In IEEE International Conference on Multimedia
33: 2611–2624. & Expo Workshops (ICME), 1–6.
Kim, W.; Son, B.; and Kim, I. 2021. Vilt: Vision-and- Woo, S.; Lee, S.; Park, Y.; Nugroho, M. A.; and Kim, C.
language transformer without convolution or region super- 2023. Towards good practices for missing modality robust
vision. In International Conference on Machine Learning action recognition. In Proceedings of the AAAI Conference
(ICML), 5583–5594. PMLR. on Artificial Intelligence (AAAI), volume 37, 2776–2784.
Lee, H.-C.; Lin, C.-Y.; Hsu, P.-C.; and Hsu, W. H. 2019. Xu, X.; Zhou, F.; Zhang, K.; Liu, S.; and Trajcevski, G.
Audio feature generation for missing modality problem in 2021. Casflow: Exploring hierarchical structures and prop-
video action recognition. In IEEE International Confer- agation uncertainty for cascade prediction. IEEE Transac-
ence on Acoustics, Speech and Signal Processing (ICASSP), tions on Knowledge and Data Engineering (TKDE), 35(4):
3956–3960. 3484–3499.
Yao, W.; Yin, K.; Cheung, W. K.; Liu, J.; and Qin, J. 2024.
DrFuse: Learning Disentangled Representation for Clinical
Multi-Modal Fusion with Missing Modality and Modal In-
consistency. In Proceedings of the AAAI Conference on Ar-
tificial Intelligence (AAAI), volume 38, 16416–16424.
Yuan, Z.; Li, W.; Xu, H.; and Yu, W. 2021. Transformer-
based feature reconstruction network for robust multimodal
sentiment analysis. In Proceedings of the ACM International
Conference on Multimedia (MM), 4400–4407.
Zhao, J.; Li, R.; and Jin, Q. 2021. Missing modality imagina-
tion network for emotion recognition with uncertain missing
modalities. In Proceedings of the Annual Meeting of the As-
sociation for Computational Linguistics (ACL), 2608–2618.
Zhong, T.; Lang, J.; Zhang, Y.; Cheng, Z.; Zhang, K.; and
Zhou, F. 2024. Predicting Micro-video Popularity via Multi-
modal Retrieval Augmentation. In Proceedings of the ACM
International Conference on Research and Development in
Information Retrieval (SIGIR), 9–16.
Zhou, F.; Xu, X.; Trajcevski, G.; and Zhang, K. 2021. A sur-
vey of information cascade analysis: Models, predictions,
and recent advances. ACM Computing Surveys (CSUR),
54(2): 1–36.
Zhou, K.; Yang, J.; Loy, C. C.; and Liu, Z. 2022a. Learning
to prompt for vision-language models. International Journal
of Computer Vision, 130(9): 2337–2348.
Zhou, K.; Yu, H.; Zhao, W. X.; and Wen, J.-R. 2022b. Filter-
enhanced MLP is all you need for sequential recommenda-
tion. In Proceedings of the ACM Web Conference (WWW),
2388–2399.
Zuo, H.; Liu, R.; Zhao, J.; Gao, G.; and Li, H. 2023. Exploit-
ing modality-invariant feature for robust multimodal emo-
tion recognition with missing modalities. In IEEE Interna-
tional Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), 1–5.

You might also like