0% found this document useful (0 votes)
59 views13 pages

Ma PLe

The document presents Multi-modal Prompt Learning (MaPLe), a novel approach for adapting vision-language models like CLIP by learning prompts for both text and image encoders simultaneously. This method enhances the alignment between vision and language representations, improving performance on downstream tasks such as novel class generalization and cross-dataset evaluation. MaPLe outperforms existing methods, demonstrating significant gains in accuracy across multiple image recognition datasets while maintaining efficient training and inference.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views13 pages

Ma PLe

The document presents Multi-modal Prompt Learning (MaPLe), a novel approach for adapting vision-language models like CLIP by learning prompts for both text and image encoders simultaneously. This method enhances the alignment between vision and language representations, improving performance on downstream tasks such as novel class generalization and cross-dataset evaluation. MaPLe outperforms existing methods, demonstrating significant gains in accuracy across multiple image recognition datasets while maintaining efficient training and inference.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MaPLe: Multi-modal Prompt Learning

Muhammad Uzair Khattak1 Hanoona Rasheed1 Muhammad Maaz1


Salman Khan1,2 Fahad Shahbaz Khan1,3
1
Mohamed bin Zayed University of AI 2 Australian National University 3 Linköping University
arXiv:2210.03117v3 [cs.CV] 1 Apr 2023

Abstract natural language. During inference, hand-engineered text


prompts are used e.g., ‘a photo of a <category>’ as
Pre-trained vision-language (V-L) models such as CLIP a query for text encoder. The output text embeddings are
have shown excellent generalization ability to downstream matched with the visual embeddings from an image encoder
tasks. However, they are sensitive to the choice of input text to predict the output class. Designing high quality contex-
prompts and require careful selection of prompt templates to tual prompts have been proven to enhance the performance
perform well. Inspired by the Natural Language Processing of CLIP and other V-L models [17, 42].
(NLP) literature, recent CLIP adaptation approaches learn Despite the effectiveness of CLIP towards generalization
prompts as the textual inputs to fine-tune CLIP for down- to new concepts, its massive scale and scarcity of training
stream tasks. We note that using prompting to adapt repre- data (e.g., few-shot setting) makes it infeasible to fine-tune
sentations in a single branch of CLIP (language or vision) is the full model for downstream tasks. Such fine-tuning can
sub-optimal since it does not allow the flexibility to dynam- also forget the useful knowledge acquired in the large-scale
ically adjust both representation spaces on a downstream pretraining phase and can pose a risk of overfitting to the
task. In this work, we propose Multi-modal Prompt Learn- downstream task. To address the above challenges, exist-
ing (MaPLe) for both vision and language branches to im- ing works propose language prompt learning to avoid man-
prove alignment between the vision and language represen- ually adjusting the prompt templates and providing a mech-
tations. Our design promotes strong coupling between the anism to adapt the model while keeping the original weights
vision-language prompts to ensure mutual synergy and dis- frozen [14, 25, 29, 48, 49]. Inspired from Natural Language
courages learning independent uni-modal solutions. Fur- Processing (NLP), these approaches only explore prompt
ther, we learn separate prompts across different early stages learning for the text encoder in CLIP (Fig. 1:a) while adap-
to progressively model the stage-wise feature relationships tation choices together with an equally important image en-
to allow rich context learning. We evaluate the effectiveness coder of CLIP remains an unexplored topic in the literature.
of our approach on three representative tasks of generaliza-
Our motivation derives from the multi-modal nature of
tion to novel classes, new target datasets and unseen do-
CLIP, where a text and image encoder co-exist and both
main shifts. Compared with the state-of-the-art method Co-
contribute towards properly aligning the V-L modalities.
CoOp, MaPLe exhibits favorable performance and achieves
We argue that any prompting technique should adapt the
an absolute gain of 3.45% on novel classes and 2.72% on
model completely and therefore, learning prompts only for
overall harmonic-mean, averaged over 11 diverse image
the text encoder in CLIP is not sufficient to model the adap-
recognition datasets. Our code and pre-trained models are
tations needed for the image encoder. To this end, we set out
available at https://github.com/muzairkhattak/multimodal-
to achieve completeness in the prompting approach and pro-
prompt-learning.
pose Multi-modal Prompt Learning (MaPLe) to adequately
fine-tune the text and image encoder representations such
1. Introduction that their optimal alignment can be achieved on the down-
stream tasks (Fig. 1:b). Our extensive experiments on three
Foundational vision-language (V-L) models such as CLIP key representative settings including base-to-novel gener-
(Contrastive Language-Image Pretraining) [32] have shown alization, cross-dataset evaluation, and domain generaliza-
excellent generalization ability to downstream tasks. Such tion demonstrate the strength of MaPLe. On base-to-novel
models are trained to align language and vision modali- generalization, our proposed MaPLe outperforms existing
ties on web-scale data e.g., 400 million text-image pairs in prompt learning approaches across 11 diverse image recog-
CLIP. These models can reason about open-vocabulary vi- nition datasets (Fig. 1:c) and achieves absolute average gain
sual concepts, thanks to the rich supervision provided by of 3.45% on novel classes and 2.72% on harmonic-mean
Maximize similar ity

Text Encoder Image Encoder

" Text input" Pr ompts

(a) Existing pr ompt tuning methods (Uni-modal)

Maximize similar ity

Text Encoder Image Encoder


Pr ompts Pr ompts

Pr ompts Pr ompts

" Text input" Pr ompts Pr ompts

(b) Multi-modal Pr ompt Lear ning (MaPLe) (c) Per for mance compar ison on base-to-novel gener alization
Figure 1. Comparison of MaPLe with standard prompt learning methods. (a) Existing methods adopt uni-modal prompting techniques
to fine-tune CLIP representations as prompts are learned only in a single branch of CLIP (language or vision). (b) MaPLe introduces
branch-aware hierarchical prompts that adapt both language and vision branches simultaneously for improved generalization. (c) MaPLe
surpasses state-of-the-art methods on 11 diverse image recognition datasets for novel class generalization task.

over the state-of-the-art method Co-CoOp [48]. Further, terest in the computer vision community. In contrast to
MaPLe demonstrates favorable generalization ability and models learned with only image supervision, these vision-
robustness in cross-dataset transfer and domain generaliza- language (V-L) models encode rich multimodal representa-
tion settings, leading to consistent improvements compared tions. Recently, V-L models like CLIP [32], ALIGN [15],
to existing approaches. Owing to its streamlined architec- LiT [45], FILIP [41] and Florence [43] have demonstrated
tural design, MaPLe exhibits improved efficiency during exceptional performance on a wide spectrum of tasks in-
both training and inference without much overhead, as com- cluding few-shot and zero-shot visual recognition. These
pared to Co-CoOp which lacks efficiency due to its image models learn joint image-language representations in a self-
instance conditioned design. In summary, the main contri- supervised manner using abundantly available data from
butions of this work include: the web. For example, CLIP and ALIGN respectively
• We propose multi-modal prompt learning in CLIP to use ∼400M and ∼1B image-text pairs to train a multi-
favourably align its vision-language representations. modal network. Although these pre-trained V-L models
To the best of our knowledge, this is the first multi- learn generalized representations, efficiently adapting them
modal prompting approach for fine-tuning CLIP. to downstream tasks is still a challenging problem. Many
• To link prompts learned in text and image encoders, we works have demonstrated better performance on down-
propose a coupling function to explicitly condition vi- stream tasks by using tailored methods to adapt V-L mod-
sion prompts on their language counterparts. It acts as els for few-shot image-recognition [9, 19, 46], object detec-
a bridge between the two modalities and allows mutual tion [8,10,27,34,44,50], and segmentation [5,22,26,33]. In
propagation of gradients to promote synergy. this work, we propose a novel multi-modal prompt learning
technique to effectively adapt CLIP for few-shot and zero-
• Our multi-modal prompts are learned across multi-
shot visual recognition tasks.
ple transformer blocks in both vision and language
branches to progressively learn the synergistic be- Prompt Learning: The instructions in the form of a sen-
haviour of both modalities. This deep prompting strat- tence, known as text prompt, are usually given to the lan-
egy allows modeling the contextual relationships inde- guage branch of a V-L model, allowing it to better under-
pendently, thus providing more flexibility to align the stand the task. Prompts can be handcrafted for a down-
vision-language representations. stream task or learned automatically during fine-tuning
stage. The latter is referred to as ‘Prompt Learning’ which
2. Related Work was first used in NLP [21,23,24] followed by the adaptation
in V-L [48, 49, 51] and vision-only [16, 38, 39, 47] models.
Vision Language Models: The combined use of language Similar to [16] our design also uses deep ‘vision’ prompt-
supervision with natural images is found to be of great in- ing. However, ours is the first multi-modal prompting de-
Text Encoder

" a photo of a
cat"
Wor d Embed

Encoder Layer
Encoder Layer

Encoder Layer

...
...
...
y1 y2 y3 ... yn

x1 x1.y1 x1.y2 x1.y3 ... x1.yn

x2 x2.y1 x2.y2 x2.y3 ... x2.yn

x3 x3.y1 x3.y2 x2.y3 ... x3.yn

xn
Image Encoder

...
Encoder Layer

Encoder Layer

Encoder Layer
Patch Embed

Figure 2. Overview of our proposed MaPLe (Multi-modal Prompt Learning) framework for prompt learning in V-L models. MaPLe tunes .....
both vision and language branches where only the context prompts are learned, while the rest of the model is frozen. MaPLe conditions the

xn.y1 xn.y2 xn.y3 ... xn.yn


...

...

...

...
...
vision prompts on language prompts via a V-L coupling function F to induce mutual synergy between the two modalities. Our framework

.....
uses deep contextual prompting where separate context prompts are learned across multiple transformer blocks.
...

...
...
sign while [16] is uni-modal. through context optimization via prompting. Fig. 2 shows
Prompt Learning in Vision Language models: Full fine- the overall architecture of our proposed MaPLe (Multi-
tuning and linear probing [9] are two typical approaches modal Prompt Learning) framework. Unlike previous ap-
to adapt a V-L model (i.e. CLIP) to the downstream tasks. proaches [48, 49] which learn context prompts only at the
The complete fine-tuning results in degrading the previ- language branch, MaPLe proposes a joint prompting ap-
ously learned joint V-L representation while linear probing proach where the context prompts are learned in both vi-
limits the zero-shot capability of CLIP. To this end, inspired sion and language branches. Specifically, we append learn-
from prompt learning in NLP, many works have proposed able context tokens in the language branch and explicitly
to adapt V-L models by learning the prompt tokens in an condition the vision prompts on the language prompts via a
end-to-end training. CoOp [49] fine-tunes CLIP for few- coupling function to establish interaction between them. To
shot transfer by optimizing continuous set of prompt vectors learn hierarchical contextual representations, we introduce
at its language branch. Co-CoOp [48] highlights the infe- deep prompting in both branches through separate learnable
rior performance of CoOp on novel classes and solves the context prompts across different transformer blocks. Dur-
generalization issue by explicitly conditioning prompts on ing fine-tuning, only the context prompts along with their
image instances. [25] proposes to optimize multiple set of coupling function are learned while the rest of the model is
prompts by learning the distribution of prompts. [18] adapt frozen. Below, we first outline the pre-trained CLIP archi-
CLIP by learning prompts for video understanding tasks. [1] tecture and then present our proposed fine-tuning approach.
perform visual prompt tuning on CLIP by prompting on the
vision branch. We note that the existing methods follow in-
dependent uni-modal solutions and learn prompts either in 3.1. Revisiting CLIP
the language or in the vision branch of CLIP, thus adapt-
ing CLIP partially. In this paper, we explore an important We build our approach on a pre-trained vision-language (V-
question: given the multimodal nature of CLIP, is complete L) model, CLIP, which consists of a text and vision encoder.
prompting (i.e., in both language and vision branches) bet- Consistent with existing prompting methods [48, 49], we
ter suited to adapt CLIP? Our work is the first to answer this use a vision transformer (ViT) [6] based CLIP model. CLIP
question by investigating the effectiveness of multi-modal encodes an image I ∈ RH×W ×3 and a corresponding text
prompt learning in order to improve alignment between vi- description as explained below.
sion and language representations.
Encoding Image: Image encoder V with K transformer
3. Method layers {Vi }K
i=1 , splits the image I into M fixed-size patches
which are projected into patch embeddings E0 ∈ RM ×dv .
Our approach concerns with fine-tuning a pre-trained multi- Patch embeddings Ei are input to the (i + 1)th transformer
modal CLIP for better generalization to downstream tasks block (Vi+1 ) along with a learnable class (CLS) token ci
Co-CooP: Base 77.0 Co-CooP: Novel 56.0 Co-CooP: Base 87.5 Co-CooP: Novel 60.1 Co-CooP: Base 82.3 Co-CooP: Novel 73.5

MaPLe: Base 80.4 MaPLe: Novel 59.2 MaPLe: Base 94.1 MaPLe: Novel 73.2 MaPLe: Base 83.0 M: Novel 78.7

(a) DTD (Texture Classification) (b) EuroSAT (Sattelite Imagery Recognition) (c) UCF101 (Action Recognition)

Figure 3. t-SNE plots of image embeddings in uni-modal prompting method Co-CoOp, and MaPLe on 3 diverse image recognition datasets.
MaPLe shows better separability in both base and novel classes.

and sequentially processed through K transformer blocks, tuning. We reason that prior works that have predominantly
explored uni-modal approaches are less suitable as they do
[ci , Ei ] = Vi ([ci−1 , Ei−1 ]) i = 1, 2, · · · , K. not offer the flexibility to dynamically adapt both language
To obtain the final image representation x, the class token and vision representation spaces. Thus to achieve complete-
cK of last transformer layer (VK ) is projected to a common ness in prompting, we underline the importance of multi-
V-L latent embedding space via ImageProj, modal prompting approach. In Fig. 3, we visualize and
compare the image embeddings of MaPLe with recent state-
x = ImageProj(cK ) x ∈ Rdvl . of-the-art work, Co-CoOp. Note that the image embeddings
of CLIP, CoOp and Co-CoOp will be identical as they do not
Encoding Text: CLIP text encoder generates feature learn prompts in the vision branch. The visualization shows
representations for text description by tokenizing the that image embeddings of MaPLe are more separable indi-
words and projecting them to word embeddings W0 = cating that learning vision prompts in addition to language
[w01 , w02 , · · · , w0N ] ∈ RN ×dl . At each stage, Wi is input prompts leads to better adaptation of CLIP.
to the (i + 1)th transformer layer of text encoding branch
In addition to multi-modal prompting, we find that it is
(Li+1 ),
essential to learn prompts in the deeper transformer layers
[Wi ] = Li (Wi−1 ) i = 1, 2, · · · , K. to progressively model stage-wise feature representations.
To this end, we propose to introduce learnable tokens in the
The final text representation z is obtained by projecting the first J (where J < K) layers of both vision and language
text embeddings corresponding to the last token of the last branches. These multi-modal hierarchical prompts utilize
transformer block LK to a common V-L latent embedding the knowledge embedded in CLIP model to effectively learn
space via TextProj, task relevant contextual representations (see Fig. 4).
N
z = TextProj(wK ) z ∈ Rdvl .
3.2.1 Deep Language Prompting
Zero-shot Classification: For zero-shot classification, text
prompts are hand-crafted with class labels y ∈ {1, 2, . . . C} To learn the language context prompts, we introduce b
(e.g., ‘a photo of a <category>’) having C classes. Pre- learnable tokens {P i ∈ Rdl }bi=1 , in the language branch
diction ŷ corresponding to the image I having the highest of CLIP. The input embeddings now follow the form
cosine similarity score (sim(·)) is calculated with a temper- [P 1 , P 2 , · · · , P b , W0 ], where W0 = [w1 , w2 , · · · , wN ]
ature parameter τ , corresponds to fixed input tokens. New learnable tokens
are further introduced in each transformer block of the lan-
exp(sim(x, zŷ )/τ )
p(ŷ|x) = PC . guage encoder (Li ) up to a specific depth J,
i=1 exp(sim(x, zi ))
[ , Wi ] = Li ([Pi−1 , Wi−1 ]) i = 1, 2, · · · , J. (1)
3.2. MaPLe: Multi-modal Prompt Learning
To efficiently fine-tune CLIP for downstream image recog- Here [·, ·] refers to the concatenation operation. After J th
nition tasks, we explore the potential of multi-modal prompt transformer layer, the subsequent layers process previous
layer prompts and final text representation z is computed, to dv . This acts as a bridge between the two modalities, thus
encouraging mutual propagation of gradients.
[Pj , Wj ] = Lj ([Pj−1 , Wj−1 ]) j = J + 1, · · · , K, (2)
z= N
TextProj(wK ). (3) [ci , Ei , ] = Vi ([ci−1 , Ei−1 , Fi−1 (Pi−1 )]) i = 1, · · · , J
[cj , Ej , P̃j ] = Vj ([cj−1 , Ej−1 , P̃j−1 ]) j = J + 1, · · · , K
When J = 1, the learnable tokens P are only applied at
the input of first transformer layer, and this deep language x = ImageProj(cK )
prompting technique degenerates to CoOp [49].
Unlike independent V-L prompting, explicit conditioning of
P̃ on P helps learn prompts in a shared embedding space
3.2.2 Deep Vision Prompting between the two branches, thus improving mutual synergy.
Similar to deep language prompting, we introduce b learn-
able tokens {P̃ i ∈ Rdv }bi=1 , in the vision branch of CLIP 4. Experiments
alongside the input image tokens. New learnable tokens are 4.1. Benchmark setting
further introduced in deeper transformer layers of the image
encoder (V) up to depth J. Generalization from Base-to-Novel Classes: We evaluate
the generalizability of MaPLe, and follow a zero-shot set-
[ci , Ei , ] = Vi ([ci−1 , Ei−1 , P̃i−1 ]) i = 1, 2, · · · , J, ting where the datasets are split into base and novel classes.
[cj , Ej , P̃j ] = Vj ([cj−1 , Ej−1 , P̃j−1 ]) j = J + 1, · · · , K, The model is trained only on the base classes in a few-shot
setting and evaluated on base and novel categories.
x = ImageProj(cK ).
Cross-dataset Evaluation: To validate the potential of our
Our deep prompting provides the flexibility to learn prompts approach in cross-dataset transfer, we evaluate our Ima-
across different feature hierarchies within the ViT architec- geNet trained model directly on other datasets. Consistent
ture. We find that sharing prompts across stages is bet- with Co-CoOp, our model is trained on all 1000 ImageNet
ter compared to independent prompts as features are more classes in a few-shot manner.
correlated due to successive transformer block processing. Domain Generalization: We evaluate the robustness of our
Thus, the later stages do not provide independently-learned method on out-of-distribution datasets. Similar to cross-
complimentary prompts as compared to the early stages. dataset evaluation, we test our ImageNet trained model di-
rectly on four other ImageNet datasets that contain various
types of domain shifts.
3.2.3 Vision Language Prompt Coupling
Datasets: For generalization from base-to-novel classes
We reason that in prompt tuning it is essential to take a and cross-dataset evaluation, we follow [48, 49] and eval-
multi-modal approach and simultaneously adapt both the vi- uate the performance of our method on 11 image classi-
sion and language branch of CLIP in order to achieve com- fication datasets which covers a wide range of recogni-
pleteness in context optimization. A simple approach would tion tasks. This includes two generic-objects datasets, Im-
be to naively combine deep vision and language prompt- ageNet [4] and Caltech101 [7]; five fine-grained datasets,
ing, where both the language prompts P , and the vision OxfordPets [31], StanfordCars [20], Flowers102 [30],
prompts P̃ , will be learned during the same training sched- Food101 [2], and FGVCAircraft [28]; a scene recogni-
ule. We name this design as ‘Independent V-L Prompting’. tion dataset SUN397 [40]; an action recognition dataset
Although this approach satisfies the requirement of com- UCF101 [36]; a texture dataset DTD [3] and a satellite-
pleteness in prompting, this design lacks synergy between image dataset EuroSAT [11]. For domain generalization,
vision and language branch as both branches do not interact we use ImageNet as source dataset and its four variants
while learning the task relevant context prompts. as target datasets including ImageNetV2 [35], ImageNet-
To this end, we propose a branch-aware multi-modal Sketch [37], ImageNet-A [13] and ImageNet-R [12].
prompting which tunes vision and language branch of Implementation Details We use a few-shot training strat-
CLIP together by sharing prompts across both modalities. egy in all experiments at 16 shots which are randomly sam-
Language prompt tokens are introduced in the language pled for each class. We apply prompt tuning on a pre-
branch up to J th transformer block similar to deep language trained ViT-B/16 CLIP model where dl = 512, dv = 768
prompting as illustrated in Eqs. 1-3. To ensure mutual syn- and dvl = 512. For MaPLe, we set prompt depth J to 9 and
ergy between V-L prompts, vision prompts P̃ , are obtained the language and vision prompt lengths to 2. All models
by projecting language prompts P via vision-to-language are trained for 5 epochs with a batch-size of 4 and a learn-
projection which we refer to as V-L coupling function F(·), ing rate of 0.0035 via SGD optimizer on a single NVIDIA
such that P̃k = Fk (Pk ). The coupling function is imple- A100 GPU. We report base and novel class accuracies and
mented as a linear layer which maps dl dimensional inputs their harmonic mean (HM) averaged over 3 runs. We initial-
Method Base Acc. Novel Acc. HM GFLOPS CLIP, we use hand-crafted prompts that are specifically de-
1: MaPLe shallow (J = 1) 80.10 73.52 76.67 167.1 signed for each dataset.
2: Deep vision prompting 80.24 73.43 76.68 18.0 In comparison with the state-of-the-art Co-CoOp,
3: Deep language prompting 81.72 73.81 77.56 166.8
4: Independent V-L prompting 82.15 74.07 77.90 167.0
MaPLe shows improved performance on both base and
novel categories on all 11 datasets with an exception of
5: MaPLe (Ours) 82.28 75.14 78.55 167.0
marginal reduction on only the base class performance of
Table 1. Comparison of MaPLe with different prompting de- Caltech101. With mutual synergy from the branch-aware
signs in base-to-novel generalization. Results are averaged over multi-modal prompting, MaPLe better generalizes to novel
11 datasets. HM refers to harmonic mean. categories on all 11 datasets in comparison with Co-CoOp,
ize the language prompts of the first layer P0 with the pre- and obtains an overall gain from 71.69% to 75.14%. When
trained CLIP word embeddings of the template ‘a photo taking into account both the base and novel classes, MaPLe
of a <category>’, while for the subsequent layers they shows an absolute average gain of 2.72% over Co-CoOp.
are randomly initialized from a normal distribution. For In comparison with CLIP on novel classes, Co-CoOp im-
training MaPLe on all 1000 classes of ImageNet as a source proves only on 4/11 datasets dropping the average novel ac-
model, prompt depth J is set to 3 and the model trained for 2 curacy from 74.22% to 71.69%. MaPLe is a strong competi-
epochs with learning rate of 0.0026. Hyper-parameters for tor which improves accuracy over CLIP on novel classes on
deep language prompting, deep vision prompting, and in- 6/11 datasets, with an average gain from 74.22% to 75.14%.
dependent V-L prompting are detailed in Appendix A. The Generalization and Performance on Base Classes: Co-
hyper-parameters are fixed across all datasets. CoOp solves the poor generalization problem in CoOp by
conditioning prompts on image instances and shows signif-
4.2. Prompting CLIP via Vision-Language Prompts icant gains in novel categories. However on base classes, it
improves over CoOp only on 3/11 datasets with an average
Prompting Variants: We first evaluate the performance
drop in performance from 82.69% to 80.47%. Meanwhile,
of different possible prompting design choices as an abla-
the completeness in prompting helps MaPLe improve over
tion for our proposed branch-aware multi-modal prompt-
CoOp on base classes in 6/11 datasets maintaining the av-
ing, MaPLe. These variants include shallow MaPLe, deep
erage base accuracy to around 82.28%, in addition to its
language prompting, deep vision prompting and indepen-
improvement in generalization to novel classes.
dent V-L prompting. In Table 1, we present the results
We find that the training strategies of Co-CoOp can be
averaged over the 11 image recognition datasets. Shal-
used to substantially boost the generalization performance
low MaPLe (row-1) provides consistant improvements over
of vanilla CoOp (6.8% gain in novel classes). We therefore
CoOp and Co-CoOp in terms of generalization. Deep lan-
compare our method with CoOp† , which trains CoOp in Co-
guage prompting (row-3) shows improvements over deep
CoOp setting (refer to Appendix A for more details).
vision prompting (row-2), indicating that prompts learned
at the language branch provide better adaptation of CLIP. Base Novel HM
Although separately combining the above two approaches CoOp 82.69 63.22 71.66
(row-4) further improves the performance, it struggles to Co-CoOp 80.47 71.69 75.83
achieve comprehensive benefits from the language and vi-
CoOp† 80.85 70.02 75.04
sion branches. We hypothesize that this is due to the lack of MaPLe 82.28 75.14 78.55
synergy between the learned vision and language prompts
as they do not interact with each other during training. Table 2. Generalization comparison of MaPLe with CoOp†.
Meanwhile, MaPLe tied with deep prompting (row-4) com- Compare to CoOp† , the vanilla CoOp model seems to
bines the benefits of prompting in both branches by en- overfit on base classes. When compared to CoOp† which
forcing interactions through explicit conditioning of vision attains an average base accuracy of 80.85%, MaPLe shows
prompts on the language prompts. It provides improve- an improvement of 1.43% with the average base accuracy
ments on novel and base class accuracies which leads to of 82.28% (Table 2).
the best HM of 78.55%. We explore other possible design
choices and present the ablations in Appendix B. 4.4. Cross-Dataset Evaluation
We test the cross-dataset generalization ability of MaPLe
4.3. Base-to-Novel Generalization
by learning multi-modal prompts on all the 1000 ImageNet
Generalization to Unseen Classes: Table 3 presents the classes and then transferring it directly on the remaining
performance of MaPLe in base-to-novel generalization set- 10 datasets. Table 4 shows the performance comparison
ting on 11 recognition datasets. We compare its perfor- between MaPLe, CoOp and Co-CoOp. On the ImageNet
mance with CLIP zero-shot, and recent prompt learning source dataset, MaPLe achieves performance comparable
works including CoOp [49] and Co-CoOp [48]. In case of to competing approaches but demonstrates a much stronger
(a) Average over 11 datasets (b) ImageNet. (c) Caltech101

Base Novel HM Base Novel HM Base Novel HM


CLIP 69.34 74.22 71.70 CLIP 72.43 68.14 70.22 CLIP 96.84 94.00 95.40
CoOp 82.69 63.22 71.66 CoOp 76.47 67.88 71.92 CoOp 98.00 89.81 93.73
Co-CoOp 80.47 71.69 75.83 Co-CoOp 75.98 70.43 73.10 Co-CoOp 97.96 93.81 95.84
MaPLe 82.28 75.14 78.55 MaPLe 76.66 70.54 73.47 MaPLe 97.74 94.36 96.02
+1.81 +3.45 +2.72 +0.68 +0.11 +0.37 -0.22 +0.55 +0.18

(d) OxfordPets (e) StanfordCars (f) Flowers102

Base Novel HM Base Novel HM Base Novel HM


CLIP 91.17 97.26 94.12 CLIP 63.37 74.89 68.65 CLIP 72.08 77.80 74.83
CoOp 93.67 95.29 94.47 CoOp 78.12 60.40 68.13 CoOp 97.60 59.67 74.06
Co-CoOp 95.20 97.69 96.43 Co-CoOp 70.49 73.59 72.01 Co-CoOp 94.87 71.75 81.71
MaPLe 95.43 97.76 96.58 MaPLe 72.94 74.00 73.47 MaPLe 95.92 72.46 82.56
+0.23 +0.07 +0.15 +2.45 +0.41 +1.46 +1.05 +0.71 +0.85

(g) Food101 (h) FGVCAircraft (i) SUN397

Base Novel HM Base Novel HM Base Novel HM


CLIP 90.10 91.22 90.66 CLIP 27.19 36.29 31.09 CLIP 69.36 75.35 72.23
CoOp 88.33 82.26 85.19 CoOp 40.44 22.30 28.75 CoOp 80.60 65.89 72.51
Co-CoOp 90.70 91.29 90.99 Co-CoOp 33.41 23.71 27.74 Co-CoOp 79.74 76.86 78.27
MaPLe 90.71 92.05 91.38 MaPLe 37.44 35.61 36.50 MaPLe 80.82 78.70 79.75
+0.01 +0.76 +0.39 +4.03 +11.90 +8.76 +1.08 +1.84 +1.48

(j) DTD (k) EuroSAT (l) UCF101

Base Novel HM Base Novel HM Base Novel HM


CLIP 53.24 59.90 56.37 CLIP 56.48 64.05 60.03 CLIP 70.53 77.50 73.85
CoOp 79.44 41.18 54.24 CoOp 92.19 54.74 68.69 CoOp 84.69 56.05 67.46
Co-CoOp 77.01 56.00 64.85 Co-CoOp 87.49 60.04 71.21 Co-CoOp 82.33 73.45 77.64
MaPLe 80.36 59.18 68.16 MaPLe 94.07 73.23 82.35 MaPLe 83.00 78.66 80.77
+3.35 +3.18 +3.31 +6.58 +13.19 +11.14 +0.67 +5.21 +3.13
Table 3. Comparison with state-of-the-art methods on base-to-novel generalization. MaPLe learns multi-modal prompts and demon-
strates strong generalization results over existing methods on 11 recognition datasets. Absolute gains over Co-CoOp are indicated in
blue.

Source Target Table 4. Comparison


of MaPLe with ex-
ars

02

isting approaches on
01

ets

rdC
et

AT
rs1
h1

01

97
dP

01

e
eN

aft

cross-dataset evaluation.
rag
nfo
c

d1

N3

roS
we

F1
for
lte

rcr
ag

D
o

e
UC
Flo

DT
Sta

SU
Ox
Im

Ca

Eu

Overall, MaPLe achieves


Fo

Ai

Av

CoOp 71.51 93.70 89.14 64.51 68.71 85.30 18.47 64.15 41.92 46.39 66.55 63.88 competitive performance
Co-CoOp 71.02 94.43 90.14 65.32 71.88 86.06 22.94 67.36 45.73 45.37 68.21 65.74 providing highest average
MaPLe 70.72 93.53 90.49 65.57 72.23 86.20 24.74 67.01 46.49 48.06 68.69 66.30
accuracy, indicating better
generalization.

generalization performance by surpassing CoOp in 9/10 and 4.5. Domain Generalization


Co-CoOp in 8/10 datasets. Overall, MaPLe shows compet-
itive performance leading to the highest averaged accuracy We show that MaPLe generalizes favourably on out-of-
of 66.30%. This suggests that the use of branch-aware V-L distribution datasets as compared to CoOp and Co-CoOp.
prompting in MaPLe facilitates better generalization. We evaluate the direct transferability of ImageNet trained
model to various out-of-domain datasets, and observe that
Source Target Figure 5. Percentage
classes where MaPLe
ImageNet ImageNetV2 ImageNet-S ImageNet-A ImageNet-R shows improved perfor-
CLIP 66.73 60.83 46.15 47.77 73.96 mance over Co-CoOp,
CoOp 71.51 64.20 47.99 49.71 75.21 which increases as
Co-CoOp 71.02 64.07 48.75 50.63 76.18 dataset domain shift
MaPLe 70.72 64.07 49.15 50.90 76.98
from generic categories
increases (→).
Table 5. Comparison of MaPLe with existing approaches in do-
main generalization setting. MaPLe shows consistant improve- complexity of MaPLe in comparison with other approaches.
ments on all target datasets. Although MaPLe utilizes multi-modal prompts, its overall
FLOPS (Floating Point Operations) exceeds only by 0.1%
it consistently improves against all the existing approaches over CoOp and Co-CoOp. The independent V-L prompting
as indicated in Table 5. This indicates that utilizing multi- also provides comparable FLOP count. In terms of infer-
modal branch-aware prompting helps MaPLe in enhancing ence speed, Co-CoOp is significantly slower and the FPS
the generalization and robustness of V-L models like CLIP. (Frames Per Second) remains constant as the batch size in-
4.6. Ablation Experiments creases. In contrast, MaPLe has no such overhead and pro-
vides much better inference and training speeds. Further,
Prompt Depth: In Fig. 4 (left), we illustrate the effect of MaPLe provides better convergence as it requires only half
prompt depth J for MaPLe and ablate on the depth of lan- training epochs as compared to Co-CoOp (5 vs 10 epochs).
guage and vision branch individually. In general, the per- MaPLe adds about 2.85% training parameters on top of
formance improves as prompt depth increases. We note CLIP. To study if the performance gain is mainly attributed
that performance sensitivity increases when randomly ini- to more parameters, we experiment with MaPLe†, which
tialized prompts are inserted in the deeper layers of a frozen uses a unified V-L coupling function for all layer prompts.
model where the model feature space is already mature. MaPLe† with about 9x lesser parameters than MaPLe also
Similar trend is also reported by [16]. As earlier methods improves over existing methods. We also ablate by compar-
utilize shallow language prompting (J = 1), we compare ing MaPLe with heavier CoCoOp in Appendix D.
our method with deep language prompting. Overall, MaPLe
achieves better performance than deep language prompting Params FPS (with BS)
and achieves maximum performance at a depth of 9. Method Params HM
% CLIP 1 4 100
Prompt Length: Fig. 4 (right) shows the effect of prompt
CoOp 2048 0.002 13.8 55.3 1353.0 71.66
length for MaPLe. As the prompt length increases, the per- CoCoOp 35360 0.03 64.6 114.7 15.1 75.83
formance on base classes is generally maintained, while the Independent V-L 31488 0.02 62.5 239.4 1383.8 77.90
novel class accuracy decreases. This indicates over-fitting MaPLe 3.55 M 2.85 60.2 239.0 1365.1 78.55
which inherently hurts the generalization to novel classes. MaPLe† 0.41 M 0.33 60.2 238.0 1365.0 78.11
Effectiveness of Multi-modal Prompting: Fig. 5 shows Table 6. Comparison of computational complexity among differ-
the analysis of per class accuracy for selected datasets in ent prompting methods. MaPLe† is a MaPLe version which uti-
the order of increasing domain shift. It indicates that the lizes a common V-L coupling function for all layers.
performance gains of MaPLe in comparison to Co-CoOp
varies across different datasets. MaPLe provides significant 5. Conclusion
gains over Co-CoOp for datasets that have large distribution
shifts from the pretraining dataset of CLIP, and vision con- Adaptation of large-scale V-L models, e.g., CLIP [32] to
cepts that are usually rare and less generic. Further detailed downstream tasks is a challenging problem due to the large
analysis is provided in Appendix C. number of tunable parameters and limited size of down-
Prompting complexity: Table 6 shows the computational stream datasets. Prompt learning is an efficient and scalable
technique to tailor V-L models to novel downstream tasks.
To this end, the current prompt learning approaches either
consider only the vision or language side prompting. Our
work shows that it is critical to perform prompting for both
vision and language branches to appropriately adapt V-L
models to downstream tasks. Further, we propose a strategy
to ensure synergy between vision-language modalities by
Figure 4. Ablation on prompt depth (left) and prompt length explicitly conditioning the vision prompts on textual prompt
(right) in MaPLe. We report average results on the held-out vali- across different transformer stages. Our approach improves
dation sets of all datasets. the generalization towards novel categories, cross-dataset
transfer and datasets with domain shifts. Samyak Parajuli, Mike Guo, et al. The many faces of robust-
ness: A critical analysis of out-of-distribution generalization.
References In Proceedings of the IEEE/CVF International Conference
on Computer Vision, pages 8340–8349, 2021. 5
[1] Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and [13] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein-
Phillip Isola. Visual prompting: Modifying pixel space to hardt, and Dawn Song. Natural adversarial examples. In
adapt pre-trained models. arXiv preprint arXiv:2203.17274, Proceedings of the IEEE/CVF Conference on Computer Vi-
2022. 3, 12 sion and Pattern Recognition, pages 15262–15271, 2021. 5
[2] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. [14] Tony Huang, Jack Chu, and Fangyun Wei. Unsupervised
Food-101–mining discriminative components with random prompt learning for vision-language models. arXiv preprint
forests. In The European Conference on Computer Vision, arXiv:2204.03649, 2022. 1
pages 446–461. Springer, 2014. 5 [15] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
[3] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom
Mohamed, and Andrea Vedaldi. Describing textures in the Duerig. Scaling up visual and vision-language representa-
wild. In Proceedings of the IEEE/CVF Conference on Com- tion learning with noisy text supervision. In International
puter Vision and Pattern Recognition, pages 3606–3613, Conference on Machine Learning, pages 4904–4916. PMLR,
2014. 5 2021. 2
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, [16] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi-
database. In Proceedings of the IEEE/CVF Conference on sual prompt tuning. In The European Conference on Com-
Computer Vision and Pattern Recognition, pages 248–255. puter Vision, 2022. 2, 3, 8
Ieee, 2009. 5 [17] Woojeong Jin, Yu Cheng, Yelong Shen, Weizhu Chen, and
[5] Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- Xiang Ren. A good prompt is worth millions of pa-
coupling zero-shot semantic segmentation. In Proceedings of rameters? low-resource prompt-based learning for vision-
the IEEE/CVF Conference on Computer Vision and Pattern language models. arXiv preprint arXiv:2110.08484, 2021.
Recognition, pages 11583–11592, 2022. 2 1
[6] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [18] Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Xie. Prompting visual-language models for efficient video
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- understanding. In The European Conference on Computer
vain Gelly, et al. An image is worth 16x16 words: Trans- Vision, 2021. 3
formers for image recognition at scale. In International Con- [19] Konwoo Kim, Michael Laskin, Igor Mordatch, and Deepak
ference on Learning Representations, 2021. 3 Pathak. How to adapt your large-scale vision-and-language
[7] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- model, 2022. 2
ative visual models from few training examples: An incre- [20] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
mental bayesian approach tested on 101 object categories. In 3d object representations for fine-grained categorization. In
2004 conference on computer vision and pattern recognition Proceedings of the IEEE/CVF International Conference on
workshop, pages 178–178. IEEE, 2004. 5 Computer Vision, pages 554–561, 2013. 5
[8] Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, [21] Brian Lester, Rami Al-Rfou, and Noah Constant. The power
Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma. Prompt- of scale for parameter-efficient prompt tuning. In Confer-
det: Towards open-vocabulary detection using uncurated im- ence on Empirical Methods in Natural Language Processing,
ages. In The European Conference on Computer Vision, 2021. 2
2022. 2 [22] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen
[9] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Koltun, and Rene Ranftl. Language-driven semantic seg-
Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. mentation. In International Conference on Learning Rep-
Clip-adapter: Better vision-language models with feature resentations, 2022. 2
adapters. arXiv preprint arXiv:2110.04544, 2021. 2, 3 [23] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing
[10] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. continuous prompts for generation. In Proceedings of the
Open-vocabulary object detection via vision and language 59th Annual Meeting of the Association for Computational
knowledge distillation. arXiv preprint arXiv:2104.13921, Linguistics and the 11th International Joint Conference on
2021. 2 Natural Language Processing, 2021. 2
[11] Patrick Helber, Benjamin Bischke, Andreas Dengel, and [24] Xiao Liu, Kaixuan Ji, Yicheng Fu, Zhengxiao Du, Zhilin
Damian Borth. Eurosat: A novel dataset and deep learning Yang, and Jie Tang. P-tuning v2: Prompt tuning can be com-
benchmark for land use and land cover classification. IEEE parable to fine-tuning universally across scales and tasks.
Journal of Selected Topics in Applied Earth Observations In Proceedings of the 59th Annual Meeting of the Associ-
and Remote Sensing, 12(7):2217–2226, 2019. 5 ation for Computational Linguistics and the 11th Interna-
[12] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- tional Joint Conference on Natural Language Processing,
vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, 2021. 2
[25] Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, [38] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun,
and Xinmei Tian. Prompt distribution learning. In Proceed- Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vin-
ings of the IEEE/CVF Conference on Computer Vision and cent Perot, Jennifer Dy, et al. Dualprompt: Complementary
Pattern Recognition, pages 5206–5215, 2022. 1, 3 prompting for rehearsal-free continual learning. In The Eu-
[26] Timo Lüddecke and Alexander Ecker. Image segmenta- ropean Conference on Computer Vision, 2022. 2
tion using text and image prompts. In Proceedings of [39] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang,
the IEEE/CVF Conference on Computer Vision and Pattern Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer
Recognition, pages 7086–7096, 2022. 2 Dy, and Tomas Pfister. Learning to prompt for continual
[27] Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fa- learning. In Proceedings of the IEEE/CVF Conference on
had Shahbaz Khan, Rao Muhammad Anwer, and Ming- Computer Vision and Pattern Recognition, pages 139–149,
Hsuan Yang. Class-agnostic object detection with multi- 2022. 2
modal transformer. In The European Conference on Com- [40] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva,
puter Vision. Springer, 2022. 2 and Antonio Torralba. Sun database: Large-scale scene
[28] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew recognition from abbey to zoo. In 2010 IEEE computer so-
Blaschko, and Andrea Vedaldi. Fine-grained visual classi- ciety conference on computer vision and pattern recognition,
fication of aircraft. arXiv preprint arXiv:1306.5151, 2013. pages 3485–3492. IEEE, 2010. 5
5 [41] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe
[29] Shu Manli, Nie Weili, Huang De-An, Yu Zhiding, Gold- Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and
stein Tom, Anandkumar Anima, and Xiao Chaowei. Test- Chunjing Xu. Filip: Fine-grained interactive language-image
time prompt tuning for zero-shot generalization in vision- pre-training. arXiv preprint arXiv:2111.07783, 2021. 2
language models. 2022. 1 [42] Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-
Seng Chua, and Maosong Sun. Cpt: Colorful prompt tun-
[30] Maria-Elena Nilsback and Andrew Zisserman. Automated
ing for pre-trained vision-language models. arXiv preprint
flower classification over a large number of classes. In 2008
arXiv:2109.11797, 2021. 1
Sixth Indian Conference on Computer Vision, Graphics &
[43] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella,
Image Processing, pages 722–729. IEEE, 2008. 5
Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang,
[31] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and
Boxin Li, Chunyuan Li, et al. Florence: A new
CV Jawahar. Cats and dogs. In 2012 IEEE conference on
foundation model for computer vision. arXiv preprint
computer vision and pattern recognition, pages 3498–3505.
arXiv:2111.11432, 2021. 2
IEEE, 2012. 5
[44] Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and
[32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Chen Change Loy. Open-vocabulary detr with conditional
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, matching. In The European Conference on Computer Vision,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- 2022. 2
ing transferable visual models from natural language super-
[45] Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner,
vision. In International Conference on Machine Learning,
Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer.
pages 8748–8763. PMLR, 2021. 1, 2, 8
Lit: Zero-shot transfer with locked-image text tuning. In
[33] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Proceedings of the IEEE/CVF Conference on Computer Vi-
Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. sion and Pattern Recognition, pages 18123–18133, 2022. 2
Denseclip: Language-guided dense prediction with context- [46] Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang,
aware prompting. In Proceedings of the IEEE/CVF Con- Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li.
ference on Computer Vision and Pattern Recognition, pages Tip-adapter: Training-free clip-adapter for better vision-
18082–18091, 2022. 2 language modeling. In The European Conference on Com-
[34] Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair puter Vision, 2022. 2
Khattak, Salman Khan, and Fahad Shahbaz Khan. Bridg- [47] Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. Neural
ing the gap between object and image-level representations prompt search. arXiv preprint arXiv:2206.04673, 2022. 2
for open-vocabulary detection. In Advances in Neural Infor- [48] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
mation Processing Systems, 2022. 2 Liu. Conditional prompt learning for vision-language mod-
[35] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and els. In Proceedings of the IEEE/CVF Conference on Com-
Vaishaal Shankar. Do imagenet classifiers generalize to im- puter Vision and Pattern Recognition, pages 16816–16825,
agenet? In International Conference on Machine Learning, 2022. 1, 2, 3, 5, 6
pages 5389–5400. PMLR, 2019. 5 [49] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
[36] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Liu. Learning to prompt for vision-language models. Inter-
Ucf101: A dataset of 101 human actions classes from videos national Journal of Computer Vision, pages 1–12, 2022. 1,
in the wild. arXiv preprint arXiv:1212.0402, 2012. 5 2, 3, 5, 6
[37] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P [50] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp
Xing. Learning robust global representations by penalizing Krähenbühl, and Ishan Misra. Detecting twenty-thousand
local predictive power. In Advances in Neural Information classes using image-level supervision. In The European Con-
Processing Systems, volume 32, 2019. 5 ference on Computer Vision, 2022. 2
[51] Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Supplementary Material
Zhang. Prompt-aligned gradient for prompt tuning. arXiv
preprint arXiv:2205.14865, 2022. 2 This section contains supplementary material that provides
additional details for the main paper and further experimen-
tal analysis. This section follows the contents in the follow-
ing order.
• Additional implementation details (Appendix A)
• Alternate prompting design choices (Appendix B)
• Understanding multi-modal prompts (Appendix C)
• Comparison of MaPLe with heavier Co-CoOp (Ap-
pendix D)

A. Additional Implementation details


In this section, we provide further hyper-parameter details
of the proposed approaches presented in the main paper. Ta-
ble 7 shows the hyper-parameters chosen for vision, lan-
guage and independent V-L prompting techniques. We use
a learning rate of 0.0025 for language and vision prompting,
and 0.0035 for independent V-L prompting.

Method Prompt Depth (K) V-tokens (P̃ ) T-tokens (P )


Language prompting 12 0 4
Vision prompting 12 4 0
I-V-L prompting 12 2 2

Table 7. Hyper-parameter settings for deep prompting variants.


I-V-L refers to independent V-L prompting. Here K represents
the depth of prompts. Number of prompt tokens in vision and
language branches are denoted as P̃ and P respectively.

CoOp in Co-CoOp setting: The CoOp approach trained


in Co-CoOp setting (denoted by CoOp†) uses training con-
figurations of CoCoOp. Similar to Co-CoOp training,
CoOp† trains the standard CoOp method for 10 epochs in-
stead of default 200 epochs. We use a batch size of 4 with a
learning rate of 0.0035.

B. Alternate Design Choices


Prompt Initialization: Table 8 shows the effect of prompt
initialization on MaPLe. Best performance is achieved
when the learnable prompts in the first layer are initialized
with the prompt ‘a photo of a <category>’ and rest of the
layers are initialized randomly (row-3). Initializing prompts
with a similar template in all layers leads to lower perfor-
mance suggesting that this is redundant as these prompts
learn hierarchically different contextual concepts in differ-
ent layers (row-1). However, complete random initializa-
tion of prompts provides competitive performance (row-2).
For implementation, if the number of learnable prompts
M = #P are less than the total tokens of initial prompt
template, we convert the former M word embeddings of
template with learnable prompts and consider the rest of Prompt Proj. Base Novel HM
word embeddings of prompt template as fixed and use all P̃ → P 81.37 73.25 77.10
token embeddings (learnable prompts + fixed word tokens) P → P̃ 82.28 75.14 78.55
as input to text encoder.
Table 9. Projecting from P to P̃ provides the best results.

Method Base Novel HM


Method Base Novel HM
1: MaPLe: All layers: ‘a photo of a’ 81.90 74.22 77.88
1: I-V-L + Progressive prompting 81.20 74.92 77.93
2: MaPLe: Random initialization 82.27 75.10 78.52
2: MaPLe + Progressive prompting 81.45 75.04 78.11
3: MaPLe: Only first layer: ‘a photo of a’ 82.28 75.14 78.55
3: MaPLe + I-V-L prompting 82.27 74.05 77.94
Table 8. Ablation on prompt initialization. In general, the perfor-
4: MaPLe 82.28 75.14 78.55
mance of MaPLe is affected by the choice of prompt initialization.
Table 10. Analysis on alternative design choices for V-L prompt-
ing. Overall, MaPLe proves to be the best variant among alternate
Direction of prompt projection: As discussed in Sec-
prompting-related design choices.
tion 3.2.3, MaPLe explicitly conditions the vision prompts
P̃ on the language prompts P (P → P̃ ) using a V-L cou-
pling function F. Here, we provide analysis for an alter- ther, datasets like EuroSAT (satellite images) and DTD (tex-
native design choice where P is conditioned on P̃ (P̃ → ture dataset) has more distributional gap from ImageNet [1].
P ). Table 9 shows that our approach (P → P̃ ) is a bet- Fig. 5 shows per class analysis for selected datasets in the
ter choice which can be reasoned by the lower information order of increasing diversity (distribution gap w.r.t CLIP
loss in such a design since the dimension size dv of vision pretraining dataset, i.e. generic objects). The overall trend
prompts is greater than the dimension size dl of language indicates that MaPLe is more effective than Co-CoOp as
prompts. the diversity of the dataset increases. We conjecture that
Exploring other prompting designs: We provide analysis this is because fine-tuning or prompting bridges the gap be-
on other possible multi-modal prompting design choices in tween the distribution of the downstream and the pretraining
comparison to MaPLe. As learnable prompts in different dataset and thus improves the performance. However, the
transformer layers do not interact with each other, we ex- effectiveness would therefore be less substantial for datasets
plore a progressive prompting approach where the prompts with little distribution shifts. This intriguing property is also
at each block are conditioned on the prompts from the validated for visual prompting in literature [1]. MaPLe pro-
previous block via linear projection which are then added vides completeness in prompting by learning both vision
with the deep prompts initialized at every corresponding and language prompts to effectively steer CLIP, this makes
layer. We apply this approach to independent V-L prompt- it more adaptive than Co-CoOp to improve on datasets with
ing (row-1) and MaPLe (row-2). To analyze whether in- larger distribution shifts.
dependent V-L prompting and MaPLe provide complemen-
Additionally, we note that MaPLe benefits on categories
tary gains, we also explore a design choice combining them
which would have been rarely seen by CLIP during its pre-
together (row-3) in the same model. The results in Ta-
training (400 million image caption dataset, obtained from
ble 10 indicate that MaPLe provides best performance as
internet images). We observe that MaPLe provides signif-
compared to other design choices.
icant gains over Co-CoOp for vision concepts that tend to
be rare and less generic, e.g., satellite images. In contrast,
C. Understanding Multi-modal Prompts MaPLe performs competitively to Co-CoOp on frequent
Our experimental results in Section 4.3 indicates that the and more generic categories e.g., forest, river, dog, etc.
performance gains of MaPLe in comparison to Co-CoOp Multi-modal prompts allow MaPLe to better adapt CLIP
varies significantly across different datasets. For some for visual concepts that are rarely occurring as compared to
datasets, like ImageNet and Caltech101, the gains are less existing uni-modal prompting techniques. In Table 12, we
than 1%, while on other datasets like EuroSAT, FGVCAir- highlight category-wise comparison between MaPLe and
crafts and DTD, MaPLe shows significant improvements Co-CoOp for some selected datasets.
like +13% over Co-CoOp. To better understand at which Text embeddings analysis: As all samples within a cat-
cases MaPLe is most effective, we dissect the individual egory are represented using a single text embedding, we
dataset performances and perform an exhaustive per-class take a quantitative approach in Tab. 11 for analyzing the
analysis. Consistent with earlier work [1], we conjecture text embeddings of CoOp and MaPLe. We show the pair-
that CLIP pretraining dataset has been curated in a way that wise cosine similarity and normalized l2 distance metrics
maximizes its zero-shot performance on ImageNet-1k and averaged across text embeddings. We observe that MaPLe
can be used as a proxy for CLIP pretraining dataset. Fur- shows better separability among the categories.
l2 distance ↑ Cosine similarity ↓
Method DTD UCF EuroSAT DTD UCF EuroSAT
CoOp 0.87 0.85 0.57 0.62 0.63 0.83
MaPLe 0.93 0.87 0.78 0.57 0.62 0.69

Table 11. Avg. cosine similarity and l2 distance of text embed-


dings. MaPLe shows better separability among the text categories.

MaPLe is better Co-CoOp is better


Dataset
than Co-CoOp than MaPLe
Caltech101 Crontosaurus, Elephant,
(Generic Objects) Gerenuk, Sea Horse Ceiling Fan, Cellphone
EuroSAT Annual Crop Land,
-
(Satellite Image) Permanent Crop Land
UCF101 Handstand Walking, Walking With Dog,
(Action recognition) Playing Daf Horse Riding

Table 12. Analyzing the nature of categories where MaPLe per-


forms better than Co-CoOp. Co-CoOp performs favourably well
on generic categories, while MaPLe provides benefits on classes
that are typically rare.

D. Comparing MaPLe with Heavier Co-CoOp


The multi-modal deep prompting architecture design of
MaPLe along with its V-L coupling function F constitutes
more learnable parameters as compared to CoOp and Co-
CoOp. To verify that the performance gain is not due
to increased parameter count, we compare Co-CoOp with
MaPLe shallow (J = 1) that utilizes prompts only at the
first layer of vision and language branch of CLIP. Further,
we also experiment with a heavier Co-CoOp in which we
retrain a version of Co-CoOp that matches the parameter
count of MaPLe (J = 9) by stacking multiple additional
layers in its Meta-Net block. Table 13 indicates the effec-
tiveness of multi-modal prompting in MaPLe (for both J =
1 and J = 9) over the heavier Co-CoOp. In addition to that,
we experiment with MaPLe†, which uses a unified V-L cou-
pling function for all layer prompts. MaPLe† with about 9x
lesser parameters than MaPLe also improves over existing
methods. This shows that the difference in the number of
parameters is not the cause of gain in our case and the pro-
posed multi-modal prompting design choice makes a differ-
ence.

Method Base Novel HM


Co-CoOp 80.47 71.69 75.83
Heavier Co-CoOp 80.14 72.02 75.86
MaPLe shallow (J = 1) 80.10 73.52 76.67
MaPLe† (J = 9) 82.29 74.34 78.11
MaPLe (J = 9) 82.28 75.14 78.55

Table 13. Comparison of MaPLe with a heavier Co-CoOp model.


We retrain a heavier version of Co-CoOp which is comparable
with MaPLe in terms of total parameter count. MaPLe† is a
MaPLe version which utilizes a common V-L coupling function
for all layers.

You might also like