Ma PLe
Ma PLe
Pr ompts Pr ompts
(b) Multi-modal Pr ompt Lear ning (MaPLe) (c) Per for mance compar ison on base-to-novel gener alization
Figure 1. Comparison of MaPLe with standard prompt learning methods. (a) Existing methods adopt uni-modal prompting techniques
to fine-tune CLIP representations as prompts are learned only in a single branch of CLIP (language or vision). (b) MaPLe introduces
branch-aware hierarchical prompts that adapt both language and vision branches simultaneously for improved generalization. (c) MaPLe
surpasses state-of-the-art methods on 11 diverse image recognition datasets for novel class generalization task.
over the state-of-the-art method Co-CoOp [48]. Further, terest in the computer vision community. In contrast to
MaPLe demonstrates favorable generalization ability and models learned with only image supervision, these vision-
robustness in cross-dataset transfer and domain generaliza- language (V-L) models encode rich multimodal representa-
tion settings, leading to consistent improvements compared tions. Recently, V-L models like CLIP [32], ALIGN [15],
to existing approaches. Owing to its streamlined architec- LiT [45], FILIP [41] and Florence [43] have demonstrated
tural design, MaPLe exhibits improved efficiency during exceptional performance on a wide spectrum of tasks in-
both training and inference without much overhead, as com- cluding few-shot and zero-shot visual recognition. These
pared to Co-CoOp which lacks efficiency due to its image models learn joint image-language representations in a self-
instance conditioned design. In summary, the main contri- supervised manner using abundantly available data from
butions of this work include: the web. For example, CLIP and ALIGN respectively
• We propose multi-modal prompt learning in CLIP to use ∼400M and ∼1B image-text pairs to train a multi-
favourably align its vision-language representations. modal network. Although these pre-trained V-L models
To the best of our knowledge, this is the first multi- learn generalized representations, efficiently adapting them
modal prompting approach for fine-tuning CLIP. to downstream tasks is still a challenging problem. Many
• To link prompts learned in text and image encoders, we works have demonstrated better performance on down-
propose a coupling function to explicitly condition vi- stream tasks by using tailored methods to adapt V-L mod-
sion prompts on their language counterparts. It acts as els for few-shot image-recognition [9, 19, 46], object detec-
a bridge between the two modalities and allows mutual tion [8,10,27,34,44,50], and segmentation [5,22,26,33]. In
propagation of gradients to promote synergy. this work, we propose a novel multi-modal prompt learning
technique to effectively adapt CLIP for few-shot and zero-
• Our multi-modal prompts are learned across multi-
shot visual recognition tasks.
ple transformer blocks in both vision and language
branches to progressively learn the synergistic be- Prompt Learning: The instructions in the form of a sen-
haviour of both modalities. This deep prompting strat- tence, known as text prompt, are usually given to the lan-
egy allows modeling the contextual relationships inde- guage branch of a V-L model, allowing it to better under-
pendently, thus providing more flexibility to align the stand the task. Prompts can be handcrafted for a down-
vision-language representations. stream task or learned automatically during fine-tuning
stage. The latter is referred to as ‘Prompt Learning’ which
2. Related Work was first used in NLP [21,23,24] followed by the adaptation
in V-L [48, 49, 51] and vision-only [16, 38, 39, 47] models.
Vision Language Models: The combined use of language Similar to [16] our design also uses deep ‘vision’ prompt-
supervision with natural images is found to be of great in- ing. However, ours is the first multi-modal prompting de-
Text Encoder
" a photo of a
cat"
Wor d Embed
Encoder Layer
Encoder Layer
Encoder Layer
...
...
...
y1 y2 y3 ... yn
xn
Image Encoder
...
Encoder Layer
Encoder Layer
Encoder Layer
Patch Embed
Figure 2. Overview of our proposed MaPLe (Multi-modal Prompt Learning) framework for prompt learning in V-L models. MaPLe tunes .....
both vision and language branches where only the context prompts are learned, while the rest of the model is frozen. MaPLe conditions the
...
...
...
...
vision prompts on language prompts via a V-L coupling function F to induce mutual synergy between the two modalities. Our framework
.....
uses deep contextual prompting where separate context prompts are learned across multiple transformer blocks.
...
...
...
sign while [16] is uni-modal. through context optimization via prompting. Fig. 2 shows
Prompt Learning in Vision Language models: Full fine- the overall architecture of our proposed MaPLe (Multi-
tuning and linear probing [9] are two typical approaches modal Prompt Learning) framework. Unlike previous ap-
to adapt a V-L model (i.e. CLIP) to the downstream tasks. proaches [48, 49] which learn context prompts only at the
The complete fine-tuning results in degrading the previ- language branch, MaPLe proposes a joint prompting ap-
ously learned joint V-L representation while linear probing proach where the context prompts are learned in both vi-
limits the zero-shot capability of CLIP. To this end, inspired sion and language branches. Specifically, we append learn-
from prompt learning in NLP, many works have proposed able context tokens in the language branch and explicitly
to adapt V-L models by learning the prompt tokens in an condition the vision prompts on the language prompts via a
end-to-end training. CoOp [49] fine-tunes CLIP for few- coupling function to establish interaction between them. To
shot transfer by optimizing continuous set of prompt vectors learn hierarchical contextual representations, we introduce
at its language branch. Co-CoOp [48] highlights the infe- deep prompting in both branches through separate learnable
rior performance of CoOp on novel classes and solves the context prompts across different transformer blocks. Dur-
generalization issue by explicitly conditioning prompts on ing fine-tuning, only the context prompts along with their
image instances. [25] proposes to optimize multiple set of coupling function are learned while the rest of the model is
prompts by learning the distribution of prompts. [18] adapt frozen. Below, we first outline the pre-trained CLIP archi-
CLIP by learning prompts for video understanding tasks. [1] tecture and then present our proposed fine-tuning approach.
perform visual prompt tuning on CLIP by prompting on the
vision branch. We note that the existing methods follow in-
dependent uni-modal solutions and learn prompts either in 3.1. Revisiting CLIP
the language or in the vision branch of CLIP, thus adapt-
ing CLIP partially. In this paper, we explore an important We build our approach on a pre-trained vision-language (V-
question: given the multimodal nature of CLIP, is complete L) model, CLIP, which consists of a text and vision encoder.
prompting (i.e., in both language and vision branches) bet- Consistent with existing prompting methods [48, 49], we
ter suited to adapt CLIP? Our work is the first to answer this use a vision transformer (ViT) [6] based CLIP model. CLIP
question by investigating the effectiveness of multi-modal encodes an image I ∈ RH×W ×3 and a corresponding text
prompt learning in order to improve alignment between vi- description as explained below.
sion and language representations.
Encoding Image: Image encoder V with K transformer
3. Method layers {Vi }K
i=1 , splits the image I into M fixed-size patches
which are projected into patch embeddings E0 ∈ RM ×dv .
Our approach concerns with fine-tuning a pre-trained multi- Patch embeddings Ei are input to the (i + 1)th transformer
modal CLIP for better generalization to downstream tasks block (Vi+1 ) along with a learnable class (CLS) token ci
Co-CooP: Base 77.0 Co-CooP: Novel 56.0 Co-CooP: Base 87.5 Co-CooP: Novel 60.1 Co-CooP: Base 82.3 Co-CooP: Novel 73.5
MaPLe: Base 80.4 MaPLe: Novel 59.2 MaPLe: Base 94.1 MaPLe: Novel 73.2 MaPLe: Base 83.0 M: Novel 78.7
(a) DTD (Texture Classification) (b) EuroSAT (Sattelite Imagery Recognition) (c) UCF101 (Action Recognition)
Figure 3. t-SNE plots of image embeddings in uni-modal prompting method Co-CoOp, and MaPLe on 3 diverse image recognition datasets.
MaPLe shows better separability in both base and novel classes.
and sequentially processed through K transformer blocks, tuning. We reason that prior works that have predominantly
explored uni-modal approaches are less suitable as they do
[ci , Ei ] = Vi ([ci−1 , Ei−1 ]) i = 1, 2, · · · , K. not offer the flexibility to dynamically adapt both language
To obtain the final image representation x, the class token and vision representation spaces. Thus to achieve complete-
cK of last transformer layer (VK ) is projected to a common ness in prompting, we underline the importance of multi-
V-L latent embedding space via ImageProj, modal prompting approach. In Fig. 3, we visualize and
compare the image embeddings of MaPLe with recent state-
x = ImageProj(cK ) x ∈ Rdvl . of-the-art work, Co-CoOp. Note that the image embeddings
of CLIP, CoOp and Co-CoOp will be identical as they do not
Encoding Text: CLIP text encoder generates feature learn prompts in the vision branch. The visualization shows
representations for text description by tokenizing the that image embeddings of MaPLe are more separable indi-
words and projecting them to word embeddings W0 = cating that learning vision prompts in addition to language
[w01 , w02 , · · · , w0N ] ∈ RN ×dl . At each stage, Wi is input prompts leads to better adaptation of CLIP.
to the (i + 1)th transformer layer of text encoding branch
In addition to multi-modal prompting, we find that it is
(Li+1 ),
essential to learn prompts in the deeper transformer layers
[Wi ] = Li (Wi−1 ) i = 1, 2, · · · , K. to progressively model stage-wise feature representations.
To this end, we propose to introduce learnable tokens in the
The final text representation z is obtained by projecting the first J (where J < K) layers of both vision and language
text embeddings corresponding to the last token of the last branches. These multi-modal hierarchical prompts utilize
transformer block LK to a common V-L latent embedding the knowledge embedded in CLIP model to effectively learn
space via TextProj, task relevant contextual representations (see Fig. 4).
N
z = TextProj(wK ) z ∈ Rdvl .
3.2.1 Deep Language Prompting
Zero-shot Classification: For zero-shot classification, text
prompts are hand-crafted with class labels y ∈ {1, 2, . . . C} To learn the language context prompts, we introduce b
(e.g., ‘a photo of a <category>’) having C classes. Pre- learnable tokens {P i ∈ Rdl }bi=1 , in the language branch
diction ŷ corresponding to the image I having the highest of CLIP. The input embeddings now follow the form
cosine similarity score (sim(·)) is calculated with a temper- [P 1 , P 2 , · · · , P b , W0 ], where W0 = [w1 , w2 , · · · , wN ]
ature parameter τ , corresponds to fixed input tokens. New learnable tokens
are further introduced in each transformer block of the lan-
exp(sim(x, zŷ )/τ )
p(ŷ|x) = PC . guage encoder (Li ) up to a specific depth J,
i=1 exp(sim(x, zi ))
[ , Wi ] = Li ([Pi−1 , Wi−1 ]) i = 1, 2, · · · , J. (1)
3.2. MaPLe: Multi-modal Prompt Learning
To efficiently fine-tune CLIP for downstream image recog- Here [·, ·] refers to the concatenation operation. After J th
nition tasks, we explore the potential of multi-modal prompt transformer layer, the subsequent layers process previous
layer prompts and final text representation z is computed, to dv . This acts as a bridge between the two modalities, thus
encouraging mutual propagation of gradients.
[Pj , Wj ] = Lj ([Pj−1 , Wj−1 ]) j = J + 1, · · · , K, (2)
z= N
TextProj(wK ). (3) [ci , Ei , ] = Vi ([ci−1 , Ei−1 , Fi−1 (Pi−1 )]) i = 1, · · · , J
[cj , Ej , P̃j ] = Vj ([cj−1 , Ej−1 , P̃j−1 ]) j = J + 1, · · · , K
When J = 1, the learnable tokens P are only applied at
the input of first transformer layer, and this deep language x = ImageProj(cK )
prompting technique degenerates to CoOp [49].
Unlike independent V-L prompting, explicit conditioning of
P̃ on P helps learn prompts in a shared embedding space
3.2.2 Deep Vision Prompting between the two branches, thus improving mutual synergy.
Similar to deep language prompting, we introduce b learn-
able tokens {P̃ i ∈ Rdv }bi=1 , in the vision branch of CLIP 4. Experiments
alongside the input image tokens. New learnable tokens are 4.1. Benchmark setting
further introduced in deeper transformer layers of the image
encoder (V) up to depth J. Generalization from Base-to-Novel Classes: We evaluate
the generalizability of MaPLe, and follow a zero-shot set-
[ci , Ei , ] = Vi ([ci−1 , Ei−1 , P̃i−1 ]) i = 1, 2, · · · , J, ting where the datasets are split into base and novel classes.
[cj , Ej , P̃j ] = Vj ([cj−1 , Ej−1 , P̃j−1 ]) j = J + 1, · · · , K, The model is trained only on the base classes in a few-shot
setting and evaluated on base and novel categories.
x = ImageProj(cK ).
Cross-dataset Evaluation: To validate the potential of our
Our deep prompting provides the flexibility to learn prompts approach in cross-dataset transfer, we evaluate our Ima-
across different feature hierarchies within the ViT architec- geNet trained model directly on other datasets. Consistent
ture. We find that sharing prompts across stages is bet- with Co-CoOp, our model is trained on all 1000 ImageNet
ter compared to independent prompts as features are more classes in a few-shot manner.
correlated due to successive transformer block processing. Domain Generalization: We evaluate the robustness of our
Thus, the later stages do not provide independently-learned method on out-of-distribution datasets. Similar to cross-
complimentary prompts as compared to the early stages. dataset evaluation, we test our ImageNet trained model di-
rectly on four other ImageNet datasets that contain various
types of domain shifts.
3.2.3 Vision Language Prompt Coupling
Datasets: For generalization from base-to-novel classes
We reason that in prompt tuning it is essential to take a and cross-dataset evaluation, we follow [48, 49] and eval-
multi-modal approach and simultaneously adapt both the vi- uate the performance of our method on 11 image classi-
sion and language branch of CLIP in order to achieve com- fication datasets which covers a wide range of recogni-
pleteness in context optimization. A simple approach would tion tasks. This includes two generic-objects datasets, Im-
be to naively combine deep vision and language prompt- ageNet [4] and Caltech101 [7]; five fine-grained datasets,
ing, where both the language prompts P , and the vision OxfordPets [31], StanfordCars [20], Flowers102 [30],
prompts P̃ , will be learned during the same training sched- Food101 [2], and FGVCAircraft [28]; a scene recogni-
ule. We name this design as ‘Independent V-L Prompting’. tion dataset SUN397 [40]; an action recognition dataset
Although this approach satisfies the requirement of com- UCF101 [36]; a texture dataset DTD [3] and a satellite-
pleteness in prompting, this design lacks synergy between image dataset EuroSAT [11]. For domain generalization,
vision and language branch as both branches do not interact we use ImageNet as source dataset and its four variants
while learning the task relevant context prompts. as target datasets including ImageNetV2 [35], ImageNet-
To this end, we propose a branch-aware multi-modal Sketch [37], ImageNet-A [13] and ImageNet-R [12].
prompting which tunes vision and language branch of Implementation Details We use a few-shot training strat-
CLIP together by sharing prompts across both modalities. egy in all experiments at 16 shots which are randomly sam-
Language prompt tokens are introduced in the language pled for each class. We apply prompt tuning on a pre-
branch up to J th transformer block similar to deep language trained ViT-B/16 CLIP model where dl = 512, dv = 768
prompting as illustrated in Eqs. 1-3. To ensure mutual syn- and dvl = 512. For MaPLe, we set prompt depth J to 9 and
ergy between V-L prompts, vision prompts P̃ , are obtained the language and vision prompt lengths to 2. All models
by projecting language prompts P via vision-to-language are trained for 5 epochs with a batch-size of 4 and a learn-
projection which we refer to as V-L coupling function F(·), ing rate of 0.0035 via SGD optimizer on a single NVIDIA
such that P̃k = Fk (Pk ). The coupling function is imple- A100 GPU. We report base and novel class accuracies and
mented as a linear layer which maps dl dimensional inputs their harmonic mean (HM) averaged over 3 runs. We initial-
Method Base Acc. Novel Acc. HM GFLOPS CLIP, we use hand-crafted prompts that are specifically de-
1: MaPLe shallow (J = 1) 80.10 73.52 76.67 167.1 signed for each dataset.
2: Deep vision prompting 80.24 73.43 76.68 18.0 In comparison with the state-of-the-art Co-CoOp,
3: Deep language prompting 81.72 73.81 77.56 166.8
4: Independent V-L prompting 82.15 74.07 77.90 167.0
MaPLe shows improved performance on both base and
novel categories on all 11 datasets with an exception of
5: MaPLe (Ours) 82.28 75.14 78.55 167.0
marginal reduction on only the base class performance of
Table 1. Comparison of MaPLe with different prompting de- Caltech101. With mutual synergy from the branch-aware
signs in base-to-novel generalization. Results are averaged over multi-modal prompting, MaPLe better generalizes to novel
11 datasets. HM refers to harmonic mean. categories on all 11 datasets in comparison with Co-CoOp,
ize the language prompts of the first layer P0 with the pre- and obtains an overall gain from 71.69% to 75.14%. When
trained CLIP word embeddings of the template ‘a photo taking into account both the base and novel classes, MaPLe
of a <category>’, while for the subsequent layers they shows an absolute average gain of 2.72% over Co-CoOp.
are randomly initialized from a normal distribution. For In comparison with CLIP on novel classes, Co-CoOp im-
training MaPLe on all 1000 classes of ImageNet as a source proves only on 4/11 datasets dropping the average novel ac-
model, prompt depth J is set to 3 and the model trained for 2 curacy from 74.22% to 71.69%. MaPLe is a strong competi-
epochs with learning rate of 0.0026. Hyper-parameters for tor which improves accuracy over CLIP on novel classes on
deep language prompting, deep vision prompting, and in- 6/11 datasets, with an average gain from 74.22% to 75.14%.
dependent V-L prompting are detailed in Appendix A. The Generalization and Performance on Base Classes: Co-
hyper-parameters are fixed across all datasets. CoOp solves the poor generalization problem in CoOp by
conditioning prompts on image instances and shows signif-
4.2. Prompting CLIP via Vision-Language Prompts icant gains in novel categories. However on base classes, it
improves over CoOp only on 3/11 datasets with an average
Prompting Variants: We first evaluate the performance
drop in performance from 82.69% to 80.47%. Meanwhile,
of different possible prompting design choices as an abla-
the completeness in prompting helps MaPLe improve over
tion for our proposed branch-aware multi-modal prompt-
CoOp on base classes in 6/11 datasets maintaining the av-
ing, MaPLe. These variants include shallow MaPLe, deep
erage base accuracy to around 82.28%, in addition to its
language prompting, deep vision prompting and indepen-
improvement in generalization to novel classes.
dent V-L prompting. In Table 1, we present the results
We find that the training strategies of Co-CoOp can be
averaged over the 11 image recognition datasets. Shal-
used to substantially boost the generalization performance
low MaPLe (row-1) provides consistant improvements over
of vanilla CoOp (6.8% gain in novel classes). We therefore
CoOp and Co-CoOp in terms of generalization. Deep lan-
compare our method with CoOp† , which trains CoOp in Co-
guage prompting (row-3) shows improvements over deep
CoOp setting (refer to Appendix A for more details).
vision prompting (row-2), indicating that prompts learned
at the language branch provide better adaptation of CLIP. Base Novel HM
Although separately combining the above two approaches CoOp 82.69 63.22 71.66
(row-4) further improves the performance, it struggles to Co-CoOp 80.47 71.69 75.83
achieve comprehensive benefits from the language and vi-
CoOp† 80.85 70.02 75.04
sion branches. We hypothesize that this is due to the lack of MaPLe 82.28 75.14 78.55
synergy between the learned vision and language prompts
as they do not interact with each other during training. Table 2. Generalization comparison of MaPLe with CoOp†.
Meanwhile, MaPLe tied with deep prompting (row-4) com- Compare to CoOp† , the vanilla CoOp model seems to
bines the benefits of prompting in both branches by en- overfit on base classes. When compared to CoOp† which
forcing interactions through explicit conditioning of vision attains an average base accuracy of 80.85%, MaPLe shows
prompts on the language prompts. It provides improve- an improvement of 1.43% with the average base accuracy
ments on novel and base class accuracies which leads to of 82.28% (Table 2).
the best HM of 78.55%. We explore other possible design
choices and present the ablations in Appendix B. 4.4. Cross-Dataset Evaluation
We test the cross-dataset generalization ability of MaPLe
4.3. Base-to-Novel Generalization
by learning multi-modal prompts on all the 1000 ImageNet
Generalization to Unseen Classes: Table 3 presents the classes and then transferring it directly on the remaining
performance of MaPLe in base-to-novel generalization set- 10 datasets. Table 4 shows the performance comparison
ting on 11 recognition datasets. We compare its perfor- between MaPLe, CoOp and Co-CoOp. On the ImageNet
mance with CLIP zero-shot, and recent prompt learning source dataset, MaPLe achieves performance comparable
works including CoOp [49] and Co-CoOp [48]. In case of to competing approaches but demonstrates a much stronger
(a) Average over 11 datasets (b) ImageNet. (c) Caltech101
02
isting approaches on
01
ets
rdC
et
AT
rs1
h1
01
97
dP
01
e
eN
aft
cross-dataset evaluation.
rag
nfo
c
d1
N3
roS
we
F1
for
lte
rcr
ag
D
o
e
UC
Flo
DT
Sta
SU
Ox
Im
Ca
Eu
Ai
Av
CoOp 71.51 93.70 89.14 64.51 68.71 85.30 18.47 64.15 41.92 46.39 66.55 63.88 competitive performance
Co-CoOp 71.02 94.43 90.14 65.32 71.88 86.06 22.94 67.36 45.73 45.37 68.21 65.74 providing highest average
MaPLe 70.72 93.53 90.49 65.57 72.23 86.20 24.74 67.01 46.49 48.06 68.69 66.30
accuracy, indicating better
generalization.