0% found this document useful (0 votes)
15 views18 pages

Grounding Descriptions in Images Informs Zero-Shot Visual Recognition

Uploaded by

用户ming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views18 pages

Grounding Descriptions in Images Informs Zero-Shot Visual Recognition

Uploaded by

用户ming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Grounding Descriptions in Images informs Zero-Shot Visual Recognition

Shaunak Halbe*1 Junjiao Tian1 K J Joseph2 James Seale Smith1


Katherine Stevo1 Vineeth N Balasubramanian3 Zsolt Kira1
1 2 3
Georgia Institute of Technology Adobe Research Indian Institute of Technology, Hyderabad
arXiv:2412.04429v1 [cs.CV] 5 Dec 2024

Abstract fixed set of classes that were seen during training. However,
in the real and open-world, models need to account for test
Vision-language models (VLMs) like CLIP have been cher- conditions where the number of classes is unknown during
ished for their ability to perform zero-shot visual recog- training and can include classes that were not seen.Vision-
nition on open-vocabulary concepts. This is achieved by language models (VLMs) like CLIP [33] offer a solution
selecting the object category whose textual representation in this space, owing to their open-vocabulary nature. These
bears the highest similarity with the query image. While models undergo extensive pretraining on large datasets con-
successful in some domains, this method struggles with taining paired image-text data and learn to encode images
identifying fine-grained entities as well as generalizing to and texts in a shared latent space where semantically sim-
unseen concepts that are not captured by the training dis- ilar representations are mapped closed together. For zero-
tribution. Recent works attempt to mitigate these challenges shot classification, CLIP leverages the names of all classes
by integrating category descriptions at test time, albeit within the test dataset—referred to as the vocabulary—as
yielding modest improvements. We attribute these limited the candidate set, and determines the most probable image-
gains to a fundamental misalignment between image and classname pairing by computing the similarity between
description representations, which is rooted in the pretrain- their latent representations. This vocabulary of classes is
ing structure of CLIP. In this paper, we propose GRAIN, unconstrained, enabling the inclusion of any concept, re-
a new pretraining strategy aimed at aligning representa- gardless of its presence in the training set. This facilitates
tions at both fine and coarse levels simultaneously. Our classification from an open-set of concepts.
approach learns to jointly ground textual descriptions in im- Despite this, CLIP’s zero-shot capabilities are still lim-
age regions along with aligning overarching captions with ited by a few critical challenges. Firstly, in practice, CLIP
global image representations. To drive this pre-training, often struggles to differentiate between fine-grained cate-
we leverage frozen Multimodal Large Language Models gories, a limitation highlighted by its under-performance
(MLLMs) to derive large-scale synthetic annotations. We on Fine-Grained Visual Classification (FGVC) datasets [24,
demonstrate the enhanced zero-shot performance of our 42]. Secondly, while known for its open-vocabulary poten-
model compared to current state-of-the art methods across tial, it can still perform poorly for some domains not well-
11 diverse image classification datasets. Additionally, we represented in the training distribution, especially if the vo-
introduce Products-2023, a newly curated, manually la- cabulary used has confounding categories during testing.
beled dataset featuring novel concepts, and showcase our Using a vocabulary that exceeds the scope of the test dataset
model’s ability to recognize these concepts by benchmark- significantly diminishes the performance of CLIP even for
ing on it. Significant improvements achieved by our model common datasets like Imagenet [13]. This decline is again
on other downstream tasks like retrieval further highlight largely attributed to CLIP’s challenges in differentiating be-
the superior quality of representations learned by our ap- tween semantically similar, fine-grained concepts. Addi-
proach. Code available at https://github.com/ tionally, CLIP’s inability to recognize novel concepts, such
shaunak27/grain-clip. as Apple Vision Pro that were not present during its
training phase, further restricts its capability to function as
a genuinely open-vocabulary model.
1. Introduction Recent works [25, 32] aim to address these challenges
Traditionally, image classification has operated under the by incorporating extra information in the form of class de-
closed-set assumption where models are evaluated on a scriptions generated by Large Language Models (LLMs) at
test time. These approaches leverage the “visual” knowl-
* Correspondence to [email protected] edge embedded in LLMs to augment the textual repre-

1
sentations used in zero-shot classification. As an ex-
ample, the class French Bulldog would be expanded
to A French Bulldog, which has small and
pointy ears. These methods provide some improve-
ments over standard CLIP models, though they leave room
for further advancements.
We hypothesize that the reason for achieving limited
gains from injecting these descriptions lies in the poor
alignment between image and description embeddings
learned by CLIP. As a result, we aim to verify this hypoth-
esis and propose a method to overcome these challenges.
Specifically, we posit that the misalignment between im- Figure 1. Overview of our two-stage annotation process: (1)
ages and descriptions stems from CLIP’s training structure, prompting LLaVA for image descriptions and (2) acquiring cor-
responding region annotations from OWLv2.
which focuses solely on the global objective of matching
entire images to their overarching captions, neglecting the
rich information that image regions and textual descrip-
fine-grained aligned representations, leading to poor zero-
tions share with each other. Our observations align with
shot performance in some domains.
recent research indicating that CLIP tends to overlook
• We propose GRAIN, a novel pre-training architecture
fine-grained visual details during pretraining, leading to
and objective designed to simultaneously learn local and
subpar performance on tasks requiring localization [35],
global correspondences, obtained via weak supervision
object attributes [48], or physical reasoning [30].
from Multimodal LLMs and open-vocabulary detectors.
In this work, we propose GRAIN: Grounding and con- • To drive this pre-training, we introduce an automated an-
trastive alignment of descriptions, a novel objective for con- notation engine to source fine-grained supervision signal.
trastive vision-language pretraining that learns representa- • We demonstrate significant gains across a range of tasks,
tions more conducive to zero-shot visual recognition. This including image classification and retrieval, specifically
is achieved through fine-grained correspondence between improve over the state-of-art by up to 9% in absolute
image regions and detailed text descriptions. As a first step top-1 accuracy for zero-shot classification and up to 25%
towards our approach, given that pretraining datasets (Con- across cross-modal retrieval tasks.
ceptual Captions [39], LAION [38], etc.) only contain im- • Acknowledging the lack of novel image classifica-
ages with noisy captions but without detailed descriptions, tion datasets, we collect and manually label a dataset,
we employ an instruction-tuned Multimodal Large Lan- Products-2023, for benchmarking, which we plan to re-
guage Model (MLLM) to generate descriptions and iden- lease.
tify salient attributes from the images in these datasets. Fol- • Additionally, we aim to release our pre-trained model
lowing this, we acquire region-level annotations that cor- weights along with the large-scale annotations to aid fu-
respond to these descriptions using an off-the-shelf Open- ture research.
vocabulary Object Detector (OVD). We then propose a
method that learns to jointly ground text descriptions into
2. Related Works
specific image regions along with aligning image and cap-
tion representations at a global level. This strategy aims Contrastive Language-Image Pretraining. Follow-up
to learn representations that encode both coarse-grained works on CLIP [34] and ALIGN [15] focus on improv-
(global) and fine-grained (local) information. To achieve ing the quality of learned representations by further in-
this, we introduce a query transformer architecture for en- troducing self-supervision or cross-modal alignment ob-
coding images and a text encoder for processing captions jectives [14, 28, 49]. Relevant to our focus, FILIP [46]
and descriptions. The architecture and objectives of our introduces a cross-modal late interaction mechanism that
model are specifically crafted to learn object/region-aware explores token-wise maximum similarity between image
image representations that are valuable for zero-shot tasks and text tokens to improve fine-grained alignment. Re-
as we demonstrate in the subsequent sections. Finally, to cently, SPARC [1] proposes a sparse similarity metric be-
evaluate our model’s ability to recognize novel concepts, tween image patches and text tokens to learn fine-grained
we curate and manually label a new image classification representations. While our paper shares motivation with
dataset, Products-2023, and benchmark upon it. these works, we address the fact that web-based captioning
datasets [38, 39] contain noisy captions that lack descrip-
To summarize, our main contributions are as follows:
tive information thereby limiting the gains achievable from
• We hypothesize and show that CLIP pre-training lacks such elaborate objectives. Instead, we source rich text de-

2
Figure 2. Architecture overview. Our method, GRAIN, aligns image representations to text captions at a global level while localizing
salient image regions and aligning them to text descriptions at the local level.

scriptions and region annotations and design a pre-training descriptions. To supervise this fine-grained objective, we
objective to learn from them. This allows us to effectively first generate annotations at scale by leveraging Multimodal
use complementary information at test-time (in the form of Large Language Models (MLLMs) and Open-vocabulary
LLM-generated descriptions) to recognize fine-grained or Object Detectors (OVDs). In this section, we first elaborate
novel entities. our automated annotation process and then proceed to
Improving CLIP using Generative Models. Recent discussing our architecture and training methodology.
works have explored the use of LLMs towards improving
the downstream performance of CLIP. Menon et al. [25]
3.1. Weak Supervision from MLLMs and OVDs
and CuPL [32] focus on the task of zero-shot classification, We utilize the 3M and 12M versions of the Conceptual
and prompt GPT-3 [2] at test-time to generate class descrip- Captions [39] (CC3M, CC12M) dataset to train our model.
tions. These descriptions are integrated into the classifica- These datasets contain images sourced from the internet,
tion prompts to achieve gains in terms of accuracy and in- each paired with corresponding alt-texts (or captions). Our
terpretability. Different from these, LaCLIP [11] and Ve- approach requires region-level supervision that is not pro-
CLIP [17] use LLMs to rephrase captions from pretraining vided by any existing dataset at scale. Specifically, we
datasets and observe noticeable gains on downstream tasks find that the captions associated with these images are often
by training on these captions. In this paper, we propose to noisy, lack detail and may not fully capture the dense vi-
leverage synthetic annotations in the form on image regions sual context. To learn fine-grained correspondence between
and descriptions generated by a MLLM and an open-world the two modalities, we propose focusing on regions within
detector to drive a novel pretraining strategy. the image and their descriptions in text as supervision for
training our model. For generating descriptions and locat-
3. Approach ing their corresponding regions, we leverage an instruction-
tuned Multimodal Large Language Model, LLaVA[21]. We
We propose GRAIN, a novel pretraining approach that select LLaVA for its superior captioning capabilities and
simultaneously learns local and global correspondences accessibility due to its openness; however, our approach
between image and text representations. Motivated by the is fundamentally compatible with any multimodal LLM.
observation that CLIP representations lack sufficient fine- For our annotation purposes, we select the LLaVA v1.6
grained understanding, we introduce a transformer-based model which integrates a pretrained Vision Transformer
architecture inspired by DETR [4], to infuse the rich context Large (ViT-L) [10] as the visual encoder with the Vicuna-
from sub-image regions into learned visual representations. 13B LLM [7]. It is worth noting that we only use LLaVA
Alongside encoding the image into a semantic represen- to describe regions/components of the image at a high level
tation, our model predicts bounding boxes for salient and not pinpoint specific fine-grained categories. A com-
image regions containing discriminative information. mon problem with instruction-tuned models like LLaVA is
These localizations are then aligned with detailed textual their tendency to hallucinate, which causes the model to

3
Figure 3. Contrastively align predicted regions with descriptions.

output sentences that are not well-grounded in the image.


To address this, we propose a two-stage approach, as il-
lustrated in Figure 1, to elicit accurate descriptions from
Figure 4. For zero-shot image classification, the image output em-
LLaVA while minimizing hallucination.
bedding is compared with text embeddings of classnames enriched
Specifically, the two-stage prompting approach is as fol- with descriptions.
lows: in the first stage, we ask LLaVA to identify the
primary visual subject in the image using a simple, fixed quently utilized to train our model, as detailed in the up-
prompt: “What is the primary visual subject in this image? coming section. To our knowledge, we are the first to obtain
Answer in 2-3 words at most.” By doing this for every im- such fine-grained annotations on a large scale. The overall
age, we collect the main focus of each image. The gen- annotation process took around 600 GPU hours for CC3M
erations from this prompt typically capture the prominent and ∼ 2200 GPU hours for CC12M using NVIDIA A40s.
object, scene, or concept at a high level. Next, we construct We aim to release this dataset to benefit future research in
specific prompts for each image by asking LLaVA to de- this direction.
scribe the identified subject: “What are some distinguishing
visual features of this {subject}? Answer as a concise list 3.2. Model Architecture
of features”. We observe that the generations from this two-
stage pipeline are more faithful to the visual context and less We adopt a dual-encoding approach similar to CLIP for pro-
susceptible to hallucinations.We present a qualitative analy- cessing image and text modalities, leveraging contrastive
sis on this in the Appendix. This procedure provides us with learning to align these representations. For visual represen-
a list of descriptions for each image. Additionally, we ask tations, we utilize an encoder-decoder network architecture.
LLaVA to generate a short one-line description the image by Notably, all components of our architecture are trained from
prompting it with “Describe this image in one line”. This scratch without any pretrained initialization. In our vision
description gives a high-level overview of the visual con- encoder, we adopt a standard vision transformer (ViT) that
text, and it is utilized as text-level data augmentation during divides the input image into HWP 2 patches where (H, W ) is
training. From this point forward, we refer to this descrip- the input image resolution and P denotes the patch size. The
tion as the MLLM-caption, and the one from the pretraining output tokens corresponding to each input patch are fed into
dataset as the original caption. our transformer decoder as shown in Figure 2. Both text de-
Next, we are tasked with localizing these generated de- scriptions and captions are processed by a text transformer
scriptions within the image to obtain the necessary supervi- which utilizes the same architecture employed in CLIP.
sion for training our grounding module. We leverage the Transformer Decoder. Inspired by DETR [4], we imple-
OWLv2 Open-vocabulary Detector [26] to localize these ment a transformer decoder that takes as input a small num-
descriptions within the image. For each description, we fil- ber of learnable position embeddings called queries and at-
ter out the core attribute being referred to and pass it to the tends to the encoder output. We use two types of queries
open-world detector for localization. The detector gener- as input to this model. First we have nq number of queries
ates several candidate proposals, from which we select de- that we call region queries, whose corresponding outputs
tections based on a confidence threshold value. We set this are used to predict bounding boxes. Additionally, we use a
threshold to a relatively high value to ensure high-quality single image query to learn the overall image context. The
detections. Subsequently, we eliminate redundant bounding transformer model transforms these input queries through
box predictions using non-maximum suppression, retaining self-attention between region and image queries and cross-
only the box with the highest confidence score for each re- attention with the encoder output to form output embed-
gion and discarding others with significant overlap. dings. The embeddings corresponding to the region queries
This procedure enables us to acquire descriptions, are utilized for bounding box prediction and serve as seman-
bounding boxes, and MLLM captions, which are subse- tic representations for local regions, while the embedding

4
Table 1. Zero-shot transfer evaluation of different models. We highlight the best performance of each setting in bold. We see that GRAIN
improves performance under both pretraining datasets, outperforming CLIP by up to 9% in absolute top-1 accuracy. CLIP* is a version of
CLIP with the same number of parameters as our method for fair comparison.

Caltech-101
CIFAR-100
CIFAR-10

ImageNet
Places365

Food101
SUN397

Average
Flowers
DTD

CUB
Cars

Pets
Data Model
LLaVA + CLIP 89.69 57.72 55.24 15.90 35.37 47.16 75.03 24.69 6.22 29.43 52.80 44.48 35.20
CLIP[34] 48.86 18.70 28.44 0.68 9.23 6.94 41.02 8.48 2.51 17.85 8.73 17.40 14.01
Menon&Vondrick[25] 49.35 17.93 29.74 0.60 10.43 7.05 43.89 7.67 2.84 19.12 9.64 18.02 14.12
CuPL[32] 50.16 18.98 29.66 0.71 9.89 8.22 43.95 8.84 2.91 19.73 10.51 18.51 14.14
CC3M
CLIP* 46.99 18.49 29.76 0.52 8.40 6.62 42.56 8.29 3.36 18.70 10.01 17.62 14.04
CLIP* + Menon&Vondrick[25] 49.37 17.98 29.94 0.62 10.55 7.14 44.02 8.38 3.51 19.23 10.24 18.27 14.16
CLIP* + CuPL[32] 50.24 18.86 30.12 0.74 10.14 8.06 43.78 8.95 3.32 19.56 10.77 18.59 14.14
GRAIN (Ours) 65.86 35.20 38.07 1.34 17.24 14.15 65.20 13.24 5.47 24.96 16.18 27.00 23.34
CLIP [34] 71.24 36.66 48.84 4.57 19.28 42.06 70.09 20.51 7.63 31.84 40.94 35.79 34.66
Menon&Vondrick [25] 72.68 37.08 48.59 5.12 18.45 41.38 72.29 21.15 8.27 31.36 41.20 36.14 34.32
CuPL [32] 72.85 37.37 49.06 4.88 18.71 41.58 71.17 22.82 7.94 30.28 40.89 36.15 34.65
CC12M
CLIP* 70.07 35.63 50.42 4.31 18.35 39.40 74.24 21.04 7.96 32.03 41.36 35.89 33.51
CLIP* + Menon&Vondrick [25] 72.74 37.44 51.20 5.31 18.47 41.74 74.44 21.22 8.32 32.72 41.92 36.87 34.50
CLIP* + CuPL [32] 72.77 37.85 51.08 5.12 18.98 41.14 74.22 22.68 8.05 32.34 41.65 36.90 34.77
GRAIN (Ours) 81.40 46.23 55.26 8.42 25.68 48.76 81.49 26.27 10.28 36.76 45.39 42.36 41.46

corresponding to the image query captures the overall im- captions.


age representation needed for contrastive learning alongside
Image-Caption Alignment (Lic ). We adopt the symmet-
captions. This image query output is passed through a pro-
ric cross entropy loss from CLIP to maximize the similarity
jection layer before contrastive alignment with the text cap-
between correct image-caption pairings while contrasting
tions. The bounding box prediction module is exclusively
against incorrect pairings within the batch. As with CLIP,
used during training to learn region-aware image features
we use the [EOS] token from the last layer of the text trans-
and is inactive during evaluation.
former and the output embedding corresponding to the im-
Bounding-Box Prediction. The region output embeddings age query as feature representations for Lic .
are fed into a multi-layer perceptron for bounding box pre- Bounding Box Loss (Lbox ). Our model predicts nq bound-
diction. The input size of this MLP is equal to the em- ing boxes per image corresponding to the region queries. nq
bedding dimension d and the output size is set to 4, corre- is set to be greater than or equal to the maximum number
sponding to the four bounding box coordinates. These MLP of objects per image in the training set. Given the vari-
weights are shared across all queries. able number of objects per image, we employ the Hun-
Semantic Representations. Each region output embedding garian Matching algorithm to establish a bipartite match-
is additionally passed through a projection layer to map ing between predicted and ground truth boxes. For the
it into the shared semantic space. The resulting semantic matched boxes, we implement the bounding box loss de-
representations are utilized for contrastive alignment rived from DETR, which combines the scale-invariant IOU
with text descriptions. This region-description alignment loss and the L1 loss between the bounding box coordi-
procedure is illustrated in Figure 3. nates. Overall, the bounding box Lbox (bi , b̂σ(i) ) is defined
as Liou (bi , b̂σ(i) ) + ∥bi − b̂σ(i) ∥1 .
3.3. Training Objectives
Region-Description Alignment (Lrd ). We use an In-
Our approach simultaneously optimizes for three objec- foNCE loss [40] to learn alignment between output region
tives: localizing salient regions within the image, con- embeddings and descriptions. Here, the descriptions corre-
trastively aligning text descriptions to these salient image sponding to ground truth bounding boxes serve as supervi-
region representations, and globally aligning images with sion. We leverage the matched indices between predicted

5
Table 2. Results (Recall@k) on zero-shot image-to-text and text-to-image retrieval tasks on MS-COCO and Flickr30k.

MS-COCO Flickr30k
Data Model Image-to-Text Text-to-Image Image-to-Text Text-to-Image
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
CLIP 15.79 38.26 50.70 13.58 33.76 46.04 27.00 53.80 66.30 21.78 44.26 55.10
CC3M
GRAIN 38.26 65.96 77.03 28.81 55.86 69.00 59.90 81.80 88.40 42.82 68.21 76.54
∆ +22.47 +27.70 +26.33 +15.23 +22.10 +22.96 +32.90 +28.00 +22.10 +21.04 +23.95 +21.44
CLIP 41.32 69.40 80.04 30.02 57.32 69.65 59.60 84.70 89.90 43.63 68.75 76.77
CC12M
GRAIN 58.30 83.07 89.67 42.66 70.77 80.83 78.00 94.60 97.80 59.36 80.01 85.59
∆ +16.98 +13.67 +9.63 +12.64 +13.45 +11.18 +18.40 +9.90 +7.90 +15.73 +11.26 +8.82

outputs and ground truth boxes obtained via the Hungarian image and their associations with corresponding textual de-
Matching algorithm in the last step to determine ground- scriptions. Although the focus of our method is on visual
truth descriptions for each predicted region output embed- recognition, we observe that our learned representations
ding. These matched ground truths are considered positive are of high quality through experiments on cross-modal re-
pairings, while all other pairings within the batch are treated trieval benchmarks. We compare against CLIP as our pri-
as negatives for InfoNCE. Optimizing for this loss enables mary baseline, along with recent works like Menon & Von-
our model to learn fine-grained associations between rich drick [25] and CuPL [32], that also leverage complementary
textual descriptions and salient image regions that contain information from foundation models to improve upon CLIP.
discriminative visual features. Overall, the final objective We train all CLIP-based baselines from scratch under the
function is an equally weighted combination of three com- same training conditions and evaluate all approaches with a
ponents. zero-shot evaluation protocol.

Ltotal = Lic + Lbox + Lrd (1) 4.1. Experimental Setup


Model Architectures. For all models, we employ the
ViT-B/16 [10] architecture for the vision encoders and the
3.4. Inference Transformer base model [41] for text encoders as described
in CLIP [34]. We include results for additional ViT sizes in
At inference time, our model behaves similar to CLIP,
the Appendix. Our approach, GRAIN, additionally utilizes
conducting zero-shot classification/retrieval by computing
a query-decoder with 6 transformer decoder layers. We
image-text similarities. The image output embedding from
set the number of queries nq to 10. The outputs from
our decoder serves as the feature representation for the im-
the decoder are processed by projection layers to obtain
age. Through self and cross-attention mechanisms, this fea-
features in the semantic space, and a 2-layered MLP for
ture is informed about the fine-grained regions that are char-
predicting bounding boxes. In addition to these compar-
acteristic of the given image. The localization modules are
isons, we evaluate our approach against the substantially
inactive during inference; however, they can be used to pro-
larger LLaVA v1.6 model, which includes a ViT-L/14
vide valuable insights for interpreting the model’s predic-
paired with Vicuna-13 LLM. For this model, we utilize a
tions. For zero-shot image classification (Tables 1, 5), we
pretrained checkpoint from huggingface [44].
enhance class names by appending their descriptions, as il-
lustrated in Figure 4. These descriptions are sourced from Pretraining Setup. All models are pre-trained on two
a LLM similar to [25, 32]. Leveraging the rich image-text distinct image-text datasets that vary in scale: Concep-
correspondences learned during training, our model effec- tual Captions 3M (CC3M) and Conceptual Captions 12M
tively uses these descriptions to recognize fine-grained and (CC12M) [39]. Training for all models is conducted using
novel categories. the AdamW optimizer [22] across 35 epochs, using a cosine
learning rate schedule and weight decay regularization. We
4. Experiments use a batch size of 1024 for CC3M experiments and 2048
for CC12M. Training GRAIN for CC3M on a 8 NVIDIA
The goal of our method is to learn fine-grained vision- H100 DGX machine takes about 16 hours and for CC12M
language representations that can aid zero-shot visual experiments on 2 × 8 H100 machines takes 36 hours. While
recognition. By recognizing and addressing the alignment training GRAIN, we randomly choose between the original
discrepancy between CLIP’s representations of image re- caption and the MLLM-generated caption as the text super-
gions and the rich textual context, our method learns visual vision.
representations that are aware of the salient regions in the Baselines. To ensure fair evaluation, all baselines were

6
Table 3. We report top-1 accuracy (%) for zero-shot attribute-based classification. This is a challenging task as indicated by the results.

Caltech-101
CIFAR-100
CIFAR-10

ImageNet
Places365

Food101
SUN397

Average
Flowers
DTD

CUB
Cars

Pets
Data Model
CLIP 24.20 7.30 13.65 0.75 6.86 3.43 24.68 1.90 1.79 8.93 5.04 8.97 4.53
CC3M GRAIN (Ours) 46.06 18.20 20.02 0.95 14.57 4.87 45.82 2.34 1.72 13.06 7.63 15.93 7.87
∆ +21.86 +10.90 +6.37 +0.20 +7.71 +1.44 +21.14 +0.44 -0.07 +4.13 +2.59 +6.96 +3.34
CLIP 43.71 16.05 23.06 1.67 11.33 7.02 40.61 4.08 2.29 14.78 12.74 16.12 9.41
CC12M GRAIN (Ours) 67.39 26.29 32.46 4.21 17.61 12.38 59.09 3.66 2.72 20.39 18.29 24.04 14.53
∆ +23.68 +10.24 +9.40 +2.54 +6.28 +5.36 +18.48 -0.42 +0.43 +5.61 +5.55 +7.92 +5.12

trained under conditions similar to GRAIN. The introduc- across all other datasets. Notably, our method surpasses ex-
tion of the decoder architecture in our model results in a isting benchmarks by significant margins across both fine
22% increase in parameter count compared to CLIP. For and coarse-grained datasets, with our most substantial im-
a more fair comparison we report numbers for CLIP by provement reaching up to 22% absolute accuracy on the
leveraging the same architectures as GRAIN but with lo- Caltech-101 [12] dataset within the CC3M training setting.
calization modules turned off. This baseline is reported
as CLIP* throughout the paper. Additionally, we report
the performance of the LLaVA v1.6 model to benchmark 4.3. Cross-modal retrieval
our model’s performance against a state-of-the-art MLLM. We evaluate the pre-trained models on the task of cross-
Open-ended MLLMs like LLaVA are known to struggle modal retrieval under the zero-shot setting. Specifically, we
with fine-grained visual recognition [51]. Hence, we pro- focus on the Image-to-Text (I2T) and Text-to-Image (T2I)
pose a new inference strategy to evaluate LLaVA on clas- retrieval tasks using the MSCOCO and Flickr30k datasets
sification tasks, providing a stronger baseline. Specifically, in Table 2. Our evaluations are conducted on the standard
we first prompt LLaVA to predict a category for an image. test sets for both datasets, and we report performance met-
Due to its open-ended nature, we cannot directly determine rics in terms of Recall@k for k values of 1, 5, and 10. Com-
if the generated answer matches the ground truth. To ad- pared to CLIP, our method achieves superior performance
dress this, we use a pretrained CLIP text encoder to map with performance gains of up to 33%. On average, we ob-
LLaVA’s generated answer to the closest category within serve improvements of 23.8% with CC3M trained models
the dataset’s vocabulary. This mapped category is then used and 12.46% with models trained on CC12M.
as the prediction to compute the top-1 accuracy. We refer
to this baseline as LLaVA + CLIP in Table 1, representing 4.4. Zero-shot attribute-based classification
a stronger and improved baseline over LLaVA alone. De-
To measure image-description alignment, we design an ex-
spite possesing orders of magnitude more parameters and
periment to classify images by leveraging only descrip-
being trained on billion-scale datasets, our method manages
tions/attributes. This is a challenging task, as image classi-
to surpass LLaVA’s performance, which shows that our im-
fication is being performed devoid of class names. Toward
provements emerge from careful modeling decisions, rather
this end, we first prompted GPT-3 using class names from
than a simple increase in data volume or model size.
the downstream dataset’s vocabulary to obtain descriptions.
Next, instead of the traditional approach of encoding class
4.2. Zero-shot image classification
names and computing similarities with images, we encoded
We perform zero-shot classification and evaluate all models the description corresponding to the class name (omitting
on Imagenet and 11 additional datasets encompassing com- the class name itself) to obtain the text representation and
mon and fine-grained sets. We measure the top-1 accuracy computed similarities with images. The class correspond-
and report results in Table 1. Our approach, GRAIN, consis- ing to the text representation that scored the maximum sim-
tently outperforms the current state-of-the-art across all set- ilarity with the test image is considered the prediction for
tings and datasets. Specifically, GRAIN improves the zero- that image. We compute top-1 accuracy as usual and re-
shot performance by as much as 9% in absolute accuracy ported for all datasets in Table 3. From Table 3, we ob-
on Imagenet and achieves similar improvements averaged serve that our model is able to achieve strong improvements

7
Table 4. Ablation studies on our CC3M trained model reporting top-1 accuracy (%)

Caltech-101
CIFAR-100
CIFAR-10

ImageNet
Places365

Food101
SUN397

Average
Flowers
DTD

CUB
Cars

Pets
Setting
GRAIN 65.86 35.20 38.07 1.34 17.24 14.15 65.20 13.24 5.47 24.96 16.18 27.00 23.34
– Region-description loss 58.21 27.07 35.28 1.01 14.20 9.18 58.86 9.13 3.52 22.31 13.05 22.89 18.73
– Box loss 57.06 26.17 34.38 0.93 14.67 8.87 56.91 8.31 3.20 21.35 13.12 22.27 17.54
– MLLM-caption 47.24 19.92 28.51 0.70 8.78 7.04 43.95 8.20 2.99 20.06 9.01 17.85 14.56
– Menon&Vondrick [25] 46.99 18.49 29.76 0.52 8.40 6.62 42.56 8.29 3.36 18.70 10.01 17.62 14.04

over CLIP, demonstrating closer image-description align- on average. This considerable decrease underscores the vi-
ment. On average, we achieve an improvement of 6-7% tal role of this loss in establishing fine-grained correspon-
over CLIP, showcasing better alignment. dences between salient image regions and their descrip-
tions.
4.5. Recognizing Novel Examples
Ablating the localization loss. Further removing the
It is desirable for open-vocabulary models to generalize bounding box prediction losses from our training regime
to novel, unseen examples at test-time without requiring leads to a modest performance drop. This loss is instru-
re-training. Zero-shot learning methods often utilize aux- mental in identifying and predicting salient regions within
iliary information, such as attributes, for classifying un- the image, and, in conjunction with the alignment loss is
known entities. Hence, our approach aims to recognize crucial to developing fine-grained visual understanding.
these concepts by leveraging LLM-generated descriptions.
Ablating the role of MLLM-caption during training.
In this experiment, we Accuracy (%) Products-2023
We employ captions generated by LLaVA as a form of
aim to test our model’s CLIP 33.65 text-level data augmentation during training, alternating be-
ability in recognizing LLaVA 42.08 tween these and the original image captions. The MLLM-
novel entities that were GRAIN 45.24 generated caption provides a high-level visual summary of
absent from the train-
Table 5 the image, proving to be significant for training, as indicated
ing distribution. Toward
by a 3% decrease in performance upon its removal.
this end, we collect 1500 images of products launched af-
ter 2023, manually filter these images for quality control Ablating the role of test-time descriptions. In line with
and label them into 27 novel categories to form a new the approach of Menon & Vondrick [25], we utilize descrip-
benchmark dataset. We call this the Products-2023 dataset. tions generated by GPT-3 to enrich class names during zero-
These concepts are absent from our model’s training distri- shot classification. Excluding these augmented descriptions
bution making them novel. We provide additional details on results in a minor performance reduction, suggesting that
this dataset in Appendix. We evaluate our model along with while beneficial, our model’s performance is not reliant on
CLIP and LLaVA on this dataset in Table 5 which demon- these test-time descriptions.
strates superior results achieved by our approach against
CLIP and even against the much larger LLaVA model con- 5. Conclusion
firming the efficacy of our approach in recognizing novel
samples. In this paper, we propose a new pre-training method for con-
trastive vision-language models. Specifically, we hypothe-
4.6. Ablations size that many of the current limitations of CLIP stem from
its image-level contrastive pre-training, which neglects fine-
To assess the importance of the different components in grained alignment. As a result, we propose to leverage
GRAIN, we conduct four ablation experiments. We re- Multi-Modal Large Language Models (LLaVA) and Open-
strict to models trained on CC3M due to computational con- Vocabulary Object Detectors (OWLv2) to automatically
straints. The outcomes of these ablations are reported as generate weak supervision to drive a more fine-grained pre-
top-1 accuracy in Table 4. training process. We demonstrate superior performance
Ablating the region-description alignment loss. This across 11 different classification datasets, including ones
component is pivotal to our framework as removing it containing fine-grained and novel examples, as well as addi-
causes a significant accuracy decline of 5% on all datasets tional tasks such as cross-modal retrieval. Our results show

8
significant improvement over the state-of-art, including by [11] Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and
up to 9% in absolute top-1 accuracy for zero-shot classifi- Yonglong Tian. Improving clip training with language
cation and 25% on retrieval. Our method can even outper- rewrites. In NeurIPS, 2023. 3
form LLaVA, which is over 13B parameters (compared to [12] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener-
our ∼170M) and was trained on billions of data-points. ative visual models from few training examples: An incre-
mental bayesian approach tested on 101 object categories. In
References 2004 conference on computer vision and pattern recognition
workshop, pages 178–178. IEEE, 2004. 7
[1] Ioana Bica, Anastasija Ilić, Matthias Bauer, Goker Erdogan, [13] Hexiang Hu, Yi Luan, Yang Chen, Urvashi Khandelwal,
Matko Bošnjak, Christos Kaplanis, Alexey A. Gritsenko, Mandar Joshi, Kenton Lee, Kristina Toutanova, and Ming-
Matthias Minderer, Charles Blundell, Razvan Pascanu, and Wei Chang. Open-domain visual entity recognition: To-
Jovana Mitrović. Improving fine-grained understanding in wards recognizing millions of wikipedia entities. Interna-
image-text pre-training, 2024. 2, 11, 13 tional Conference on Computer Vision, 2023. 1, 11
[2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- [14] Shih-Cheng Huang, Liyue Shen, Matthew P. Lungren, and
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- Serena Yeung. Gloria: A multimodal global-local represen-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- tation learning framework for label-efficient medical image
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom recognition. In 2021 IEEE/CVF International Conference on
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Computer Vision (ICCV), pages 3922–3931, 2021. 2, 11
Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, [15] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom
Jack Clark, Christopher Berner, Sam McCandlish, Alec Rad- Duerig. Scaling up visual and vision-language representation
ford, Ilya Sutskever, and Dario Amodei. Language models learning with noisy text supervision, 2021. 2, 11
are few-shot learners, 2020. 3 [16] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel
[3] Kaylee Burns, Zach Witzel, Jubayer Ibn Hamid, Tianhe Yu, Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr – mod-
Chelsea Finn, and Karol Hausman. What makes pre-trained ulated detection for end-to-end multi-modal understanding,
visual representations successful for robust manipulation?, 2021. 11
2023. 11 [17] Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu,
[4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiu-
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- long Shan, Chen-Nee Chuah, Yinfei Yang, and Meng Cao.
end object detection with transformers, 2020. 3, 4, 11 Veclip: Improving clip training via visual-enriched captions,
[5] Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel 2024. 3
Kondermann, Samuel Thomas, Shih-Fu Chang, Rogerio [18] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmel-
Feris, James Glass, and Hilde Kuehne. What, when, and ing. Learning to detect unseen object classes by between-
where? – self-supervised spatio-temporal grounding in class attribute transfer. In 2009 IEEE Conference on Com-
untrimmed multi-action videos from narrated instructions, puter Vision and Pattern Recognition, pages 951–958, 2009.
2023. 11 11
[6] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, [19] Jie Lei, Tamara L. Berg, and Mohit Bansal. Qvhighlights:
Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Detecting moments and highlights in videos via natural lan-
Universal image-text representation learning, 2020. 11 guage queries, 2021. 11
[7] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- [20] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei
hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu
hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Wei, Yejin Choi, and Jianfeng Gao. Oscar: Object-semantics
Xing. Vicuna: An open-source chatbot impressing gpt-4 aligned pre-training for vision-language tasks, 2020. 11
with 90%* chatgpt quality, 2023. 3 [21] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
[8] Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Visual instruction tuning. Advances in neural information
Rota, Yiming Wang, and Elisa Ricci. Vocabulary-free image processing systems, 36, 2024. 3, 11
classification, 2023. 11 [22] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
[9] Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng regularization. arXiv preprint arXiv:1711.05101, 2017. 6
Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, [23] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert:
Haoran Wei, et al. Dreamllm: Synergistic multimodal com- Pretraining task-agnostic visiolinguistic representations for
prehension and creation. arXiv preprint arXiv:2309.11499, vision-and-language tasks, 2019. 11
2023. 11 [24] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Blaschko, and Andrea Vedaldi. Fine-grained visual classi-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, fication of aircraft, 2013. 1
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [25] Sachit Menon and Carl Vondrick. Visual classification via
vain Gelly, et al. An image is worth 16x16 words: Trans- description from large language models. In The Eleventh In-
formers for image recognition at scale. arXiv preprint ternational Conference on Learning Representations, 2023.
arXiv:2010.11929, 2020. 3, 6 1, 3, 5, 6, 8, 11, 12, 13, 14

9
[26] Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. [39] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu
Scaling open-vocabulary object detection. Advances in Neu- Soricut. Conceptual captions: A cleaned, hypernymed, im-
ral Information Processing Systems, 36, 2024. 4, 13 age alt-text dataset for automatic image captioning. In Pro-
[27] M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Horst Pos- ceedings of the 56th Annual Meeting of the Association for
segger, Rogerio Feris, and Horst Bischof. Tap: Targeted Computational Linguistics (Volume 1: Long Papers), pages
prompting for task adaptive generation of textual training in- 2556–2565, 2018. 2, 3, 6, 11
stances for visual classification, 2023. 11 [40] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-
[28] Norman Mu, Alexander Kirillov, David Wagner, and Sain- sentation learning with contrastive predictive coding, 2019.
ing Xie. Slip: Self-supervision meets language-image pre- 5
training, 2021. 2, 11 [41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
[29] OpenAI. Gpt-4v(ision) system card. OpenAI, 2023. 11
Polosukhin. Attention is all you need. Advances in neural
[30] Letitia Parcalabescu, Michele Cafagna, Lilitta Muradjan, information processing systems, 30, 2017. 6
Anette Frank, Iacer Calixto, and Albert Gatt. VALSE: A
[42] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.
task-independent benchmark for vision and language models
Cub. Technical Report CNS-TR-2011-001, California Insti-
centered on linguistic phenomena. In Proceedings of the 60th
tute of Technology, 2011. 1
Annual Meeting of the Association for Computational Lin-
[43] Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan,
guistics (Volume 1: Long Papers), pages 8253–8280, Dublin,
Xudong Lin, Ying Shan, Xiaohu Qie, and Mike Zheng
Ireland, 2022. Association for Computational Linguistics. 2
Shou. Object-aware video-language pre-training for re-
[31] Devi Parikh and Kristen Grauman. Relative attributes. In trieval, 2022. 11
2011 International Conference on Computer Vision, pages
[44] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau-
503–510, 2011. 11
mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim
[32] Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam
does a platypus look like? generating customized prompts Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien
for zero-shot image classification. In Proceedings of the Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama
IEEE/CVF International Conference on Computer Vision, Drame, Quentin Lhoest, and Alexander M. Rush. Hugging-
pages 15691–15701, 2023. 1, 3, 5, 6, 11, 12 face’s transformers: State-of-the-art natural language pro-
[33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya cessing, 2020. 6, 12
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, [45] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- Akata. Feature generating networks for zero-shot learning.
ing transferable visual models from natural language super- In 2018 IEEE/CVF Conference on Computer Vision and Pat-
vision. 1 tern Recognition, pages 5542–5551, 2018. 11
[34] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya [46] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Chunjing Xu. Filip: Fine-grained interactive language-image
transferable visual models from natural language supervi- pre-training, 2021. 2, 11, 13
sion. In International conference on machine learning, pages [47] Jiacheng Ye, Jiahui Gao, Qintong Li, Hang Xu, Jiangtao
8748–8763. PMLR, 2021. 2, 5, 6, 11, 12 Feng, Zhiyong Wu, Tao Yu, and Lingpeng Kong. Zerogen:
[35] Kanchana Ranasinghe, Brandon McKinzie, Sachin Ravi, Efficient zero-shot learning via dataset generation. 2022. 11
Yinfei Yang, Alexander Toshev, and Jonathon Shlens. Per- [48] Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri,
ceptual grouping in contrastive vision-language models, Dan Jurafsky, and James Zou. When and why vision-
2023. 2 language models behave like bags-of-words, and what to do
[36] Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- about it?, 2023. 2
rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M [49] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and
Anwer, Erix Xing, Ming-Hsuan Yang, and Fahad S Khan. Lucas Beyer. Sigmoid loss for language image pre-training,
Glamm: Pixel grounding large multimodal model. arXiv 2023. 2, 11
preprint arXiv:2311.03356, 2023. 11 [50] Chuhan Zhang, Ankush Gupta, and Andrew Zisserman.
[37] Zhiyuan Ren, Yiyang Su, and Xiaoming Liu. Chatgpt- Helping hands: An object-aware ego-centric video recogni-
powered hierarchical comparisons for image classification, tion model, 2023. 11
2023. 11 [51] Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh,
[38] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Yuchang Su, Ludwig Schmidt, and Serena Yeung-Levy. Why
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo are visually-grounded language models bad at image classi-
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- fication?, 2024. 7
man, et al. Laion-5b: An open large-scale dataset for training [52] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo-
next generation image-text models. Advances in Neural In- hamed Elhoseiny. Minigpt-4: Enhancing vision-language
formation Processing Systems, 35:25278–25294, 2022. 2, understanding with advanced large language models. arXiv
11 preprint arXiv:2304.10592, 2023. 11

10
Appendix erated from a Large Language Model (LLM) as auxiliary
information to augment the zero-shot performance of CLIP.
A. Additional Details on Related Work On similar lines, CuPL [32] and Ren et. al. [37] use LLMs
to generate descriptions in the form of long, cohesive sen-
The scope of our work spans several domains in- tences or via nuanced, hierarchy-aware comparisons. TAP
cluding Vision-Language Representation Learning, Fine- [27] learns a text classifier mapping descriptions to cate-
grained Visual Recognition, Visual Grounding and Open- gories during training which is used to map from images
world/Zero-shot learning. Since the related works section to categories at test-time. Different from these works, our
in the main paper cannot adequately cover all of these areas, method aims to improve alignment between images and de-
we provide a more comprehensive summary in this supple- scriptions that would further bolster the efficacy of using
mentary material: descriptions at test-time.
Contrastive Language-Image Pretraining. Methods like Object-aware Vision-Language Pretraining. Encour-
CLIP [34] and ALIGN [15] leverage large internet scraped aging object-oriented representations within a vision-
datasets of image-text pairs to learn a joint representation language pretraining objective [3, 5, 6, 20, 23, 43, 50] has
by contrastively aligning the two modalities. The objec- been shown to facilitate learning of robust models that can
tive of these methods is to pull together image and text positively impact downstream performance across a vari-
representations that are semantically similar and push apart ety of tasks in vision-language, video understanding and
dissimilar pairs. These works employ a dual encoder ap- embodied AI. Many of these approaches follow the DETR
proach, separately encoding representations for images and line of works [4, 16, 19] that introduce a query-transformer
text. These learned representations are effective for various backbone for detection and grounding. We take inspiration
downstream vision and language tasks. Follow-up works from these works to develop our architecture for encoding
[28, 49] in this area focus on improving downstream per- visual information. However, our approach only uses the
formance by incorporating self-supervision or using other grounding task as an auxiliary objective to distill informa-
objective functions during pretraining. However, aligning tion from local regions to global representations. We lever-
representations at a global (whole image or caption level) age our synthetic descriptions to supervise this grounding
is known to only learn coarse-grained features and discard module, which is then disabled during evaluation as detailed
fine-grained visual information. Acknowledging this prob- earlier.
lem, FILIP [46] introduces a cross-modal late interaction
mechanism that utilizes a token-wise maximum similarity Universal Visual Recognition. Recent works [8, 13]
between image and text tokens to drive the contrastive ob- introduce the problem of universal visual recognition or
jective. In the medical domain, GLORIA [14] proposes an vocabulary-free image classification, where the motivation
attention-based framework that uses text tokens to attend to is to free models like CLIP from a constrained vocabulary
sub-image regions and learns local and global representa- thereby allowing classification from an unrestricted set of
tions. Concurrent to our work, SPARC [1] proposes using concepts. Corroborating with our claims, these works ob-
a sparse similarity metric between image patches and text serve limitations of CLIP toward recognizing novel exam-
tokens to learn fine-grained alignment. Our paper shares a ples and fine-grained entities. These works formalize this
motivation to these works in terms of aiming to learn fine- problem and introduce retrieval-based methods as an initial
grained representations. However, unlike these methods, step towards a solution.
we address the fact that image-caption datasets like Con- Multimodal Large Language Models. MLLMs like
ceptual Captions [39] or LAION [38] contain noisy captions LLaVA [21], GPT-4V [29], Mini-GPT4 [52] integrate im-
that lack descriptive information, thereby limiting the gains age tokens into LLMs, leveraging their powerful reasoning
that such fine-grained region-token matching objectives can capabilities. MLLMs has been found useful in tasks such
achieve. Secondly, our approach focuses on learning visual as scene understanding [36], story-telling [9] etc., where a
representations that would be able to leverage complemen- comprehensive understanding of the images and text is re-
tary information at test-time (in the form of LLM-generated quired. We leverage their ability for visual comprehension
descriptions as proposed by [25, 32] ) to recognize fine- to generate a set of descriptions for an input image that are
grained or novel entities. Finally, in principle, these meth- used to supervise our fine-grained losses during training.
ods are orthogonal to our contributions and can be coupled
Zero-shot Learning for Images. Zero-shot learning (ZSL)
with our method.
learning is a challenging problem that requires methods to
Zero-shot Learning with CLIP. In image classification, recognize object categories not seen during training. Var-
Zero-shot learning methods aim to recognize novel entities ious approaches [18, 31] have proposed using side infor-
that were not seen during training. Relevant to our work, mation like attributes, hierarchical representations etc. to
Menon & Vondrick [25] leverage category descriptions gen- learn a generalizable mapping. More recent efforts [45, 47]

11
Table A. Zero-shot top-1 accuracy (%) of different methods using the ViT-L/14 backbone.

Caltech-101
CIFAR-100
CIFAR-10

ImageNet
Places365

Food101
SUN397

Average
Flowers
DTD

CUB
Cars

Pets
Data Model
LLaVA + CLIP 89.69 57.72 55.24 15.90 35.37 47.16 75.03 24.69 6.22 29.43 52.80 44.48 35.20
CLIP [34] 73.36 38.06 49.96 4.59 21.84 43.98 71.79 22.01 7.72 33.16 42.25 37.15 36.72
Menon&Vondrick [25] 73.74 38.48 50.05 5.22 22.04 44.33 72.56 22.10 8.28 33.78 43.04 37.60 36.84
CC12M
CuPL [32] 73.53 38.55 50.46 5.14 21.96 43.28 73.08 23.12 8.65 32.48 42.96 37.56 37.05
GRAIN (Ours) 81.62 44.98 55.82 9.12 27.66 52.98 82.05 28.18 12.73 37.34 46.92 43.58 42.68

explore the use of generative models to synthesize useful reported datasets, we use the handcrafted prompts spe-
features for unseen categories. Our method aligns more cific to each dataset as introduced in the official code-
closely with the former, as we learn a fine-grained corre- base [34]. These hand-engineered prompts improve
spondence conducive for zero-shot classification by lever- the zero-shot performance of CLIP beyond the vanilla,
aging descriptions as side-information. A photo of {classname} style prompts.

B. Implementation Details CLIP*. With the introduction of the decoder and bounding-
box modules, our method, GRAIN, uses ∼ 22% more pa-
All baselines reported in the main paper (except LLaVA) rameters compared to CLIP. For a more fair comparison in
utilize a ViT-B/16 architecture as the vision encoder. In terms of number of parameters, we report performance for
Table A, we report results using the ViT-L/14 architecture CLIP by using the same architecture as ours, but with the
trained on CC12M. For encoding text, we utilize a 12-layer localization modules turned off. We refer to this baseline as
transformer network as used with CLIP [34]. The outputs CLIP*
from the vision encoder are 768 dimensional, which are
then projected to 512. The outputs embeddings obtained Menon&Vondrick. We leverage the official code-base [25]
from the decoder are also passed through separate projec- to report performance for this baseline. In the main paper,
tion layers. The projection layer is shared between all re- we implement this baseline on top of the CLIP method as
gion output embeddings and a separate projection layer is per the norm.
used for the image output embedding. Similarly, the text- CuPL. Similarly, we implement the CuPL baseline leverag-
encoder output is projected to be of the same 512 dimen- ing official code [32] and report performance with CLIP and
sional size. Additionally, a two-layer MLP with output size CLIP* in the main paper and in Table A respectively. CuPL
4 is used to regress on the bounding boxes conditioning on shows a similar trend of improving over CLIP baselines but
the region output embeddings. The supervision for bound- trailing behind our method.
ing boxes is obtained through the OWLv2 detector, which is
originally for a 960 × 960 resolution image which is down- LLaVA + CLIP. We use a pretrained LLAVA v1.6 check-
scaled to 224 × 224 following the input resolution of our point from huggingface [44] that is composed of a ViT-L/14
model. While generating these bounding box annotations vision encoder and a Vicuna-13B LLM. The vision and
from OWLv2, we use a confidence threshold value of 0.3. text encoders of LLAVA have been separately pretrained
on billion-scale datasets and conjoined through a projection
C. Additional Results and Baseline Analysis layer. LLaVA has been trained through multiple stages on
a specialized set of ∼ 150k instructions. Being a generative
In Table A, we report performance of GRAIN and compet- model, we ask LLAVA to predict a category for an image
ing baselines when using the ViT-L/14 transformer back- by using prompts specific to each dataset as described in
bone. It shows that our approach is able to consistently Table B. Next, we use a pretrained CLIP text encoder to
outperform all baselines under various Vision Transformer map the answer generated by LLaVA to the closest cate-
backbones. gory in the vocabulary of the dataset being evaluated on.
CLIP. We train the same ViT variant on the same Con- We use this mapped category as the prediction to compute
ceptual Captions (CC3M and CC12M) datasets as our the top-1 accuracy as usual. We call this method LLaVA
GRAIN model. For performing zero-shot testing on all + CLIP. Observing Table A, our approach is able to reach

12
Figure B. Visualization of top-5 predictions of our model on novel
Figure A. Attention maps show more effective object localization entities alongside [25]. Our method consistently identifies the
by our model compared to CLIP. ground truth class as the top prediction.

and even surpass LLaVA’s performance on several datasets


despite having orders of magnitude smaller parameters and
training datasets. Google Glass. For the Tesla Cybertruck, we
include other pickup trucks like the Rivian R1T,
FILIP. Although FILIP [46] shares a similar motivation to Ford F-150, and Toyota Tundra. We then utilize
our method, we note that the approach taken for FILIP is GPT-3 (language only) to generate descriptions for the con-
orthogonal to ours. FILIP employs a cross-modal late in- cepts in this extended vocabulary. Following the inference-
teraction mechanism to learn associations between image time procedure discussed in the main paper, we present
patches and caption tokens without using any side infor- the top-5 predictions made by both our model and [25]
mation. In contrast, our approach leverages complemen- in Figure B. Our findings indicate that our model consis-
tary information in the form of descriptions and their cor- tently identifies the correct class names with high confi-
responding localizations to learn fine-grained alignments. dence, whereas the baseline is able to include it in top-5
Being orthogonal to our contributions, in principle, the late but fails to rank them as the top choice. This highlights our
interaction mechanism from FILIP can be coupled with our models ability to recognize novel concepts by leveraging
approach. Secondly, a concurrent work [1] finds FILIP’s the learned image-description alignment.
results challenging to reproduce due to high training insta-
bility, and in practice, observe FILIP to substantially under- Description Grounding. To showcase the efficacy of
perform even the zero-shot performance of CLIP on classi- our grounding module, we present visualizations of its
fication tasks. For these reasons, we refrain from comparing predictions in Figure C, with images from the Imagenet
our method to FILIP. dataset. We include additional visualizations in the Ap-
pendix. These visualizations include LLM-generated de-
D. Qualitative Analysis scriptions and the corresponding bounding boxes predicted
by our model, with each matched pair coded by color. We
Visualizing Attention Maps. We visualize attention maps also include descriptions belonging to this class that are not
of the penultimate encoding layer for our GRAIN model matched to a bounding box.
and CLIP in Figure A. Our model is seen to effectively fo-
cus on the object regions in the image which stems from our
localization and alignment objectives. E. Pretraining Dataset Details
Recognizing Novel Classes. In this experiment, we
focus on recognizing newly popular entities, namely We train all models on the CC3M and CC12M datasets. As
the Apple Vision Pro and Tesla Cybertruck, explained earlier, to obtain description and localization an-
which emerged after datasets like Conceptual Captions notations, we prompt LLaVA in two stages to extract the
were constructed. First, we add these two classnames to primary visual subject of the image and then gather image
the Imagenet-1K vocabulary. Next, to simulate a real-world descriptions by asking LLaVA to focus on the identified
open-vocabulary scenario, we also include three related but visual subject. We obtain bounding boxes corresponding
distinct categories for each novel entity, making this a chal- to each description by using OWLv2 [26], an off-the-shelf
lenging task. Specifically, for the Apple Vision Pro, open-vocabulary object detector. We filter the predicted
we add competing Virtual Reality (VR) headsets such boxes using a confidence threshold of 0.3 to discard noisy
as the Meta Quest 2, Microsoft Hololens, and predictions.

13
Figure C. Localization and region-description matching predictions made by our model on images from ImageNet.

F. Products-2023 Dataset Details making the descriptions difficult to localize.

To evaluate our approach’s ability to recognize novel sam-


ples, we manually curated a dataset comprising 1,500 im-
H. Two-stage versus Single stage Annotation
ages spanning 27 distinct categories. These images were In this work, we employ a two-stage annotation pipeline
carefully filtered and labeled through a manual process. to elicit descriptions from LLaVA. Specifically, in the first
Specifically, we compiled a list of products launched after stage, we prompt LLaVA to identify the primary visual sub-
2023, scraped corresponding images, and performed man- ject in the image, followed by generating descriptions for
ual filtering and labeling. Since the pretraining datasets this subject. We observe that this approach leads to the gen-
used in our setup (CC3M, CC12M) were finalized prior eration of descriptions that are more specific and focused
to 2023, this dataset represents novel examples. The full on the constituent regions in the image that make up the
list of categories includes: [Apple Vision Pro, Grimace subject. In Figure E, we compare the descriptions gener-
Shake(drink from McD), Starry (drink from Pepsi), Playsta- ated by this strategy with a single-stage pipeline that di-
tion Portal, Apple Watch Ultra 2, Apple Watch Series 9, rectly prompts LLaVA to generate descriptions without first
Samsung Galaxy Watch 6, Xiaomi Smart Band 8, Kia K4, identifying the subject. We randonly pick samples from the
Rivian R2, Honda Ye GT, Ferrari 12Cilindri, Renault 5 CC12M dataset to illustrate the difference. Contrasting the
E-Tech, Toyota Tundra, Ford F-150 Lightning, Tesla Cy- two setups, we can see that the two-stage approach produces
bertruck, Xiaomi SU7, Lamborghini Revuelto, Hyundai Mu- more specific descriptions that are well-grounded in the im-
fasa, Wordle, Asus ROG Ally, Meta Quest Pro, Microsoft age compared to the one-stage approach, which either out-
Hololens, Google Glass, Prime (drink), iPhone 15, Google puts overly generic descriptions or tends to hallucinate (See
Pixel 8 ]. We intend to release this dataset with the final Rows 1 and 2). This issue is more pronounced for complex
version of this paper. scenes involving unusual or fine-grained objects.

G. Sample Annotations I. Failure modes in our grounding module


In Figure D, we illustrate the annotations obtained us- In the main paper, we showcased examples where the
ing our two-stage LLaVA prompting followed by bound- grounding module of our approach successfully localized
ing box prediction using OWLv2. We randomly select im- descriptions in images. In this section, we particularly high-
ages and captions (original caption) from the CC3M dataset light failure cases where the model is unable to correctly
and present the corresponding MLLM caption, primary vi- localize descriptions within the image. Following the same
sual subject, and descriptions generated by our annotation setup as the main paper, we use images from ImageNet and
pipeline. The descriptions are color-coded by their asso- descriptions generated from an LLM by following Menon
ciated bounding box. Overall, our annotation pipeline is & Vondrick’s [25] strategy of prompting GPT-3 (language-
effective in identifying the primary visual subject, which is only) with category names. It is important to note that
the most prominent object or concept in the image, and gen- since these descriptions were generated using only the cate-
erating descriptions and corresponding localizations by fo- gory name and without access to images, some descriptions
cusing on this subject. The first five rows show cases where might not be visible in every image. We expect our ap-
the pipeline successfully localized at least one description, proach to localize descriptions that are present in an image
whereas the last row demonstrates a case where no descrip- and not localize those that are absent. While our approach
tion could be localized due to the vague nature of the image, effectively grounds descriptions on average, we illustrate

14
Figure D. Sample annotations generated using our two-stage LLaVA prompting scheme followed by OWLv2 localization.

15
Table B. Prompts to LLaVA for the zero-shot visual recognition task in Table A.

Dataset Prompt
DTD Fill in the blank: this is a photo of a {} texture
Pets What animal is in the image? Be specific about the breed. Fill in the blank: this is a
photo of a {}
Places365 What place is this in the image? Fill in the blank: this is a photo of a {}
Food101 What food is in the image? Fill in the blank: this is a photo of a {}
Cars What type of car is in the image? Be specific about the make and year. Fill in the
blank: this is a photo of a {}
Others Fill in the blank: this is a photo of a {}

failure cases in Figure F. (MLLMs) to train a region-aware model that strongly out-
Row 1 includes partially successful cases, where the performs CLIP across several tasks and datasets. De-
model localizes descriptions but the bounding boxes are ei- spite having significantly smaller parameters and training
ther slightly off the mark or does not localize all instances costs, our approach matches and sometimes even outper-
of that description in the image. forms LLaVA, a state-of-the-art MLLM, on zero-shot visual
Row 2 includes examples where either the model can- recognition. Although obtaining these annotations is com-
not localize a single description in the image or incorrectly putationally expensive, once acquired, our approach can
associates the description with another region in the im- be viewed as enabling the training of smaller models with
age. (the description typically orange or brown small-scale datasets to achieve performance equivalent to a
refers to the basketball but was incorrectly assigned to the large model trained on extensive data, potentially making
jersey of the player that has a similar color.) Vision and Language Model (VLM) training more accessi-
Row 3 includes cases of hallucination, where the model ble. Further integration of our approach with retrieval-based
localizes descriptions that are not present in the image. systems and multimodal LLMs is an interesting future di-
rection.
J. Limitations and Broader Impact
Limitations. Our method achieves substantial gains over
CLIP and other baselines on zero-shot transfer tasks such
as image classification, attribute-based image classification,
and cross-modal retrieval. These improvements can be at-
tributed to the fine-grained region-to-description associa-
tions learned by our model during the training process.
However, learning these correspondences requires annota-
tions in the form of descriptions and bounding box local-
izations, which are computationally expensive to obtain.
As mentioned earlier, our annotation scheme demands sig-
nificant GPU resources and can take long hours for large
datasets. Additionally, since we do not filter or curate these
annotations, it might result in some misaligned or inaccu-
rate descriptions or captions, which might not provide the
correct signal during the learning process. Future work
could explore the use of efficient models to generate anno-
tations as well as a filtering mechanism to ensure all gener-
ated text and bounding boxes are correctly aligned with the
semantic content of the image.
Broader Impact. We propose a strategy to learn fine-
grained image-text correspondences without requiring ad-
ditional human annotations. Our approach leverages weak
supervision from Multimodal Large Language Models

16
Figure E. Qualitative comparison between one-stage (middle) and two-stage (right) LLaVA-based annotation schemes.

17
Figure F. Visualization of failure modes from our grounding module on ImageNet-1K.

18

You might also like