Grounding Descriptions in Images Informs Zero-Shot Visual Recognition
Grounding Descriptions in Images Informs Zero-Shot Visual Recognition
Abstract fixed set of classes that were seen during training. However,
in the real and open-world, models need to account for test
Vision-language models (VLMs) like CLIP have been cher- conditions where the number of classes is unknown during
ished for their ability to perform zero-shot visual recog- training and can include classes that were not seen.Vision-
nition on open-vocabulary concepts. This is achieved by language models (VLMs) like CLIP [33] offer a solution
selecting the object category whose textual representation in this space, owing to their open-vocabulary nature. These
bears the highest similarity with the query image. While models undergo extensive pretraining on large datasets con-
successful in some domains, this method struggles with taining paired image-text data and learn to encode images
identifying fine-grained entities as well as generalizing to and texts in a shared latent space where semantically sim-
unseen concepts that are not captured by the training dis- ilar representations are mapped closed together. For zero-
tribution. Recent works attempt to mitigate these challenges shot classification, CLIP leverages the names of all classes
by integrating category descriptions at test time, albeit within the test dataset—referred to as the vocabulary—as
yielding modest improvements. We attribute these limited the candidate set, and determines the most probable image-
gains to a fundamental misalignment between image and classname pairing by computing the similarity between
description representations, which is rooted in the pretrain- their latent representations. This vocabulary of classes is
ing structure of CLIP. In this paper, we propose GRAIN, unconstrained, enabling the inclusion of any concept, re-
a new pretraining strategy aimed at aligning representa- gardless of its presence in the training set. This facilitates
tions at both fine and coarse levels simultaneously. Our classification from an open-set of concepts.
approach learns to jointly ground textual descriptions in im- Despite this, CLIP’s zero-shot capabilities are still lim-
age regions along with aligning overarching captions with ited by a few critical challenges. Firstly, in practice, CLIP
global image representations. To drive this pre-training, often struggles to differentiate between fine-grained cate-
we leverage frozen Multimodal Large Language Models gories, a limitation highlighted by its under-performance
(MLLMs) to derive large-scale synthetic annotations. We on Fine-Grained Visual Classification (FGVC) datasets [24,
demonstrate the enhanced zero-shot performance of our 42]. Secondly, while known for its open-vocabulary poten-
model compared to current state-of-the art methods across tial, it can still perform poorly for some domains not well-
11 diverse image classification datasets. Additionally, we represented in the training distribution, especially if the vo-
introduce Products-2023, a newly curated, manually la- cabulary used has confounding categories during testing.
beled dataset featuring novel concepts, and showcase our Using a vocabulary that exceeds the scope of the test dataset
model’s ability to recognize these concepts by benchmark- significantly diminishes the performance of CLIP even for
ing on it. Significant improvements achieved by our model common datasets like Imagenet [13]. This decline is again
on other downstream tasks like retrieval further highlight largely attributed to CLIP’s challenges in differentiating be-
the superior quality of representations learned by our ap- tween semantically similar, fine-grained concepts. Addi-
proach. Code available at https://github.com/ tionally, CLIP’s inability to recognize novel concepts, such
shaunak27/grain-clip. as Apple Vision Pro that were not present during its
training phase, further restricts its capability to function as
a genuinely open-vocabulary model.
1. Introduction Recent works [25, 32] aim to address these challenges
Traditionally, image classification has operated under the by incorporating extra information in the form of class de-
closed-set assumption where models are evaluated on a scriptions generated by Large Language Models (LLMs) at
test time. These approaches leverage the “visual” knowl-
* Correspondence to [email protected] edge embedded in LLMs to augment the textual repre-
1
sentations used in zero-shot classification. As an ex-
ample, the class French Bulldog would be expanded
to A French Bulldog, which has small and
pointy ears. These methods provide some improve-
ments over standard CLIP models, though they leave room
for further advancements.
We hypothesize that the reason for achieving limited
gains from injecting these descriptions lies in the poor
alignment between image and description embeddings
learned by CLIP. As a result, we aim to verify this hypoth-
esis and propose a method to overcome these challenges.
Specifically, we posit that the misalignment between im- Figure 1. Overview of our two-stage annotation process: (1)
ages and descriptions stems from CLIP’s training structure, prompting LLaVA for image descriptions and (2) acquiring cor-
responding region annotations from OWLv2.
which focuses solely on the global objective of matching
entire images to their overarching captions, neglecting the
rich information that image regions and textual descrip-
fine-grained aligned representations, leading to poor zero-
tions share with each other. Our observations align with
shot performance in some domains.
recent research indicating that CLIP tends to overlook
• We propose GRAIN, a novel pre-training architecture
fine-grained visual details during pretraining, leading to
and objective designed to simultaneously learn local and
subpar performance on tasks requiring localization [35],
global correspondences, obtained via weak supervision
object attributes [48], or physical reasoning [30].
from Multimodal LLMs and open-vocabulary detectors.
In this work, we propose GRAIN: Grounding and con- • To drive this pre-training, we introduce an automated an-
trastive alignment of descriptions, a novel objective for con- notation engine to source fine-grained supervision signal.
trastive vision-language pretraining that learns representa- • We demonstrate significant gains across a range of tasks,
tions more conducive to zero-shot visual recognition. This including image classification and retrieval, specifically
is achieved through fine-grained correspondence between improve over the state-of-art by up to 9% in absolute
image regions and detailed text descriptions. As a first step top-1 accuracy for zero-shot classification and up to 25%
towards our approach, given that pretraining datasets (Con- across cross-modal retrieval tasks.
ceptual Captions [39], LAION [38], etc.) only contain im- • Acknowledging the lack of novel image classifica-
ages with noisy captions but without detailed descriptions, tion datasets, we collect and manually label a dataset,
we employ an instruction-tuned Multimodal Large Lan- Products-2023, for benchmarking, which we plan to re-
guage Model (MLLM) to generate descriptions and iden- lease.
tify salient attributes from the images in these datasets. Fol- • Additionally, we aim to release our pre-trained model
lowing this, we acquire region-level annotations that cor- weights along with the large-scale annotations to aid fu-
respond to these descriptions using an off-the-shelf Open- ture research.
vocabulary Object Detector (OVD). We then propose a
method that learns to jointly ground text descriptions into
2. Related Works
specific image regions along with aligning image and cap-
tion representations at a global level. This strategy aims Contrastive Language-Image Pretraining. Follow-up
to learn representations that encode both coarse-grained works on CLIP [34] and ALIGN [15] focus on improv-
(global) and fine-grained (local) information. To achieve ing the quality of learned representations by further in-
this, we introduce a query transformer architecture for en- troducing self-supervision or cross-modal alignment ob-
coding images and a text encoder for processing captions jectives [14, 28, 49]. Relevant to our focus, FILIP [46]
and descriptions. The architecture and objectives of our introduces a cross-modal late interaction mechanism that
model are specifically crafted to learn object/region-aware explores token-wise maximum similarity between image
image representations that are valuable for zero-shot tasks and text tokens to improve fine-grained alignment. Re-
as we demonstrate in the subsequent sections. Finally, to cently, SPARC [1] proposes a sparse similarity metric be-
evaluate our model’s ability to recognize novel concepts, tween image patches and text tokens to learn fine-grained
we curate and manually label a new image classification representations. While our paper shares motivation with
dataset, Products-2023, and benchmark upon it. these works, we address the fact that web-based captioning
datasets [38, 39] contain noisy captions that lack descrip-
To summarize, our main contributions are as follows:
tive information thereby limiting the gains achievable from
• We hypothesize and show that CLIP pre-training lacks such elaborate objectives. Instead, we source rich text de-
2
Figure 2. Architecture overview. Our method, GRAIN, aligns image representations to text captions at a global level while localizing
salient image regions and aligning them to text descriptions at the local level.
scriptions and region annotations and design a pre-training descriptions. To supervise this fine-grained objective, we
objective to learn from them. This allows us to effectively first generate annotations at scale by leveraging Multimodal
use complementary information at test-time (in the form of Large Language Models (MLLMs) and Open-vocabulary
LLM-generated descriptions) to recognize fine-grained or Object Detectors (OVDs). In this section, we first elaborate
novel entities. our automated annotation process and then proceed to
Improving CLIP using Generative Models. Recent discussing our architecture and training methodology.
works have explored the use of LLMs towards improving
the downstream performance of CLIP. Menon et al. [25]
3.1. Weak Supervision from MLLMs and OVDs
and CuPL [32] focus on the task of zero-shot classification, We utilize the 3M and 12M versions of the Conceptual
and prompt GPT-3 [2] at test-time to generate class descrip- Captions [39] (CC3M, CC12M) dataset to train our model.
tions. These descriptions are integrated into the classifica- These datasets contain images sourced from the internet,
tion prompts to achieve gains in terms of accuracy and in- each paired with corresponding alt-texts (or captions). Our
terpretability. Different from these, LaCLIP [11] and Ve- approach requires region-level supervision that is not pro-
CLIP [17] use LLMs to rephrase captions from pretraining vided by any existing dataset at scale. Specifically, we
datasets and observe noticeable gains on downstream tasks find that the captions associated with these images are often
by training on these captions. In this paper, we propose to noisy, lack detail and may not fully capture the dense vi-
leverage synthetic annotations in the form on image regions sual context. To learn fine-grained correspondence between
and descriptions generated by a MLLM and an open-world the two modalities, we propose focusing on regions within
detector to drive a novel pretraining strategy. the image and their descriptions in text as supervision for
training our model. For generating descriptions and locat-
3. Approach ing their corresponding regions, we leverage an instruction-
tuned Multimodal Large Language Model, LLaVA[21]. We
We propose GRAIN, a novel pretraining approach that select LLaVA for its superior captioning capabilities and
simultaneously learns local and global correspondences accessibility due to its openness; however, our approach
between image and text representations. Motivated by the is fundamentally compatible with any multimodal LLM.
observation that CLIP representations lack sufficient fine- For our annotation purposes, we select the LLaVA v1.6
grained understanding, we introduce a transformer-based model which integrates a pretrained Vision Transformer
architecture inspired by DETR [4], to infuse the rich context Large (ViT-L) [10] as the visual encoder with the Vicuna-
from sub-image regions into learned visual representations. 13B LLM [7]. It is worth noting that we only use LLaVA
Alongside encoding the image into a semantic represen- to describe regions/components of the image at a high level
tation, our model predicts bounding boxes for salient and not pinpoint specific fine-grained categories. A com-
image regions containing discriminative information. mon problem with instruction-tuned models like LLaVA is
These localizations are then aligned with detailed textual their tendency to hallucinate, which causes the model to
3
Figure 3. Contrastively align predicted regions with descriptions.
4
Table 1. Zero-shot transfer evaluation of different models. We highlight the best performance of each setting in bold. We see that GRAIN
improves performance under both pretraining datasets, outperforming CLIP by up to 9% in absolute top-1 accuracy. CLIP* is a version of
CLIP with the same number of parameters as our method for fair comparison.
Caltech-101
CIFAR-100
CIFAR-10
ImageNet
Places365
Food101
SUN397
Average
Flowers
DTD
CUB
Cars
Pets
Data Model
LLaVA + CLIP 89.69 57.72 55.24 15.90 35.37 47.16 75.03 24.69 6.22 29.43 52.80 44.48 35.20
CLIP[34] 48.86 18.70 28.44 0.68 9.23 6.94 41.02 8.48 2.51 17.85 8.73 17.40 14.01
Menon&Vondrick[25] 49.35 17.93 29.74 0.60 10.43 7.05 43.89 7.67 2.84 19.12 9.64 18.02 14.12
CuPL[32] 50.16 18.98 29.66 0.71 9.89 8.22 43.95 8.84 2.91 19.73 10.51 18.51 14.14
CC3M
CLIP* 46.99 18.49 29.76 0.52 8.40 6.62 42.56 8.29 3.36 18.70 10.01 17.62 14.04
CLIP* + Menon&Vondrick[25] 49.37 17.98 29.94 0.62 10.55 7.14 44.02 8.38 3.51 19.23 10.24 18.27 14.16
CLIP* + CuPL[32] 50.24 18.86 30.12 0.74 10.14 8.06 43.78 8.95 3.32 19.56 10.77 18.59 14.14
GRAIN (Ours) 65.86 35.20 38.07 1.34 17.24 14.15 65.20 13.24 5.47 24.96 16.18 27.00 23.34
CLIP [34] 71.24 36.66 48.84 4.57 19.28 42.06 70.09 20.51 7.63 31.84 40.94 35.79 34.66
Menon&Vondrick [25] 72.68 37.08 48.59 5.12 18.45 41.38 72.29 21.15 8.27 31.36 41.20 36.14 34.32
CuPL [32] 72.85 37.37 49.06 4.88 18.71 41.58 71.17 22.82 7.94 30.28 40.89 36.15 34.65
CC12M
CLIP* 70.07 35.63 50.42 4.31 18.35 39.40 74.24 21.04 7.96 32.03 41.36 35.89 33.51
CLIP* + Menon&Vondrick [25] 72.74 37.44 51.20 5.31 18.47 41.74 74.44 21.22 8.32 32.72 41.92 36.87 34.50
CLIP* + CuPL [32] 72.77 37.85 51.08 5.12 18.98 41.14 74.22 22.68 8.05 32.34 41.65 36.90 34.77
GRAIN (Ours) 81.40 46.23 55.26 8.42 25.68 48.76 81.49 26.27 10.28 36.76 45.39 42.36 41.46
5
Table 2. Results (Recall@k) on zero-shot image-to-text and text-to-image retrieval tasks on MS-COCO and Flickr30k.
MS-COCO Flickr30k
Data Model Image-to-Text Text-to-Image Image-to-Text Text-to-Image
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
CLIP 15.79 38.26 50.70 13.58 33.76 46.04 27.00 53.80 66.30 21.78 44.26 55.10
CC3M
GRAIN 38.26 65.96 77.03 28.81 55.86 69.00 59.90 81.80 88.40 42.82 68.21 76.54
∆ +22.47 +27.70 +26.33 +15.23 +22.10 +22.96 +32.90 +28.00 +22.10 +21.04 +23.95 +21.44
CLIP 41.32 69.40 80.04 30.02 57.32 69.65 59.60 84.70 89.90 43.63 68.75 76.77
CC12M
GRAIN 58.30 83.07 89.67 42.66 70.77 80.83 78.00 94.60 97.80 59.36 80.01 85.59
∆ +16.98 +13.67 +9.63 +12.64 +13.45 +11.18 +18.40 +9.90 +7.90 +15.73 +11.26 +8.82
outputs and ground truth boxes obtained via the Hungarian image and their associations with corresponding textual de-
Matching algorithm in the last step to determine ground- scriptions. Although the focus of our method is on visual
truth descriptions for each predicted region output embed- recognition, we observe that our learned representations
ding. These matched ground truths are considered positive are of high quality through experiments on cross-modal re-
pairings, while all other pairings within the batch are treated trieval benchmarks. We compare against CLIP as our pri-
as negatives for InfoNCE. Optimizing for this loss enables mary baseline, along with recent works like Menon & Von-
our model to learn fine-grained associations between rich drick [25] and CuPL [32], that also leverage complementary
textual descriptions and salient image regions that contain information from foundation models to improve upon CLIP.
discriminative visual features. Overall, the final objective We train all CLIP-based baselines from scratch under the
function is an equally weighted combination of three com- same training conditions and evaluate all approaches with a
ponents. zero-shot evaluation protocol.
6
Table 3. We report top-1 accuracy (%) for zero-shot attribute-based classification. This is a challenging task as indicated by the results.
Caltech-101
CIFAR-100
CIFAR-10
ImageNet
Places365
Food101
SUN397
Average
Flowers
DTD
CUB
Cars
Pets
Data Model
CLIP 24.20 7.30 13.65 0.75 6.86 3.43 24.68 1.90 1.79 8.93 5.04 8.97 4.53
CC3M GRAIN (Ours) 46.06 18.20 20.02 0.95 14.57 4.87 45.82 2.34 1.72 13.06 7.63 15.93 7.87
∆ +21.86 +10.90 +6.37 +0.20 +7.71 +1.44 +21.14 +0.44 -0.07 +4.13 +2.59 +6.96 +3.34
CLIP 43.71 16.05 23.06 1.67 11.33 7.02 40.61 4.08 2.29 14.78 12.74 16.12 9.41
CC12M GRAIN (Ours) 67.39 26.29 32.46 4.21 17.61 12.38 59.09 3.66 2.72 20.39 18.29 24.04 14.53
∆ +23.68 +10.24 +9.40 +2.54 +6.28 +5.36 +18.48 -0.42 +0.43 +5.61 +5.55 +7.92 +5.12
trained under conditions similar to GRAIN. The introduc- across all other datasets. Notably, our method surpasses ex-
tion of the decoder architecture in our model results in a isting benchmarks by significant margins across both fine
22% increase in parameter count compared to CLIP. For and coarse-grained datasets, with our most substantial im-
a more fair comparison we report numbers for CLIP by provement reaching up to 22% absolute accuracy on the
leveraging the same architectures as GRAIN but with lo- Caltech-101 [12] dataset within the CC3M training setting.
calization modules turned off. This baseline is reported
as CLIP* throughout the paper. Additionally, we report
the performance of the LLaVA v1.6 model to benchmark 4.3. Cross-modal retrieval
our model’s performance against a state-of-the-art MLLM. We evaluate the pre-trained models on the task of cross-
Open-ended MLLMs like LLaVA are known to struggle modal retrieval under the zero-shot setting. Specifically, we
with fine-grained visual recognition [51]. Hence, we pro- focus on the Image-to-Text (I2T) and Text-to-Image (T2I)
pose a new inference strategy to evaluate LLaVA on clas- retrieval tasks using the MSCOCO and Flickr30k datasets
sification tasks, providing a stronger baseline. Specifically, in Table 2. Our evaluations are conducted on the standard
we first prompt LLaVA to predict a category for an image. test sets for both datasets, and we report performance met-
Due to its open-ended nature, we cannot directly determine rics in terms of Recall@k for k values of 1, 5, and 10. Com-
if the generated answer matches the ground truth. To ad- pared to CLIP, our method achieves superior performance
dress this, we use a pretrained CLIP text encoder to map with performance gains of up to 33%. On average, we ob-
LLaVA’s generated answer to the closest category within serve improvements of 23.8% with CC3M trained models
the dataset’s vocabulary. This mapped category is then used and 12.46% with models trained on CC12M.
as the prediction to compute the top-1 accuracy. We refer
to this baseline as LLaVA + CLIP in Table 1, representing 4.4. Zero-shot attribute-based classification
a stronger and improved baseline over LLaVA alone. De-
To measure image-description alignment, we design an ex-
spite possesing orders of magnitude more parameters and
periment to classify images by leveraging only descrip-
being trained on billion-scale datasets, our method manages
tions/attributes. This is a challenging task, as image classi-
to surpass LLaVA’s performance, which shows that our im-
fication is being performed devoid of class names. Toward
provements emerge from careful modeling decisions, rather
this end, we first prompted GPT-3 using class names from
than a simple increase in data volume or model size.
the downstream dataset’s vocabulary to obtain descriptions.
Next, instead of the traditional approach of encoding class
4.2. Zero-shot image classification
names and computing similarities with images, we encoded
We perform zero-shot classification and evaluate all models the description corresponding to the class name (omitting
on Imagenet and 11 additional datasets encompassing com- the class name itself) to obtain the text representation and
mon and fine-grained sets. We measure the top-1 accuracy computed similarities with images. The class correspond-
and report results in Table 1. Our approach, GRAIN, consis- ing to the text representation that scored the maximum sim-
tently outperforms the current state-of-the-art across all set- ilarity with the test image is considered the prediction for
tings and datasets. Specifically, GRAIN improves the zero- that image. We compute top-1 accuracy as usual and re-
shot performance by as much as 9% in absolute accuracy ported for all datasets in Table 3. From Table 3, we ob-
on Imagenet and achieves similar improvements averaged serve that our model is able to achieve strong improvements
7
Table 4. Ablation studies on our CC3M trained model reporting top-1 accuracy (%)
Caltech-101
CIFAR-100
CIFAR-10
ImageNet
Places365
Food101
SUN397
Average
Flowers
DTD
CUB
Cars
Pets
Setting
GRAIN 65.86 35.20 38.07 1.34 17.24 14.15 65.20 13.24 5.47 24.96 16.18 27.00 23.34
– Region-description loss 58.21 27.07 35.28 1.01 14.20 9.18 58.86 9.13 3.52 22.31 13.05 22.89 18.73
– Box loss 57.06 26.17 34.38 0.93 14.67 8.87 56.91 8.31 3.20 21.35 13.12 22.27 17.54
– MLLM-caption 47.24 19.92 28.51 0.70 8.78 7.04 43.95 8.20 2.99 20.06 9.01 17.85 14.56
– Menon&Vondrick [25] 46.99 18.49 29.76 0.52 8.40 6.62 42.56 8.29 3.36 18.70 10.01 17.62 14.04
over CLIP, demonstrating closer image-description align- on average. This considerable decrease underscores the vi-
ment. On average, we achieve an improvement of 6-7% tal role of this loss in establishing fine-grained correspon-
over CLIP, showcasing better alignment. dences between salient image regions and their descrip-
tions.
4.5. Recognizing Novel Examples
Ablating the localization loss. Further removing the
It is desirable for open-vocabulary models to generalize bounding box prediction losses from our training regime
to novel, unseen examples at test-time without requiring leads to a modest performance drop. This loss is instru-
re-training. Zero-shot learning methods often utilize aux- mental in identifying and predicting salient regions within
iliary information, such as attributes, for classifying un- the image, and, in conjunction with the alignment loss is
known entities. Hence, our approach aims to recognize crucial to developing fine-grained visual understanding.
these concepts by leveraging LLM-generated descriptions.
Ablating the role of MLLM-caption during training.
In this experiment, we Accuracy (%) Products-2023
We employ captions generated by LLaVA as a form of
aim to test our model’s CLIP 33.65 text-level data augmentation during training, alternating be-
ability in recognizing LLaVA 42.08 tween these and the original image captions. The MLLM-
novel entities that were GRAIN 45.24 generated caption provides a high-level visual summary of
absent from the train-
Table 5 the image, proving to be significant for training, as indicated
ing distribution. Toward
by a 3% decrease in performance upon its removal.
this end, we collect 1500 images of products launched af-
ter 2023, manually filter these images for quality control Ablating the role of test-time descriptions. In line with
and label them into 27 novel categories to form a new the approach of Menon & Vondrick [25], we utilize descrip-
benchmark dataset. We call this the Products-2023 dataset. tions generated by GPT-3 to enrich class names during zero-
These concepts are absent from our model’s training distri- shot classification. Excluding these augmented descriptions
bution making them novel. We provide additional details on results in a minor performance reduction, suggesting that
this dataset in Appendix. We evaluate our model along with while beneficial, our model’s performance is not reliant on
CLIP and LLaVA on this dataset in Table 5 which demon- these test-time descriptions.
strates superior results achieved by our approach against
CLIP and even against the much larger LLaVA model con- 5. Conclusion
firming the efficacy of our approach in recognizing novel
samples. In this paper, we propose a new pre-training method for con-
trastive vision-language models. Specifically, we hypothe-
4.6. Ablations size that many of the current limitations of CLIP stem from
its image-level contrastive pre-training, which neglects fine-
To assess the importance of the different components in grained alignment. As a result, we propose to leverage
GRAIN, we conduct four ablation experiments. We re- Multi-Modal Large Language Models (LLaVA) and Open-
strict to models trained on CC3M due to computational con- Vocabulary Object Detectors (OWLv2) to automatically
straints. The outcomes of these ablations are reported as generate weak supervision to drive a more fine-grained pre-
top-1 accuracy in Table 4. training process. We demonstrate superior performance
Ablating the region-description alignment loss. This across 11 different classification datasets, including ones
component is pivotal to our framework as removing it containing fine-grained and novel examples, as well as addi-
causes a significant accuracy decline of 5% on all datasets tional tasks such as cross-modal retrieval. Our results show
8
significant improvement over the state-of-art, including by [11] Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and
up to 9% in absolute top-1 accuracy for zero-shot classifi- Yonglong Tian. Improving clip training with language
cation and 25% on retrieval. Our method can even outper- rewrites. In NeurIPS, 2023. 3
form LLaVA, which is over 13B parameters (compared to [12] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener-
our ∼170M) and was trained on billions of data-points. ative visual models from few training examples: An incre-
mental bayesian approach tested on 101 object categories. In
References 2004 conference on computer vision and pattern recognition
workshop, pages 178–178. IEEE, 2004. 7
[1] Ioana Bica, Anastasija Ilić, Matthias Bauer, Goker Erdogan, [13] Hexiang Hu, Yi Luan, Yang Chen, Urvashi Khandelwal,
Matko Bošnjak, Christos Kaplanis, Alexey A. Gritsenko, Mandar Joshi, Kenton Lee, Kristina Toutanova, and Ming-
Matthias Minderer, Charles Blundell, Razvan Pascanu, and Wei Chang. Open-domain visual entity recognition: To-
Jovana Mitrović. Improving fine-grained understanding in wards recognizing millions of wikipedia entities. Interna-
image-text pre-training, 2024. 2, 11, 13 tional Conference on Computer Vision, 2023. 1, 11
[2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- [14] Shih-Cheng Huang, Liyue Shen, Matthew P. Lungren, and
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- Serena Yeung. Gloria: A multimodal global-local represen-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- tation learning framework for label-efficient medical image
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom recognition. In 2021 IEEE/CVF International Conference on
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Computer Vision (ICCV), pages 3922–3931, 2021. 2, 11
Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, [15] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom
Jack Clark, Christopher Berner, Sam McCandlish, Alec Rad- Duerig. Scaling up visual and vision-language representation
ford, Ilya Sutskever, and Dario Amodei. Language models learning with noisy text supervision, 2021. 2, 11
are few-shot learners, 2020. 3 [16] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel
[3] Kaylee Burns, Zach Witzel, Jubayer Ibn Hamid, Tianhe Yu, Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr – mod-
Chelsea Finn, and Karol Hausman. What makes pre-trained ulated detection for end-to-end multi-modal understanding,
visual representations successful for robust manipulation?, 2021. 11
2023. 11 [17] Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu,
[4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiu-
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- long Shan, Chen-Nee Chuah, Yinfei Yang, and Meng Cao.
end object detection with transformers, 2020. 3, 4, 11 Veclip: Improving clip training via visual-enriched captions,
[5] Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel 2024. 3
Kondermann, Samuel Thomas, Shih-Fu Chang, Rogerio [18] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmel-
Feris, James Glass, and Hilde Kuehne. What, when, and ing. Learning to detect unseen object classes by between-
where? – self-supervised spatio-temporal grounding in class attribute transfer. In 2009 IEEE Conference on Com-
untrimmed multi-action videos from narrated instructions, puter Vision and Pattern Recognition, pages 951–958, 2009.
2023. 11 11
[6] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, [19] Jie Lei, Tamara L. Berg, and Mohit Bansal. Qvhighlights:
Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Detecting moments and highlights in videos via natural lan-
Universal image-text representation learning, 2020. 11 guage queries, 2021. 11
[7] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- [20] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei
hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu
hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Wei, Yejin Choi, and Jianfeng Gao. Oscar: Object-semantics
Xing. Vicuna: An open-source chatbot impressing gpt-4 aligned pre-training for vision-language tasks, 2020. 11
with 90%* chatgpt quality, 2023. 3 [21] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
[8] Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Visual instruction tuning. Advances in neural information
Rota, Yiming Wang, and Elisa Ricci. Vocabulary-free image processing systems, 36, 2024. 3, 11
classification, 2023. 11 [22] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
[9] Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng regularization. arXiv preprint arXiv:1711.05101, 2017. 6
Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, [23] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert:
Haoran Wei, et al. Dreamllm: Synergistic multimodal com- Pretraining task-agnostic visiolinguistic representations for
prehension and creation. arXiv preprint arXiv:2309.11499, vision-and-language tasks, 2019. 11
2023. 11 [24] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Blaschko, and Andrea Vedaldi. Fine-grained visual classi-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, fication of aircraft, 2013. 1
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [25] Sachit Menon and Carl Vondrick. Visual classification via
vain Gelly, et al. An image is worth 16x16 words: Trans- description from large language models. In The Eleventh In-
formers for image recognition at scale. arXiv preprint ternational Conference on Learning Representations, 2023.
arXiv:2010.11929, 2020. 3, 6 1, 3, 5, 6, 8, 11, 12, 13, 14
9
[26] Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. [39] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu
Scaling open-vocabulary object detection. Advances in Neu- Soricut. Conceptual captions: A cleaned, hypernymed, im-
ral Information Processing Systems, 36, 2024. 4, 13 age alt-text dataset for automatic image captioning. In Pro-
[27] M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Horst Pos- ceedings of the 56th Annual Meeting of the Association for
segger, Rogerio Feris, and Horst Bischof. Tap: Targeted Computational Linguistics (Volume 1: Long Papers), pages
prompting for task adaptive generation of textual training in- 2556–2565, 2018. 2, 3, 6, 11
stances for visual classification, 2023. 11 [40] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-
[28] Norman Mu, Alexander Kirillov, David Wagner, and Sain- sentation learning with contrastive predictive coding, 2019.
ing Xie. Slip: Self-supervision meets language-image pre- 5
training, 2021. 2, 11 [41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
[29] OpenAI. Gpt-4v(ision) system card. OpenAI, 2023. 11
Polosukhin. Attention is all you need. Advances in neural
[30] Letitia Parcalabescu, Michele Cafagna, Lilitta Muradjan, information processing systems, 30, 2017. 6
Anette Frank, Iacer Calixto, and Albert Gatt. VALSE: A
[42] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.
task-independent benchmark for vision and language models
Cub. Technical Report CNS-TR-2011-001, California Insti-
centered on linguistic phenomena. In Proceedings of the 60th
tute of Technology, 2011. 1
Annual Meeting of the Association for Computational Lin-
[43] Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan,
guistics (Volume 1: Long Papers), pages 8253–8280, Dublin,
Xudong Lin, Ying Shan, Xiaohu Qie, and Mike Zheng
Ireland, 2022. Association for Computational Linguistics. 2
Shou. Object-aware video-language pre-training for re-
[31] Devi Parikh and Kristen Grauman. Relative attributes. In trieval, 2022. 11
2011 International Conference on Computer Vision, pages
[44] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau-
503–510, 2011. 11
mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim
[32] Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam
does a platypus look like? generating customized prompts Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien
for zero-shot image classification. In Proceedings of the Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama
IEEE/CVF International Conference on Computer Vision, Drame, Quentin Lhoest, and Alexander M. Rush. Hugging-
pages 15691–15701, 2023. 1, 3, 5, 6, 11, 12 face’s transformers: State-of-the-art natural language pro-
[33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya cessing, 2020. 6, 12
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, [45] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- Akata. Feature generating networks for zero-shot learning.
ing transferable visual models from natural language super- In 2018 IEEE/CVF Conference on Computer Vision and Pat-
vision. 1 tern Recognition, pages 5542–5551, 2018. 11
[34] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya [46] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Chunjing Xu. Filip: Fine-grained interactive language-image
transferable visual models from natural language supervi- pre-training, 2021. 2, 11, 13
sion. In International conference on machine learning, pages [47] Jiacheng Ye, Jiahui Gao, Qintong Li, Hang Xu, Jiangtao
8748–8763. PMLR, 2021. 2, 5, 6, 11, 12 Feng, Zhiyong Wu, Tao Yu, and Lingpeng Kong. Zerogen:
[35] Kanchana Ranasinghe, Brandon McKinzie, Sachin Ravi, Efficient zero-shot learning via dataset generation. 2022. 11
Yinfei Yang, Alexander Toshev, and Jonathon Shlens. Per- [48] Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri,
ceptual grouping in contrastive vision-language models, Dan Jurafsky, and James Zou. When and why vision-
2023. 2 language models behave like bags-of-words, and what to do
[36] Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- about it?, 2023. 2
rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M [49] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and
Anwer, Erix Xing, Ming-Hsuan Yang, and Fahad S Khan. Lucas Beyer. Sigmoid loss for language image pre-training,
Glamm: Pixel grounding large multimodal model. arXiv 2023. 2, 11
preprint arXiv:2311.03356, 2023. 11 [50] Chuhan Zhang, Ankush Gupta, and Andrew Zisserman.
[37] Zhiyuan Ren, Yiyang Su, and Xiaoming Liu. Chatgpt- Helping hands: An object-aware ego-centric video recogni-
powered hierarchical comparisons for image classification, tion model, 2023. 11
2023. 11 [51] Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh,
[38] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Yuchang Su, Ludwig Schmidt, and Serena Yeung-Levy. Why
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo are visually-grounded language models bad at image classi-
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- fication?, 2024. 7
man, et al. Laion-5b: An open large-scale dataset for training [52] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo-
next generation image-text models. Advances in Neural In- hamed Elhoseiny. Minigpt-4: Enhancing vision-language
formation Processing Systems, 35:25278–25294, 2022. 2, understanding with advanced large language models. arXiv
11 preprint arXiv:2304.10592, 2023. 11
10
Appendix erated from a Large Language Model (LLM) as auxiliary
information to augment the zero-shot performance of CLIP.
A. Additional Details on Related Work On similar lines, CuPL [32] and Ren et. al. [37] use LLMs
to generate descriptions in the form of long, cohesive sen-
The scope of our work spans several domains in- tences or via nuanced, hierarchy-aware comparisons. TAP
cluding Vision-Language Representation Learning, Fine- [27] learns a text classifier mapping descriptions to cate-
grained Visual Recognition, Visual Grounding and Open- gories during training which is used to map from images
world/Zero-shot learning. Since the related works section to categories at test-time. Different from these works, our
in the main paper cannot adequately cover all of these areas, method aims to improve alignment between images and de-
we provide a more comprehensive summary in this supple- scriptions that would further bolster the efficacy of using
mentary material: descriptions at test-time.
Contrastive Language-Image Pretraining. Methods like Object-aware Vision-Language Pretraining. Encour-
CLIP [34] and ALIGN [15] leverage large internet scraped aging object-oriented representations within a vision-
datasets of image-text pairs to learn a joint representation language pretraining objective [3, 5, 6, 20, 23, 43, 50] has
by contrastively aligning the two modalities. The objec- been shown to facilitate learning of robust models that can
tive of these methods is to pull together image and text positively impact downstream performance across a vari-
representations that are semantically similar and push apart ety of tasks in vision-language, video understanding and
dissimilar pairs. These works employ a dual encoder ap- embodied AI. Many of these approaches follow the DETR
proach, separately encoding representations for images and line of works [4, 16, 19] that introduce a query-transformer
text. These learned representations are effective for various backbone for detection and grounding. We take inspiration
downstream vision and language tasks. Follow-up works from these works to develop our architecture for encoding
[28, 49] in this area focus on improving downstream per- visual information. However, our approach only uses the
formance by incorporating self-supervision or using other grounding task as an auxiliary objective to distill informa-
objective functions during pretraining. However, aligning tion from local regions to global representations. We lever-
representations at a global (whole image or caption level) age our synthetic descriptions to supervise this grounding
is known to only learn coarse-grained features and discard module, which is then disabled during evaluation as detailed
fine-grained visual information. Acknowledging this prob- earlier.
lem, FILIP [46] introduces a cross-modal late interaction
mechanism that utilizes a token-wise maximum similarity Universal Visual Recognition. Recent works [8, 13]
between image and text tokens to drive the contrastive ob- introduce the problem of universal visual recognition or
jective. In the medical domain, GLORIA [14] proposes an vocabulary-free image classification, where the motivation
attention-based framework that uses text tokens to attend to is to free models like CLIP from a constrained vocabulary
sub-image regions and learns local and global representa- thereby allowing classification from an unrestricted set of
tions. Concurrent to our work, SPARC [1] proposes using concepts. Corroborating with our claims, these works ob-
a sparse similarity metric between image patches and text serve limitations of CLIP toward recognizing novel exam-
tokens to learn fine-grained alignment. Our paper shares a ples and fine-grained entities. These works formalize this
motivation to these works in terms of aiming to learn fine- problem and introduce retrieval-based methods as an initial
grained representations. However, unlike these methods, step towards a solution.
we address the fact that image-caption datasets like Con- Multimodal Large Language Models. MLLMs like
ceptual Captions [39] or LAION [38] contain noisy captions LLaVA [21], GPT-4V [29], Mini-GPT4 [52] integrate im-
that lack descriptive information, thereby limiting the gains age tokens into LLMs, leveraging their powerful reasoning
that such fine-grained region-token matching objectives can capabilities. MLLMs has been found useful in tasks such
achieve. Secondly, our approach focuses on learning visual as scene understanding [36], story-telling [9] etc., where a
representations that would be able to leverage complemen- comprehensive understanding of the images and text is re-
tary information at test-time (in the form of LLM-generated quired. We leverage their ability for visual comprehension
descriptions as proposed by [25, 32] ) to recognize fine- to generate a set of descriptions for an input image that are
grained or novel entities. Finally, in principle, these meth- used to supervise our fine-grained losses during training.
ods are orthogonal to our contributions and can be coupled
Zero-shot Learning for Images. Zero-shot learning (ZSL)
with our method.
learning is a challenging problem that requires methods to
Zero-shot Learning with CLIP. In image classification, recognize object categories not seen during training. Var-
Zero-shot learning methods aim to recognize novel entities ious approaches [18, 31] have proposed using side infor-
that were not seen during training. Relevant to our work, mation like attributes, hierarchical representations etc. to
Menon & Vondrick [25] leverage category descriptions gen- learn a generalizable mapping. More recent efforts [45, 47]
11
Table A. Zero-shot top-1 accuracy (%) of different methods using the ViT-L/14 backbone.
Caltech-101
CIFAR-100
CIFAR-10
ImageNet
Places365
Food101
SUN397
Average
Flowers
DTD
CUB
Cars
Pets
Data Model
LLaVA + CLIP 89.69 57.72 55.24 15.90 35.37 47.16 75.03 24.69 6.22 29.43 52.80 44.48 35.20
CLIP [34] 73.36 38.06 49.96 4.59 21.84 43.98 71.79 22.01 7.72 33.16 42.25 37.15 36.72
Menon&Vondrick [25] 73.74 38.48 50.05 5.22 22.04 44.33 72.56 22.10 8.28 33.78 43.04 37.60 36.84
CC12M
CuPL [32] 73.53 38.55 50.46 5.14 21.96 43.28 73.08 23.12 8.65 32.48 42.96 37.56 37.05
GRAIN (Ours) 81.62 44.98 55.82 9.12 27.66 52.98 82.05 28.18 12.73 37.34 46.92 43.58 42.68
explore the use of generative models to synthesize useful reported datasets, we use the handcrafted prompts spe-
features for unseen categories. Our method aligns more cific to each dataset as introduced in the official code-
closely with the former, as we learn a fine-grained corre- base [34]. These hand-engineered prompts improve
spondence conducive for zero-shot classification by lever- the zero-shot performance of CLIP beyond the vanilla,
aging descriptions as side-information. A photo of {classname} style prompts.
B. Implementation Details CLIP*. With the introduction of the decoder and bounding-
box modules, our method, GRAIN, uses ∼ 22% more pa-
All baselines reported in the main paper (except LLaVA) rameters compared to CLIP. For a more fair comparison in
utilize a ViT-B/16 architecture as the vision encoder. In terms of number of parameters, we report performance for
Table A, we report results using the ViT-L/14 architecture CLIP by using the same architecture as ours, but with the
trained on CC12M. For encoding text, we utilize a 12-layer localization modules turned off. We refer to this baseline as
transformer network as used with CLIP [34]. The outputs CLIP*
from the vision encoder are 768 dimensional, which are
then projected to 512. The outputs embeddings obtained Menon&Vondrick. We leverage the official code-base [25]
from the decoder are also passed through separate projec- to report performance for this baseline. In the main paper,
tion layers. The projection layer is shared between all re- we implement this baseline on top of the CLIP method as
gion output embeddings and a separate projection layer is per the norm.
used for the image output embedding. Similarly, the text- CuPL. Similarly, we implement the CuPL baseline leverag-
encoder output is projected to be of the same 512 dimen- ing official code [32] and report performance with CLIP and
sional size. Additionally, a two-layer MLP with output size CLIP* in the main paper and in Table A respectively. CuPL
4 is used to regress on the bounding boxes conditioning on shows a similar trend of improving over CLIP baselines but
the region output embeddings. The supervision for bound- trailing behind our method.
ing boxes is obtained through the OWLv2 detector, which is
originally for a 960 × 960 resolution image which is down- LLaVA + CLIP. We use a pretrained LLAVA v1.6 check-
scaled to 224 × 224 following the input resolution of our point from huggingface [44] that is composed of a ViT-L/14
model. While generating these bounding box annotations vision encoder and a Vicuna-13B LLM. The vision and
from OWLv2, we use a confidence threshold value of 0.3. text encoders of LLAVA have been separately pretrained
on billion-scale datasets and conjoined through a projection
C. Additional Results and Baseline Analysis layer. LLaVA has been trained through multiple stages on
a specialized set of ∼ 150k instructions. Being a generative
In Table A, we report performance of GRAIN and compet- model, we ask LLAVA to predict a category for an image
ing baselines when using the ViT-L/14 transformer back- by using prompts specific to each dataset as described in
bone. It shows that our approach is able to consistently Table B. Next, we use a pretrained CLIP text encoder to
outperform all baselines under various Vision Transformer map the answer generated by LLaVA to the closest cate-
backbones. gory in the vocabulary of the dataset being evaluated on.
CLIP. We train the same ViT variant on the same Con- We use this mapped category as the prediction to compute
ceptual Captions (CC3M and CC12M) datasets as our the top-1 accuracy as usual. We call this method LLaVA
GRAIN model. For performing zero-shot testing on all + CLIP. Observing Table A, our approach is able to reach
12
Figure B. Visualization of top-5 predictions of our model on novel
Figure A. Attention maps show more effective object localization entities alongside [25]. Our method consistently identifies the
by our model compared to CLIP. ground truth class as the top prediction.
13
Figure C. Localization and region-description matching predictions made by our model on images from ImageNet.
14
Figure D. Sample annotations generated using our two-stage LLaVA prompting scheme followed by OWLv2 localization.
15
Table B. Prompts to LLaVA for the zero-shot visual recognition task in Table A.
Dataset Prompt
DTD Fill in the blank: this is a photo of a {} texture
Pets What animal is in the image? Be specific about the breed. Fill in the blank: this is a
photo of a {}
Places365 What place is this in the image? Fill in the blank: this is a photo of a {}
Food101 What food is in the image? Fill in the blank: this is a photo of a {}
Cars What type of car is in the image? Be specific about the make and year. Fill in the
blank: this is a photo of a {}
Others Fill in the blank: this is a photo of a {}
failure cases in Figure F. (MLLMs) to train a region-aware model that strongly out-
Row 1 includes partially successful cases, where the performs CLIP across several tasks and datasets. De-
model localizes descriptions but the bounding boxes are ei- spite having significantly smaller parameters and training
ther slightly off the mark or does not localize all instances costs, our approach matches and sometimes even outper-
of that description in the image. forms LLaVA, a state-of-the-art MLLM, on zero-shot visual
Row 2 includes examples where either the model can- recognition. Although obtaining these annotations is com-
not localize a single description in the image or incorrectly putationally expensive, once acquired, our approach can
associates the description with another region in the im- be viewed as enabling the training of smaller models with
age. (the description typically orange or brown small-scale datasets to achieve performance equivalent to a
refers to the basketball but was incorrectly assigned to the large model trained on extensive data, potentially making
jersey of the player that has a similar color.) Vision and Language Model (VLM) training more accessi-
Row 3 includes cases of hallucination, where the model ble. Further integration of our approach with retrieval-based
localizes descriptions that are not present in the image. systems and multimodal LLMs is an interesting future di-
rection.
J. Limitations and Broader Impact
Limitations. Our method achieves substantial gains over
CLIP and other baselines on zero-shot transfer tasks such
as image classification, attribute-based image classification,
and cross-modal retrieval. These improvements can be at-
tributed to the fine-grained region-to-description associa-
tions learned by our model during the training process.
However, learning these correspondences requires annota-
tions in the form of descriptions and bounding box local-
izations, which are computationally expensive to obtain.
As mentioned earlier, our annotation scheme demands sig-
nificant GPU resources and can take long hours for large
datasets. Additionally, since we do not filter or curate these
annotations, it might result in some misaligned or inaccu-
rate descriptions or captions, which might not provide the
correct signal during the learning process. Future work
could explore the use of efficient models to generate anno-
tations as well as a filtering mechanism to ensure all gener-
ated text and bounding boxes are correctly aligned with the
semantic content of the image.
Broader Impact. We propose a strategy to learn fine-
grained image-text correspondences without requiring ad-
ditional human annotations. Our approach leverages weak
supervision from Multimodal Large Language Models
16
Figure E. Qualitative comparison between one-stage (middle) and two-stage (right) LLaVA-based annotation schemes.
17
Figure F. Visualization of failure modes from our grounding module on ImageNet-1K.
18